This semester, I am enrolled in a Digital Humanities seminar. I wanted to get a theoretical foundation for how evolving technology is shaping attitudes about research and special collections in other disciplines. To that end, this seminar did not disappoint.
Over the past sixteen weeks, our seminar touched upon most major conversations being had about DH, from the nuances of ontology an computational hermeneutics, to the boundaries of disciplinarity. In addition, we also got out hands dirty with established DH tools and methods. For one such exercise, we were assigned with the task of finding two collections of text to analyze in Jupyter notebook, a Python based environment for literate programming. Using a program written by seminar faculty, Ted Underwood, I compared the texts of two popular health magazines. Here are a few lessons learned about data modeling and collection analysis.
Choose Compatible Collections
The first challenge I addressed was the identification of two collections that were different enough so as to draw contrast, but similar enough that any substantial results could be interpreted as reasonably correlated. Initially I wanted to contrast the reflections of two demographics responding to the same social or political phenomena, one being the demographic most closely associated with that phenomena and the other being the “outsiders” looking in. So, I sought out transcripts for YouTube videos about bullying, one collection being videos featuring young persons and another collection featuring adults or parents reflecting on bullying. After some digging, I came to the conclusion it would be easier to work within collections that have already been curated to some degree. So, I settled on comparing content from two fitness magazines, one for men (Men’s Health) and one for women (Women’s Health). Using gender to distinguish between the two collections provided a predetermined contrast (despite the acknowledged oversimplification of gender for the purpose of this project). The samples were pretty easy to find through the library’s database interface and the actual text was already packaged. I simply had to copy and paste the words into separate text files.
Tinker–Clean Your Data and Cross-Validate
After finding my samples, the next lesson really centered around finding a way to make the texts work with the program. This required patience and problem solving. Some data cleaning was required and some variables (pre-programmed) within the script needed to be removed in order to prevent skewing. For example, the first run of the script revealed the word “photo” to be prominently featured as one of the more common words, the reason being that the word was being used as a placeholder when the document was processed from PDF to HTML full text. I also noticed that words from certain articles were being given an unusually high preference (in this case, the word “chicken” which came from an article about cooking) even though it was not mentioned in any of the other articles in the collection. So once again I returned to the text files and tweaked the samples, cutting and pasting to ensure all the articles were roughly the same length so as to give more equal preference to all articles within the collection. I also did a word count so as to provide the num_variables input with a reasonable integer.
Once I ran the numbers, I did make some interested observations, although not necessarily about the content. As expected there was a gendering of common words. Most notably, the women’s health magazine featured words that imply a reflection of self, “beauty,” “feeling,” and “personal”. The men’s magazine on the other hand, featured active, motion-based words, “power,” “faster,” “say,” “do,” and “enjoy.” However, what I found more interesting was less what the words may mean, but what we might be able to deduce by analyzing the linguistic features of the words used (doing data on data, so to speak). For example, of the nineteen most common words identified in Women’s Health, 73% were complex, multi-syllabic words with more than one syllable. In contrast, the men’s magazine provided multi-syllabic words only 31% of the time. I am curious what these findings imply about the perceived audience of these two magazines. As far as the high percentage of multi-syllabic words in Women’s Health may imply, could there be assumptions about the typical level of educational attainment and socioeconomic status at play? By looking at these collections side by side, are there things we can deduce about how these magazines may do different things, despite being presented as two sides of the same coin?
Digital Humanities and Librarians
This single activity, by no means, makes me a data wizard or Jupyter aficionado. But as I’ve mentioned in other posts, librarians must be multidisciplinary beasts. As more collections are being digitized by academic institutions, or as certain publications simply forgo the printing process altogether in favor of cheaper and more accessible electronic versions, librarians should be expected to understand the new research opportunities made possible by digital collections. In order to serve academic communities, we need to be able to speak across a wide range of disciplines and understand the inquiries, scope and challenges, that go into designing DH projects.
Cover photo. Men’s Health. Jul. 2012. Web, 2016 08 Dec.
Cover photo. Women’s Health. South Africa. Apr.2013. Web, 2016 08 Dec.