The Glamorous World of Data Modelling

cover_of_menshealth_magazine    womens-health-3-april-2013







This semester, I am enrolled in a Digital Humanities seminar. I wanted to get a theoretical foundation for how evolving technology is shaping attitudes about research and special collections in other disciplines. To that end, this seminar did not disappoint.

Over the past sixteen weeks, our seminar touched upon most major conversations being had about DH, from the nuances of ontology an computational hermeneutics, to the boundaries of disciplinarity. In addition, we also got out hands dirty with established DH tools and methods. For one such exercise, we were assigned with the task of finding two collections of text to analyze in Jupyter notebook, a Python based environment for literate programming. Using a program written by seminar faculty, Ted Underwood, I compared the texts of two popular health magazines. Here are a few lessons learned about  data modeling and collection analysis.

Choose Compatible Collections

The first challenge I addressed was the identification of two collections that were different enough so as to draw contrast, but similar enough that any substantial results could be interpreted as reasonably correlated. Initially I wanted to contrast the reflections of two demographics responding to the same social or political phenomena, one being the demographic most closely associated with that phenomena and the other being the “outsiders” looking in. So, I sought out transcripts for YouTube videos about bullying, one collection being videos featuring young persons and another collection featuring adults or parents reflecting on bullying. After some digging, I came to the conclusion it would be easier to work within collections that have already been curated to some degree. So, I settled on comparing content from two fitness magazines, one for men (Men’s Health) and one for women (Women’s Health). Using gender to distinguish between the two collections provided a predetermined contrast (despite the acknowledged oversimplification of gender for the purpose of this project). The samples were pretty easy to find through the library’s database interface and the actual text was already packaged. I simply had to copy and paste the words into separate text files.

Tinker–Clean Your Data and Cross-Validate

1After finding my samples, the next lesson really centered around finding a way to make the texts work with the program. This required patience and problem solving. Some data cleaning was required and some variables (pre-programmed) within the script needed to be removed in order to prevent skewing. For example, the first run of the script revealed the word “photo” to be prominently featured as one of the more common words, the reason being that the word was being used as a placeholder when the document was processed from PDF to HTML full text. I also noticed that words from certain articles were being given an unusually high preference (in this case, the word “chicken” which came from an article about cooking) even though it was not mentioned in any of the other articles in the collection. So once again I returned to the text files and tweaked the samples, cutting and pasting to ensure all the articles were roughly the same length so as to give more equal preference to all articles within the collection. I also did a word count so as to provide the num_variables input with a reasonable integer.



Once I ran the numbers, I did make some interested observations, although not necessarily about the content. As expected there was a gendering of common words. Most notably, the women’s health magazine featured words that imply a reflection of self, “beauty,” “feeling,” and “personal”. The men’s magazine on the other hand, featured active, motion-based words, “power,” “faster,” “say,” “do,” and “enjoy.” However, what I found more interesting was less what the words may mean, but what we might be able to deduce by analyzing the linguistic features of the words used (doing data on data, so to speak). For example, of the nineteen most common words identified in Women’s Health, 73% were complex, multi-syllabic words with more than one syllable. In contrast, the men’s magazine provided multi-syllabic words only 31% of the time. I am curious what these findings imply about the perceived audience of these two magazines. As far as the high percentage of multi-syllabic words in Women’s Health may imply, could there be assumptions about the typical level of educational attainment and socioeconomic status at play? By looking at these collections side by side, are there things we can deduce about how these magazines may do different things, despite being presented as two sides of the same coin?

Digital Humanities and Librarians

This single activity, by no means, makes me a data wizard or Jupyter aficionado. But as I’ve mentioned in other posts, librarians must be multidisciplinary beasts. As more collections are being digitized by academic institutions, or as certain publications simply forgo the printing process altogether in favor of cheaper and more accessible electronic versions, librarians should be expected to understand the new research opportunities made possible by digital collections.  In order to serve academic communities, we need to be able to speak across a wide range of disciplines and understand the inquiries, scope and challenges, that go into designing DH projects.

Image Credits:
Cover photo.  Men’s Health. Jul. 2012. Web, 2016 08 Dec.
Cover photo. Women’s Health. South Africa. Apr.2013. Web, 2016 08 Dec.

2 replies

  1. Thanks, Kristina, this was a great, compact introduction to how scholars will actually be using library data for digital humanities research. I was familiar with the approach in theory but feel like I have a much more concrete understanding now.


  2. Hi Leslie, thanks for the feedback! I’ve been very lucky to have an excellent teacher who is interested in understanding DH from the perspective of the academic disciplines (For whom does DH work?). As a result, I think my class came to think about DH as a way of better understanding the status of scholarship in our respective fields–Film, Music, Art History, Communication, English, Library Science. As a result, it’s sort of flipped my understanding of DH. We are always thinking about what DH can do for us, but the more I think about it, the more I think DH needs subject specialists to help articular a DH agenda.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s