Right at the end of the last semester, a professor dropped The Accidental Data Scientist on my desk. The book, by corporate librarian Amy Affelt, is a relatively recent publication, and it inspired me to take a second look at what I had come to consider a hackneyed buzz-term: Big Data.
I started by consulting the OED. (Don’t you?)
According to the Oxford English Dictionary, “big data” is a noun, “also [written] with capital initials,” that entered the lexicon in its present form circa 1980, as exemplified by Charles Tilly in his essay on trends in modern historiography. Big Data—which I render with the fancy optional capital initials—is defined by the self-proclaimed “definitive record of the English language” as—wait for it—“data of a very large size.”
Thanks, OED. Penetrating. (Don’t you also sass the OED in your head sometimes?)
I freely admit that I was disappointed by what seemed a myopic definition. Big Data is a compound noun, as the OED hints, but it is also a collective noun. Like Big Government, Big Business, Big Ag, and Big Pharma, Big Data is a single term used to refer to a whole industry of individuals and companies and goods and services, and its use is becoming increasingly widespread. While Google Trends shows that “big bang theory” is far more frequently searched than “Big Data,” I wonder whether it is the scientific theory or the television show that is holding its own against our increasing fascination with Big Data. After all, searches for “Big Data” now outstrip not only searches for “Big Government,” “Big Business,” “Big Ag,” and “Big Pharma,” but also searches for “Big Sur” and “big yellow taxi.” Even information on the “Big Ten” is more sought only during the month of March. Big Data is a big deal. Given the widespread fascination with the term, I expected the OED to offer a more thorough explanation.
Despite my initial disappointment, however, two things struck me about the OED entry on Big Data. Both festered into full-blown ideas that I thought worth sharing here:
Big Data is probably better understood as two words rather than one term.
The grammar nerd in me was initially a little uncomfortable with the OED’s identification of “big data” as a noun. I don’t like to pick fights with the OED, but I can’t help but think that Big Data—even with the fancy optional capital initials—is better understood as two things rather than one: a noun preceded by an adjective rather than a single compound noun.
Fundamentally, Big Data is just a lot of data. Even the OED definition implies that the major distinguishing feature of Big Data is its bigness. Though the term is used generally to refer to the industry that collects, manages, and manipulates the large datasets generated by contemporary technology, librarians do well to remember that there is nothing unique about the information that falls under the heading of Big Data except its amount and complexity. We need to remember this because:
Charles Tilly really needed the help of a good data librarian.
The paper that Charles Tilly published in 1980 that the OED now cites as the first instance exemplary of the current meaning of Big Data was originally delivered as an address to the “Conference on New Directions in History” at the State University of New York in Buffalo. Wryly humorous throughout, the paper entitled “The Old New Social History and the New Old Social History” documents the rise and the fall in popularity of data-driven historiography, remembering first its initial ascent in the early 1960s then its later decline in the late 1970s before finally providing comments on potential directions for future historiographical work.
Describing the shift away from data-driven methods of analysis, Mr. Tilly records a litany of reasons that one of his colleagues, an “eminent European social historian,” has become disillusioned with the “cliometrics” and “prosopography” he once championed. Among his colleague’s complaints about the lack of reliability in the data and the lack of consistency in the work of the research assistants, Mr. Tilly also notes “that mathematical results are incomprehensible to the historians they are meant to persuade,” and “that the investigations tend to lose their wit, grace, and sense of proportion in the pursuit of statistical results.” When Mr. Tilly finally uses the term for which the OED has made his essay enduringly famous, it is with derision: “none of the big questions has actually yielded to the bludgeoning of the big-data people.” A quote from the colleague finishes the section claiming that “the usefulness of the results seems […] to be in inverse correlation to the mathematical complexity of the methodology and the grandiose scale of data-collection.” For Mr. Tilly and his colleague, the bigger the data, the less useful the results.
Big Data is one of the side-effects of the Digital Revolution. In 1980, the personal computer was, even for the most forward-looking of librarians, still a device more likely found in Science Fiction than on a reference desk. The microprocessor, the internet, wireless technology: all contributed to the creation, the storage, the sharing of data. Now, smartphones with a suite of sensors are standard technology, and they interface with so-called “wearables” that gather even more data. The Internet of Things connects all, providing a virtually endless supply of information to those who have the access and the ability to read and to interpret it. Therein lies the problem and the promise of Big Data for librarians: the access and the ability to read and to interpret.
Big Data is a librarian’s issue. The collection, organization, storage, and retrieval of information is the librarian’s specialty. The recent trend among librarians and librarian education programs to identify as information scientists or schools of information science reflects a growing awareness that the antiquated image of the librarian as guardian of a library full of books and other printed items is no longer either particularly relevant or particularly accurate. Librarians, information scientists, now specialize in solving the problems identified by Charles Tilly and his fellow historiographers 35 years ago. Even with advances in technology, it remains all too easy for the application of data to questions of human significance to feel like “bludgeoning.” The job of the librarian—whether employed in a library or in another context as an information scientist —is to make sure that the “big questions” do yield to inquiry, ideally without becoming “incomprehensible” and without losing their “wit, grace, and sense of proportion.” Big Data is not new, but it is newly accessible to those who have the ability to read and to interpret it. The librarian’s job is to ensure access to the resources of Big Data is as free of restraint and penalty as is reading the OED.
If you, like me, are interested in a career specializing in or around Big Data, I highly recommend Chris Eaker and Christina Czuhajewski on working in data curation and data visualization. And Courtney Baron’s contribution gives some great tips to us all on what we can do as students to prepare for a career in data. Share any resources of your own in the comments.