First steps in Information Retrieval

Before starting an MLIS in digital librarianship I have worked in libraries for around nine years. Despite being involved and enthusiastic, a lot on the course is new to me. One of these topics is information retrieval (IR).

In my experience at work I have seen information retrieval mostly from the point of view of the end user, advising about the best search terms or search filters to use to bring back the best results, but without a real understanding of why this was. With no real prior expectations about this module I was certainly surprised – and to start with a little intimidated – to be working with formulas like:

tfidf (ki, dj) := (tf (ki, dj) x idf (ki)

(to do with term weighting…) and assumed initially that this topic was only of relevance to a very behind-the-scenes area of library and information science for those actually designing the library systems. And whilst it is true that IR in the automated, computerised sense can typically be placed under the umbrella of computer science, it has always existed in librarianship, in a simpler form at one time involving manually retrieving and cross-referencing after interpretation of a users search query by a librarian.

I’m part way through this module and I wanted to share my initial thoughts and experiences of it.

This is a great definition of IR from the online edition of Manning, Raghavan, and Schütze’s Introduction to Information Retrieval:

“Information retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers)” (Manning, Raghavan, Schütze. p1, 2009.).

You will note that this definition specifically talks about the material being retrieved as usually documents and text, but increasingly the information that people are looking for is not only contained in text documents, but also multimedia such as video, audio, images, and this I believe is one interesting and challenging area for IR. One of the most important aspects of IR is the notion of relevance as described in the text Modern Information Retrieval as the fundamental ‘IR Problem’:

“The IR Problem: the primary goal of an IR system is to retrieve all the documents that are relevant to a user query while retrieving as few nonrelevant documents as possible” (Baeza-Yates & Ribeiro-Neto. p4, 2010).

A simple way to envisage one form of information retrieval is to imagine reading from start to finish through an entire work noting when search terms occur or are not present. This straightforward, if time consuming, process called ‘grepping’ (after the Unix command ‘grep‘) is effective for small collections and simple searches, but try to superimpose this thought experiment onto vast collections, not to mention that it doesn’t rank results by relevance or allow for more advanced operations, and you can appreciate that other methods are required. It’s not only enough to bring back information, that information has to be relevant, and we tend to see this illustrated by the ranked lists of results we get back when we perform a search in a library catalogue or on any online search engine. The top one should be most relevant to our information query, then getting less and less so as we scan down the list. How the IR system actually figures this out is complex and involves the way the information is described by its metadata, the way this is scanned, the weighting given to terms – suffice to say the formula above was the tip of the iceberg in this respect. Despite being daunted at first by the mathematical content of the course, once I got into the right frame of mind it all started to seem pretty straightforward and very illuminating in understanding how searches bring back the results they do.

As the course has progressed I have begun to appreciate that Information retrieval and information searching are necessarily tied together, and so my previous experience of IR I talked about in the first paragraph is far from being as irrelevant as I’d first thought. I’d say it is essential for the study of each area to encompass the study of the other, as IR will only be effective when a users information seeking behaviour is understood, and a users information seeking behaviour will be much more effective with some understanding of how the IR system works and how to most effectively use it.

Furthermore, this all ties in with the way information is described in the first place, and with the development of the Semantic Web and machine-to-machine information exchange it is becoming even more essential to have a reliable and accurate system of information retrieval. In a Scientific American article from 2001 introducing the concept of the Semantic Web the authors emphasised that:

“properly designed, the Semantic Web can assist the evolution of human knowledge as a whole” (Berners-Lee, Hendler, & Lassila. 2001).

I see this as one of the most exciting opportunities in IR for librarians, as experts in describing, cataloguing, and sharing information they are ideally placed to make the most of this next extension of the Web, to be those involved in ‘properly designing’ it. In that article the authors enthusiastically looked towards a future where personal devices could communicate with one another and the Web to automatically set up, for example, necessary medical appointments for a person at a convenient location and time, all with only the most minimal input from the end user, only possible due to the accurately described and organised data available and linked via the Web – the kind of thing librarians are expert in. These kinds of technologies have a huge potential for libraries to enrich their information, enable more relevant and efficient information retrieval, and enhance their ability for knowledge sharing. Properly designed, users will be able to find the information they need without having to be experts in formulating information queries or filtering results.

Hopefully this has given you a brief glimpse into the world of IR if you’re thinking about studying it, as well as some useful links to follow up and learn more. I’m still taking this module and learning as I go, but this is how I understand it at the moment, and despite feeling slightly overwhelmed to start with I’m becoming more and more enthusiastic about this topic. Let me know in the comments what you think about IR and how you see it being shaped in the future…

References
Baeza-Yates, R. & Ribeiro-Neto, B. 2010, 2nd ed. Modern Information Retrieval. Online edition: Addison-Wesley. Retrieved from http://www.mir2ed.org/

Berners-Lee, T., Hendler, J., Lassila, O. 2001. The Semantic Web. Scientific American. V284, N5.

Manning, C D., Raghavan, P., Schütze, H. 2009. An Introduction to Information Retrieval. Online edition: Cambridge University Press. Retrieved from http://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

 

Featured Image: Linked Data (Semantic Web) candies by Reed Sturtevant licensed under CC BY-SA 2.0

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s