1. Module Objectives In this module we will be looking at the basic concepts of Information Retrieval. Module Objectives: By the end of this module, students will be able to:
Module
Topics:
2. Introduction to Module Have you ever wondered how web search tools can find the information you request so quickly, out of millions and millions of web pages? Have you ever wondered why it’s so easy to find some things on the WWW and all but impossible for other topics? The answers to all these questions fall into the domain of what’s called “Information Retrieval [IR]”. 3. How Big is the WWW & How Is it Used? Since this module deals with the retrieval of information that brings up an interesting question. Just how much information is out there? Take a look at this in the “How Much Information” report from Berkeley: http://www.sims.berkeley.edu/research/projects/how-much-info-2003/ . Questions: How much information is produced worldwide each year? How much information per woman, man, and child does this equate to? What is 5 exabytes equivalent to? Of all the information produced each year,how much is in print?
Let’s take it down a notch. Instead of asking how much information exists in the world, let’s limit the question to just the World Wide Web. In a well-known study published in Nature, Steve Lawrence and Lee Giles stated:
It's important to understand that even when the study was done in 1999 the actual size of the WWW was considerably larger than the quoted 15 TB. There is an enormous amount of information stored on the web that hasn't been indexed by search engines. This is often referred to as the "invisible web" and we'll learn more about it later in this module. Still having trouble visualizing 15 TB? Maybe this will help. Back in Module 2 we found that based on data from 2000, the UBC library contains approximately 4,000,000 volumes, which worked out to about 4 TB [Terabytes] of information. Since the L:awrence and Giles study was released other size estimates have emerged. For example, SocioSite from the University of Amsterdam states:
OK, I think we can all agree that the Web is BIG. Beyond that, since this module deals with Information Retrieval, we might ask "just what is it on the Web people are retrieving?" A variety of sites on the Web offer statistics on how people search the Web and what they search for. A really interesting one is produced by the Google search engine it’s called the “Google Zeitgeist”. The summary page for 2001 is located at: http://www.google.com/press/zeitgeist2001.html. If you look at the “Top 20 Gaining Queries” it’s easy to see the impact of the events of September 11. Do you have any idea why “nostradamus” is the highest rated query word? Take a look at the Timeline and click on the "nostradamus" link for September [this is a required link]. If you’d like to see the results for 2004, including an interactive flash-based 2004 zeitgeist go to: http://www.google.com/press/zeitgeist2004.html [optional link]. Additional information on web usage is available from a number of sites. Some of the most interesting are Nua: http://www.nua.com, Nielsen: http://www.nielsen-netratings.com/news.jsp, and Internet Traffic Report: http://www.internettrafficreport.com/main.htm [optional links] Beyond this, have you ever wondered what people are searching for right now? There are a number of search sites that offer constantly updated lists of current search questions and terms. For examples go to [but keep in mind these are unfiltered and can get a bit racy at times]
Don’t know about you, but I never cease to be both amazed and impressed
[and sometimes just plain spooked] by the topics in which people
are
interested. Link Rot Aside from the challenges of using the actual search engines, there is another experience with which we're probably all too familiar. You use a search engine and seem to find a link to the "perfect" site for your information need. You eagerly click on it and up pops some variation of a "page not found" message. The horror, the horror... Welcome to the land of "link rot" [ defined as the natural decay of web links as the sites they're connected to change or die]. Not finding what you want is bad enough when someone doing some everyday surfing, but it becomes a BIG problem when the missing page has been used in the footnotes of published scientific research. Self-reflection Activity
Choose one item from the list and come up with a list of questions each object might supply. Next Assignment |