Untitled Document

1. Module Objectives

In this module we will be looking at the basic concepts of Information Retrieval.

Module Objectives:

By the end of this module, students will be able to:

  • Appreciate the capabilities and limitations of information retrieval systems
  • Understand the design and implementation of retrieval systems for text and other media.
  • Evaluate the performance of an information retrieval system
Module Topics:
  • Introduction to Module
  • How Big is the WWW & How Is it Used?
  • Search Engines
    • Spider-Based Search Engines
    • Human-Based Search Engines
    • Hybrid Search Engines & Directories
  • Portals
  • Relevance
    • Relevance Ranking by Search Engines
  • The Semantic Web
  • The Invisible Web
  • Information Retrieval
    • Boolean Keyword Exact Match
    • Best Match Relevance

2. Introduction to Module

Have you ever wondered how web search tools can find the information you request so quickly, out of millions and millions of  web pages?  Have you ever wondered why it’s so easy to find some things on the WWW and all but impossible for other topics?  The answers to all these questions fall into the domain of what’s called “Information Retrieval [IR]”. 


3. How Big is the WWW & How Is it Used?

Since this module deals with the retrieval of information that brings up an interesting question.  Just how much information is out there?  Take a look at this in the “How Much Information” report from Berkeley: http://www.sims.berkeley.edu/research/projects/how-much-info-2003/

Questions:

How much information is produced worldwide each year?

How much information per woman, man, and child does this equate to?

What is 5 exabytes equivalent to?

Of all the information produced each year,how much is in print?

 

Answers

 

Let’s take it down a notch.  Instead of asking how much information exists in the world, let’s limit the question to just the World Wide Web.  In a well-known study published in Nature, Steve Lawrence and Lee Giles stated:

The publicly indexable web contains an estimated 800 million pages as of February 1999, encompassing about 15 terabytes of information or about 6 terabytes of text after removing HTML tags, comments, and extra white-space.

It's important to understand that even when the study was done in 1999 the actual size of the WWW was considerably larger than the quoted 15 TB.  There is an enormous amount of information stored on the web that hasn't been indexed by search engines.  This is often referred to as the "invisible web" and we'll learn more about it later in this module.  Still having trouble visualizing 15 TB? Maybe this will help. Back in Module 2 we found that based on data from 2000, the UBC library contains approximately 4,000,000 volumes, which worked out to about 4 TB [Terabytes] of information.

Since the L:awrence and Giles study was released other size estimates have emerged. For example, SocioSite from the University of Amsterdam states:

The surface web contains some 2.5 billion documents, with a daily growth rate of 7.3 million pages. The average size of the surface pages varies from 10 kbytes per page to 20 kbytes per page. The total amount of information on the surface web varies between 25 to 50 terabytes (including html codes and pictures). The textual information is estimated at 10 to 20 terabytes. The 7.3 million new pages that are added to the surface web are responsible for the fact that each day 0.1 terabytes of new information (including html) is available.

To this we have to add the information that is made available by the deep web - online databases, dynamic web pages, intranet sites etc. It is estimated that the deep web contains 550 billion web related documents with an average size of 14 kbytes per page. The major part of this information - 95% - is public information. When all this information would be stored at one place, it would need 7,500 terabytes storage room. That is 150 times more storage room than would be necessary for the whole surface web (even if one starts from the highest estimation of 50 terabytes for the surface web). Because 56% of this information is actual content (excluding html), it is estimated that there are 4,200 terabytes of data. [http://www2.fmg.uva.nl/sociosite/websoc/demography.html]


OK, I think we can all agree that the Web is BIG.  Beyond that, since this module deals with Information Retrieval, we might ask "just what is it on the Web people are retrieving?"  A variety of sites on the Web offer statistics on how people search the Web and what they search for.  A really interesting one is produced by the Google search engine ­ it’s called the “Google Zeitgeist”.  The summary page for 2001 is located at: http://www.google.com/press/zeitgeist2001.html.  If you look at the “Top 20 Gaining Queries” it’s easy to see the impact of the events of September 11.  Do you have any idea why “nostradamus” is the highest rated query word?  Take a look at the Timeline and click on the "nostradamus" link for September [this is a required link].  If you’d like to see the results for 2004, including an interactive flash-based 2004 zeitgeist go to: http://www.google.com/press/zeitgeist2004.html [optional link]. Additional information on web usage is available from a number of sites.  Some of the most interesting are Nua: http://www.nua.com, Nielsen: http://www.nielsen-netratings.com/news.jsp, and Internet Traffic Report: http://www.internettrafficreport.com/main.htm [optional links]

Beyond this, have you ever wondered what people are searching for right now?  There are a number of search sites that offer constantly updated lists of current search questions and terms.  For examples go to [but keep in mind these are unfiltered and can get a bit racy at times]

Don’t know about you, but I never cease to be both amazed and impressed [and sometimes just plain spooked] by the topics in which people are interested.

Link Rot

Aside from the challenges of using the actual search engines, there is another experience with which we're probably all too familiar. You use a search engine and seem to find a link to the "perfect" site for your information need. You eagerly click on it and up pops some variation of a "page not found" message. The horror, the horror... Welcome to the land of "link rot" [ defined as the natural decay of web links as the sites they're connected to change or die]. Not finding what you want is bad enough when someone doing some everyday surfing, but it becomes a BIG problem when the missing page has been used in the footnotes of published scientific research.

Self-reflection Activity

Concept: What is information?

Pick one item and tell why it is or is not information. Be prepared to discuss your response.

audio recording of a political speech

music,

interviews

fine art

photographs

film

video

tombstones

tools

weapons

inventions

uniforms,

fashion

census data

 

maps

architectural drawings

cookbooks

advertisements

journals

souvenirs

ancestors' clothes

buildings

ancestors' papers

letters

Choose one item from the list and come up with a list of questions each object might supply.

Next Assignment