Information Retrieval

After completing the course, students should be able to- CO1: Understand the concept of information seeking. CO5: Design and implement innovative features in search engines CO6: Analyze various real-world applications of information retrieval. Information retrieval can be defined as a software program that deals with the organization, storage, retrieval and evaluation of information from document repositories, especially textual information.

Information retrieval is the activity of retrieving material that can usually be documented in an unstructured nature, i.e. information retrieval extends support to users in browsing or filtering a collection of documents or processing a set of retrieved documents. It creates an essential functionality of the IR process as it is the first step in IR and helps in efficient information retrieval.

Evaluation in information retrieval is the process of systematically determining the merit, value and significance of a subject using certain criteria governed by a set of standards. The main topics of Information Retrieval (IR) are document and query indexing, query evaluation, and system evaluation.

Major challenges in IR

Features of an IR system

Stop words are the high frequency words that are considered unlikely to be useful for searching. According to Zipf's law, a stoplist covering a few dozen words reduces the size of the inverted index by almost half. On the other hand, eliminating stop words can sometimes cause the elimination of the term that is useful for searching.

For example, if we remove the alphabet “A” from “Vitamin A”, it would have no meaning. Stemming, the simplified form of morphological analysis, is the heuristic process of extracting the basic form of words by chopping off the ends of words. For example, the words laugh, laugh, laugh are said to be derived from the root word laugh.

Components of an IR model

Representation: It consists of indexing containing free text terms, controlled vocabulary, manual and automatic techniques as well. Example: Abstract contains summary and bibliographic description containing author, title, sources, data and metadata.

IR system block diagram

The system helps users find the information they are looking for, but does not explicitly return answers to questions. It informs the existence and location of documents that may contain the requested information. From the diagram above it is clear that a user who needs information will have to formulate a request in the form of a question in natural language.

Then the IR system will respond by retrieving the relevant output, in the form of documents, about the required information.

Boolean retrieval

Boolean model is a reverse index search to determine whether a document is relevant or not.

The inverted index can be created for this corpus as

Information versus Data Retrieval

Text Categorization

Text categorization

Instead of manually classifying documents or manually designing automatic classification rules, statistical text categorization uses machine learning methods to learn automatic classification rules based on human-labeled training documents. A standard method for computing the value of a feature x(i) for a given document d is called the bag-of-words approach. In the simplest form, x(i)=TF(i,d)⋅IDF(i) , where TF(i,d) (frequency of term) is the number of times term i appears in document d.

IDF(i) =logNni is the inverse document frequency, where N is the total number of documents in a collection, and ni is the number of documents containing term i. To make documents of different lengths comparable, each feature vector x is typically normalized to Euclidean length 1, by dividing each feature value by the Euclidean length ||x||=x⋅x−−−−−√ of the original vector.

IR processes and fields

Vector Model

The inverse document frequency () considers the ith terms and all documents in the collection. The final result for the i-th term in the j-th document consists of a simple multiplication. Since a document/query contains only a subset of all the distinct terms in the collection, the frequency term can be zero for a large number of terms: this means that a sparse vector representation is needed.

It can be shown that the cosine similarity is the same as the Euclidean distance under the assumption of vector normalization. The idea of twisted normalization is to make document shorter than an empirical value ( twisted length : ) less relevant and longer document more relevant as shown in the following image: Twisted Normalization. A major issue that is not taken into account in the VSM is the synonyms: there is no semantic relationship between terms, as it is not captured by the term frequency nor the inverse document frequency.

The vector space model (VSM) is a widely used information retrieval model that represents documents as vectors in a high-dimensional space, where each dimension corresponds to a term in a vocabulary. VSM is based on the assumption that the meaning of a document can be inferred from the distribution of its terms and that documents with similar content will have a similar distribution of terms. The matrix contains the frequency of each term in each document, or some variant thereof (eg, inverse document frequency of the term, TF-IDF).

The query is also preprocessed and represented as a vector in the same space as the documents. VSM has many advantages, such as its simplicity, efficiency and ability to process large collections of documents. These limitations can be addressed by using more sophisticated models, such as probabilistic models or neural models that take into account the semantic relationships between words and documents.

Access to Vast Amounts of Information: WIR provides access to a vast amount of information available on the Internet, making it a valuable resource for research, decision-making, and entertainment. Quality of information: The quality of information retrieved by WIR can vary widely, with some sources being unreliable, outdated or biased. Search Overload: Due to the sheer amount of information available on the Internet, WIR can be overwhelming, leading to information overload and difficulty finding the most relevant information.

Probabilistic Model and Latent Semantic Indexing Model