A Text Retrieval System Using Latent Semantic Analysis

Accepted and approved in partial fulfillment of the requirements for the degree of Bachelor of Science in Computer Science. Mathematical and Computer Science Unit Department of Physical Sciences and Mathematics Department of Physical Sciences and Mathematics University of the Philippines, Manila. The recombination of the truncated matrices forms the basis for calculating the distances of each document from a query vector obtained by treating a query as a pseudo-document.

INTRODUCTION

Background of the Study

Current information retrieval systems make use of keyword matching, which relies heavily on exact matches between keywords and words in document titles and/or document summaries [1]. Typically, a result set is created by checking whether a document contains certain keywords or not [2]. In addition, search engines usually return documents that contain one or more keywords, even if these documents are not relevant to the user.

Basing search results on keywords alone results in poor relevance, so desired documents may not appear at the top of the list [3]. To improve document retrieval, there is a need for a system that analyzes the latent semantic structure in the relationships between terms and documents. This process is known as Latent Semantic Analysis (LSA) with the truncated Singular Value Decomposition (SVD) as the model.

This is a purely mathematical approach to find the relationship between terms and documents in a given collection [2, 4], where documents are represented as vectors in a multidimensional term space [1, 5-7].

Statement of the problem

However, this method fails to retrieve relevant material that does not contain words in the users' queries [1]. A third, more technical factor that causes keyword matching in texts to fail is that they use models such as Boolean, Standard Vector, and Probabilistic that treat words as if they are independent [7] when in fact they are not [1 , 7 ].

Objectives of Dealing with the Proposed Problem

Significance of the Study

Scope and Limitations

REVIEW OF RELATED LITERATURE

The results showed that the average difference in accuracy between LSI and the term matching method was 0.06, which is 13%. LSI improved prediction performance over keyword matching by an average of 13% and showed a 26% improvement in accuracy over presenting articles in the order received. The results indicated that user preferences for articles tend to cluster based on the semantic similarities between articles.

When a customer adds an item to his/her shopping cart, a certain number (based on a threshold) of spatially nearest neighbors in the new item abstract in the cart item are suggested to the customer as items that he or she may be helpful. An empirical study of its effectiveness showed that 8% of items added to shopping carts were LSI recommendations, which resulted in a 7% increase in revenue. Results indicated that LSA was at least as valid and sensitive as traditional measures.

While latent semantic analysis using singular value parsing is a strictly mathematical approach [2] and does not use prior linguistic or perceptual knowledge of a particular vocabulary [4], there is some preparatory work that could be considered before document indexing is very specific to language. as dating.

CONCEPTUAL THEORETICAL FRAMEWORK

Other applications of Latent Semantic Analysis are automatic cross-linguistic information retrieval and information filtering [9]. Words in a given document may be considered more meaningful than others based on a number of criteria that may include the frequency of a word in a. Unlike stemming, it is not language specific and does not require any linguistic knowledge to calculate the weights.

The process is not critical to LSI, but is a potentially good mechanism to improve retrieval accuracy.

Keyword Matching

The conventional keyword matching approach is based on the idea that either a document contains a given word or it does not. Documents that do not contain these keywords are ignored, and the rest are ranked and returned to the user. This lack of interdependence between documents is due to the difference between the words that searchers use and the words with which the information they seek is indexed [2, 3].

An "R" in the column labeled REL (relevant) indicates that the user would have judged the document relevant to the query. Terms that occur in both the query and a document are indicated by an asterisk in the corresponding cell; an "M" in the MATCH column indicates that the document matches the query based on their keywords and would have been returned to the user. In this example, the semantic conditional term "audio-based" in the query is not found in the index.

Latent Semantic Analysis

This method is based on the idea that the observed term-document association data is unreliable and must be treated as a statistical problem. The assumption is that some latent semantic structure exists in the data, which is partially obscured by the diversity of word usage in terms of retrieval [7]. Deerwester et al., in this search, limited consideration to proximity models such as hierarchical, partitioned, and overlapping clusters; ultrametric and additive trees; and factor analytic and multidimensional models.

In contrast, previous factor analyzes failed to represent both terms and documents in the same space. In his article, Deerwester et al. proposed a two-mode analysis method based on the following set of criteria: SVD keeps as much information as possible about the relative distances between the document vectors, while compressing them into a smaller number of dimensions.

The noise from the original term-to-document matrix is minimized, revealing similarities that were hidden in the document collection [1, 2].

Fundamental Concepts in Linear Algebra

The main actor of the system is the user who connects to the system and its database. Part of the term-by-document array from the Matlab command window, a text retrieval system using latent semantic analysis. Matrix partitioning by LSA using truncated SVD, text retrieval system using latent semantic analysis.

If the system does not encounter any problems, the user is informed of the success of adding new document title as shown in Figure 15. If no problems are encountered by the system, the user is informed of the success of editing the document title. The user is informed about the success of editing the document title as shown in Figure 20.

The user is informed of the success of deleting the document title as shown in Figure 23.

Definition of Terms

DESIGN AND IMPLEMENTATION

These indices are used to represent the document as a vector indicating which elements of. Otherwise, the distance of the query vector from each column in the updated matrix is calculated. If a problem is encountered, the system informs the user of the failure with dialogs similar to Figures 16 and 17.

If we click on Delete and the system does not encounter any problems, the user is informed about the successful deletion of the document title. After finding that applications using singular value parsing for latent semantic analysis typically occupy about 10% of the total number of documents in a document collection, such as in the works of Dumais et al. This may be attributed to the small size of the document collection, which may not adequately sample a typical document collection.

Judgments of importance may also have played a vital role in the outcome of the above numbers. Landauer, Richard Harshman, "Indexing by Latent Semantic Analysis", Journal of the Society for Information Science, 41(6), pp.

Figure 1. Context-free Diagram, Text Retrieval System Using Latent Semantic Analysis

TECHNICAL ARCHTECTURE

RESULTS

It allows the user to submit a query using Search, add a document title using Insert Document, edit a document title using Edit Document and delete a document using Delete Document. After typing the query, the user can start the search by clicking Go or cancel the search by selecting Stop. An empty query will cause the following dialog to appear telling the user to type a query as shown in Figure 8.

The user is prompted to enter a document title and has the option to cancel adding a new document title. Adding a new document title, as shown in Figure 12, results in a new document record that has the generated file name and user-entered document title as fields. A database transaction error causes a failure to add the title of a new document, a text retrieval system that uses latent semantic analysis.

If you choose to edit the document title, all the documents from the document table will be displayed as shown in Figure 18. For this process, all the documents in the document table will be displayed as shown in Figure 18. When the document is selected as shown in Figure 21, It will show a document record is created with the document file name and document title selected by the user.

When the OK button is pressed, the user is prompted to confirm the delete action as shown in Figure 22. Selecting No returns the user to the main menu, while selecting Yes closes the program.

Figure 7. Interface for Search, A Text Retrieval System Using Latent Semantic Analysis

DISCUSSION

Our study produced a different result, with precision almost perfect at the lower levels of recall. It can be expected that sets containing tens of thousands of documents will be extremely difficult to manage, since the extraction of the unique value decomposition of a very large matrix can be very. This is why said technique is hardly used on the Internet, where search engines index millions of documents and the collections need to be updated from time to time.

It can be inferred that LSA is indeed a promising tool for uncovering the latent semantic structure to improve information retrieval.

CONCLUSION

RECOMMENDATIONS

BIBLIOGRAPHY

Dumais, “Using Latent Semantic Indexing (LSI) for Information Retrieval, Information Filtering, and Other Things,” Cognitive Technology Conference, April 1997.

APPENDICES

JOptionPane.showMessageDialog(this, "Type your query and then click Go to start your search query is missing", . JOptionPane.INFORMATION_MESSAGE, null);.

ACKNOWLEDGEMENT