Text Information Retrieval Systems

The publisher assumes no responsibility for any injury and/or damage to persons or property resulting from product liability, negligence or otherwise or from any use or operation of any methods, products, instructions or ideas contained in this material. In particular, due to the rapid progress of medical science, independent verification of diagnoses and drug dosages is necessary.

Preface

The purpose of this book is not to teach readers how to become searchers, but to teach people who search how the systems they use work. While this is not primarily a book on how to search, we have included a chapter on search strategy.

Introduction

The User Sequence

Interpreting or expressing the need - The next step for the user is to express the need for information. The doctor seeking details of treatment information may use the same terms as the layman - the name of the disease - and for the same reason, namely that it is not clear whether or not anything else is required.

The Database Producer Sequence

Decide on the design of the individual pages—what information should be displayed where, in what format. Selection of the new items for inclusion and review—What changes should be made, and how often.

System Design and Functioning

Deciding on the scope of the site - about which aspects of an individual's life to collect information. It will have different meanings on websites depending on the expertise of the website administrator and owners.

Why the Process Is Not Perfect

Each record would have an established set of information elements describing the stock or part. The meaning of the names of the data elements present will be well established, at least among the database user community.

Information Specialists

Differences between these groups are often based on experience with different aspects of computer use or information content, but this very term of experience is difficult to operationalize (Meadow et al., 1994).

Subject Specialist End Users

Non-Subject Specialist End Users

Even this cannot be done completely because many design problems depend on understanding user behavior and on our ability to train users in skillful use of the systems. Here are some of the problems outstanding in the design of IRS, or in its use.

Design

The aim should be to make the best possible use of the user's own perceptions and understanding of what is being covered. Question management programs: These programs interpret what the user says, whether in natural language or one of the many artificial language approaches.

Understanding User Behavior

The ideal seems to be to let the language be most relevant to the user and let the program determine what additional information it needs to search effectively. Presentation (display) of information to the user — There are technical and behavioral problems with display: what to display, with what symbols, how they are arranged on the screen and how they are related to each other.

Traditional Information Retrieval Methods

Unfortunately, the amount of effort involved is directly proportional to the number of items in the collection, and today's collections are often very large. In the immediate aftermath of the French Revolution, the new government confiscated and cataloged many book collections.

Pre-Computer IR Systems

The document numbers are listed in 10 columns according to the lowest digit of the number. Optical coincidence system: this method inverts the logic of the Uniterm system by having a map for each document and together with it display the numbers corresponding to index terms relating to the document.

Special Purpose Computer Systems

Each uses the algebra of sets developed by George Boole (1958) to create sets of items during a search, and they are clear predecessors of the automated retrieval systems that followed. We will briefly discuss examples of such machines before returning to the use of the general purpose computer, the evolution of which soon overcame the problems Bagley identified.

General Purpose Computer Systems

Bar Hillel (1959) soon proposed a selection based on the ratio of the frequency of words in a document to the frequency of words in the language as a whole. This is the most common modern term selection method, which uses the occurrence of a term in a document and the number of documents in the file to which the term has been assigned to identify significant discriminating terms.

Online Database Services

If we measure the number of documents in a collection in which a term appears, its importance can be measured by the number of documents in the collection that contain that term. In the next section, we will discuss the development of the most significant piece of such software.

The World Wide Web

Such relatively inexpensive communication technology allowed for widespread use and led to the birth of the Internet. These links are like citations embedded in the text, but the difference is that the hypertext reader can "open" the cited work immediately, at any time.

Data, Information, and Knowledge

Data
Information
News
Knowledge
Intelligence
Meaning
Wisdom
Relevance and Value

The card can also tell in which part of the library the book is located. It is also true that the nature of the action will be determined by the receiver's level of belief in the information or understanding of its meaning.

Representation of Information

Natural Language
Restricted Natural Language
Artificial Language
Codes, Measures, and Descriptors
Mathematical Models of Text
Discriminating Power
Identification of Similarity
Descriptiveness
Ambiguity
Conciseness
Hierarchical Codes
Measurements
Nominal Descriptors
Inflected Language
Full Text
Explicit Pointers and Links

In the first round of decision making, the information that users of the database are likely to want should be considered first. While user training in the use of the language is required, the chance of usage errors can be minimized.

Attribute Content and Values

Numbers

Real numbers, as the term is used in computing, consist of a set of numeric digits with one and only one decimal place and a sign. They are commonly represented in computer storage as two numbers plus a sign: a mantissa between 0 and 1 and an exponent understood as applied to 10.

Character Strings: Names

This gives only two in common out of four or five, but if we recognize the similarity of I and Y, the count goes to four out of five. TSCHAIKOWSKY and CZAJKOWSKI (Russian transliteration and Polish spelling, but both pronounced identically) share 4 of 11 or 9 digrams, but if we can equate Y and I or I and J, which look different but are pronounced identically similarly, the count goes to 7 from 11 or 9.

Other Character Strings

We conclude with an examination of some of the problems of ambiguity in attribute value systems and methods of controlling it. Whether the color applied is the color of the sand in any real desert is not the point.

Hierarchical Classification

Both of these examples illustrate that true precision is not achieved by simply writing exact values, and that overly precise values may not be what is required in the search. So, as with most IR, the issue is not whether the classification is "good" or "bad", but how well users understand what it means and what its limits are.

Network Relationships

Station A, on the other hand, may similarly be subordinate to another station X, but other network configurations may allow A1 to talk directly to A2, or to B, a member of another branch of the hierarchy. . In these cases, we use a structure called a grid, although other forms are also grids—an unfortunate multiple meaning of the term.

Class Membership: Binary, Probabilistic, or Fuzzy

With the exception of fraudulent credentials, there is no doubt about the degree earned, which is a different matter from what the person knows, although the former is often used as a measure of the latter. This is usually done in libraries that use a cut-down version of the classification code to inform users where they can find books in different classes.

Transformation of Words by Stemming

If any stem ends in a consonant other than S followed by the letter S, delete the S. If the word ends in ES, drop the final S. If the word ends in ION, remove ION unless the rest of the word has two or fewer letters .

Sound-Based Transformation of Words

In general, the more lines of the depicted types, the greater the chance that a word is rooted correctly. The code is not unique, but the stored memory may be sufficient to correct this defect.

Transformation of Words by Meaning

DATABASE and RETRIEVAL are related terms in the context of COMPUTING, but RETRIEVAL also has other, unrelated meanings. We might have an entry like "MICROCOMPUTER: usage, PERSONAL COMPUTER", which simply indicates that the former, although often used in natural language, is not an approved term in the database.

Transformation of Graphics

The minutiae will vary, even on prints of the same person taken at different times and under different circumstances. Characteristics of Fingerprints: Some common general shapes are: tent arch, loop, and circle.

Transformation of Sound

In the sentence "The dog wagged its tail", the meaning of its is clear, and its use here is grammatically correct - it refers to the past noun. These—the more complete information plus the intelligence that can be built into a computer program—make it less and less important for the seeker to rely on formal index terms.

Elements of Control

Controlling a dictionary means limiting the number of possible values that can be used for attributes. It is not common for a database to be changed retroactively every time its thesaurus is changed.

Dissemination of Controlled Vocabularies

It is possible to try to control this problem by using the precision device known as the role indicator. The assumption is that there is an identifiable set of views that will apply to all dictionary terms, and that indexers can apply them consistently.

Models of Virtual Data Structure

Scalar Variables and Constants
Vector Variables
Structures
Arrays
Tuples
Relations
Text
Linear Sequential Model
Relational Model
Hierarchical and Network Models
Hypertext
Spreadsheet Files

SELECT—The SELECT operator creates a subset of the tuples of a relation that includes only those that meet the criteria stated in the command. CROSS - The CROSS operator allows extending a relation by adding information from another relation to it.

The Physical Structure of Data

Basic Structures

Variable-length record: each field or attribute is of fixed length, but there can be a variable number of occurrences of the structure Transactions. The number of bytes for each field is given explicitly in the table and the number of transactions is given (no_transactions).

Space-Time and Transaction Rate

The Order of Records

Finding Records

Each of the methods described below has variations in how it is implemented by a given computer operating system. It can be difficult to learn exactly how a favorite retrieval or database system organizes records, and it may not matter until a file becomes very large and has a high level of activity.

Sequential Files

If the middle record is not the one required, then only the remaining half of the file needs to be searched further (Fig. 6.5), because the program would know whether the key of the desired record was higher or lower than the key of tested. If the desired record key is lower than the center key (512), look further only in the first half of the file.

Index-File Structures

If the reverse file can be stored in RAM, finding a main file entry requires only one disk access. Adding a new record to the main file also requires adding one to each index file.

Lists

This new record should point to the record with key value 324, so the old first record pointer (1) is placed (d) in the pointer of record 2. However, one of the attributes is a pointer to the record with the next higher key value.

Trees

The advantage of this method is fast searching compared to both sequential file searching and conventional list searching. Adding another record to a list: now, add a second record with key 287, which is less than the first record's key of 324, so logically it should precede that record.

Direct-Acess Structures

Each points to the first relevant record in one of the other files that contains specific information about the patient. In a personnel file for which an index must be created on the attribute, ssn, we would expect to include each ssn and the number of the record in which it appeared in this index.

Phrase Parsing

Dialog Corporation, with one of the largest online file collections in the world, uses only AN, AND, BY, FOR,. If the index entry consists of the complete set of subject headings used in a record, as a unit, together with the record number.

Word Parsing

Word and Phrase Parsing

Nested Indexes

The index can be searched faster because we can get more records from disk storage to RAM at once than we can find records in the main file; hence we can find the desired key more quickly. If the file is to be searchable based on one of n keys (e.g. a sort key and one of the n-1 other attributes found in the record), then the index is needed to prevent the entire main file from being searched each time. must be searched. time search.

Direct Structure with Chains

An effective way to do this is to determine how many index records can fit in RAM at one time. This means that we only need two disk accesses to find any record: one for the portion of the main index, which the in-RAM index points to, and one for the record itself.

Indexed Sequential Access Method

Random order means that the location of the record is not determined by the value of an attribute in the record. The index gives the highest key value of the records in its sector and the address of the first of any excess records.

Querying the Information Retrieval System

Sets and Subsets

It is a list of the identifying accession numbers of the retrieved records that match the query or component statement. The user is usually informed only of the set number assigned by the IRS and the number of records in the set.

Relational Statements

Until recently, the design of most text-oriented search systems assumed frequent revision of query phrases and required the IRS to create a set for each query, or even for each attribute value specification or logical combination within the query. Each such set represented a subset of the database, but the word set is a general usage.

Boolean Query Logic

Containing string (chemical name contains (ETHYL) (CONTAIN) A record will be in the group if ETHYL. The fact that the Japanese term appears in the text of the record is not an indication that the record is in the Japanese language.

Ranked and Fuzzy Sets

It is a clearly defined set, each record of the database is clearly in the set or not in it. A record's score will be calculated by adding 1 for each time any of the words appear in the record.

Similarity Measures

Connect to an IRS

Connecting to the Locally Subsidized IRS – It is common practice in libraries today to subscribe to services that provide access to various databases and search software for use by their patrons at no direct cost. Connection to the Remotely Subsidized IRS - Primary examples of this are databases and the IRS supported by the federal government of the United States.

Select a Database

Search the Inverted File or Thesaurus

Look for the inverted file or thesaurus within the local access subsidized IRS—Providers of these systems will normally provide access to a thesaurus if available and allow access, sometimes from a pull-down menu, to certain parts of the inverted file. overturned. usually the author's name and the title of the journal, but sometimes keywords, descriptors and other fields are assigned. The user has requested to see information about terms related to TESTING (by pressing EXPAND E3).

Create a Subset of the Database

Search an inverted file or thesaurus using an online search engine - most online search engines do not support a thesaurus, nor will the inverted view of the file be available for users to search. The terms are searched in the inverted index, and the conjunction or disjunction (depending on the mechanism) of the collected document numbers is collected and sorted by some algorithm.

Search for Strings

The purpose here is not to truncate, but to indicate that any single letter or character is acceptable in the specified position. Searching for strings within the local CD-ROM IRS is usually the same as one of the traditional methods.

Analyze a Set

Searching for strings within the subsidized local access IRS also usually follows the traditional methods. A sentence function, where a complete multi-word sentence can be specified, is the most common of these.

Sort, Display, and Format Records

Handle the Unstructured Record

Download

There is some debate as to whether databases in the United States are subject to copyright law. In any case, worldwide websites and the full texts of documents found in databases are protected.

Order Documents

Copyright law, and in particular the concept of what kind of exceptions fall under fair use doctrine, is subject to different interpretations resolved only by case law, and varies from country to country. Downloading or printing a single copy for personal use is very likely fair use, but that is less clear.

Save, Recall, and Edit Searches

The normal World Wide Web search engine will provide an entry box for a single set of search terms or a Boolean statement. An entered search statement can be edited to change its form and resubmitted, but previous sets are not retained.

Current Awareness Search

Some sites, such as the Custom News Service of the National Science Foundation, allow the user to create a web page that will automatically collect new foundation publications that match a profile entered on the service-provided profile settings page and then display their citations. These citations are hyperlinks to the actual publications maintained on the site.

Cost Summary

By narrowing the search down to the point where it gets the number, the user will take time to read calls for focus and skill from the user. A relatively brief appearance on the scene was made by PointCast, a company that provides Internet push service.

Terminate a Session

There's even a movement to start charging based on the number of terms in a query, regardless of whether they get anything or not, on the grounds that the service being sold is a search service, not just a search service. In addition, the cost of publishing material online and the potential loss of revenue from print versions will likely increase the prices of what we have had for free until now.

Interpretation and Execution of Query Statements

Parsing Command Language

The parser will start on the left side of the statement and look for a substring that constitutes a valid command in the language. What follows is not a set number or a parenthetical expression, so it is treated as a search term, even though it was intended as part of the command.

Parsing Natural Language