Informedia: multi-modal information retrieval

Chapter 8 introduced the Informedia digital library of segments of digitized video and described some of the user interface concepts. Much of the research work of the project aims at indexing and searching video segments, with no assistance from human indexers or catalogers.

The multi-modal approach to information retrieval

The key word in Informedia is "multi-modal". Many of the techniques used, such as identifying changes of scene, use computer programs to analyze the video materials for clues. They analyze the video track, the sound track, the closed captioning if present, and any other information. Individually, the analysis of each mode gives imperfect information but combining the evidence from all can be surprisingly effective.

Informedia builds on a number of methods from artificial intelligence, such as speech recognition, natural language processing, and image recognition. Research in these fields has been carried out by separate research projects; Informedia brings them together to create something that is greater than the sum of its parts.

Adding material to the library

The first stage in adding new materials to the Informedia collection is to take the incoming video material and to divide it into segments by topics. The computer program uses a variety of techniques of image and sound processing to look for clues as to when one topic ends and another begins. For example, with materials from broadcast television, the gap intended for advertisements often coincides with a change of topic.

The next stage is to identify any text associated with the segment. This is obtained by speech recognition on the sound track, by identifying any captions within the video stream, and from closed captioning if present. Each of these inputs is prone to error.

The next phase is to process the raw text with a variety of tools from natural language processing to create an approximate index record that is loaded into the search system.

Speech recognition

The methods of information discovery discussed in Chapters 10 and 11 can be applied to audio material, such as audio tapes and the sound track of video, if the spoken word can be converted to computer text. This conversion proves to be a tough computing problem, but steady progress has been made over the years, helped by ever increasing computer power.

Informedia has to tackle some of the hardest problems in speech recognition, including speaker independence, indistinct speech, noise, unlimited vocabulary, and accented speech. The computer program has to be independent of who is speaking.

The speech on the sound track may be indistinct, perhaps because of background noise of music. It may contain any word in the English language including proper nouns, slang, and even foreign words. Under these circumstances, even a human listener misses some words. Informedia successfully recognizes about 50 to 80 percent of the words, depending on the characteristics of the specific video segment.

Searching the Informedia collection

To search the Informedia collection, a user provides a query either by typing or by speaking it aloud to be processed by the speech recognition system. Since there may be errors in the recognition of a spoken query and since the index is known to be built from inexact data, the information retrieval uses a ranking method that identifies the best apparent matches. The actual algorithm is based on the same research as the Lycos web search program and the index uses the same underlying retrieval system.

The final example in this section looks at the problems of delivering real-time information, such as sound recordings, to users. Digitized sound recordings are an example of continuous streams of data, requiring a special method of dissemination,

so that the data is presented to the user at the appropriate pace. Sound recordings are on the boundary of what can reasonably be transmitted over the Internet as it exists today. User interfaces have a choice between real-time transmission, usually of indifferent quality, and batch delivery, requiring the user to wait for higher quality sound to be transmitted more slowly. Panel 12.3 describes RealAudio, one way to disseminate low-quality sound recordings within the constraints of today's Internet.

Panel 12.3 RealAudio

One hour of digitized sound of CD quality requires 635 megabytes of storage if uncompressed. This poses problems for digital libraries. The storage requirements for any substantial collection are huge and transmission needs high-speed networks.

Uncompressed sound of this quality challenges even links that run at 155 megabits/second. Since most local area networks share Ethernets that run at less than a tenth of this speed and dial-up links are much slower, some form of compression is needed.

RealAudio is a method of compression and an associated protocol for transmitting digitized sound. In RealAudio format, one hour of sound requires about 5 megabytes of storage. Transmission uses a streaming protocol between the repository where the information is stored and a program running on the user's computer. When the user is ready, the repository transmits a steady sequence of sound packets. As they arrive at the user's computer, they are converted to sound and played by the computer. This is carried out at strict time intervals. There is no error checking. If a packet has not arrived when the time to play it is reached, it is ignored and the user hears a short gap in the sound.

This process seems crude, but, if the network connection is reasonably clear, the transmission of spoken sounds in RealAudio is quite acceptable when transmitted over dial-up lines at 28.8 thousand bits per second. An early experiment with RealAudio was to provide a collection of broadcast segments from the programs of National Public Radio.

The service uses completely standard web methods, except in two particulars, both of which are needed to transmit audio signals over the Internet in real time. The first is that the user's browser must accept a stream of audio data in RealAudio format. This requires adding a special player to the browser, which can be downloaded over the Internet. The second is that, to achieve a steady flow of data, the library sends data using the UDP protocol instead of TCP. Since some network security systems do not accept UDP data packets, RealAudio can not be delivered everywhere.

Dynamic and complex objects

Many of the digital objects that are now being considered for digital library collections can not be represented as static files of data.

x Dynamic objects. Dynamic or active library objects include computer programs, Java applets, simulations, data from scientific sensors, or video games. With these types of object, what is presented to the user depends upon the execution of computer programs or other external activities, so that the user gets different results every time the object is accessed.

x Complex objects. Library objects can be made up from many inter-related elements. These elements can have various relationships to each other. They

can be complementary elements of content, such as the audio and picture channels of a video recording. They can be alternative manifestations, such as a high-resolution or low-resolution satellite image, or they can be surrogates, such as data and metadata. In practice these distinctions are often blurred. Is a thumbnail photograph an alternative manifestation, or is it metadata about a larger image?

x Alternate disseminations. Digital objects may offer the user a choice of access methods. Thus a library object might provide the weather conditions at San Francisco Airport. When the user accesses this object, the information returned might be data, such as the temperature, precipitation, wind speed and direction, and humidity, or it might be a photograph to show cloud cover.

Notice that this information might be read directly from sensors, when requested, or from tables that are updated at regular intervals.

x Databases. A database comprises many alternative records, with different individuals selected each time the database is accessed. Some databases can be best thought of as complete digital library collections, with the individual records as digital objects within the collections. Other databases, such as directories, are library objects in their own right.

The methods for managing these more general objects are still subjects for debate.

Whereas the web provides a unifying framework that most people use for static files, there is no widely accepted framework for general objects. Even the terminology is rife with dispute. A set of conventions that relate the intellectual view of library materials to the internal structure is sometimes called a "document model", but, since it applies to all aspects of digital libraries, "object model" seems a better term.

Identification

The first stage in building an object model is to have a method to identify the materials. The identifier is used to refer to objects in catalogs and citations, to store and access them, to provide access management, and to archive them for the long term. This sounds simple, but identifiers must meet requirements that overlap and frequently contradict each other. Few topics in digital libraries cause as much heated discussion as names and identifiers.

One controversy is whether semantic information should be embedded in a name.

Some people advocate completely semantic names. An example is the Serial Item and Contribution Identifier standard (SICI). By a precisely defined set of rules, a SICI identifies either an issue of a serial or an article contained within a serial. It is possible to derive the SICI directly from a journal article or citation. This is a daunting objective and the SICI succeeds only because there is already a standard for identifying serial titles uniquely. The following is a typical SICI; it identifies a journal article published by John Wiley & Sons:

0002-8231(199601)47:1<23:TDOMII>2.0.TX;2-2

Fully semantic names, such as SICIs, are inevitably restricted to narrow classes of information; they tend to be long and ugly because of the complexity of the rules that are used to generate them. Because of the difficulties in creating semantic identifiers for more general classes of objects, compounded by arguments over trademarks and other names, some people advocate the opposite: random identifiers that contain no semantic information about who assigned the name and what it references. Random

strings used as identifiers can be shorter, but without any embedded information they are hard for people to remember and may be difficult for computers to process.

In practice, many names are mnemonic; they contain information that makes them easy to remember. Consider the name "www.apple.com". At first glance this appears to be a semantic name, the web site of a commercial organization called Apple, but this is just an informed guess. The prefix "www" is conventionally used for web sites, but this is merely a convention. There are several commercial organizations called Apple and the name gives no hint whether this web site is managed by Apple Computer or some other company.

Another difficulty is to decide what a name refers to: work, expression, manifestation, or item. As an example, consider the International Standard Book Number (ISBN).

This was developed by publishers and the book trade for their own use. Therefore ISBNs distinguish separate products that can be bought or sold; a hard back book will usually have a different ISBN from a paper back version, even if the contents are identical. Libraries, however, may find this distinction to be unhelpful. For bibliographic purposes, the natural distinction is between versions where the content differs, not the format. For managing a collection or in a rare book library, each individual copy is distinct and needs its own identifier. There is no universal approach to naming that satisfies every need.

Dalam dokumen Digital Libraries (Halaman 181-185)