4 Metadata, Citation, and Search - Claudia Koschtial Thomas Köhler Carsten Felden Editors

In CLARIN, metadata is usually represented in a component metadata infrastructure (CMDI).⁴The underlying technology of CMDI is XML-Schema (components, profiles), XML (instances), and REST (component registry). CMDI addresses the problem of various specialized metadata standards used for specific purposes by different research communities. Instead of introducing yet another standard, CMDI

1http://de.clarin.eu.

2http://ec.europa.eu/research/index.cfm?pg=newsalert&lg=en&year=2012&na=na-290212-1.

3http://ec.europa.eu/research/infrastructures/index_en.cfm?pg=eric.

4https://www.clarin.eu/content/component-metadata.

Fig. 1 Components, profiles, and component registry

aims at describing and reusing, and (when used in combination with ISOcat⁵) inter- preting and supporting the integration of existing metadata standards. CMDI components act as basic building blocks that define groups of field definitions. These components can be combined into profiles that define the syntax and semantics of a certain class of resources and act as blueprints for metadata instances describing items of this class. These components are managed in a component registry, which allows users to archive and share existing components, thus enabling their reuse (see Fig.1). Through this approach, CMDI supports the free definition and usage of metadata standards dedicated to specific use cases. As long as metadata is stored in XML, CMDI is able to “embrace” other standards. By combining the data itself with semantic information stored in the ISOcat data-category registry, CMDI forms a solid basis for using sophisticated exploration and search algorithms.

Metadata is the backbone of the infrastructure and publicly available in CLARIN from the resource centers (cf. Boehlke et al.2012) via the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH).⁶The openness of metadata is impor- tant to CLARIN since it guarantees high visibility of the provided resources in the research community.

OAI-PMH is a well-established standard and is supported by numerous repository systems like DSpace⁷and Fedora.⁸The OAI-PMH protocol is based on REST and XML and provides the ability to do two things. It offers full access to the metadata provided by the resource centers and allows for selective harvesting of metadata (see Fig.2) for search portals like the Virtual Language Observatory (VLO). The VLO enables users to perform a faceted search on the metadata that was harvested from the repositories of all CLARIN centers. By using the information stored in the ISOcat data-category registry (cf. Kemps-Snijders et al.2008) and the CMDI profiles (see Fig.3) associated to the CMDI metadata instances, the VLO map information is stored in these instances onto a predefined set of facets (see Fig.4). The VLO also supports the extraction and usage of additional, CLARIN/CMDI-specific, metadata

5http://www.isocat.org/.

6http://www.openarchives.org/pmh/.

7http://www.dspace.org/.

8http://fedora-commons.org/.

Fig. 2 OAI-PMH harvesting

Fig. 3 Metadata records, profiles, and ISOcat.Source https://www.clarin.eu/sites/

default/files/styles/opensc ience_3col/public/cmdi-ove rview.png

such as ResourceProxy (e.g., link to download, dedicated search portal) and federated content search (FCS) interfaces.

CLARIN also provides support for content-based search. The CLARIN-D FCS⁹ is based on Search/Retrieval via URL (SRU) and Contextual Query Language (CQL) and allows users to perform a CLARIN-wide search over all repositories that offer a FCS interface by using a simple Web application. This Web application and external applications send a request to an aggregator service. This service first queries a repository registry and searches for compatible interfaces. The initial query is then

9https://www.clarin.eu/content/federated-content-search.

Fig. 4 VLO

sent to all of these interfaces and the individual results are aggregated and sent back to the user or application (see Figs.5and6). Since CLARIN is designed as an open infrastructure, third-party content providers may easily plug their own repository and FCS interface into this process by registering it to the CLARIN repository registry.

Web services in CLARIN are also described via CMDI (which may very well contain a link to a WSDL file). If more specific metadata is provided (i.e., the information enforced by a certain CMDI profile is given), these Web services can be used in a workflow system called WebLicht (cf. Hinrichs et al.2010). WebLicht allows users to build and execute chains of Web services by analyzing the metadata available for each service and ensuring that the format of the data is compatible; that is, that the output of a predecessor service satisfies the specification of a successor service.

Fig. 5 Federated content search.Sourcehttp://www.clarin.eu/sites/default/files/FCS_components.

png

Fig. 6 CLARIN-D FCS Web application

Table 1 Example input specification for a POS tagger Web service

Format MyFormat

Input text=UTF-8

language=German tokens=present

When thinking about interchanging neuro-linguistic programming (NLP) data like text, there are several established standards defining how texts can be encoded and how annotations like POS tags may be added. These standardization efforts are supported by WebLicht, hence the following interface definition of a Web service compatible with WebLicht:

• the format used is TCF (or TEI¹⁰P5, etc.);

• the document contains German text and is annotated with POS tags;

• the POS tags are encoded according to the STTS¹¹tagset.

A complete interface definition of a WebLicht Web service consists of two iden- tically structured specifications for input and output. Each of these specifications defines the format of a document that is used to represent the data. Additionally, a set of pairs of parameter types is mandatory to invoke the service for the input specification, or is computed and added by the service for the output specification.

Each of these parameter types is bound to a standard definition, which binds it to a standardized encoding of the information.

Tables1and2give example input and output specifications of a POS tagger Web service. This service consumes documents that contain German text that was split

10An organization which maintains a format for digital text representation. Seehttp://www.tei-c.

org/index.xml.

11Stuttgart Tübingen Tagset. Seehttp://www.sfb441.uni-tuebing-en.de/a5/codii/info-stts-en.xhtml.

Table 2 Example output specification for a POS tagger Web service

Format MyFormat

Output POS tags=STTS

into tokens encoded in an imaginary format. It produces a document of the same format by adding POS tags based on the STTS tagset.

The chaining algorithm of WebLicht (cf. Boehlke 2010) is based on the idea that NLP services usually consume a document of a well-defined standard and will also return such a document. The successful invocation of a service for an input document hence depends on which information is available in that document. A POS tagger Web service may only work if sufficient information on sentence and token boundaries is available, while a named entity recognizer (NER) requires appropriate POS tags. Therefore, the standard used for the input document needs to allow for a representation of this kind of information, and, of course, this information needs to be present in the input document itself. This fact is also represented in the interface definition. Thus, for service chaining to work, it must be ensured that this information is available by using a type checker on each step of a chain.

This check can be done when building the chain, since all the necessary information is already available. Based on a formal Web service description according to the proposed structure, a chaining algorithm, which is basically a type checker, can be implemented. A service can be executed if the previous services in the chain meet the following constraints:

the format specified in the output is equal to the format specified in the input specification of the service;

every parameter-type/standard pair defined in the input specification needs to be one of the pairs in the output specifications of services which have been executed (or scheduled for execution previously in the chain, if we stay on build time).

These two constraints are of course a simplification. But in many simple cases, an algorithm like this will be sufficient. A short and simplified example of the chaining logic is given in Figs.7and8, which show part of a chain consisting of Web services A (a tokenizer) and B (a POS tagger). In Fig.7, Service A can be executed since all constraints defined in its input specification are met. The format of the input document is compatible and its content fulfills the requirements because it contains German text encoded in UTF-8. The tokenizer segments the text into sentences and tokens. After its execution, this information is added to the resulting output document. Service B is checked against this updated knowledge about the content of the output document of Service A (see current metadata in Fig.8). Service B is compatible since all of its input requirements, format and parameters, are available in the output document of Service A.

Fig. 7 Tokenizer service specification

Fig. 8 POS tagger service specification

Dalam dokumen Claudia Koschtial Thomas Köhler Carsten Felden Editors (Halaman 111-117)