Extracting Events and Temporal Expressions A Literature Survey

Initially, temporal expressions were considered as a kind of named entities and their identification was part of the named entity recognition task. It was developed as part of the EVENT/TIMEX3 track of the 2012 i2b2 clinical temporal relations challenge. It is a simple rule-based system with limited normalization functionality. It annotates documents using TIMEX2 tags.

Figure 2.1: State Diagram for Identifier

CRF Based Models

Feature Engineering

For example, two-digit and four-digit numbers can stand for years, and when followed by an "s" they can stand for a decade. An example function can be created by applying a non-alpha function over the word to create word-level functions like nonAlpha(I.B.M.). For example, a pattern function might map all uppercase letters to "A", all lowercase letters to "a", all digits to "0", and punctuation to.

Aggregate pattern features are a condensed form of the above pattern feature in which successive types of similar repeating patterns are removed. The circumscriptions help break down the meaning of the word and therefore make the decision more deterministic. For example, "May" followed by a number is more likely to be a time expression than followed by a non-numeric sign.

In the most basic sense, a distributed representation is one that is distributed over a set of features for representation, as opposed to localized approach where each feature is independent of each other. We will use distributed representation of words in our Neural Network Model for Temporal Expression Recognition.

Table 2.1: Word level features for Time Expression Detection

Distributed representation for words

As in the above example, if we want the correlation between Home and House, the representation does not show any correlation between the terms. Therefore, an algorithm that only saw "Dog" during training would not be able to tag "Cat" during testing. Distributed representation would represent these words in some lower-dimensional real-valued dense vector, where each dimension represents a latent feature for word model.

For words that are rare in the labeled training corpus, the parameters estimated by one-hot representation will be poor. Moreover, the model cannot handle the words that are not in the corpus. The hope is that the distributed representation would capture the semantic and syntactic properties of the word and have a similar representation for syntactically and semantically related words.

For example, in the POS tagging example above, even when we haven't seen Cat in training, the distributed representation of Cat will look like Dog. The algorithm will be able to classify Cat with similar tag as it would have learned for Dog.

Training distributed word representation

Neural Probabilistic Language Model
Contrastive criterion learning
Combined Bag of Words Approach (CBOW)
Continuous Skip-gram approach

The primary goal of the paper is to develop a language model that overcomes the curse of dimensionality. The motivation of the paper is to develop a unified architecture that can perform various NLP tasks such as POS tagging, parsing, named entity recognition, and semantic role tagging. The basis of multitask learning is learning a very good representation of words from which higher-level features can be extracted for the specific needs of the task. 2011), a detailed explanation of unsupervised coaching is first presented in Collobert and Weston (2008).

If the sample does not split substantially, then the error is propagated back into the network to adjust the network parameters. In addition, the hidden layer calculates the probability distribution over all words in the dictionary waiting in the contrastive evaluation method. In the above equation H ×V is the dominant term due to the nonlinear nature of the layer.

A log linear classifier is used to classify the current word given the window w along the past and future of the word in question. Since the words are randomly chosen, we skip some words in context and hence the name skip-gram is used for model.

Figure 3.1: Neural architecture for language modeling loop k between 1 to n-1

Semantic and syntactic information in representation

In the same article, Mikolov et al. 2013) propose second efficient model to generate word representation. Method uses current word as an input to the projection layer and tries to predict the words within certain range before and after the current word. As we will show in the results section of the report, semantic and syntactic similarities between the words are well captured by the model.

We can answer some analogy question using simple algebraic operations with the vector representation of the words. For example, to find a word that is similar to small in the same sense that largest is similar to large, we can simply calculate X=vector(”largest”)−vector(”large”) +vector(”small”). If we find words with similar representation to X using cosine similarity and use it to answer the query, it is possible that one of the options will be the "smallest" of the possibilities.

Second, there are knowledge-driven methods as discussed in Section 4.2 that use expert knowledge encoded in the form of patterns or rules to extract knowledge. Finally, hybrid event extraction approaches as discussed in Section 4.3 combine data-driven and knowledge-based methods.

Data-Driven Event Extraction

First, there are data-driven approaches as discussed in Section 4.1 that exploit statistics, machine learning and linear algebra to convert data into knowledge. Second, document vectors are generated from the collected entries using morphological analysis, named entity recognition, and IDF-based weighting function. Finally, a topic word is extracted from each topic cluster using a modified version of the C-value method.

A disadvantage of the discussed data-driven methods for extracting events is that they do not explicitly address meaning. Another drawback of statistics-based text mining is that it requires a large amount of data to get statistically significant results. However, since these approaches are not knowledge-based, neither linguistic resources nor expert (domain) knowledge are required.

Knowledge-Driven Event Extraction

Finally, the events extracted from individual documents are clustered to allow viewing of clustered events across documents. Here, lexical-syntactic patterns are used to detect a wide range of relationships and events. The Tagger module exploits pattern matching rules to extract an event using lexicon-based and syntax-based generic patterns.

-reference resolution module only resolves definite noun phrases of Organization, Person and Location types, and singular person pronouns: he and she. 2010) presented a method to extract event-based common sense knowledge using lexico-syntactic pattern matching and semantic role labeling. Patterns like subject + "are able to" + verb, subject +. is able to "+ verb and many others are used to extract sentences from the web. Each sentence is analyzed by a Semantic Role Tagging module to extract verbs and its arguments.

Verbs together with their arguments form knowledge, but this method is prone to errors, so the authors proposed a semantic role plausibility check strategy based on a semantic role replacement strategy, which significantly reduced knowledge items with a high probability of incorrectly parsed semantic roles. 2009) extracted events from personal blogging experiences. It exploits the syntactic and semantic pattern encoded in the form of rules to extract the desired information.

Hybrid Event Extraction

In summary, the advantage of sample-based approaches is the use of a very small amount of data. Semantic roles were used to detect events that are nominalizations of verbs, such as agree for agree or construct for construct. The authors proposed a new algorithm for extracting topic sentences by emphasizing the importance of the news headline; Then, event facts (i.e., 5W1H) were extracted from these topic sentences using a rule-based (verb-based) and a supervised machine learning (SVM) method.

Next, SVM classifier is trained with morphological features such as POS tag, sentence length, word position to extract candidate words. Finally, rules based on valence grammar and previous two stages are used to find 5W1H elements. Authors used the concept of Valency Grammar to construct syntactic rules for extracting verbs and its arguments.

In hybrid event extraction systems, due to the use of data-driven methods, the amount of data required increases, but usually remains smaller than is the case with purely data-driven methods. On the other hand, the amount of expert knowledge needed for effective and efficient event detection is generally less than for model-based methods, due to the fact that the lack of domain knowledge can be compensated for by using statistical methods. .

TempEval 2013 Task of Event Extraction

Participants and their approaches

The authors of the system were primarily interested in evaluating how useful the various corpora are. From the results it can be concluded that the minimalist feature set was useful for tasks such as detecting relationships between entities, it could not achieve good results in event identification and classification task. The NavyTime system (Chambers, 2013) also used the minimalist feature set derived from cues, part-of-speech tags, region analysis trees, and dependency trees to train the Maxent classifier.

The system performed well for the event identification task, but could not achieve high scores in the event classification. KUL (Kolomiyets and Moens, 2013) used multi-label logistic regression classifier for event detection and classification with features derived from dependency and constituency parse trees and shallow parsing. Authors of the Temp:ESAFeature system have experimented with Explicit Semantic Analysis Score and Wordnet Hypernym as features to classify the event and types.

The use of features such as semantic roles, ESA, Wordnet lexical and semantic relations proved to be beneficial for event identification, but failed to contribute significantly to event classification task. In order to determine the Class for the event extraction task, authors experimented with using a language-.

Figure 4.2: Rule Based approaches for Event Extraction

Observations

Web mining for event-based common sense knowledge using lexico-syntactic pattern matching and semantic role labeling. In Proceedings of the Second Joint Conference on Lexical and Computational Semantics (*. SEM), Part 2: Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013), pages. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML '01, pages 282–289, San Francisco, CA, USA.

James Pustejovsky, Bob Ingria, Roser Sauri, Jose Castano, Jessica Littman, Rob Gaizauskas, Andrea Setzer, Graham Katz, and Inderjeet Mani. Proceedings of the conference on Human Language Technology and Empirical Methods in Natural Language Processing, pages 700–707.