Tim Teks Mining
What is Part-of-Speech (POS)
• Generally speaking, Word Classes (=POS) :
• Verb, Noun, Adjective, Adverb, Article, …
• We can also include inflection:
• Verbs: Tense, number, …
Parts of Speech
• 8 (ish) traditional parts of speech
• Noun, verb, adjective, preposition, adverb, article, interjection,
pronoun, conjunction, etc
• Called: parts-of-speech, lexical categories, word classes,
morphological classes, lexical tags...
• Lots of debate within linguistics about the number, nature, and
universality of these
7 Traditional POS Categories
• N noun chair, bandwidth, pacing • V verb study, debate, munch • ADJ adj purple, tall, ridiculous
• ADV adverb unfortunately, slowly, • P preposition of, by, to
• PRO pronoun I, me, mine
POS Tagging
• The process of assigning a part-of-speech or lexical
class marker to each word in a collection.
WORD tag
the DET
koala N
put V
the DET
keys N
on P
the DET
Penn TreeBank POS Tag Set
• Penn Treebank: hand-annotated corpus of Wall Street
Journal, 1M words
• 46 tags
• Some particularities:
• to /TO not disambiguated
Why POS tagging is useful?
• Speech synthesis:
• How to pronounce “lead”? • INsult inSULT
• OBject obJECT • OVERflow overFLOW
• DIScount disCOUNT • CONtent conTENT
• Stemming for information retrieval
• Can search for “aardvarks” get “aardvark”
• Parsing and speech recognition and etc
• Possessive pronouns (my, your, her) followed by nouns
• Personal pronouns (I, you, he) likely to be followed by verbs
• Need to know if a word is an N or V before you can parse • Information extraction
• Finding names, relations, etc.
Open and Closed Classes
• Closed class: a small fixed membership • Prepositions: of, in, by, …
• Auxiliaries: may, can, will had, been, … • Pronouns: I, you, she, mine, his, them, …
• Usually function words (short common words which play a role in grammar)
Open Class Words
• Nouns
• Proper nouns (Boulder, Granby, Eli Manning) • English capitalizes these.
• Common nouns (the rest). • Count nouns and mass nouns
• Count: have plurals, get counted: goat/goats, one goat, two goats • Mass: don’t get counted (snow, salt, communism) (*two snows)
• Adverbs: tend to modify things
• Unfortunately, John walked home extremely slowly yesterday • Directional/locative adverbs (here,home, downhill)
• Degree adverbs (extremely, very, somewhat) • Manner adverbs (slowly, slinkily, delicately)
• Verbs
Closed Class Words
Examples:
• prepositions: on, under, over, … • particles: up, down, on, off, … • determiners: a, an, the, … • pronouns: she, who, I, ..
• conjunctions: and, but, or, …
POS Tagging
Choosing a Tagset
• There are so many parts of speech, potential distinctions we can
draw
• To do POS tagging, we need to choose a standard set of tags to
work with
• Could pick very coarse tagsets • N, V, Adj, Adv.
• More commonly used set is finer grained, the “Penn TreeBank
tagset”, 45 tags
• PRP$, WRB, WP$, VBG
Using the Penn Tagset
• The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT
number/NN of/IN other/JJ topics/NNS ./.
• Prepositions and subordinating conjunctions marked IN
(“although/IN I/PRP..”)
• Except the preposition/complementizer “to” is just marked
POS Tagging
• Words often have more than one POS: back
• The back door = JJ • On my back = NN
• Win the voters back = RB
• Promised to back the bill = VB
• The POS tagging problem is to determine the POS tag for
a particular instance of a word.
Current Performance
• How many tags are correct?
• About 97% currently
• But baseline is already 90% • Baseline algorithm:
• Tag every word with its most frequent tag • Tag unknown words as nouns
Quick Test: Agreement?
• the students went to class • plays well with others
• fruit flies like a banana
DT: the, this, thatNN: noun
VB: verb
Quick Test
• the students went to class
DT NN VB P NN
• plays well with others
VB ADV P NN NN NN P DT
• fruit flies like a banana
1960 1970 1980 1990 2000
Brown Corpus Created (EN-US)
1 Million Words
Brown Corpus Tagged
HMM Tagging (CLAWS) 93%-95% Greene and Rubin
Rule Based - 70%
LOB Corpus Created (EN-UK)
1 Million Words
DeRose/Church Efficient HMM
Sparse Data 95%+
British National Corpus
(tagged by CLAWS) POS Tagging
separated from other NLP
Transformation Based Tagging
(Eric Brill) Rule Based – 95%+
Tree-Based Statistics (Helmut Shmid) Rule Based – 96%+ Neural Network
96%+ Trigram Tagger
(Kempe) 96%+
Combined Methods 98%+
Penn Treebank Corpus (WSJ, 4.5M) LOB Corpus
Two Methods for POS Tagging
1. Rule-based tagging
• (ENGTWOL)
2. Stochastic
1. Probabilistic sequence models
• HMM (Hidden Markov Model) tagging
Rule-Based Tagging
• Start with a dictionary
• Assign all possible tags to words from the dictionary • Write rules by hand to selectively remove tags
Rule-based taggers
• Early POS taggers all hand-coded
• Most of these (Harris, 1962; Greene and Rubin, 1971)
and the best of the recent ones, ENGTWOL (Voutilainen, 1995) based on a two-stage architecture
• Stage 1: look up word in lexicon to give list of potential POSs • Stage 2: Apply rules which certify or disallow tag sequences • Rules originally handwritten; more recently Machine
Start With a Dictionary
• she: PRP
• promised: VBN,VBD
• to TO
• back: VB, JJ, RB, NN
• the: DT
• bill: NN, VB
Assign Every Possible Tag
NN RB
VBN JJ VB
PRPVBD TO VB DT NN
Write Rules to Eliminate Tags
Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP”
NN RB
JJ VB
PRPVBD TO VB DT NN
She promised to back the bill
POS tagging
The involvement of ion channels in B and T lymphocyte activation is DT NN IN NN NNS IN NN CC NN NN NN VBZ
supported by many reports of changes in ion fluxes and membrane VBN IN JJ NNS IN NNS IN NN NNS CC NN
………. ……….
Machine Learning Algorithm
training
We demonstrate that …
Unseen text
We demonstrate PRP VBP
31/39
We want the best set of tags for a sequence of words (a
sentence)
W — a sequence of words T — a sequence of tags
Example:
But, the Sparse Data Problem …
• Rich Models often require vast amounts of data
• Count up instances of the string "heat oil in a large pot" in
the training corpus, and pick the most common tag assignment to the string..
POS Tagging as Sequence Classification
• We are given a sentence (an “observation” or “sequence of
observations”)
• Secretariat is expected to race tomorrow
• What is the best sequence of tags that corresponds to this
sequence of observations?
• Probabilistic view:
• Consider all possible sequences of tags
• Out of this universe of sequences, choose the tag sequence which is
Getting to HMMs
• We want, out of all sequences of n tags t1…tn the single tag sequence
such that P(t1…tn|w1…wn) is highest.
• Hat ^ means “our estimate of the best one”
Getting to HMMs
• This equation is guaranteed to give us the best tag
sequence
• But how to make it operational? How to compute this
value?
• Intuition of Bayesian classification:
• Use Bayes rule to transform this equation into a set of other
36/39
Bayes’ Theorem (1763)
posterior
posterior priorprior
likelihood
likelihood
marginal likelihood
marginal likelihood
Reverend Thomas Bayes — Presbyterian minister (1702-1761)
Reverend Thomas Bayes — Presbyterian minister (1702-1761)
37/39
P(W|T) and P(T) can be counted from a large
38/39
Assume each word in the sequence depends only on
its corresponding tag:
Make a Markov assumption and use N-grams over
tags ...
P(T) is a product of the probability of N-grams that make
it up
Make a Markov assumption and use N-grams over tags ...
P(T) is a product of the probability of N-grams that make it up
Count P(T)
Part-of-speech tagging with Hidden Markov
Models
words tags
output probability transition probability
Analyzing
A Simple POS HMM
start 0.8 noun verb end
0.2
0.8
0.7 0.1
0.2
0.1
P ( word | state )
• A two-word language: “fish” and “sleep” • Suppose in our training corpus,
• “fish” appears 8 times as a noun and 5 times as a verb
• “sleep” appears twice as a noun and 5 times as a verb
• Emission probabilities:
• Noun
• P(fish | noun) : 0.8
• P(sleep | noun) : 0.2
• Verb
• P(fish | verb) : 0.5
Viterbi Probabilities
0
1
2
3
start
verb
noun
start 0.8 noun 0.8 verb 0.7 end
0.2
0.1 0.1
0
1
2
3
start
1
verb
0
noun
0
start 0.8 noun 0.8 verb 0.7 end
0.2
0.1 0.1
Token 1: fish
0
1
2
3
start
1
0
verb
0
.2 * .5
noun
0
.8 * .8
start 0.8 noun 0.8 verb 0.7 end
0.2
0.1 0.1
Token 1: fish
0
1
2
3
start
1
0
verb
0
.1
noun
0
.64
start 0.8 noun 0.8 verb 0.7 end
0.2
0.1 0.1
Token 2: sleep
(if ‘fish’ is verb)
0
1
2
3
start
1
0
0
verb
0
.1
.1*.1*.5noun
0
.64
.1*.2*.2-start 0.8 noun 0.8 verb 0.7 end
0.2
0.1 0.1
Token 2: sleep
(if ‘fish’ is verb)
0
1
2
3
start
1
0
0
verb
0
.1
.005noun
0
.64
.004-start 0.8 noun 0.8 verb 0.7 end
0.2
0.1 0.1
Token 2: sleep
(if ‘fish’ is a noun)
0
1
2
3
start
1
0
0
verb
0
.1
.005.64*.8*.5
noun
0
.64
.004 .64*.1*.2-start 0.8 noun 0.8 verb 0.7 end
0.2
0.1 0.1
Token 2: sleep
(if ‘fish’ is a noun)
0
1
2
3
start
1
0
0
verb
0
.1
.005.256
noun
0
.64
.004 .0128-start 0.8 noun 0.8 verb 0.7 end
0.2
0.1 0.1
Token 2: sleep take maximum,
set back pointers
0
1
2
3
start
1
0
0
verb
0
.1
.005.256
noun
0
.64
.004 .0128-start 0.8 noun 0.8 verb 0.7 end
0.2
0.1 0.1
Token 2: sleep take maximum,
set back pointers
0
1
2
3
start
1
0
0
verb
0
.1
.256
noun
0
.64
.0128
-start 0.8 noun 0.8 verb 0.7 end
0.2
0.1 0.1
Token 3: end
0
1
2
3
start
1
0
0
0
verb
0
.1
.256
-noun
0
.64
.0128
start 0.8 noun 0.8 verb 0.7 end
0.2
0.1 0.1
Token 3: end take maximum,
set back pointers
0
1
2
3
start
1
0
0
0
verb
0
.1
.256
-noun
0
.64
.0128
start 0.8 noun 0.8 verb 0.7 end
0.2
0.1 0.1
Decode: fish = noun
sleep = verb
0
1
2
3
start
1
0
0
0
verb
0
.1
.256
-noun
0
.64
.0128
START END
Transition Probability
Exercise
POS taggers
• Brill’s tagger
• http://www.cs.jhu.edu/~brill/ • TnT tagger
• http://www.coli.uni-saarland.de/~thorsten/tnt/ • Stanford tagger
• http://nlp.stanford.edu/software/tagger.shtml • SVMTool
• http://www.lsi.upc.es/~nlp/SVMTool/ • GENIA tagger
• http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/
• More complete list at: