• Tidak ada hasil yang ditemukan

POS tAGGING and HMM

N/A
N/A
Protected

Academic year: 2018

Membagikan "POS tAGGING and HMM"

Copied!
59
0
0

Teks penuh

(1)

Tim Teks Mining

(2)
(3)

What is Part-of-Speech (POS)

Generally speaking, Word Classes (=POS) :

Verb, Noun, Adjective, Adverb, Article,

We can also include inflection:

Verbs: Tense, number,

(4)

Parts of Speech

8 (ish) traditional parts of speech

Noun, verb, adjective, preposition, adverb, article, interjection,

pronoun, conjunction, etc

Called: parts-of-speech, lexical categories, word classes,

morphological classes, lexical tags...

Lots of debate within linguistics about the number, nature, and

universality of these

(5)

7 Traditional POS Categories

N noun chair, bandwidth, pacingV verb study, debate, munchADJ adj purple, tall, ridiculous

ADV adverb unfortunately, slowly,P preposition of, by, to

PRO pronoun I, me, mine

(6)

POS Tagging

The process of assigning a part-of-speech or lexical

class marker to each word in a collection.

WORD tag

the DET

koala N

put V

the DET

keys N

on P

the DET

(7)

Penn TreeBank POS Tag Set

Penn Treebank: hand-annotated corpus of Wall Street

Journal, 1M words

46 tags

Some particularities:

to /TO not disambiguated

(8)
(9)

Why POS tagging is useful?

Speech synthesis:

How to pronounce lead?INsult inSULT

OBject obJECTOVERflow overFLOW

DIScount disCOUNTCONtent conTENT

Stemming for information retrieval

Can search for “aardvarks” get “aardvark”

Parsing and speech recognition and etc

Possessive pronouns (my, your, her) followed by nouns

Personal pronouns (I, you, he) likely to be followed by verbs

Need to know if a word is an N or V before you can parseInformation extraction

Finding names, relations, etc.

(10)

Open and Closed Classes

Closed class: a small fixed membership Prepositions: of, in, by, …

Auxiliaries: may, can, will had, been, …Pronouns: I, you, she, mine, his, them, …

Usually function words (short common words which play a role in grammar)

(11)

Open Class Words

Nouns

Proper nouns (Boulder, Granby, Eli Manning)English capitalizes these.

Common nouns (the rest). Count nouns and mass nouns

Count: have plurals, get counted: goat/goats, one goat, two goats • Mass: don’t get counted (snow, salt, communism) (*two snows)

Adverbs: tend to modify things

• Unfortunately, John walked home extremely slowly yesterday • Directional/locative adverbs (here,home, downhill)

Degree adverbs (extremely, very, somewhat)Manner adverbs (slowly, slinkily, delicately)

Verbs

(12)

Closed Class Words

Examples:

prepositions: on, under, over,particles: up, down, on, off, …determiners: a, an, the, …pronouns: she, who, I, ..

conjunctions: and, but, or, …

(13)
(14)
(15)
(16)

POS Tagging

Choosing a Tagset

There are so many parts of speech, potential distinctions we can

draw

To do POS tagging, we need to choose a standard set of tags to

work with

Could pick very coarse tagsetsN, V, Adj, Adv.

More commonly used set is finer grained, the “Penn TreeBank

tagset”, 45 tags

PRP$, WRB, WP$, VBG

(17)

Using the Penn Tagset

The/DT grand/JJ jury/NN commmented/VBD on/IN a/DT

number/NN of/IN other/JJ topics/NNS ./.

Prepositions and subordinating conjunctions marked IN

(“although/IN I/PRP..”)

Except the preposition/complementizer “to” is just marked

(18)

POS Tagging

Words often have more than one POS: back

The back door = JJOn my back = NN

Win the voters back = RB

Promised to back the bill = VB

The POS tagging problem is to determine the POS tag for

a particular instance of a word.

(19)
(20)

Current Performance

How many tags are correct?

About 97% currently

But baseline is already 90%Baseline algorithm:

Tag every word with its most frequent tag • Tag unknown words as nouns

(21)

Quick Test: Agreement?

the students went to classplays well with others

fruit flies like a banana

DT: the, this, thatNN: noun

VB: verb

(22)

Quick Test

the students went to class

DT NN VB P NN

plays well with others

VB ADV P NN NN NN P DT

fruit flies like a banana

(23)

1960 1970 1980 1990 2000

Brown Corpus Created (EN-US)

1 Million Words

Brown Corpus Tagged

HMM Tagging (CLAWS) 93%-95% Greene and Rubin

Rule Based - 70%

LOB Corpus Created (EN-UK)

1 Million Words

DeRose/Church Efficient HMM

Sparse Data 95%+

British National Corpus

(tagged by CLAWS) POS Tagging

separated from other NLP

Transformation Based Tagging

(Eric Brill) Rule Based – 95%+

Tree-Based Statistics (Helmut Shmid) Rule Based – 96%+ Neural Network

96%+ Trigram Tagger

(Kempe) 96%+

Combined Methods 98%+

Penn Treebank Corpus (WSJ, 4.5M) LOB Corpus

(24)

Two Methods for POS Tagging

1. Rule-based tagging

(ENGTWOL)

2. Stochastic

1. Probabilistic sequence models

HMM (Hidden Markov Model) tagging

(25)

Rule-Based Tagging

Start with a dictionary

Assign all possible tags to words from the dictionaryWrite rules by hand to selectively remove tags

(26)

Rule-based taggers

Early POS taggers all hand-coded

Most of these (Harris, 1962; Greene and Rubin, 1971)

and the best of the recent ones, ENGTWOL (Voutilainen, 1995) based on a two-stage architecture

Stage 1: look up word in lexicon to give list of potential POSsStage 2: Apply rules which certify or disallow tag sequencesRules originally handwritten; more recently Machine

(27)

Start With a Dictionary

• she: PRP

• promised: VBN,VBD

• to TO

• back: VB, JJ, RB, NN

• the: DT

• bill: NN, VB

(28)

Assign Every Possible Tag

NN RB

VBN JJ VB

PRPVBD TO VB DT NN

(29)

Write Rules to Eliminate Tags

Eliminate VBN if VBD is an option when VBN|VBD follows “<start> PRP”

NN RB

JJ VB

PRPVBD TO VB DT NN

She promised to back the bill

(30)

POS tagging

The involvement of ion channels in B and T lymphocyte activation is DT NN IN NN NNS IN NN CC NN NN NN VBZ

supported by many reports of changes in ion fluxes and membrane VBN IN JJ NNS IN NNS IN NN NNS CC NN

………. ……….

Machine Learning Algorithm

training

We demonstrate that …

Unseen text

We demonstrate PRP VBP

(31)

31/39

We want the best set of tags for a sequence of words (a

sentence)

W — a sequence of wordsT — a sequence of tags

Example:

(32)

But, the Sparse Data Problem …

Rich Models often require vast amounts of data

Count up instances of the string "heat oil in a large pot" in

the training corpus, and pick the most common tag assignment to the string..

(33)

POS Tagging as Sequence Classification

We are given a sentence (an “observation” or “sequence of

observations”)

Secretariat is expected to race tomorrow

What is the best sequence of tags that corresponds to this

sequence of observations?

Probabilistic view:

Consider all possible sequences of tags

Out of this universe of sequences, choose the tag sequence which is

(34)

Getting to HMMs

• We want, out of all sequences of n tags t1…tn the single tag sequence

such that P(t1…tn|w1…wn) is highest.

Hat ^ means “our estimate of the best one”

(35)

Getting to HMMs

This equation is guaranteed to give us the best tag

sequence

But how to make it operational? How to compute this

value?

Intuition of Bayesian classification:

Use Bayes rule to transform this equation into a set of other

(36)

36/39

Bayes’ Theorem (1763)

posterior

posterior priorprior

likelihood

likelihood

marginal likelihood

marginal likelihood

Reverend Thomas Bayes — Presbyterian minister (1702-1761)

Reverend Thomas Bayes — Presbyterian minister (1702-1761)

(37)

37/39

P(W|T) and P(T) can be counted from a large

(38)

38/39

Assume each word in the sequence depends only on

its corresponding tag:

(39)

Make a Markov assumption and use N-grams over

tags ...

P(T) is a product of the probability of N-grams that make

it up

Make a Markov assumption and use N-grams over tags ...

P(T) is a product of the probability of N-grams that make it up

Count P(T)

(40)

Part-of-speech tagging with Hidden Markov

Models

words tags

output probability transition probability

(41)

Analyzing

(42)

A Simple POS HMM

start 0.8 noun verb end

0.2

0.8

0.7 0.1

0.2

0.1

(43)

P ( word | state )

A two-word language: fish and sleepSuppose in our training corpus,

fish appears 8 times as a noun and 5 times as a verb

sleep appears twice as a noun and 5 times as a verb

Emission probabilities:

Noun

• P(fish | noun) : 0.8

• P(sleep | noun) : 0.2

Verb

• P(fish | verb) : 0.5

(44)

Viterbi Probabilities

0

1

2

3

start

verb

noun

(45)

start 0.8 noun 0.8 verb 0.7 end

0.2

0.1 0.1

0

1

2

3

start

1

verb

0

noun

0

(46)

start 0.8 noun 0.8 verb 0.7 end

0.2

0.1 0.1

Token 1: fish

0

1

2

3

start

1

0

verb

0

.2 * .5

noun

0

.8 * .8

(47)

start 0.8 noun 0.8 verb 0.7 end

0.2

0.1 0.1

Token 1: fish

0

1

2

3

start

1

0

verb

0

.1

noun

0

.64

(48)

start 0.8 noun 0.8 verb 0.7 end

0.2

0.1 0.1

Token 2: sleep

(if ‘fish’ is verb)

0

1

2

3

start

1

0

0

verb

0

.1

.1*.1*.5

noun

0

.64

.1*.2*.2

(49)

-start 0.8 noun 0.8 verb 0.7 end

0.2

0.1 0.1

Token 2: sleep

(if ‘fish’ is verb)

0

1

2

3

start

1

0

0

verb

0

.1

.005

noun

0

.64

.004

(50)

-start 0.8 noun 0.8 verb 0.7 end

0.2

0.1 0.1

Token 2: sleep

(if ‘fish’ is a noun)

0

1

2

3

start

1

0

0

verb

0

.1

.005

.64*.8*.5

noun

0

.64

.004 .64*.1*.2

(51)

-start 0.8 noun 0.8 verb 0.7 end

0.2

0.1 0.1

Token 2: sleep

(if ‘fish’ is a noun)

0

1

2

3

start

1

0

0

verb

0

.1

.005

.256

noun

0

.64

.004 .0128

(52)

-start 0.8 noun 0.8 verb 0.7 end

0.2

0.1 0.1

Token 2: sleep take maximum,

set back pointers

0

1

2

3

start

1

0

0

verb

0

.1

.005

.256

noun

0

.64

.004 .0128

(53)

-start 0.8 noun 0.8 verb 0.7 end

0.2

0.1 0.1

Token 2: sleep take maximum,

set back pointers

0

1

2

3

start

1

0

0

verb

0

.1

.256

noun

0

.64

.0128

(54)

-start 0.8 noun 0.8 verb 0.7 end

0.2

0.1 0.1

Token 3: end

0

1

2

3

start

1

0

0

0

verb

0

.1

.256

-noun

0

.64

.0128

(55)

start 0.8 noun 0.8 verb 0.7 end

0.2

0.1 0.1

Token 3: end take maximum,

set back pointers

0

1

2

3

start

1

0

0

0

verb

0

.1

.256

-noun

0

.64

.0128

(56)

start 0.8 noun 0.8 verb 0.7 end

0.2

0.1 0.1

Decode: fish = noun

sleep = verb

0

1

2

3

start

1

0

0

0

verb

0

.1

.256

-noun

0

.64

.0128

(57)

START END

Transition Probability

(58)

Exercise

(59)

POS taggers

Brills tagger

http://www.cs.jhu.edu/~brill/TnT tagger

http://www.coli.uni-saarland.de/~thorsten/tnt/Stanford tagger

http://nlp.stanford.edu/software/tagger.shtmlSVMTool

http://www.lsi.upc.es/~nlp/SVMTool/GENIA tagger

http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger/

More complete list at:

Referensi

Dokumen terkait

NAMA SATUAN KERJA NAMA KEGIATAN LOKASI SUMBER DANA DESKRIPSI PELAKSANAAN PEKERJAAN... 17 Kecamatan Cipedes

Deskripsi Singkat : Mata Kuliah ini membahas tentang bagaimana cara menulis artikel di media massa, artikel ilmiah, dan menjelaskan teknik penulisan kertas kerja dan artikel

Tujuan dari penelitian tugas akhir ini adalah membantu pengguna aplikasi dalam menyandikan file PDF dengan menggunakan Algoritma Kunci Publik LUC dan

Alasan mereka menyatakan kurang cukup adalah karena disaat pengobatan, pasien sering menunggu, karena petugas yang akan memberikan pengobatan kepada pasien tersebut

Pengukuran kepuasan kerja menggunakan skala kepuasan kerja, pengukuran komitmen organisasi menggunakan skala komitmen organisasi, dan pengukuran prestasi kerja karyawan menggunakan

D. Dalam proses pembelajaran,peserta didik dapat mensyukuri anugerah Tuhan akan keberadaan ilmu kimia dan menggunakannya sebagai sarana dalam memahami, menerapkan, dan

Peningkatan Kesejahteraan Petani Peningkatan Kemampuan Lembaga Petani Belanja Bantuan Sosial Barang yang akan diserahkan kepada Masyarakat.. - Belanja Alat Perangkap Babi/Lapon

Berdasarkan hasil penelitian menunjukkan bahwa kemampuan pemecahan masalah matematika dan motivasi belajar siswa lebih baik yang diajarkan dengan pembelajaran