Combining Chart Parsing and Finite State

(1)

Combining Chart-Parsing and Finite State

Parsing.

Conference Paper

· January 1999

Source: DBLP

CITATIONS

8 READS

14 3 authors:

Some of the authors of this publication are also working on these related projects:

PROSA - MED: Advanced semantic textual processing for the detection of diagnostic codes, procedures,

concepts and their relationships in health records

View project

Koldo Gojenola

Universidad del País Vasco / Euskal Herriko Uni…

75 PUBLICATIONS

404 CITATIONS

SEE PROFILE

Izaskun Aldezabal

Universidad del País Vasco / Euskal Herriko Uni…

42 PUBLICATIONS

228 CITATIONS

SEE PROFILE

Maite Oronoz

Universidad del País Vasco / Euskal Herriko Uni…

55 PUBLICATIONS

245 CITATIONS

SEE PROFILE

All content following this page was uploaded by Koldo Gojenola on 04 July 2014.

(2)

of the

ESSLLI Student Session

1999

(3)

(4)

Combining Chart-Parsing and Finite State

Parsing

I. Aldezabal,K. Gojenola, M. Oronoz

InformatikaFakultatea,649 P. K.Euskal Herriko Unibertsitatea,

20080 Donostia (Euskal Herria)

E-mail: jipgogaksi.ehu.es

Abstrat

Thispaperpresentsthedevelopmentofaparsingsystemthatombinesauniation-basedpartialhart-parser

with nitestatetehnology. It isbeingappliedtoBasque, anagglutinativelanguage. Ina rst phase,a

uniationgrammarisappliedtoeahsentene,givingahartasaresult. Thegrammarispartialandgives

agoodoverageofthemainelementsofthesentene,buttheoutputisambiguous. Afterthat,theresulting

hartistreatedasanautomatontowhihnitestatedisambiguationonstraintsandltersanbeappliedto

hoosethebestsingleanalysis. Thesystemhasbeentestedontwodierentappliations: theaquisitionof

subategorizationinformationforverbs,andthedetetionofsyntatierrors.

0.1 Introdution

Thispaperpresentsa projetforthedevelopment of aparsing systemthat

ombines a uniation-based partial hart-parser with nite state

tehnol-ogy. ThesystemisbeingappliedtoBasque,anagglutinativelanguage,with

freeorder among sentene omponents.

Inarstphase,auniationgrammarisappliedtoeahsentene,giving

a hart as a result. The grammar is partialbut gives a ompleteoverage

of the main elements of the sentene, suh as noun phrases, prepositional

phrases, sentential omplements and simple sentenes. It an be seen as a

shallowparser[2 ,3℄thatanbeusedforsubsequentproessing. However,it

ontainsbothmorphologialansyntatiambiguities,givingahugenumber

ofdierent interpretations persentene.

After that, the resulting hart is treated as an automaton to whih

-nite state disambiguation onstraints and lters an be applied, so that

unwanted readings or information an be disarded or parts of a sentene

anbehosen,dependingontheappliation. Thesystemhasbeentestedon

two dierent appliations: the extration of subategorization information

fora givenverb, and thedetetion of syntatierrors.

In this manner, we ombine onstrutive and redutionistiapproahes

to parsing: inarstphase theuniation-grammarobtainssyntatiunits,

usually with many dierent interpretations, while in a seond phase the

analysesarerestritedaordingto theontext andthekindofappliation.

The remainder of this paper is organized as follows. In Setion 2 we

presentadesriptionofthehart-parser. Setion3desribesthenitestate

(5)

applia-0.2 The hart parser

0.2.1 Previous work

Thesystemreliesondierentwide-overage toolsthathavebeendeveloped

forBasque:

The LexialDatabaseforBasque [4 ℄. Itisa largerepositoryof lexial

informationwithabout70.000entries(60.000lemmas),eahone with

its assoiatedlinguistifeatures: ategory,subategory,ase, number,

deniteness, mood and tense, among others. As this database is the

basisofthesyntatianalyzer,wemustsaythatthereisnoinformation

related to verbsubategorization.

A morphologial analyzer [4℄. The analyzer applies Two-Level

Mor-phology forthemorphologial desriptionand obtains,foreah word,

itssegmentation(s)intoomponentmorphemes. Themoduleisrobust

and hasfulloverage of free-runningtexts inBasque.

A morphologial disambiguator based on boththe Constraint

Gram-mar formalism [13 , 5℄ and a statistial tagger [9℄. This tool redues

the high word-level ambiguity from 2.65 to 1.19 interpretations, but

stillleavesa numberofdierent interpretationspersentene.

0.2.2 The syntati grammar

Basque being an agglutinative language, linguisti information like ase,

number and determination are given by means of morphemes appended

to the last element of a noun/prepositional phrase. Following the most

extendedlinguistidesriptionsforBasque,we tookmorphemes(seeFigure

1)asthebasiunitsuponwhihsyntatianalysisisbased,departingfrom

thetraditionaluseofthegrammatialwordasthesyntatiunit,asisdone

innon agglutinativelanguages likeEnglish. Althoughthegureonlyshows

the ategory of eah lexial/syntati unit, there is a rih information for

eahofthem,enodedintheformoffeaturestrutures. Moreover, afterthe

syntati unitsareobtained, theanalysis tree isnotusedanymore.

ThePATR-IIformalismwasusedforthedenitionofthesyntatirules.

Therewere two mainreasonsforthiseletion:

Simpliity. Thegrammaris notlinkedto alinguistitheory,likeLFG

or HPSG,asthere hasnotbeenabroad desriptionfor Basqueusing

those formalisms [1℄. Moreover, theirappliation would require

infor-mation notavailableat the moment, suh as verb subategorization.

PATR-II is more exibleat the ost of extra writing, as it is dened

(6)

mendi 0 ko etxe polit 0 an

mountain the of/at house nice the in

noun sing genitive noun adj sing inessive

DECLENSION

NMODIFIER

NP1

PP

NP2

DECLENSION

Figure 1: Parse tree for mendiko etxe politan ('in the nie house at the

mountain')

The formalism is based on uniation. This is useful for the

ma-nipulation of omplex linguisti strutures for the representation of

grammatial onstituents. We must stress that the grammar uses

good linguisti granularity, in the sense that we keep all the

linguis-tially relevant morphosyntati information, very rih ompared to

mosthunkingsystems. Thisfat willenableusto useitasa general

toolfordierent appliations.

Thegrammaratthemomentontains120rules. Thereisanaveragenumber

of15equationsperrule,some ofthemfortestingonditionslikeagreement,

and othersforstruturebuilding. The mainphenomenaovered are:

Nounphrases and prepositional phrases. Agreement among the

om-ponentelementsis veried, addedto theproperuse ofdeterminers.

Subordinate sentenes, suh as sentential omplements (ompletive

lauses,indiretquestions,...) and relativelauses.

Simplesentenesusingall thepreviouselements. Therih agreement

betweentheverband the mainsenteneonstituents(subjet, objet

and seondobjet)inase, numberand personis veried.

The omponentsfound by theparser are very reliable,as onlywell-formed

elementsareobtained,whihareneessaryforafullsenteneinterpretation.

Duetothevarietyof syntatistruturesfoundinreal sentenes,the

gram-mar is limited in the sensethat onlya smallfration of the sentenes will

reeive a full analysis. However, onstruting a omplete grammar would

be a ostly enterprise. For that reason, this grammar is applied

bottom-up,resulting ina parser that obtainspiees of analysesor hunks, and the

resultinghart an be usedforsubsequent proessing.

(7)

...

Figure2: State of thehartafter theanalysisof Mendiko etxe politan ikusi

dutnik ('Ihave seen (it)inthenie houseat themountain')

is found, with the aim of orreting a syntati error. In fat, this

sys-temassumedthat every orretsentene getsa ompleteanalysis,whilewe

onsiderthat quite often the grammar will notbe able to parse the whole

sentene. Even inthease whenwe want todetet agrammatial error,we

donotassume thatthe orreted sentene wouldgive a ompleteparse.

0.3 The Finite State Parser

In reent years the eld of nite state systems has gained muh attention

[12 ,13 ℄. Aharateristithatallnitestatesystemshaveinommonisthat

theyworkonrealtextsdealingwithunitsbiggerthanthegrammatialword,

usingmore elaborated information,butwithoutreahing theomplexityof

fullsenteneanalysis.

Astheoutputofourhartparserisambiguous,weobtainalargenumber

ofpotentialanalyses(onatenationsofhunksoveringthewholesentene).

Weneededatoolfortheseletionofpatternsoverthenalhart,andnite

state tehnology is adequate for this task. Currently we use the Xerox

FiniteState Tool(XFST 1

). Asalinkbetweenbothparsers,thehartmust

beonvertedintoanautomaton,whihanthenbeproessedbyXFST(see

Figure 2). In the gure, dashed linesare used to indiate lexial elements,

whileplainlines denesyntati units(thoseobtained by thehart-parser,

see Figure 1). The bold irles represent word-boundaries, and the plain

ones delimit morpheme-boundaries. As before, eah ar in the gure is

represented by its morphosyntati information, in the form of a sequene

offeature-valuepairs.

Inthe dierent appliations, we have usedsome ommonoperations:

Disambiguation. Twokindsofambiguitywereonsidered:

morphosyn-tati ambiguityleft by the Constraint Grammar and stohasti

dis-ambiguation proesses and that introdued by the hart parser. As

(8)

wholesyntatiunitsanbeused,thisproessissimilartothatofCG

disambiguationwiththeadvantageofbeingabletoreferenesyntati

unitslargerthanthe word.

Filtering. Dependingoneahappliation,sometimesnotallthe

avail-able informationis relevant. For example, one diÆultkindof

ambi-guity,noun/adjetive,asinzuriekin (zuri='white','withthewhites'/

'with the white ones') an be ignored when aquiring verb

subate-gorization information, as we are mainly interested in the syntati

ategory and thegrammatial ase (prepositionalphraseand

ommi-tative, respetively), thesame inbothalternatives.

Extrating partsof a sentene. Theglobal ambiguityof a sentene is

onsiderablyredued ifonly part of it is onsidered. Forinstane, in

thease of extrating verb subategorization information, some rules

examine the ontext of the target verb and dene the sope of the

subsenteneto whihtheprevious operationswillbeapplied.

Among the nite state operators used (see Figure 3 2

), we apply

omposi-tion,intersetionandunionofregularexpressionsandtransduers 3

. Weuse

bothordinaryompositionandthereentlydenedlenientomposition [10 ℄.

This operator allows the appliation of dierent eliminating onstraints to

asentene,alwayswiththeertaintythatwhensome givenonstraint

elim-inatesall the interpretations, thenthe onstraint is notappliedat all, that

is,theinterpretationsare"resued". Astheoperatoris denedinterms of

regularrelations itan be omposed withother automata.

0.4 Appliations

0.4.1 Aquisition of subategorization information

We are interested in automatially obtaining verb subategorization

infor-mationfrom orpora[7, 8℄. Ourobjetive is to enrih theverbentries

on-tainedinthelexialdatabasewithinformationaboutthemainonstituents

(noun phrases, prepositional omplements and subordinate sentenes)

ap-pearingwitheah verb.

Thehartparseranalyzesthemainunitsineahsenteneandthennite

state rulesarry outthefollowingtasks(see Figure 3):

Reognition of therelevant subsentene. The mean numberof words

persenteneis22,rangingfromone tosixsubsentenes. Theproblem

isthat ofdelimiting thepartorrespondingto the target verb.

2

The .o. and & operators denote respetively the omposition and intersetion of

regularlanguagesandrelations. 3

(9)

Sentence

Chart parser

chart (automaton)

.o.

SubSentenceRecognizers

.o.

DisambiguatingFilters

.o.

LongestPhraseFilters

.o.

OutputCleaningFilters

verb + elements

No Error / Error Type(s)

.o.

ErrorPattern & NotException

.o.

OutputCleaningFilters

Figure 3: Dierent appliationsof thesystem

Disambiguation. Some heuristisareused to solve theremaining

am-biguities, related,amongothers, to dierent typesof subordination.

Seletion of the longest NP/PPs. This heuristi proved to be very

reliable to exlude multiplereadingsof asingle NP/PP.

Cleaning the output. The result of the previous phase ontains a

lot of morphsyntati information, but most of it is superuous for

thisappliation,as we aremainlyinterested inthegrammatial ase,

numberand subordinationtype of eahelement.

Thedesignofthenitestaterulesisanon-trivialtaskwhendealingwith

real texts. Nevertheless, the delarativeness of nite state tools improves

signiantly our previous version of the system implemented by means of

ad ho programs, diÆult to orret and maintain. Moreover, XFST also

improves eÆieny. About 350 nite state denitions were made for this

appliation.

As a rst experiment, 500 sentenes for eah of 5 dierent verbs were

seleted. Figure 4 shows theases that have mostlyappeared around eah

of the seleted verbs. When there was more than a single analysis for a

subsentene, it was not taken into aount. This eliminated about 5% of

thesentenes. Thisisnotaproblemaslongastheorpusisbigenough[7℄.

Theabsolutive ase (theone that usuallyrepresentsboththeobjetof the

(10)

0

Figure4: Frequenyofelementsforeah verb

to haraterize the dierent behaviour of the verbs. On the other hand,

we ndases that seem to have lose relationship to a given verb, suh as

the ablative ase ('from') in the atera verb ('to go out'), the alative ase

('to') in the joan verb ('to go') and the inessive ase ('in') in the agertu

verb ('to appear'). Both the ablative and alative ases usually represent

respetively the soure and the goal of an ation, whih means that both

represent a movement. As for the inessive ase, it represents the

spatio-temporal oordinates of an entity or event. That is why it appears in all

theverbs. Finally,thereareases thatshowanexlusivetendenytowarda

givenverb,suh astheompletivesubordinate-la ('that') intheikusi verb

('to see'), and theadverbial subordinate-tzeko ('for', 'so that').

Allthesepropertiesmakelearthatverbstakeotherrelevantasesapart

from the ones representing the objet and subjet of the sentenes, and

thereforeit doesnotstrike usasimplausibleto assume thattheyshouldbe

taken into aount inthe mainstruture of the verbs. In other words, the

systemanhelp usindeidingwhether agivenphraseanbeonsideredas

an argument forthetreatedverb.

Added to this experiment, we are also working on getting statistial

patterns from thedierent ombinations ofthe elementsappearingineah

sentene. Wewillapplythesystemto8.000sentenes(about200.000words)

orresponding to 10 new verbs. In thisway, we expet to obtain patterns

that willlet us groupall verbs indierent lasses. It will be also a wayto

verifythe suggestionsand similaritiesgot from theaboveexperiment.

Regarding evaluation, there are not already existing resoures (in the

form of annotated orpus or ditionaries) with subategorization

informa-tionto automatiallyomparewiththeresultsofthetool, asin[7,8℄. Asa

(11)

sen-Conerningthereognitionofsubsentenesto whiheahverb applies,it is

done orretly 82% of the times. Sometimesthe subsentene is easily

re-ognizable(forexample, whendelimitedbypuntuationmarks), whileother

ases are diÆult due to the nested struture of omplex, long sentenes.

We also ounted the number of times that the verb is orretly returned

withall its relevant syntati omponents (without examiningtheir role as

arguments or adjunts). This was performed orretly in 70% of the

sen-tenes. Afterexaminingtheglobalerrorrate(30%,thatis,thetotalnumber

oferrorsinludingtheinorretlyreognizedsubsentenes), wefoundroom

forimprovement: halfoftheerrorswereduetothepreviousdisambiguation

proess (either the inorret seletion was hosen or there was more than

an alternative), while a third of errors were aused by the inompleteness

of the uniation grammar. Although these problems will always be error

soures, we feel that after adding some simple and general linguisti rules

theywillorret mostof theerrors.

Subsentence

recognition

Syntactic elements

Correct

Incorrect

Correct

Incorrect

82%

18%

70%

30%

Table1. Resultsofthesystemapplied to50sentenes.

0.4.2 Syntati error detetion

The system has been tested on errors in real texts written by learners of

Basque. As a preliminary experiment, we hose errors related with

date-expressionsfordierent reasons:

The ontext ofappliation iswideand well-denedinBasque,aswell

as themost ommonerrors. In Basque,date-expressions like

'Donos-tian,1995ekomaiatzaren15ean'(Donostia, 15thofJuly,1995)require

that some of the elements are ineted, sometimes aording to

an-other ontiguous element.

As it is relatively easy to obtain test data we eliminate one of the

mainproblemswhendealingwithsyntatierrorsinrealtexts: nding

erroneous sentenesforeah phenomena.

Inorder to reognize 6 dierenterror types, we needed more than 100

def-initionsof nitestate patterns fortheir treatment. We have useda orpus

onsisting of 70 dates (inluding orret and inorret ones) for the

de-nitionof the patterns. Apart from the sixkinds of errors, we also tried to

aountforthemostfrequentombinationsoferrors. Atthemomentweare

(12)

again a exibleand systemati way to solve it. We plan to extend the

sys-tem to other typesof errors, like agreement between themainomponents

ofthesentene,whihisvery rihinBasque,beingasoureofmanyerrors.

0.5 Conlusions

Thisworkpresentsthedevelopment of asystemwhihombines:

AsyntatigrammarforBasque. Itoversthemainomponentsofthe

sentene. Dueto theagglutinativenature ofBasque,theintrodution

of the syntati level greatly improves the abilityto treat the wealth

of information ontained in eah word/morpheme. The reognized

omponentsan beusedfor posteriorproessing.

Finite state rules. They provide a modular, delarative and exible

workbenhtodealwiththeresultinghart,makinguseoftheavailable

informationfordierenttasks,suhassyntatierrordetetionorthe

aquisitionof subategorizationinformation.

Thisombinationresultsvery adequate forshallowparsingappliations

to languages like Basque, with limited grammatial resoures ompared to

otherlanguages. Thefreeorderofonstituents,amongotheraspets,would

makethedesignofafullgrammaraveryomplextask. Moreover,thepartial

grammaris enoughfor theintended appliations. The onsiderationof the

hart asthe interfae betweenboth subsystems also adds to the simpliity

ofthe ombined tool. Finally,wemustemphasizetwo mainonlusions:

Thestratiedpartialparsingapproahprovidesapowerfulwayto

on-sidersimultaneouslyinformationatmorpheme,wordandphraselevel,

adequate for agglutinative languages where the grammatial word is

notthestartingpoint ofthe analysis.

The uniationgrammar and thenite state system are

omplemen-tary. The grammar is neessary to treat aspets like omplex

agree-mentandwordordervariations,urrentlyunsolvableusingnitestate

networks, due to the exponential growth in size of the resulting

au-tomata [6℄. On the other hand, regular expressions, in the form of

automataandtransduers,aresuitableforoperationslike

disambigua-tionand ltering.

Aknowledgements

Thisworkwas partiallysupported bytheBasque Government,the

Univer-sityoftheBasqueCountry andtheCICYT. Thanksto GorkaElordietafor

(13)

(14)

[1℄ J.Abaitua. Complexprediates inBasque: from lexialformsto

fun-tionalstrutures. PhDThesis, Univ.of Manhester, 1988.

[2℄ S.Abney.Part-of-SpeehTaggingandPartialParsing.InCorpus-Based

Methods in Language and Speeh Proessing, Kluwer, Dordreht, 1997.

[3℄ S. Ait-Mokhtar, J-P. Chanod. Inremental Finite-State Parsing. In

Proeedings of ANLP-97,Washington,1997.

[4℄ I.Alegria, X.Artola, K.Sarasola,M. Urkia. Automatimorphologial

analysisofBasque. InLiterary &LinguistiComputing,Vol. 11, 1996.

[5℄ I. Alegria, J. M. Arriola, X. Artola, A.Diaz de Ilarraza, K. Gojenola,

M. Maritxalar,I. Aduriz. Morphosyntatidisambiguationfor Basque

basedon theConstraintGrammar Formalism. InRANLP, 1997.

[6℄ K. Beesley. Constraining Separate Morphotati Dependenies in

Finite-StateGrammars. InProeedings of the International Workshop

onFiniteStateMethodsinNatural Language Proessing,Ankara,1998.

[7℄ M.Brent. From Grammarto Lexion: UnsupervisedLearningof

Lexi-alSyntax. In Computational Linguistis, vol. 19(2),1993.

[8℄ T.Brisoe,J.Carroll. AutomatiExtrationofSubategorizationfrom

Corpora. InProeedings of ANLP-97,Washington,1997.

[9℄ N. Ezeiza, I. Alegria, J.M. Arriola, R. Urizar, I. Aduriz. Combining

Stohasti and Rule-Based Methods forDisambiguation in

Agglutina-tiveLanguages. InProeedings of COLING-ACL-98,Montreal, 1998.

[10℄ L.Karttunen. TheProperTreatment of OptimalityinComputational

Phonology. In Proeedings of the International Workshop on Finite

StateMethods in Natural Language Proessing, Ankara, 1998.

[11℄ C.Mellish.SomeChart-BasedTehniquesforParsingIll-FormedInput.

InProeedings of EACL-89,1989.

[12℄ E.Rohe,Y. Shabes. Finite-State Language Proessing. MIT,1997.

[13℄ A.Voutilainen. Asyntax-basedpart-of-speehanalyser. InProeedings

of EACL-95,Dublin,1995.

11

ProeedingsoftheESSLLIStudentSession1999

AmaliaTodirasu(editor).