Combining Chart-Parsing and Finite State
Parsing.
Conference Paper
· January 1999
Source: DBLP
CITATIONS
8
READS
14
3 authors:
Some of the authors of this publication are also working on these related projects:
PROSA - MED: Advanced semantic textual processing for the detection of diagnostic codes, procedures,
concepts and their relationships in health records
View project
Koldo Gojenola
Universidad del País Vasco / Euskal Herriko Uni…
75
PUBLICATIONS
404
CITATIONS
SEE PROFILE
Izaskun Aldezabal
Universidad del País Vasco / Euskal Herriko Uni…
42
PUBLICATIONS
228
CITATIONS
SEE PROFILE
Maite Oronoz
Universidad del País Vasco / Euskal Herriko Uni…
55
PUBLICATIONS
245
CITATIONS
SEE PROFILE
All content following this page was uploaded by Koldo Gojenola on 04 July 2014.
of the
ESSLLI Student Session
1999
Combining Chart-Parsing and Finite State
Parsing
I. Aldezabal,K. Gojenola, M. Oronoz
InformatikaFakultatea,649 P. K.Euskal Herriko Unibertsitatea,
20080 Donostia (Euskal Herria)
E-mail: jipgogaksi.ehu.es
Abstrat
Thispaperpresentsthedevelopmentofaparsingsystemthatombinesauniation-basedpartialhart-parser
with nitestatetehnology. It isbeingappliedtoBasque, anagglutinativelanguage. Ina rst phase,a
uniationgrammarisappliedtoeahsentene,givingahartasaresult. Thegrammarispartialandgives
agoodoverageofthemainelementsofthesentene,buttheoutputisambiguous. Afterthat,theresulting
hartistreatedasanautomatontowhihnitestatedisambiguationonstraintsandltersanbeappliedto
hoosethebestsingleanalysis. Thesystemhasbeentestedontwodierentappliations: theaquisitionof
subategorizationinformationforverbs,andthedetetionofsyntatierrors.
0.1 Introdution
Thispaperpresentsa projetforthedevelopment of aparsing systemthat
ombines a uniation-based partial hart-parser with nite state
tehnol-ogy. ThesystemisbeingappliedtoBasque,anagglutinativelanguage,with
freeorder among sentene omponents.
Inarstphase,auniationgrammarisappliedtoeahsentene,giving
a hart as a result. The grammar is partialbut gives a ompleteoverage
of the main elements of the sentene, suh as noun phrases, prepositional
phrases, sentential omplements and simple sentenes. It an be seen as a
shallowparser[2 ,3℄thatanbeusedforsubsequentproessing. However,it
ontainsbothmorphologialansyntatiambiguities,givingahugenumber
ofdierent interpretations persentene.
After that, the resulting hart is treated as an automaton to whih
-nite state disambiguation onstraints and lters an be applied, so that
unwanted readings or information an be disarded or parts of a sentene
anbehosen,dependingontheappliation. Thesystemhasbeentestedon
two dierent appliations: the extration of subategorization information
fora givenverb, and thedetetion of syntatierrors.
In this manner, we ombine onstrutive and redutionistiapproahes
to parsing: inarstphase theuniation-grammarobtainssyntatiunits,
usually with many dierent interpretations, while in a seond phase the
analysesarerestritedaordingto theontext andthekindofappliation.
The remainder of this paper is organized as follows. In Setion 2 we
presentadesriptionofthehart-parser. Setion3desribesthenitestate
applia-0.2 The hart parser
0.2.1 Previous work
Thesystemreliesondierentwide-overage toolsthathavebeendeveloped
forBasque:
The LexialDatabaseforBasque [4 ℄. Itisa largerepositoryof lexial
informationwithabout70.000entries(60.000lemmas),eahone with
its assoiatedlinguistifeatures: ategory,subategory,ase, number,
deniteness, mood and tense, among others. As this database is the
basisofthesyntatianalyzer,wemustsaythatthereisnoinformation
related to verbsubategorization.
A morphologial analyzer [4℄. The analyzer applies Two-Level
Mor-phology forthemorphologial desriptionand obtains,foreah word,
itssegmentation(s)intoomponentmorphemes. Themoduleisrobust
and hasfulloverage of free-runningtexts inBasque.
A morphologial disambiguator based on boththe Constraint
Gram-mar formalism [13 , 5℄ and a statistial tagger [9℄. This tool redues
the high word-level ambiguity from 2.65 to 1.19 interpretations, but
stillleavesa numberofdierent interpretationspersentene.
0.2.2 The syntati grammar
Basque being an agglutinative language, linguisti information like ase,
number and determination are given by means of morphemes appended
to the last element of a noun/prepositional phrase. Following the most
extendedlinguistidesriptionsforBasque,we tookmorphemes(seeFigure
1)asthebasiunitsuponwhihsyntatianalysisisbased,departingfrom
thetraditionaluseofthegrammatialwordasthesyntatiunit,asisdone
innon agglutinativelanguages likeEnglish. Althoughthegureonlyshows
the ategory of eah lexial/syntati unit, there is a rih information for
eahofthem,enodedintheformoffeaturestrutures. Moreover, afterthe
syntati unitsareobtained, theanalysis tree isnotusedanymore.
ThePATR-IIformalismwasusedforthedenitionofthesyntatirules.
Therewere two mainreasonsforthiseletion:
Simpliity. Thegrammaris notlinkedto alinguistitheory,likeLFG
or HPSG,asthere hasnotbeenabroad desriptionfor Basqueusing
those formalisms [1℄. Moreover, theirappliation would require
infor-mation notavailableat the moment, suh as verb subategorization.
PATR-II is more exibleat the ost of extra writing, as it is dened
mendi 0 ko etxe polit 0 an
mountain the of/at house nice the in
noun sing genitive noun adj sing inessive
DECLENSION
NMODIFIER
NP1
PP
NP2
DECLENSION
Figure 1: Parse tree for mendiko etxe politan ('in the nie house at the
mountain')
The formalism is based on uniation. This is useful for the
ma-nipulation of omplex linguisti strutures for the representation of
grammatial onstituents. We must stress that the grammar uses
good linguisti granularity, in the sense that we keep all the
linguis-tially relevant morphosyntati information, very rih ompared to
mosthunkingsystems. Thisfat willenableusto useitasa general
toolfordierent appliations.
Thegrammaratthemomentontains120rules. Thereisanaveragenumber
of15equationsperrule,some ofthemfortestingonditionslikeagreement,
and othersforstruturebuilding. The mainphenomenaovered are:
Nounphrases and prepositional phrases. Agreement among the
om-ponentelementsis veried, addedto theproperuse ofdeterminers.
Subordinate sentenes, suh as sentential omplements (ompletive
lauses,indiretquestions,...) and relativelauses.
Simplesentenesusingall thepreviouselements. Therih agreement
betweentheverband the mainsenteneonstituents(subjet, objet
and seondobjet)inase, numberand personis veried.
The omponentsfound by theparser are very reliable,as onlywell-formed
elementsareobtained,whihareneessaryforafullsenteneinterpretation.
Duetothevarietyof syntatistruturesfoundinreal sentenes,the
gram-mar is limited in the sensethat onlya smallfration of the sentenes will
reeive a full analysis. However, onstruting a omplete grammar would
be a ostly enterprise. For that reason, this grammar is applied
bottom-up,resulting ina parser that obtainspiees of analysesor hunks, and the
resultinghart an be usedforsubsequent proessing.
...
Figure2: State of thehartafter theanalysisof Mendiko etxe politan ikusi
dutnik ('Ihave seen (it)inthenie houseat themountain')
is found, with the aim of orreting a syntati error. In fat, this
sys-temassumedthat every orretsentene getsa ompleteanalysis,whilewe
onsiderthat quite often the grammar will notbe able to parse the whole
sentene. Even inthease whenwe want todetet agrammatial error,we
donotassume thatthe orreted sentene wouldgive a ompleteparse.
0.3 The Finite State Parser
In reent years the eld of nite state systems has gained muh attention
[12 ,13 ℄. Aharateristithatallnitestatesystemshaveinommonisthat
theyworkonrealtextsdealingwithunitsbiggerthanthegrammatialword,
usingmore elaborated information,butwithoutreahing theomplexityof
fullsenteneanalysis.
Astheoutputofourhartparserisambiguous,weobtainalargenumber
ofpotentialanalyses(onatenationsofhunksoveringthewholesentene).
Weneededatoolfortheseletionofpatternsoverthenalhart,andnite
state tehnology is adequate for this task. Currently we use the Xerox
FiniteState Tool(XFST 1
). Asalinkbetweenbothparsers,thehartmust
beonvertedintoanautomaton,whihanthenbeproessedbyXFST(see
Figure 2). In the gure, dashed linesare used to indiate lexial elements,
whileplainlines denesyntati units(thoseobtained by thehart-parser,
see Figure 1). The bold irles represent word-boundaries, and the plain
ones delimit morpheme-boundaries. As before, eah ar in the gure is
represented by its morphosyntati information, in the form of a sequene
offeature-valuepairs.
Inthe dierent appliations, we have usedsome ommonoperations:
Disambiguation. Twokindsofambiguitywereonsidered:
morphosyn-tati ambiguityleft by the Constraint Grammar and stohasti
dis-ambiguation proesses and that introdued by the hart parser. As
wholesyntatiunitsanbeused,thisproessissimilartothatofCG
disambiguationwiththeadvantageofbeingabletoreferenesyntati
unitslargerthanthe word.
Filtering. Dependingoneahappliation,sometimesnotallthe
avail-able informationis relevant. For example, one diÆultkindof
ambi-guity,noun/adjetive,asinzuriekin (zuri='white','withthewhites'/
'with the white ones') an be ignored when aquiring verb
subate-gorization information, as we are mainly interested in the syntati
ategory and thegrammatial ase (prepositionalphraseand
ommi-tative, respetively), thesame inbothalternatives.
Extrating partsof a sentene. Theglobal ambiguityof a sentene is
onsiderablyredued ifonly part of it is onsidered. Forinstane, in
thease of extrating verb subategorization information, some rules
examine the ontext of the target verb and dene the sope of the
subsenteneto whihtheprevious operationswillbeapplied.
Among the nite state operators used (see Figure 3 2
), we apply
omposi-tion,intersetionandunionofregularexpressionsandtransduers 3
. Weuse
bothordinaryompositionandthereentlydenedlenientomposition [10 ℄.
This operator allows the appliation of dierent eliminating onstraints to
asentene,alwayswiththeertaintythatwhensome givenonstraint
elim-inatesall the interpretations, thenthe onstraint is notappliedat all, that
is,theinterpretationsare"resued". Astheoperatoris denedinterms of
regularrelations itan be omposed withother automata.
0.4 Appliations
0.4.1 Aquisition of subategorization information
We are interested in automatially obtaining verb subategorization
infor-mationfrom orpora[7, 8℄. Ourobjetive is to enrih theverbentries
on-tainedinthelexialdatabasewithinformationaboutthemainonstituents
(noun phrases, prepositional omplements and subordinate sentenes)
ap-pearingwitheah verb.
Thehartparseranalyzesthemainunitsineahsenteneandthennite
state rulesarry outthefollowingtasks(see Figure 3):
Reognition of therelevant subsentene. The mean numberof words
persenteneis22,rangingfromone tosixsubsentenes. Theproblem
isthat ofdelimiting thepartorrespondingto the target verb.
2
The .o. and & operators denote respetively the omposition and intersetion of
regularlanguagesandrelations. 3
Sentence
Chart parser
chart (automaton)
.o.
SubSentenceRecognizers
.o.
DisambiguatingFilters
.o.
LongestPhraseFilters
.o.
OutputCleaningFilters
verb + elements
No Error / Error Type(s)
.o.
ErrorPattern & NotException
.o.
OutputCleaningFilters
Figure 3: Dierent appliationsof thesystem
Disambiguation. Some heuristisareused to solve theremaining
am-biguities, related,amongothers, to dierent typesof subordination.
Seletion of the longest NP/PPs. This heuristi proved to be very
reliable to exlude multiplereadingsof asingle NP/PP.
Cleaning the output. The result of the previous phase ontains a
lot of morphsyntati information, but most of it is superuous for
thisappliation,as we aremainlyinterested inthegrammatial ase,
numberand subordinationtype of eahelement.
Thedesignofthenitestaterulesisanon-trivialtaskwhendealingwith
real texts. Nevertheless, the delarativeness of nite state tools improves
signiantly our previous version of the system implemented by means of
ad ho programs, diÆult to orret and maintain. Moreover, XFST also
improves eÆieny. About 350 nite state denitions were made for this
appliation.
As a rst experiment, 500 sentenes for eah of 5 dierent verbs were
seleted. Figure 4 shows theases that have mostlyappeared around eah
of the seleted verbs. When there was more than a single analysis for a
subsentene, it was not taken into aount. This eliminated about 5% of
thesentenes. Thisisnotaproblemaslongastheorpusisbigenough[7℄.
Theabsolutive ase (theone that usuallyrepresentsboththeobjetof the
0
Figure4: Frequenyofelementsforeah verb
to haraterize the dierent behaviour of the verbs. On the other hand,
we ndases that seem to have lose relationship to a given verb, suh as
the ablative ase ('from') in the atera verb ('to go out'), the alative ase
('to') in the joan verb ('to go') and the inessive ase ('in') in the agertu
verb ('to appear'). Both the ablative and alative ases usually represent
respetively the soure and the goal of an ation, whih means that both
represent a movement. As for the inessive ase, it represents the
spatio-temporal oordinates of an entity or event. That is why it appears in all
theverbs. Finally,thereareases thatshowanexlusivetendenytowarda
givenverb,suh astheompletivesubordinate-la ('that') intheikusi verb
('to see'), and theadverbial subordinate-tzeko ('for', 'so that').
Allthesepropertiesmakelearthatverbstakeotherrelevantasesapart
from the ones representing the objet and subjet of the sentenes, and
thereforeit doesnotstrike usasimplausibleto assume thattheyshouldbe
taken into aount inthe mainstruture of the verbs. In other words, the
systemanhelp usindeidingwhether agivenphraseanbeonsideredas
an argument forthetreatedverb.
Added to this experiment, we are also working on getting statistial
patterns from thedierent ombinations ofthe elementsappearingineah
sentene. Wewillapplythesystemto8.000sentenes(about200.000words)
orresponding to 10 new verbs. In thisway, we expet to obtain patterns
that willlet us groupall verbs indierent lasses. It will be also a wayto
verifythe suggestionsand similaritiesgot from theaboveexperiment.
Regarding evaluation, there are not already existing resoures (in the
form of annotated orpus or ditionaries) with subategorization
informa-tionto automatiallyomparewiththeresultsofthetool, asin[7,8℄. Asa
sen-Conerningthereognitionofsubsentenesto whiheahverb applies,it is
done orretly 82% of the times. Sometimesthe subsentene is easily
re-ognizable(forexample, whendelimitedbypuntuationmarks), whileother
ases are diÆult due to the nested struture of omplex, long sentenes.
We also ounted the number of times that the verb is orretly returned
withall its relevant syntati omponents (without examiningtheir role as
arguments or adjunts). This was performed orretly in 70% of the
sen-tenes. Afterexaminingtheglobalerrorrate(30%,thatis,thetotalnumber
oferrorsinludingtheinorretlyreognizedsubsentenes), wefoundroom
forimprovement: halfoftheerrorswereduetothepreviousdisambiguation
proess (either the inorret seletion was hosen or there was more than
an alternative), while a third of errors were aused by the inompleteness
of the uniation grammar. Although these problems will always be error
soures, we feel that after adding some simple and general linguisti rules
theywillorret mostof theerrors.
Subsentence
recognition
Syntactic elements
Correct
Incorrect
Correct
Incorrect
82%
18%
70%
30%
Table1. Resultsofthesystemapplied to50sentenes.
0.4.2 Syntati error detetion
The system has been tested on errors in real texts written by learners of
Basque. As a preliminary experiment, we hose errors related with
date-expressionsfordierent reasons:
The ontext ofappliation iswideand well-denedinBasque,aswell
as themost ommonerrors. In Basque,date-expressions like
'Donos-tian,1995ekomaiatzaren15ean'(Donostia, 15thofJuly,1995)require
that some of the elements are ineted, sometimes aording to
an-other ontiguous element.
As it is relatively easy to obtain test data we eliminate one of the
mainproblemswhendealingwithsyntatierrorsinrealtexts: nding
erroneous sentenesforeah phenomena.
Inorder to reognize 6 dierenterror types, we needed more than 100
def-initionsof nitestate patterns fortheir treatment. We have useda orpus
onsisting of 70 dates (inluding orret and inorret ones) for the
de-nitionof the patterns. Apart from the sixkinds of errors, we also tried to
aountforthemostfrequentombinationsoferrors. Atthemomentweare
again a exibleand systemati way to solve it. We plan to extend the
sys-tem to other typesof errors, like agreement between themainomponents
ofthesentene,whihisvery rihinBasque,beingasoureofmanyerrors.
0.5 Conlusions
Thisworkpresentsthedevelopment of asystemwhihombines:
AsyntatigrammarforBasque. Itoversthemainomponentsofthe
sentene. Dueto theagglutinativenature ofBasque,theintrodution
of the syntati level greatly improves the abilityto treat the wealth
of information ontained in eah word/morpheme. The reognized
omponentsan beusedfor posteriorproessing.
Finite state rules. They provide a modular, delarative and exible
workbenhtodealwiththeresultinghart,makinguseoftheavailable
informationfordierenttasks,suhassyntatierrordetetionorthe
aquisitionof subategorizationinformation.
Thisombinationresultsvery adequate forshallowparsingappliations
to languages like Basque, with limited grammatial resoures ompared to
otherlanguages. Thefreeorderofonstituents,amongotheraspets,would
makethedesignofafullgrammaraveryomplextask. Moreover,thepartial
grammaris enoughfor theintended appliations. The onsiderationof the
hart asthe interfae betweenboth subsystems also adds to the simpliity
ofthe ombined tool. Finally,wemustemphasizetwo mainonlusions:
Thestratiedpartialparsingapproahprovidesapowerfulwayto
on-sidersimultaneouslyinformationatmorpheme,wordandphraselevel,
adequate for agglutinative languages where the grammatial word is
notthestartingpoint ofthe analysis.
The uniationgrammar and thenite state system are
omplemen-tary. The grammar is neessary to treat aspets like omplex
agree-mentandwordordervariations,urrentlyunsolvableusingnitestate
networks, due to the exponential growth in size of the resulting
au-tomata [6℄. On the other hand, regular expressions, in the form of
automataandtransduers,aresuitableforoperationslike
disambigua-tionand ltering.
Aknowledgements
Thisworkwas partiallysupported bytheBasque Government,the
Univer-sityoftheBasqueCountry andtheCICYT. Thanksto GorkaElordietafor
[1℄ J.Abaitua. Complexprediates inBasque: from lexialformsto
fun-tionalstrutures. PhDThesis, Univ.of Manhester, 1988.
[2℄ S.Abney.Part-of-SpeehTaggingandPartialParsing.InCorpus-Based
Methods in Language and Speeh Proessing, Kluwer, Dordreht, 1997.
[3℄ S. Ait-Mokhtar, J-P. Chanod. Inremental Finite-State Parsing. In
Proeedings of ANLP-97,Washington,1997.
[4℄ I.Alegria, X.Artola, K.Sarasola,M. Urkia. Automatimorphologial
analysisofBasque. InLiterary &LinguistiComputing,Vol. 11, 1996.
[5℄ I. Alegria, J. M. Arriola, X. Artola, A.Diaz de Ilarraza, K. Gojenola,
M. Maritxalar,I. Aduriz. Morphosyntatidisambiguationfor Basque
basedon theConstraintGrammar Formalism. InRANLP, 1997.
[6℄ K. Beesley. Constraining Separate Morphotati Dependenies in
Finite-StateGrammars. InProeedings of the International Workshop
onFiniteStateMethodsinNatural Language Proessing,Ankara,1998.
[7℄ M.Brent. From Grammarto Lexion: UnsupervisedLearningof
Lexi-alSyntax. In Computational Linguistis, vol. 19(2),1993.
[8℄ T.Brisoe,J.Carroll. AutomatiExtrationofSubategorizationfrom
Corpora. InProeedings of ANLP-97,Washington,1997.
[9℄ N. Ezeiza, I. Alegria, J.M. Arriola, R. Urizar, I. Aduriz. Combining
Stohasti and Rule-Based Methods forDisambiguation in
Agglutina-tiveLanguages. InProeedings of COLING-ACL-98,Montreal, 1998.
[10℄ L.Karttunen. TheProperTreatment of OptimalityinComputational
Phonology. In Proeedings of the International Workshop on Finite
StateMethods in Natural Language Proessing, Ankara, 1998.
[11℄ C.Mellish.SomeChart-BasedTehniquesforParsingIll-FormedInput.
InProeedings of EACL-89,1989.
[12℄ E.Rohe,Y. Shabes. Finite-State Language Proessing. MIT,1997.
[13℄ A.Voutilainen. Asyntax-basedpart-of-speehanalyser. InProeedings
of EACL-95,Dublin,1995.
11
ProeedingsoftheESSLLIStudentSession1999
AmaliaTodirasu(editor).
Chapter0,Copyright1999, .