Tgp chiKhoa hpc Tnrdng Dgi hgc Cdn Tho Phdn A: Khoa hgc Tv nhien, Cong nghi vd Mdi irudng: 34 (2014): 66-73
Tap chl Khoa hpcTrirdng Oai hoc Can Thd website: sj.ctu.edu.vn
KET HOfP NGU" NGHIA VCfl M 6 HINH TUI TU*
DE CAI TIEN GIAI THUAT K LANG G I E N G TRONG PHAN LOfP VAN BAN NGAN D6 Thanh Nghj' va Trin Cao De'
' Khoa Cdng nghi Thong tin & Truyen thong, Trudng Dgi hgc Cdn Tha
Thdng tin chung:
Ngdy nhdn: 09/05/2014 Ngdy chdp nhdn: 30/10/2014 Title:
Semantic smoothing of the Bag-of-Words model for improving short text classification using k nearest neighbors
Tirkhda:
Phdn lap vdn bdn ngan, mo hinh tiii tir, ngii nghta, k lang
Keywords:
Text classification, Bag-of- Words, semantic smoothing, k nearest neighbors
ABSTRACT
This paper presents the semantic smoothing of the Bag-of-Words (BoW) model to improve the positive class prediction of k nearest neighbors (kNN) in the short text classification. The BoW model, a representation of the text constructed by counting the occurrence of each word in the text, is popularly used in text classification. The drawback of the BoW model is that it does not take the semantic similarity of words into account. That is often the cause of mismatches in the vocabulary used by kNN. And then, it leads to the poor prediction of the positive class in short text classification.
We propose to use the semantic smoothing of Bo W to improve the positive class prediction of kNN. The numerical test results on a real dataset show that our approach improves 8% in terms of the positive class prediction while degradesing less than 1% in term of the negative class prediction of kNN algorithm in short text classification.
TOM TAT
Trong bdi ndy, chiing toi gidi thiiu tiip cgn tich hgp ngir nghia vdi mo hinh tm td nhdm cdi tiin hieu qud dir dodn lap duang cua gidi thudt k Idng giing trong phdn Igp van bdn ngdn. Mo hinh tiii tu Id mo hinh biiu diin vdn bdn nhu vec ta tdn so xudt hien cua tit trong van bdn, dugc sir dung phS biin hien nay trong vdndirphdri ~Jdp^n_ ban.- Tuy nhien, khuyit diem cua mo hinh tiii lir la klidng'qlmn'idm den su dong nghia cda tie, diiu noy ldm gidm hiiu qud du dodn lap duang (lap quan tdm) ciia gidi thudt k Idng gieng trong phdn lap vdn bdn ngdn. Chiing toi de xudt tich hgp ngu nghia vdo mo hinh tui tit de cdi thiin kit qud du dodn ldp duang cua k Idng gieng Kit qud thirc nghiem vdi tdp dir liiu thuc cho thdy rdng cdc phuang phdp ciia chung toi de ximt cdi thien du dodn lap duang^ han 8%
trong gidm chua den 1% du dodn lap dm cua gidi thugt k lang gieng trong phdn lap vdn bdn ngdn.
1 Gl6fITHI$U
Phdn ldp vin bin (Manning, 2008), (Sebastian!, 99) li gdn nhin ty dpng cho tiing vdn bdn toeo chu di da dupe dinh nghia trudc dya vio ndi dung cua vin bin. Phin ldp van bin sft dyng phd biin ttong ftng dyng nhu; gdn nlian ty ddng mOt bin tin, phin ldp y kiin ngudi ddng ttln cac mang xa hdi, Ud Idi
ty ddng tou dien tft, sip xip hdp tou diln tft, nhdn dang tou rie, phdn tfch ndi dung di phit hien nhdm khiing bd.
Trong bdi bdo nay, chiing tdi xit din van de phdn ldp vdn ban ngan, thudng tody d cdc ftng dyng nhu phin ldp y kien trin mang xa hpi twitter (Liu, 2012), kiim Ua cdc ciu hdi / tri ldi tft cac
Tgp chi Khoa hgc Trudng Dgi hgc Cdn Tha Phdn A: Khoa hgc Tv nhiin, Cong nghi vd Mdi irudng. 34 (2014): 66-73 cupc phdng vdn trong ftng dung hoach dinh ngudn
nhdn lyc doanh nghiep (Do et ai, 2014). Cdc van bdn ndy thudng rit ngin (chua tdi da khodng 20 tft), mang rit it todng tin dk cho phep toye hien vile phdn ldp bdi cac md hinh may hpc. Hon nua, md hinh tiii tft lai khdng quan tim din sy ddng ngu nghTa cua cac tu, tim hai van bin tuong tu nhau trong gidi thugt phin ldp kNN can phii so khdp tft vyng. Dilu nay lim giam bilu qui dy dodn ldp duong (ldp quan tim). Chung tdi dl xuit tich hpp ngft nghia vio md hinh tui tu dk cii thien kit qui dy dodn Idp duong cfta kNN ttong phdn ldp van bdn ngdn. Md hinh ngft nghia dya Uen ty diin ddng nghia WordNet (Fellbaum, 1998), phdn tt'ch ngft nghTa tilm in LSA (Dumais, 2004) va chu dl tiem in LDA (Blei el al., 2003). Kit qui toye nghiem vdi tgp dft li?u thu dupe tft nghiin cftu (Do et ai, 2014) cho thiy ring cie phuang phap cua chung tdi de xuit cdi thien dy dodn ldp duong hon 8%
trong khi gidm chua den 1% dy doan ldp dm cua giii thuit kNN trong phdn ldp van bin ngdn.
Phin tilp toeo ciia bii vilt dupe ttinh bdy nhu sau: phan 2 trinh bay ngdn gon ve phdn Idp van ban Bang I: Vi du vi tip dft Iilu van bin
vdi md hmh ttii tii va ^di touat kNN, phdn 3 tiinh bay cie phuang phap kit hpp ngft nghTa vdi md hinh tiii di e^ tiin giai touat phdn ldp kNN. Phin 4 trinh bay cie kit qua toye ngjillm, tiip theo sau do la todo luan ve cie nghien cftu cd liln quan din phan ldp vdn bdn trudc khi kit luin vd hudng phat ttien.
2 PHAN L d P VAN BAN V 6 l M 6 HINH TUI TU" VA GIAI THUi^T KNN
Phuang phap phan Idp van ban thudng dya ttln md hinh thdng ke tii vd cdc gidi thuit hpc ty ddng (Mannmg, 2008), (Sebastian!, 99).
2.1 Mo hinh tui tur
Do dtt lieu van ban is ddu vio d dang khdng ciu true, trong khi eic giii toudt may hpc d giai dogn tilp toeo sau toudng cbi cd toe xft Iy dupe dft lieu dang cau tnic bdng (mdi ddng Id mdt pbin tft dtt lieu, cdt la chieu hay toudc ti'nh). ^h giii quyit vin di ndy, md hmh tui tft (Harris, 1954), (Salton et ai, 1975) cho phep chung ta biiu diln tap dft Heu vdn bin vl eiu true bdng.
STT Ndi dung Ldp
1 support vector machines for classifying ii 2 databases management systems
Duong Am software testing
Budc tien xft ly ndy bao gdm vile phin tich tft vyng vi tdch cdc tft trong ndi dung cua tdp vin bin, sau do chpn tip hpp eae tft c6 y nghTa quan ttpng dimg de phan Idp, bieu diln dft lieu van ban vi dgng bing di tft dd cdc giii touat may hpc cd toi hpc.di, phin Idp.^ CJ budc phin tich tft vyng, cdng vi|c cd toi laquyvdtu gdc cua cie biln toi tu, cd till xoa bd cie tft khdng cd y n^Ta cho vile phan ldp nhu cdc mgo tft, tft ndi,... Tilp din Id tach cic tft, dua vdo ty diln. Mdt van bdn dupe biiu diin dgng vee to (cd n thinh phan, chilu) mi gia Uj
thdnh phan toft j li tan sd xuit hi$n tft thft j trong van bin. Niu xet tap T gdm m vin bin vi ty diin cd n tft vyng, toi T cd toi dupe biiu dien thdnh bang D kich toudc m^n, ddng thft i ciia bdng li vie ta bieu diln van bdn thft i tuong ftng.
Xem vi dy tap dft Heu vdn bdn ttong Bdng 1, sau khi tien xft ly bieu diln vdi md hinh tfti tft thu dupe bing dft lieu D cd eiu tnic nhu Bing 2, tft bing niy giii thuit miy hpc nhu kNN cd thi xft Iy van dl phdn ldp.
Bang 2:
STT I 2 m
Bi£u di£a tSp ( 1 (support)
1 0 0
ir lifu van ban ba 2 (machine)
1 0 0
ngmc hinh tui tif B-1 (software)
0 0 I
n (system) 0 1 0
Lov Dircmg Am Am 2.2 Giii thugt kNN
Giai thugt kNN dupe Fix va Hodges di xuit tft nhftng nim 1952. Ddy la phuong phdp rit don gidn nhung cung cho hi?u qui eao ttong khai md dft lieu (Wu and Kumar, 2009). Gidi touit kNN thudng
dupe sft dyng ttong cic van de tim kilm vin bin, phan ldp van bin (Manning, 2008). Gia sft ta cd t§p miu dtt lieu hpc ban diu ed 2 ldp (Udn, vudng) nhu vi du uong Hitto I. Giii thudt kNN khdng cdn qui ttinh hpc. Khi cd mpt phin tti dft li$u x mdi din cin dy doin ldp, giii tougt di tim trong tip hpc k
Tgp chi Khoa hoc Trudng Dtfi h^c Cdn Tha Phdn A: Khoa hgc Tv nhien, Cdng nghi va Mdi trudng: 34 (2014): 66-73 lang gieng tft tip dtt tilu hoc (k = S) ciia. phdn tft
mdi din x dk thyc hiln dy doin, ldp cua phdn tft mdi din x dupe du doin dya vao lugt binh chpn sd ddng tft cdc ldp cua k lang giing (ldp cua phdn tft x dupe dy doan litt-dn).
D
Bang 3: Vi du v l tip du* Iilu 2 van ban dl, d2 va van ban x cin phin ldp
STT Npi dung Ldp
Hinh 1: Giai thuat kNN
dl kernel machines Duong d2 Linux machine Am
x support vector machines ? Trong khi phdn ldp, gidi touit kNN can tim Uong tgp hpc k lang giing. Chinh vi viy, kit qua phin ldp eua gidi touat phu toudc vdo dp do khodng each. Tuy nhiin, nlu sft dyng rad hitto tui tu nhu tren, thi khi tim k lang gieng cua kNN lai khdng toe xet den sy tuong ddng ve mat ngft nghia do md hinh tui tft khdng quan tdm den sy ddng nghia cua cie tft Uong van ban (ma can so khdp chinh xdc tft). Dl thdy rQ dilu niy, cdn xem xet vi dy vdi tap du Iilu cd 3 van ban ngdn (2 van ban d I, d2 dupe gdn san ldp va van ban x can phin Idp) nhu sau Bdng 3.
Bang 4 STT
dl d2
X
Mo hiuh tiii tir support
0 0 1
ciia t^p dii- liSu 2 vector
0 0 1
van ban dl machine
1 1 1
dl va van ban linux
0 1 0
X can phan ldp kernel Loll
1 Duong 0 Am 0 ? Bdng 4 la md hinh tiii tu bieu dien 3 van bing
ttln. Dl phin ldp x thudc vao ldp Ducmg hay Am, kNN tim 1 lang gieng cua x bing each tinh khoang each (chdng hgn Manhattan) tft x din van bdn dl vi d2, sau do ling giing la khodng cdch nhd nhat. Kit qui Ii khoang cdch d(x, di) =d(x. d2) = 3. kNN khdng till phdn ldp tot cho phin tft x. Nlu xet vi ngft nghTa, thi x tuong ty vdi dl ban Id d2, do x vi dl cung thu^c vio chft dk may hpc.
3 TiCH HOP NGtJ" NGHIA VAO MO HINH TUI Tir
Dl khic phyc itoupe diim ve viec bd qua sy ddng nghia cua tft Uong mo hinh tui tft, cdn tiiiil phai tich hop tolm ngtt nghTa di cai tiiien hieu qui
phin Idp cda giai thuat kNN. Cd hai tilp cdn dupe de xudt d day. Mdt Id sft dung ty dien ddng nghia WordNet (Fellbaum, 1998) kit hop vdi md hinh tiii tft. Ngoai ra, ed till sft dyng phuong phap phan tich ngu nghia tilm in (Latent Semantic Analysis - LSA (Dumais, 2004) va chft dl tiem an Latent Duichlet Allocation - LDA (Blei et al. 2003)).
3.1 Md hinb ngir nghta su dyng ty dien WordNet
Xuit phdt tu md hinh tui tft vdi tap tft vyng Voc kich thudcla n, cdn xdy dyng mot ma Udn ddng nghia S kich thudc Id nxn ma d dd mdi phan tft Sij la mdi quan he ddng nghia giua tft Wi va tft Wj Uoi^ tap Voc.
<Text>
BoW DIm,n]
w
1
WordNet
Semantic Sem[m,n]
algorithm kNN
Pre-processing
Hinh 2: Md hioh ngtt nghia BoW-WordNet va kNN cho phin ldp van ban Ma trdn ddng nghTa 5 dupe xay dyng dya vao
tu diin WordNet. Ty diln nay Id mpt co sd dft lieu
ngft nghia phan cap ciia eae tft tilng Anh, cimg cap cdc moi quan he ddng nghia gifta cic tu. Sy ddng
Tgp ehi Khoa hgc Trudng Dgi hgc Cdn Tha Phdn A: Khoa hgc Tu nhiin. Cdng nghe vd Moi Irudng: 34 (2014): 66-73 nghia ngft nghTa gifta hai tu ed gia ttj ttong dogn
[0, IJ. Niu h^ tu li ddng nghia tuyet doi thi WordNet Ui vi ^ a Uj ddng n ^ a ngft n ^ a la /.
Vile ti'ch hop ddng nghia vao md hlnb tiii tft toye hiln bing each nhin bdng dtt li^u D kich toudc m xn eiia md hinh tiii tu vdi ma U$n ddng nghia S, thu dupe bang dtt Iilu mdi Id Sem cd kich toudc la m ^n. Lftc niy gidi tougt kNN xft ly bdng Sem dk phin ldp dft lieu thay vi li bang D cfta md hmh tiii tii toudng toiy. Hinh 2 md td md hinh ngft nghia BoW-WordNet vi kNN cho phdn ldp van bdn.
Trd lgi vi dy ttln, vdi bp tii vyng gdm 5 tft {support; vector; machine; linux; kernel}, ma ttin ddng nghTa S duoc xdy dyng tft ty diin WordNet nhu uinh biy trong Bang 5.
Bang 5: Ma tran ddng ngbia S cua tgp tir vyng {support; vector; machine; linns;
kernel}
s
support vector machine linux kernelsupport 1.0 0.38 0.78 0.30 0.39
/ector machine linux Kernel 0.38
1.0 0.38 0.21 0.47
0.78 0.30 0.39 0.38 0.21 0.47 1.0 0.0 0.36 0.0 1.0 0.0 0.36 0.0 1.0 Tiep din, nhin bing dtt lieu 4 eiia md hinh tiii tft vdi ma trdn ddng nghTa S d Bang 5, tou duoc bang dft li^u Sem nhu Bang 6. Nlu dl y so sinh sy khde nhau gifta Bdng 4 (md hinh tiii tft) va Bdng 6 (md hinh tiil tu cd tich hpp ngtt nghia), ta cd toi toiy ring trong Bang 4 vec to x cd thinh phdn tuong ftng kernel = 0 do x khdng cd chfta tft kernel, tuy nhien cdc tft support, vector, machine lai cd quan hi ngft nghTa tii kernel nln sau khi tich hpp ngft nghTa, vec to x trong vdi Bdng 6 cd thdnh phan kernel tuong ftng li gia trj ldn ban 0. Tuang ty nhu viy, nhilu gia Uj 0 ciia Bing 4 cung dupe
thay bing cic gid tri ldn hon 0 Uong Bing 6 do tu tuang ftng cd quan he ngft n^Ta vdi cdc hi khae xuit hien Uong van bjm.
Bang 6: Md hinh tui tv d i tfch hop ngu nghia Sem cua tgp du Iilu [dl, di vi Jc}
STT support vector machine Linux kernel Ldp dl 1.17 0.85 1.36 0 1.36 Duong d2 1.08 0.58 1 1 0.36 Am X 2.16 1.76 2.16 0.51 1.21 ? De phin ldp vin bin ;c su dyng bdng 6 (bdng dft lieu da tich hpp ngu nghTa Sem), kNN tim 1 ling giing cua JC bing cich tinh khoing cdch Manhattan tu X den vin bin dl va d2. Kit qua la khodng cich d(x, dl) = 335< d(x, d2) = 4.76. Van bin x dupe gin ldp li Duang (ldp cua dl). Kit qui niy chftng td ring tich hpp ngft nghia vao md hinh tfti tft giup kNN phdn ldp hieu qua vin bdn ban.
3.2 Phin tich ngu nghfa tiem an (LSA) Phan tich ngtt nghia tiem an (LSA) li mpt ky thuit toudng dimg uong xft l^ ngdn ngtt ty nhiin v i tim kilm todng tin. LSA toye hiln phin tich eic mdi quan he gifta tip eic vin bin vi cdc tu vyng cd ttong van ban. LSA gid dinh rang nhttng tii co ngtt nghTa gan nhau toudng xuat hien trong ciing ngtt ciijh (phin tuong tu cfta van ban). Xuit phit tft bdng dft lieu D kich toudc m XM (m vdn bdn, n tft vyng) thu dupe d md hinh tfti tft, LSA sft dyng ky touit phin tfch gii tri kj' dj (Singular Value Decomposition - SVD), nit trich mii tuong quan ngtt nghia gifta cdc tft Uong tap van bdn, gidm sd cpt (chieu) ve k dgc tnmg tiim dn cfta bdng dtt lieu, tou duac bing R kich toudc m xA^Uong khi van gitt dupe cau tnic tuong ty cua cdc ddi^ trong bdng R.
G i i i - t h ^ | ^ J ^ | s ^ i m g bdng R (ngtt nghTa tiim in) di toye hi|n phan ldp vdn bin. Md hinh dupe md ta itou Hlnb 3.
H
<Text>
*
Pre-processing BoW D[m.n]
LSA LDA
Red
R[m,i(] Classification algorithm
kNN
-Text classification
L°?^-e....-
i^fc....~. ft., i t — , —
">!
Hinh 3: Md hinh ngir nghia tiim an BoW-LSA/LDA v i kNN cho phin Idp van ban 3.3 C h u d i an (LDA)
Trong khi LSA ehi cd toi nit trich ngft nghia ddng nghTa tiem in cfta cdc tft ttong t§p vdn ban toi
mpt vin di khac vin chua dupe giii quyit dd chinh l i sy da nghTa cua tft. Di gidi quyit dupe cd vin dk trin, (Blei ei ai, 2003) di di xuit md hhifa chft dk
TQP chi Khoa hgc Trudng Dgi hgc Cdn Tho Phdn A: Khoa hgc Tu nhien, Cong nghi vd Mdi irudng: 34 (2014): 66-73 tilm in LDA cbo phep kham pha ngft nghia tiim
in va ed sy da n ^ a cua tu trong tip van bin. LDA li mpt md hinh sinh xac suit cho t$p dft lieu rm rac (vin bin). LDA gii dinh ring moi tai lieu li sy tt^n lin cfta nhiiu chft dk, mSi chu di la mdt phan bd xdc suit uin cic tft, LDA vi ban chit dupe xem Id md hinh Bayes 3 cip dp: tip vdn ban, vin bin v i tft; trong dd mdi van bdn cua tgp hpp m van ban dupe md hlnb nhu mot md hinh hdn hpp eua k chft dl an, moi chu dk in la phan phoi xac suit da toftc (multinomial disUibution) cfta n tft. Md hinh LDA sinh tu va vin bin toeo quy tic nhu sau:
Ung vdi mdi tu cua «, tft vyng cfta van bin dj, - Idy mau chft dl z,j tuan theo phdn phdi xic suit da toftc multinomial(6j)
- Idy mau tu w,j tuan theo phdn phdi xac suit da toftc multinomial(<j)Zij)
d dd nhftng tham sd eua phdn phdi xac suit da thfte cho chu de Uong mdt tii Iilu Id 6j va nhftng tft Uong chu de dn ^A cd xdc sudt tiln nghiem Bang 7: Vi du vl tap du Iilu vin ban gom cac cau
hogch djnh ngudn nhin lyc doanh nghiep
Diricblet. Mdt each tryc quan, cd toi diln gidi vi toam sd (ift: de ehi tim quan Upng eua nhung tft Uong chft de k va toam sd 6j chi ra nhflng chu di khim phi Uong van bdn dj.
Xuat phdt tft bang dft lieu D kieh ^uac tn^n (m vin bdn, n tu vyng) tou dupe d md hinh tiii tft, LDA nit trich k chu dl tilm dn ttong tip m van ban, tgo bang dfi lieu mdi R kieh thudc m ^k. Giii tou|t kNN su dyng bang R (chft dk tiim in) di toye hiln phin ldp van bdn (xem Hinh 3).
4 KET QUA THirC NGHIEM Phdn toye nghiem nhim dinh gid hieu qua ciia tilp can de xuit sft dyng ngft nghia kit hpp vdi md hinh tfti tft de cii tiln hieu qud dy dodn ldp duong cua giii tougt kNN ttong phdn Idp vdn bdn ngdn.
v i chuang trinh, chiing tdi da tiin hdnh cdi dgt giii thuat kNN, phuang phip phdn tich ngft nghTa tiem an LSA va chu dl an LDA bdng ngdn ngu lap trmh C/C-H- tten may PC Linux.
hdi / t r i ldi tir cic cupc phdng van trong irng dung STT
1
2
Ndi dung Q: Are all your employees working at toe same buildii^?
A: Inconsistent C:
Q: What kind of clients does toe company have?
A:
C: Check for update
Ldp Duong
Am
Chftng tdi sft dyng tip dft Iilu vin bdn ngan dupe ngbien cuu ttong (Do et ai, 2014). Day Id tip dft li^u van ban gdm eic cdu hdi / tti ldi tft cdc cudc phdng vin trong ftng dyng hogch djnh ngudn nhdn lyc doanh nghiip. Cdc van bdn ngdn niy thudng rat ngan (chfta kliodng 20 tft), nhu vi dy Uong Bing 7. Vin di ein xft ly d day la dya vao ndi dung cau hdi (Q) va cdu tra ldi (A) mi kilm Ua xem c|p Q/A dupe phan ldp la diing (duong) hay sai (dm) cdn ehinh sfta (C). Tap dft Heu ddu tien cd 3696 cdp Q/A, sau budc tiin xft ly logi bd cic cap Q/A triing ldp, chftng tdi chi cdn Igi 966 cap Q/A.
Chiing tdi sft dyng thu vi?n Libbow cua (McCallum, 1998) de toye hiln budc tiin xft l;^ dft li^u vin bdn (loai bd tft it y nghia stop-words vd quy vi tft gie), xiy dyng md hinh tui tft vdi 500 tft.
Chiing tdi thu dupe tip dft Hiu li bing D ed 966 ddng (c$p Q/A) vdi 500 cpt (chiiu, tii), cd 2 ldp (duong/im) ttong dd 112 phin tft toudc ldp duong (chiim ty 11 khoing 12%) va 854 phdn tir toudc ldp im (chiim ty ll khoang 88%).
Nghi thfte kiem Ua dupe sft dyng d diy la hold- out, Igp lai 10 lin, mdi Idn Idy ngau nhien 644 ddng ldm tap hpc vd 322 ddng cdn lgi lam tap kilm Ua.
Tinh hieu qui l i trung biito cfta 10 lan lap. Giii tougt kNN dgt hieu qud cao nhit vdi k=3.
kNN phin ldp bang dtt li|u tou duoc tii md hinh tui tft dupe ky hieu la kNN.
Rieng phdn ti'ch hpp ngtt nghia WordNet vdi md hinh tfti tft, chung tdi sft dyng tou vien cung eip bdi (Seco et ai, 2004), kNN phdn ldp Uin bang dft Iilu nay dupe k^ hiiu li WordNet-kNN.
Chung tdi dung LSA vd LDA dl nit ttich tuong ling 50 die trung tilm an vd sft dung kNN Uen cic bing dtt lieu nay dupe ky bilu lin lupt li LSA- kNN vd LDA-kNN.
Chung tdi tiin hinh so sinh kit qua dya tten cie tilu chl ttou dp chmh xac cfta ldp duong (Ace.
Pos), dp ehinh xac cua Idp dm (Ace. Neg), Gmeans (trung bmh diiu hda gifta dd chinh xac cua ldp
Tgp chi Khoa hgc Trudng Dgi hgc Cdn Tha Phdn A: Khoa hgc Tv nhien. Cong nghi vd Mdi trudng: 34 (2014): 66- 73 duong, ldp am), dp chinh xic toan cyc (Accuracy),
dp do Fl (van Rijsbergen, 1979). Accuracy Id sd phdn tft dupe phdn ldp dftng cfta tit ed cac ldp chia cho ting si phdn tii. Fl la tiomg hinh diiu hda eua precision va recall; Uong dd precision ia sd phdn tft dupe phdn ldp dung ve ldp duong chia cho ting so phdn tft dupe dy doan Id Idp ducmg va recall la sd phin tft dupe phan ldp dung vi ldp duong nay chia cho tdng sd phan tft cua duong.
Kit qui tou dupe tft cae gidi thuat dupe ttinh bay Uong Hhih 4. Quan sit kit qud tou dupe Ace.
Pos, ed thi thiy ring tit ca cac md hinh su dung K6t qua phan Idp
ngft nghTa WordNet, LSA, LDA diu cdi toiln ket qua dy dodn Idp duong Uin 8% so vdi sft dyng kNN tryc tiip Uen md hinb tui tft. Tuy nbiln, chi cd WordNet va LDA van edn duy tri kit qua dy dodn Idp am ttoi^ khi LSA ldm gidm dy dodn Idp im din 2% so vdi md hinh tfti tft goc. Dieu ndy lam cho LSA-kNN cho kit qui toip khi so tieu chi Fl va Accuracy. Kit qui ciia LSA-kNN xuat phat tft nguyin nhin LSA dura gidi quylt tdt cho vin dl da nghia cda tft. Troi^ khi dd 2 md hinh ngft nghTa cdn lgi WordNet-kNN v i LDA-1(NN lai ludn chiim uu toi khi so sanh trin F1, Gmeans va Accuracy.
BkNN [DLS/VkNN HLDArkNN
Hinh 4: Kit qua phin Idp t i p via ban ngin Ket qud tou dupe tft toye nghiem ndy cho phep
chiing tdi tin ring tich hpp ngft nghia vao md hinh tiii tft WordNet-kNN vd LDA-kNN cdi toiln dupe ding ke hipu qui dy doan ldp dudng vd gidm rdt it kit qui dy doan ldp im cfta giai touat kNN ttong phdn ldp van ban ngan.
5 THAO LUAN VE CAC NGHIEN CU*U LIEN QUAN
Cdc tiep can phin ldp van ban duoc nghiin edu trudc diy dya tten md hinh ngft nghTa hogc may hpc (Manning, 2008), (Sebastiani, 1999). Theo gido su ddu nginh vl phan ti'ch du lieu ciia Dgi hpc California, Berkeley, M. Hearst cho ring tiip can ngft nghTa rdt phuc tgp v i khdng mang nhiiu hieu qua. Thay vi vdy, tiip cgn dya ttln may hpc ty ddng lai don gidn vd cho nhiiu kit qua tdt trong toye tien. Phuang phip phan ldp vin bin dya tten md hinh todng kl tft vd eae gidi tou^t hpc tu ddng.
Dft lieu vin bdn cd dp dai khac nhau dupe bieu
diln dudi dang vec to tan sd xudt hien cua tft ttong van ban (md hinh tiii tu (Harris, 1954), (Salton et ai, 1975)), ddy la md hmh biiu diin phd biin vd dupe diing trong hau bet eae nghiin eftu vi phin ldp van ban v i tim kilm thdng tin, (Manning, 2008), (Sebastiani, 1999), (Lewis & Gale, 1994).
Tap tu vyng tou dupe cd toe len den hing tram ngdn. Vi vgy, tap du lieu van ban dupe chuyin vl dgng mdt bang cd sd cdt (chieu, tft vyng) rit ldn.
Budc tiep theo la huan luyen md hinh hpc ty dpng tft bing dtt lieu nay. Cic md hinh mdy hpc thudng sft dyng nhu giai tougt k ling giing (kNN (Fix &
Hodges, 52)), Bayes too ngdy (NB (Good, 65)), cdy quyit djnh (Quinlan, 93), miy hpc vec ta h6 ttp (SVM (Vapnik, 95)), giai touit tip hop md hhih bao gdm Boostmg (Freund & Schapire, 95) vi rftng ngdu nbiln (Brehnan, 01).
Nghiin CUOI ^dm chiiu vdi phdn tich ngft nghTa tiim in vi chft dl in cung dupe tim thdy ttong cdc cdng truih tilu biiu nhu (Hofmann, 1999),
Tgp chi Khoa hoc Trudng Dgi hgc Cdn Tha Phdn A: Khoa hoc Tunhiin, Cong nghe vd Mdi trudng- 34 (2014)- 66-73
(Dumais, 2004), (Blei et al., 2003), ( T r i n & Pham, 2012).
Ciing da cd nhflng n ^ i l n eftu trudc diy cua Chung tdi ttong (Phgm et ai, 2006, 2008), (Dd &
Pham, 2013), di xuit giai touit tip hop md hinh cfta may hpc SVM, Bayes too ngdy, ciy xiln ngSu nhjen, cho pbin ldp hieu qui dtt Iilu vin bin biiu dien tryc tiip tu md hinh ttii tft ed sd chiiu ldn.
Ngoai ra, chftng tdi cftng da di xuit ti'ch hpp ngft nghTa bing ty diin WordNet (Fellbaum, 1998), cbo phep cii toiln kit qui hiin thj vi tim kiim chuyin gia (Nguyen et ai, 2009) vd tim kiim van ban (Bfti e/a/., 2006).
Tuy nhiin, nghien cftu eua bai vilt dupe dat Uong ngft einh phin ldp vdn bin ngin (chfta rit it tft) sft dyng md hinh tui tft va may hoe dya trin khoing each don gidn nhu kNN. (Song et ai, 2014) nlu bit vin di khd khin khi xft Iy van ban ngan vd cic tilp cdn may hoe nhu chft d^ dn, miy hpc SVM, kNN, NB. Muc tilu khdng nhim so sdnh vdi cdc gidi thugt miy hpc phuc tgp khde md y tudng chinh li cai tiln tilp can thudng thiy ttong phin ldp vin bin vi tim kilm todng tin cd sft dyng md hinli ttii tft bilu diln vdn bin va giai thugt dya ttln each ti'nh khodng cdch ludn b6 qua sy ddng nghia ctia tft.
6 KET L U ^ N V A H U * 6 N G PHAT TRifeN Chftng tdi vfta trinli biy tilp can tich hop ngtt nghTa nhim ning eao hieu qud phdn ldp van bin ngdn cua gidi thugt kNN. Vdn bin ngin mang rdt it todng tin de cho phip thyc hiln vile phdn ldp bdi kNN. Trong khi md hinh tui tft di bieu diin vin ban hi^n nay lgi khdng quan tdm din sy ddng ngu nghTa eiia cdc tft, hai vin bdn tuong ty nhau Uong giii tougt kNN ein phii so khdp tft vmigfrDieirnijfS lam gidm hi?u qui dy doin ldp duomg (Idp quan tim). Gk cii thiin kit qua dy doin ldp duong cfta kNN trong phin ldp vin bin ngin, chftng tdi di xuit tt'ch hpp ngft nghTa vao md hinh tiii tft, sft dyng ty diin ddng nghia WordNet, ngft nghTa tiim an LSA vi chft dk in LDA. Kit qui thyc nghiem vdi tgp dft Iilu thyc tiln vin bin ngin cbo toiy ring cac phuong phdp cfta chung tdi de xuat cii thiin dy doan ldp duong hon 8% uong khi gidm chua din 1% dy doin ldp im cua gidi touat kNN ttong phdn ldp vin bdn ngin.
Trong tuong lai, chftng tdi dy djnh md rdng y tudng nay di xft ly vin di tuong ty nhu tun kilm, phin ldp, inh, video, cd sft dyng md hmh bilu diln tfti tft vd md hinh may bpc dya uin khodng eich nhu kNN.
TAI LIEU THAM KHAO
1. Blei, D., Ng, A., Jordan, M.: Latent Duichlet allocation. Journal of Machine Learning Research 3 (4-5); 993-1022, (2003).
2. Breiman, L.i Random forests. Machine Learning 45(1), 5-32 (2001).
3. Bui T-T., Nguyin D-T. va Dd T-N.: HI todng tu vin tii nguyen hpc tap. Ky yeu hpi todo SGK'06, Hui, Tr. 1-9 (2006).
4. Do, T-N., Moga, S. andLenca,P.: Random forest of oblique decision Uees for ERP semi-automatic configuration, in Multiple Model Approach to Machine Learning, Sprmger(2014), pp. 25-34.
5. Dd, T-N., Pham, N-K.: Phdn logi van bdn:
Md hinb tui tft va tap hpp md hinh may hpc ty ddng. Tgp chi khoa hgc DHCT, Sd 28: 9- 16(2013).
6. Dumais, S.: Latent Semantic Analysis.
Annual Review of Information Science and Technology Vol. 38(l):188-230, (2004).
7. Fellbaum, C : WordNet: An electronic lexical database. MIT Press (1998) 8. Fix, E. and Hodges J.: Discriminatoiry
Analysis: Small Sample Perfonnance.
Technical Report 21-49-004, USAF School of Aviation Medicine, Randolph Field, USA (1952).
9. Freund, Y., and Sehapue, R.: A deeision- toeoretic generalization of on-line learning and ^1 application to boosting. In;
Computational Learning Theory:
Proceedmgs of toe Second European Conference, pp. 23-37 (1995).
10. Good, I.: The Estimation of Probabilities:
An Essay on Modem Bayesian Metoods.
MIT Press (1965).
11. Harris, Z.i Distributional Stmeture. Word 10(2/3) (1954).
12. Hofinann, T.: Probabilistic Latent Semantic Indexing. Proceedings of the 22th Annual International SIGIR Conference on Research and Development in Information Retrieval (1999), pp.
13. Lewis, D., Gale, W.: A sequential algorithm for ttaining text classifiers. In: Proceedmgs of SIGIR (1994).
14. Liu, B.: Sentiment Analysis and Opinion Mining. Morgan & Claypool, 2012.
Tgp chi Khoa hgc Trudng Dgi hgc Cdn Tho Phdn A: Khoa hgc Tunhiin. Cong ngh? vd Moi trudng: 34 (1014): 66-73 * 21. Salton, G., Wong, A., Yang, C.S.: A vector
space model for automatic indexing.
Communications of the ACM, Vol.l8(ll):613-620(1975).
22. Sebastiani, F.: Machme learning m automated text categorization. ACM Computing Surveys 34(1), 1-47 (1999) 23. Seco, N., Veale, T., Hayes, J.: An Inttinsic
Information Content Metiic for Semantic Similarity in WordNet. Proceedmgs of ECAI (2004), pp. 1089-1090.
24. Song, G., Ye, Y., Du, X., Huang, X., Bie, S.:
Short Text Classification: A Survey. Journal of Multimedia, Vol.9(5):635-643 (2014).
25. Trdn, CD va Pham N.K.; Phdn loai van ban vdi miy hpc vec to ho Up vd ciy quyit dinh. Tgp chi khoa hgc BH. Cdn Tha sd (21a):52-63(2012).
26. Van Rijsbergen, C.V.: Information Retrieval Butterworto (1979) 27. Vapnik, V.: The Nature of Statistical
Learning Theory. Sprmger-Verlag (1995) 28. Wu, X. and Kumar, V.: Top 10 Algorithms in
Data Mining. Chapman & Hall/CRC (2009).
IS.Manning, C , Ri^avan, P. and Schutze, H.:
Introduction to Information Retrieval.
Cambridge University Press (2008) 16. McCallum, A.; Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering.
1998. http://www- 2.cs.cmu.edu/~mceallum/bow.
17. Nguyen, T-B., Lenca, P., Do, T-N. et Poulet, F.: Visualisation de reseaux d'experts. Aete du 7eme Atelier Visualisation et extraction de connaissances, EGC'09,9emes Joumees d'Extraction et Gestion des Connaissances (2009), pp. 1-5.
18. Phgm, N-K, Dd, T-N, Poulet, F.: Phan loai van bin vdi gidi thudt Boosting PSVM. Ky yiu hdi nghj ©CNTT (2006), Tr. 269-278.
19. Phgm, N-K., Do, T-N., Trdn, C-D.: Phan loai da lieu vdi Giii thugt Arcx4-LSSVM.
Tuyin tap cdng trhto nghiin cftu Cdng nghi Thdng tm vd Truyen todng, NXB KPIKT (2008), Tr.72-78.
20. Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufinann, San Mateo, CA (1993).