• Tidak ada hasil yang ditemukan

TIN Tir

N/A
N/A
Protected

Academic year: 2024

Membagikan "TIN Tir"

Copied!
10
0
0

Teks penuh

(1)

TuyIn tap Cdng trinh Nghien cuu Cdng nghe Thong tin va TmyIn thdng 2010

HE THONG HUAN LUYEN VA VAN HANH WEBBOT RUT TRICH THONG TIN Tir WEB

Nguyin Huy Khdnh, Nguyin Diic Huy, Nguyin Pham Phuong Nam, Dd Hoang Cudng, Trin Minh Trilt Khoa Cong Nghf Thong Tin

Trudng Dai hpc Khoa hoc Ty nhien, DHQG-HCM {nhkhanh, ndhuy, nppnam, dhcuong, tmtriet}(gfit.hcmus.edu.vn

Tom tit. Cac ling dung web thl he thii hai c6 dac dilm Id dugc ghep ndi tu nhilu ngudn thong tin va thanh phin web khac. Tuy nhien, nhihig he thing web thl he thii nhit hien dang tin tai chua co kha nang s5n sang dugc sii dung de cung cap cac nguon thong tin va thanh phin web dl tao ra cac irng dung web thl the th\i hai. Trong bai bao nay, chiing toi dl xuit mot phuoiig phap co dac dilm: huin luyen cac wrapper (WebBot) CO kha nang nit trich thong tin tii cac website, tham s6 hoa cac gia tri diu vao trudc khi van hanh WebBot, tai van hanh WebBot nham nit trich thdng tin theo nhu ciu vd cung ung cac thong tin nit trich dugc ra nhilu dang dich vu web khac nhau. Vdi each tiep can dugc dl xuit, he thing cd thl di dang biin cac website co sin thanh nhiing nguon dit lifu cung cap va t l chuc lai thong tin theo yeu ciu vdi kit qua dugc kit xuit theo nhieu chuan dinh dang khde nhau.

Tir khoa: nit trich thong tin web; web wrapper; mashup; tu dpng hoa web; dich vu web

SYSTEM FOR TRAINING AND EXECUTING WEBBOT TO EXTRACT INFORMATION FROM WEBSITES

Abstract. Content in Web 2.0 application is the combination of information from other sources and components. However, existing Web 1.0 applications have not capable to provide information and components to create Web 2.0 application. In this paper, we propose a methodology having these features:

trains wrappers (WebBot) having ability to exUact information from websites; parameterizes input before executing WebBot, executes WebBot to exfract on demand information and expose this information through web services. With the propose approach, our system can easily turn websites into information sources, recombine information into a new source and export information to web services with common technologies (SOAP, REST).

Keywords: web information exttaction; web wrapper; mashup; web automation; web service

(2)

TuyIn tap Cdng trinh Nghien ciiu Cdng nghe Thdng tin va Tmyen thdng 2010

HE THONG HUAN LUYEN VA VAN HANH WEBBOT RUT TRICH THONG TIN TlT WEB

Nguyen Huy Khanh Khoa Cdng Nghf Thdng Tin

Trudng Dai hgc Khoa hgc f y nhien, DHQG-HCM [email protected]. vn

Nguyen Diic Huy Khoa Cdng Nghf Thdng Tin

Trudng Dai hgc Khoa hgc Ty nhien, DHQG-HCM [email protected]

Nguyin Pham Phuang Nam

Khoa Cdng Nghf Thdng Tin

Trudng Dai hgc Khoa hgc Ty nhien, DHQG-HCM nppnam(gfit.hcmus.edu.vn

Do Hoang Cuong Khoa Cdng Nghf Thdng Tin

Trudng Dai hgc Khoa hgc Ty nhien, DHQG-HCM [email protected]. vn

Tran Minh Triet Khoa Cdng Nghf Thdng Tin

Trudng Dai hgc Khoa hgc Ty nhien, DHQG-HCM [email protected]. vn

Tom tdt—Cac ung dung web the he thir hai co dac diem 1^ dugc gh^p noi tir nhieu nguon thong tin va thanh phan web khac. Tuy nhien, nhirng he thong web the he thu- nhat hien dang tin tai chua co kha nang sin sang dugc sir dung de cung cip cac nguon thong tin \k thanh phin web de tao ra cic irng dung web the the thir hai.

Trong bai bao niy, chiing toi de xuat mft phirffng phap CO dac diem: huan luyen cac wrapper (WebBot) co kha nang riit trich thong tin tii cdc website, tham sd hoa cdc gia trj dSu vdo trudc khi van hanh WebBot, tai van hanh WebBot nham riit trich thong tin theo nhu cau vd cung ung cac thong tin rut trich dugc ra nhieu dang dich vu web khac nhau. Vdi cdch tiep c3n dugc dl xuit, he thong c6 the de ddng bien cdc website co san thanh nhiing nguon dir lieu cung cap va to chirc lai thdng tin theo yeu cau vol kit qua dugc ket xuit theo nhieu chuin dinh dang khac nhau.

Keywordweb information extraction; web wrapper;

mashup; web automation; web service I. MdDAU

Web 2.0[1] da khdng cdn la mdt khai niem kha xa la ddi vdi nhieu ngudi. Nd ngay cang trd nen phd bien hom bao gid het. Hifn nay hau het ngudi sit dung web khdng ai ma khdng sii dyng ling dung cua Web 2.0. Du biet hay khdng nhung hg ciing nhan ra ring cac frang web khdng chi don thuan la ngudn cung cip thdng tin don thuan nhu bao hay truyf n hinh ma cdn cd the tuong tac va ddng gdp thdng tin (Facebook, Wikipedia, Blog,...). Ndi dung cua trang web khdng

chi gdi ggn trong website dd ma cd the lien ket vdi nhiing trang web ho^c nhiing irng dyng khac.

Web 2.0 da lam xuit hifn them cac thuat ngft mdi nhu Web API, SyndicationFeed, Mashup,...[\]. Neu nhu Web API mang den khd nang cung cap thdng tin web ma khdng can phai vao trang web dd.

Syndication Feed la dich vu cho phep cap nhat nhanh chdng ndi dung tdm luge cua trang web thi Mashupli mdt cdng cy cd kha nang lay thdng tin tii nhieu ngudn dii hfu khac nhau nham tao mgt dich vy mdi tdng hgp tii cac ngudn dir lifu do.

Tuy nhien, khdng phai website nao ciing cung cap day dii cac dich web hoac cac dich vu ma trang web dd cung cip khdng dap ling dung nhu ciu cua ngudi diing. Vi du, cac trang web gidi thifu va ban cac thiet bi difn tu tin hgc nhu Thl gidi di ddng, Phong Vii, Wonder Buy,... deu cung cap chute nang tim kifm san pham theo tu khda. Tuy nhien cac chuc nang ndy lai khac nhau theo timg site va khdng the tich hgp vao mdt hf thdng khac do khdng dugc cimg cip thanh cac Web service API. Trong bai bao nay, chiing tdi de xuit mdt hf thdng thyc hien vifc nit trich cac thdng tin can thiit ciing nhu tdng hgp thdng tin trfn web thanh cac dich vu API cung ung theo nhu ciu ngudi sii dung hoac cac he thdng khac.

Cac hf thdng nhu vay bao gdm cac chuc ndng chinh nhu sau: (1) xac dinh va nit trich thdng tin tren trang web, (2) tdng hgp va td chiic thdng tin thu dugc thanh mdt dang co so dir hfu cho phep df dang truy

(3)

TuyIn tap Cdng trinh Nghien ciiu Cdng nghf Thdng tin va TmyIn thong 2010 vin thdng tin (XML, SQL,...), (3) cung ling cdc dich

vy tuong ung cho ngudi dimg (SOAP, REST web service, RSS, Atom feed, web clipping,...).Mdt sd cdng cy dugc xay dyng nhu mdt phin md rdng (add- on) ciia img dyng [2,3], vi the vifc hiin thi va tuong tac vdi cdng cu rat han che. Mdt sd khac dugc xay dyng nhu cac API hoac framework[4,5]. Phin Idn kit qud dat dugc cua cac cdng cu chi nhdm trinh bay lai ndi dung web cho ngudi su dung [2,3,6] hoac mdt loai web service cy the [4,5].

Hf thdng de xuat cho phep ngudi sir dung cd thl huan luyfn robot nit trich thdng tin tren web (WebBot) mdt cdch tryc quan ma khdng can cd nhilu kiln thiic vl lap trinh. Sau dd, WebBot cd thl dugc tham sd hda de thyc thi lai cac hanh ddng da dugc huan luyfn trudc dd nham nit trich thdng tin tren web. Dir lifu nhdn dugc se dugc tdng hgp va td chiic tren co sd du hfu de lam ngudn cho cac dich vy web cung ling sau nay. Cac dinh dang ma hf thdng hd trg bao gdm cdc chuan thdng dyng nhu: SOAP, REST web service, RSS, ATOM feed, hon nira ngudi dimg cd the de dang tuy bien dinh dang theo muc dich ciia minh.

U. CAC NGHIEN Ciru LIEN QUAN Cling vdi sy biing nd cua Web 2.0, ngudi su dung ngay cang ddng vai trd chii dgng hem tren internet.

Nhieu hf thdng dugc phdt trien vdi muc dich cho phep ngudi dimg lya chgn cac vimg hien thj quan tam tren web. Internet Scrapbook[6] cho phep trich mdt phan trong cac trang web va tap hgp chiing thanh mdt trang cd nhan duy nhit. Tuy nhien ndi dung trich dugc chi la mdt phin doan ma HTML cua trang web va ngudi diing khdng the thay ddi ndi dung cua cac thanh phin dd. MashMaker[2] la mgt ung dung AJAX chay tren Firefox cho phep thay ddi, fruy vin va hifn thi tryc tiip dii lifu ciia trang web.

MashMaker cho phep xay dyng rieng mdt trang mashup dya tten truy vin dO lifu cua ngudi khac da tao trudc dd hoac cdc website khac. Mamiite[3] la mgt phin md rgng (extension) ciia FireFox vilt bang Javascript va XUL, su dung cac phep trich thdng tm tryc tiip trfn ttang web dang hifn thj sau do lien ket vdi cac ngudn dii lifu khac theo dang Unix pipes va

cd kha nang xuat ra cdc dir lifu khac nhu co sd dir lifu SQL, hien thi len trang web hoac cac djch vy ban dd. Tuy nhien nd lai khdng ho ttg cac ttang web dugc phdt sinh ddng nhu ttang ket qua tim kilm dya tten mdt tir khda nao dd.

Cac nghien ciiu khde tap tnmg xay dyng cdc API hoac framework cho phep bien ddi mdt website thanh cdc dich vy web. H2W[4] la mdt framework cho phep xay dyng cdc dich vy web cd the cd tii mdt ttang web hoac website cd san. Tuy nhien nd khdng the ndo xay dyng dugc tat ca cac dich vu web phyc vy day du va chinh xdc tat ca nhu cau cua ngudi su dung.

Pollock[5] sii dung XWRAP[7] dl phat smh wrapper cho ph^p tao cac dich vu web tir cac truy van tten form cua ttang web. Tuy nhien. Pollock chua xay dyng giao difn tuong tac dd hga (GUI) tuong ling vd cdn ngudi dimg cd kha nang phan tich cau tnic HTML ciia ttang web cd chiia form.

Lixto [8,9]mdt ttong nhiing cdng cu nit trich thdng tin web thuomg mai manh me nhat hifn nay.

Visual Developer cua Lixto dugc xay dyng tten nen Eclipse IDE, tich hgp trinh duyft Mozilla cho phep tao cac wrapper va dang tai cac wrapper ndy len Lixto Transformation Server do cdng ty nay qudn ly.

Nhdm giai quyet cac van de tten, hf thdng cd day dii nam chiic nang cua mdt hf thdng nit trich thdng tmweb[10]:

• Tuomg tac web, cd kha nang thyc hifn cac thao tac tten web de den dugc ttang chiia ndi dung can nit ttich.

• Huin luyfn va xay dyng cdc wrapper (WebBot) cd kha nang nit trich thdng tin tir mdt hoac nhilu ttang web khac nhau.

• Djnh ki thyc hifn lai vifc nit trich thdng tin nhim cd dugc ndi dung cap nhat nhit.

• Biin ddi, tdng hgp dir lifu tir nhieu ngudn va td chiic lim trii xudng co sd dii lifu quan hf.

• Cung ling cdc thdng tin nit trich dugc ra nhilu dang dich vu web khac nhau nhu dich vu web SOAP/REST, ATOM/RSS feed, luu trii xudng co sd dii lieu quan he.

d cac myc ben dudi, chiing tdi se gidi thich ngi dung va each thiic xay dyng cac thanh phan cua hf thdng.

(4)

TuyIn tap Cdng trinh Nghien cuu Cdng nghf Thdng tin va Truyin thdng 2010

m . TONG QUAN HE THONG

Nham tang kha nang tai sii dung va thuan tifn cho vifc thay ddi va nang cip he thdng dugc xay dung theo huting component-based nhu mgt vai he thdng tucmg ty [11 ].Toan bd hf thdng dugc phan thanh bdn tang bao gdm Data, Business, Data Presentation, Application nhu ttong Hinh 1.

Tang Data, tang dudi cimg cua hf thdng, chinh la noi chiia co sd dir lifu cua toan bd he thdng. Hf thdng se luu trir cdc tap tin hudng dan WebBot cimg vdi kit qua ma WebBot dd thyc hifn ling vdi nhieu thdng sd dau vdo khac nhau. Ngoai ra hf thdng cdn luu trfr cac thdng tin dinh danh ngudi su dung hf thdng, danh sdch cdc WebBot cua ngudi su dung va cac thdng tin dimg de qudn ly khac.

Tang thii hai. Business, chinh la noi cimg cap cac dich vu cho cac thanh phan cua he thdng cung nhu cdc dich vy co ban cho ngudi dimg. Khi cd yeu cau tang nay cd nhifm vu tuong tac vdi ting Data ben dudi de cung cap tai nguyen cd cac ting ben tren.

Tang Data Presentation cd nhifm vu chuyen ddi dir hfu theo cau tnic dir lifu ciia he thdng thanh ciu tnic dur hfu phii hgp vdi nhu cau cua ngudi dimg.

Tang nay cdn la tang cung cap cac dich vu tryc tiep den ngudi dimg hoac cac he thdng khac bao gdm cac dich vu Syndication Feed, cac SOAP web service, REST web service.

Tang tten cimg. Application, chiia cac irng dyng phuc vu ngudi su dung cung nhu cac ling dung ciia hf thdng. Cdc ling dung chinh bao gdm WebBot Creator chmh la ung dung tao WebBot cho phep thdng qua cac ling dung khac nhu WebBot Onhne Manger dl dang tai len co sd dii lifu cua hf thdng. Ung dung cua hf tiidng la WebBot Watcher khi hoat ddng se hen tuc theo ddi cac yeu ciu dugc gui din he thdng, tiiyc hifn cac yeu cau dd va gui kit qua ngugc vl co sd dii hfu cho cac irng dung khac su dung.

HMUotDiabHP

r*

Hinh 1. So do kien true he thong

A. Tdng Data

Day chinh la co sd dii hfu chinh cua todn bd hf thdng luu trir ndi dung chinh sau:

• Cac WebBot do ngudi dung hoac hf thdng dang tai de tien hanh thu thap dir Uf u.

• Ndi dung ket qua thyc thi ciia WebBot de cung cap cho ngudi diing. Ket qua thyc thi dugc dinh kem cac ttang thai nham xac dinh dii hfu cua WebBot chinh xac hay can phai cap nhat do cd ldi ttong liic thyc thi WebBot hoac dir hfu da qua han.

• Cac thdng tin quan ly ngudi dilng, danh sdch cac WebBot ma ngudi dimg da ddng tai.

• Cac thdng tin quan ly chiic nang va trang thai cac thanh phan cua he thdng.

B. Tdng Business

Day la noi cung ung cac dich vu cho cac thanh phin khac ttong hf thdng. Cac thanh phin ttong ting nay giii: nhifm vu lien lac giira co sd dii lifu va cac img dyng cua hf thdng, bao gdm cac dich vu co ban sau:

• Dang tai cac WebBot cua ngudi dimg If n ca sd du lieu ciia hf thdng

• Yeu ciu liy kit qua thyc thi ciia WebBot vdi cac thdng sd ciia ngudi dimg va tta vl ket qua thyc hien ciia WebBot tii co sd du heu

• Xac djnh cac yeu ciu thyc thi tai thdi dilm hifn tai. Dich vu nay do WebBotWatcher dinh ky kiem tra va thyc hifn cac yeu cau dd va ghi lai co sd dii lieu

(5)

Tuyen tap Cdng trinh Nghien ciiu Cdng nghe Thdng tin va Troyin thong 2010

• Ghi ket qua thyc thi WebBot vao co sd dii lifu.

C. Tdng Data Presentation

Dii lifu sau khi qua ting nay se dugc chuyin tir diir lifu theo cau tnic XML chira ket qua cua WebBot thanh cac chuan dii lifu ciia cac dich vy cung cip cho khdch hdng.

• ATOM Feed: Chuyen du hfu kit qua cua WebBot tiieo chuin cua ATOM feed.

• RSS Feed: Chuyen dd lifu kit qua cua WebBot theo chuan cua RSS feed.

• SOAP Web service: Ket qua cua WebBot se dugc cung cip thdng qua mdt SOAP Web service.

• REST Web service: Ket qua cua WebBot se dugc cung cap qua REST Web service.

D. Tdng Application

Tang Application chiia cdc 6ng dyng cua hf thdng tiiy theo muc dich va nhu cau su dung mdi ling dung se dugc xdy dyng theo tiing cdng nghf khac nhau va thyc thi tten cac hf thdng khac nhau.

• WebBot Creator: iTng dyng ho ttg giao difn cho phep ngudi su dung de dang tao cac WebBot de su dung cho cac hf thdng ben dudi ma khdng can bilt nhilu vf lap trinh.

• WebBot Onhne Manager: Ung dung quan ly WebBot tryc tuyen cho phep dang tai WebBot, xem tnrdc va hudng din su dung dich vu ciia hf thdng.

• WebBot Watcher: LTng dung chay d may chu ciia hf thdng cd nhifm vy dinh ki theo ddi cdc yfu cau cua khach hang va thyc thi WebBot ghi kit qua va co sd dii lifu nlu can thiet.

Phin tiip theo cua bao cao se trinh bay ve cac phan hf quan ttgng cua hf thdng.

IV. PHAN H$ HUAN LUYfN WEBBOT

Phan hf nay sit dung phuong phap bdn tu ddng de phdt sinh WebBot [10]. Ngudi dimg se tiidng qua mdt ung dung GUI (Hmh 2)aiuc hifn cac hanh dgng tten web dl din trang web chiia ngi dimg nit tiich va xdc dinh cdc vj tri chiia thdng tin. iTng dung se ty dgng ghi nhdn hanh dgng cua ngudi dimg de thyc hifn lai hanh dgng. Ngoai ra, ngudi dung cdn cd the tham sd

hda yeu cau va su dung cac phep phiic tap hon nhu gom nhdm dir lifu, kiem tta dieu kifn, phep lap nhim dl ddng thu thap diy dii vd chinh xac thdng tin.

A. Mdi trudng hudn luyen WebBot

Hf thdng can phai xdc dinh dugc cdc hanh ddng, cac cdng vifc cd the cd khi ngudi dimg tuomg tdcvdi mdt ngudn thdng tin (tiic mdt website) ching han nhu gd mgt ddng text vao hop textbox, chgn mdt lya chgn ttong d selectbox, hay cUck len mdt link, mdt button,... Nhu vay vifc hudng dan WebBot cin cd mdt mdi tnrdng cd khd nang tryc tuyIn nghla la cd khd ndng truy cap vd the hifn website de ngudi dimg dya tten website dd ma dua ra hudng din cho WebBot cdc hdnh ddng can thiet de WebBot cd thl thyc thi ty ddng nham ddp ling nhu cau cua minh.

Gidi phap chiing tdi dl xuat la cdng cy t^o WebBot se la mdt ling dyng desktop cho phep hf thdng ghi nh$n hanh ddng cua ngudi diing va tiiy bien chiic nang ciia WebBot. Do cdc nghifp vy deu nam tten may ciia ngudi diing nen tdc dg xu ly nhanh va cd thl kiem soat gan nhu tat ca hanh dgng cua ngudi dilng ben ttong hoac ca ben ngoai ttang web bang each theo ddi cdc event cua ttang web hodc hook cdc sy kifn chugt va ban phim.

B. Ghi nhdn hdnh dgng cua ngudi diing

Phan hf chiing tdi xay dyng cd kha nang ghi nhan lai cdc hanh ddng cua ngudi dimg de vifc tao WebBot ttd nen hoan toan ttong sudt vdi ngudi dimg. Nghla la ngudi dimg tuomg tac vdi irng dyng nhu la mdt trinh duyft web thdng thudng, ung dyng s6 ngam dinh luu l?i mgi hanh ddng cua ngudi diing de tao WebBot.

Hudng tiip can diu tien la theo ddi tit ca sy kifn cua tit ca phin tu (HTML element) ttong mdt ttang web. Day la mdt each don gidn va khdng bd sdt cdc sy kifn nao cua trang web. Nhung sd lugng cac phan tir ttong mgt ttang web rit Idn va ngudi diing chi thao tac len mgt sd it phin tii do dd vifc gdn cac ham theo ddi sy kifn vao tit ca cac phin tii gay mit nhilu thdi gian va tdn nhilu tai nguyen. Mdt han che cyc ki nghifm trgng cua hudng tiip can ndy chinh la vifc theo ddi cac phin tu dugc ty ddng them vao khi ti^ang web cd sii dyng AJAX[12]. Dilu nay dan din khdng thl ghi nhan cac hanh ddng khi ngudi dimg thao tdc tten cac phin tu mdi nay.

(6)

TuyIn tap Cdng trinh Nghien cihi Cdng nghf Thdng tin va TmyIn thdng 2010 Do mdi trudng huin luyfn d phia ngudi dimg nen

ta cd mdt each thuan tifn va nhanh chdng han la theo doi sy kien chugt va ban phim de phat sinh cac hanh dgng tdng quat cua ngudi diing len ttang web.

Sau day la cac hanh ddng tieu bilu khi duyet web:

• Click: Day la thao tac khi ngudi dung kich chugt vao mdt ddi tugng tten ttang web hoac hanh ddng bim Space, Enter khi cd mdt ddi tugng dang dugc kich hoat. De nhan biet hanh ddng nay ta xem xet sy kifn kich chugt ttai len ddi tugng va sy kifn bam phim Enter hay Space khi ddi tugng dang dugc kich boat khdng dugc dimg de nhap text.

• Chgn gia tti ttong ComboBox: Day la hanh ddng ngudi dimg thay ddi gia tti ttong combobox, ciing chinh la thay ddi gia tti chua ben trong the <SELECT>. Phan hf se ghi nhan bang each theo ddi sy kifn onchange ciia the<SELECT> dang dugc kich hoat.

• Gd van ban: Day la hanh ddng ngudi dimg dien gia tri vao mdt <INPUT> hodc mdt

<TEXTAREA>. Phan hf se theo ddi va so sanh gia tti ban dau va gia tti mdi do ngudi diing de biet dugc cd sy thay ddi ndi dung do ngudi diing nhap vao.

• Submit Form: Hdnh ddng nay xay ra khi ngudi diing bam Enter de bat diu gui dii lifu len ttang web. Phan hf se xac dinh bieu mdu (form) dang chira phan tii dang dugc kich hoat tir dd biet dugc bieu mau nao se dugc submit khi cd hanh ddng nay.

C Phdn logi dQ lieu nit trich

Hf thdng hd trg vifc xdc dinh va nit ttich hiu hit cac loai dii hfu ca ban tten ttang web bao gdm:

• Du lifu text:Bao gdm phin text cua mdt hay nhilu phan tu, hoac phan ma HTML cua mdt hay nhieu phan tit.

• Hmh anh: Hinh anh tten ttang web tdn tai d hai dang: (1) hmh anh cua the <IMG>,(2) hmh nen cua mdt phan tu HTML luu d thugc tinh bgimage hoac thugc tinh background- image luu bdng CSS. Ta se xac dinh url cua dnh va liy ndi dung cua anh.Day ciing la each liy dii lifu cua mgt tap tin d muc kl tiip.

• Tap tin: Day la cac tap tin dinh kem cd thf tdi vl dugc tten trang web. Khi nit trich can luu y vifc quan ly session va cookies de cd the thu thap dugc ndi dung ciing cac trang web yeu cau dang nhap hoac chiing thyc.

D. Cdch xdc dinh phdn tu HTML trong tdi lieu HTM 1) Phdn tit HTML thdng thudng

^k thao tdc hoac nit trich dii: lifu tten mdt ttang web ta cin phai xac djnh diing vi tri ciia phan tit dd ttong ttang web. Ta cd nhirng cacH xac dinh phan tu nhu sau:

• Xac dinh thdng qua ID: Trong mdt tai lifu nlu mdt phin tir cd ID thi ID dd la duy nhit, do dd ta cd thl dya vao ID nay de xac dinh phan tii do.

• Xdc dinh thdng qua name: Ddi vdi chuan HTML cii mdt phin tii cung cd thl cd trudng name, nhung khdng chac rang name nay la duy nhit. De xac dinh phan tii thdng qua trudng name nay ta can phai them mdt tham sd la chi sd (thii ty xuat hifn) ciia phan tu cd cimg name ttong tai lifu.

• Xac dinh thdng qua mdt phan tit HTML khac da dugc xac dinh: Neu phan tir khdng cd tnrdng ID lan trudng name! Ta phai xdc djnh phan tir dya vao phan hi cha cua nd. Tit phan tu can xac dinh ta tuin ty lin len phan tir cha cua nd cho den khi phan tu cha dd da dugc xac dinh. Nhu vay de xac dinh phan tit nay ta dya vao phan tu da dugc xac dinh va dudng di tir phan tii cha den phan tu can xac dinh.

2) TableRow

Dii lifu kilu bang la du lifu thudng thiy ttong vifc the hifn ndi dung cua mdt ttang web (chirng khoan, gia vang, ngoai tf ...). Do do de thuan tifn cho vifc truy xuit ngi dung cua tirng d cua bang ta dinh nghla cac each xac dinh mdt ddng cua table nhu sau:

• Xac dinh thdng qua chi sd ddng: Day la each thdng thudng nhat de xac dinh mgt ddng ttong mdt bang.

• Xac dinh thdng qua gia tri cua mgt cgt xac dinh: Trong trudng hgp bang cd mgt cgt de dimg de xac dinh chi muc cho timg ddng cua bdng. Vi du: ttong bang giao djch chiing

(7)

Tuyen tap Cdng trinh Nghien ciiu Cdng nghe Thdng tin va Truyin thdng 2010 khodn cdt dau tien ghi ma co philu, ta dya

vao cdt nay de xdc dinh cac trudng khac bao gdm gia tham chieu, gid ttan, gid san,...

3) Trang web cd sir dung FRAME hoac IFRAME d muc tten ta da bilt cdch xac dinh mdt phin tii HTML ttong mdt tai lifu HTML. Ngoai ra, mdt ttang web ciing cd the bao gdm nhilu FRAME hodc IFRAME (nghla la mgt ttang web cd the cd nhilu tai lifu HTML) do dd de xdc dinh phan tu HTML nay tten trang web ta can phai xac dinh dugc phin tu nay thugc vao tai hfu ndo. Vi the ta can phdi ghi nhan them cau tnic cua cac tai lifu ttong ttang web va each xac dinh tai hfu chiia phan tir.

4) Tuang tdc vdi cdc thdnh phdn AJAX

Cac thanh phan AJAX la cdc thdnh phan dugc hien thi sau khi ttang da hien thi hoan tat. Ngoai van dl ghi nhan hanh ddng cua ngudi dimg tten cdc thanh phan AJAX d phan tten thi van de thyc hifn lai cita ngudi diing ciing gap phai khd khdn. Dd chinh la phai biet dugc chinh xac khi nao nhiing phan tii dd da dugc them vao ttang web. Them niira vifc nit trich dit lifu AJAX ciing gap khd khan dl nhan biet dau Id dii lifu thyc sy ngudi dilng mudn lay.

Chiing tdi de xuat: de xdc dinh khi nao phan tir dugc them vao ttang web khi ndi dung trang web da dugc tdi hodn tit, ta phai lien tyc kiem tta sy tdn tai cua phan tit tten ttang web va chi tuomg tac khi phan hi dd dugc tai hoan tit. De xac dinh dir lifu ben ttong phan tii cd phai la du lifu ngudi dimg can khdng, ta dya vdo vifc so sanh dit hfu lay dugc vdi mau dii' lifu ngudi dimg mudn nhan.

5) Session vd cookie

Van de ghi nhd dang nhap: Bat dau tit vi du:

ta cd mdt WebBot ty ddng gui tm nhin miln phi thdng qua ttang web cua Mobifone, WebBot nay diu tien phai dang nhap bdng sd difn thoai cua minh rdi bit dau gui tin nhan.

Nhung nlu ta thyc hifn lai WebBot nay lin niia sg khdng cdn d dl dang nhap vi d lan trudc da dang nhap rdi, ta chi cd tiie dang xuit mdi cd thl dang nhap lai khiin WebBot khdng hoat dgng dugc. Chiing tdi dh xuat va hd ttg tiiuc hifn cac phuang an nhu sau: (1) ttong qud trinh tao WebBot ngudi dimg phai bim vdo mit ddng xuit khi kit tiuic tao WebBot; (2) Xda cookie ttong moi lan

WebBot hoat ddng; (3) Kilm tta nlu tdn tai mit 'dang xuat'WebBot se ty ddng bam 'dang xuat' trudc khi thyc hifn tac vy.

• Tai cdc dii lifu yeu cau dang nhap: Van de nay da dugc de cap d muc Riit trich du lifu la tap tin.

Tat ca cdc van de vd phuang phap dugc chiing tdi trinh bay d tten dugc the hifn ttong phan hf WebBot Creator thugc tang Application. Nhifm vy chinh cua phan hf nay la thyc hifn vifc ghi nhan lai cac hanh ddng ciia ngudi dimg khi tuong tdc vdi website vd luu lai cac hanh ddng dd dudi dang mdt tap tin XML.

* -^ e

!L -

.: — -*»*

Z'Z- -a

1"

^. _

Hinh 2. Giao dien WebBot Creator

V. PHAN HE TH V'C THI VA KET XUAT KET QUA

A. Thue hien lgi hdnh ddng ciia ngudi dimg

Sau khi ghi nhan cdc hanh ddng cua ngudi dimg dl nhien WebBot phai cd kha nang lap lai cdc hanh ddng dd de sir dyng cho cdc lan sau. WebBot phdi thdng minh de cd khd ndng thyc hifn tat ca mgi hanh ddng cd thl cd cua ngudi dimg khi dang duyft web tim kilm thdng tin. Sau khi thyc hifn bit cii hanh ddng nao neu cd sy chuyen ttang irng dyng can phdi dgi cho trinh duyft web ddi sang ttang mdi dl cd the thyc hifn hanh ddng tiep theo.

Dudi day la each de thyc hifn lai cdc hanh ddng cua ngudi dimg, phin djnh nghla cac hanh ddng.

• Click: Thyc hifn lai hanh ddng Click lai dl hon rit nhilu so vdi vifc ghi nhan hanh ddng.

Ta phat sinh su kifn click() ttong phan tii cua ddi tugng ma ta can click.

• Chgn gia tti ttong combobox: Ung vdi hdnh dgng nay ta thay ddi bien value ttong phan hi

(8)

TuyIn tap Cdng trinh Nghien ciiu Cdng nghf Thdng tin va Truyen thdng 2010 thanh gia tri mong mudn. Ngoai ra cung can

phdi kich hoat su kien onchange cua

<SELECT> dl ttang web nhan bilt va ggi cac ham Javascript tuomg ling.

• Gd van ban: thdng qua md hinh DOM cua ttang web va interface IHTMLElement ta se cap nhat lai thugc tinh value cua phan tii.

• Submit Form: Ta gui bieu mlu bdng each ggi hdm submitO cua interface IHTMLFormElement.

B. Cung cdp ket qud cho nguai dung

Ngudi diing d day cd the la mgt ngudi sir dung binh thudng hoac la mdt hf thdng khac. Ddi vdi ngudi su dyng web thdng thudng kit qua tta ve phai la djnh dang cd the dgc hoac xem ngay tten trinh duyft cua hg. Nhung ddi vdi cac hf thdng khac ket qud tta ve phai dugc chuyen thdng qua cdc dang web service hodc cdc giao thurc truyen thdng qua mang khac. Hon nua, thdi gian xii ly va cung cap ket qua theo yfu ciu ciia ngudi dimg ciing la mdt vin df kha quan ttgng.

Tuy theo muc dich sii dung hf thdng cd the tta ve cdc dinh dang khac nhau thdng qua cdc dich vu khac nhau dya vao muc dich cua ngudi dimg la chgn lgc thdng tin thich hgp hoac de tuong tac vdi cdc hf thdng khde.

• Ddi vdi ngudi sir dung web, hf thdng cd cac kieu tta ve theo dang feed cho phep ngudi dimg dgc nhanh cac dau de cung cac md ta ngan ve cac chu de dd (day la dang thudng gap khi hf thdng lay ndi dung d trang tin tten cac ttang bdo hoac cdc trang thdng tin, tin tiic).

• Ddi vdi cac hf thdng khac, hf thdng cd thl cung cap ket qua thdng qua cac dich vy web.

Hf thdng ho ttg ca hai chuin SOAI^'web' service va REST web service.

Nham tang tdc do xu ly va giii kit qua, hf tin MI::

se cd CO sd dii lifu rieng nhdm luu lai (cache) cac K 'i' qua tiidng dung va tta kit qua nhanh chdng cho nhiing yfu ciu sau ma khdng phai thyc hifn lai cdng viec nit trich.

L^4r;

%

Applli^Ntsn

u

'- .0

f

J:

I

^U

Hinh 3. Qui ninh giii ket qua thyc thi WebBot Tir cac dl xuit cho phdn hf nay, chiing tdi da xdy dyng phan hf WeBot service de cung ling cdc dich vu ra ben ngoai gdm cac dinh dang:

• Web service: Ddi vdi nhu cau tich hgp ket qua cua WebBot vao cac hf thdng khac tiii web service se la lya chgn hang dau. Hf thdng ho ttg hai loai web service thdng dung nhit hifn nay chinh la SOAP web service va REST service. Ngay khi tao va dang tdi WebBot ngudi dimg se dugc cimg cap dia chi de sii dung dich vu nay.

• REST Web service: Dich vu REST Web servicengay cang dugc ua chudng nhd tinh don gian va do tuong thich cao cua nd. Hf thdng cung iing cac djch vu REST thdng qua cac URI cd ciu tnic www.yourwebsite.cotn/.../{WebBotID}/{Dat atype}/{Parameters} vdi

o WebBotID: Sd dinh danh cua WebBot.

o Datatype: Kieu dii lieu kit xuit. Cd the la RSS, ATOM ddi vdi cac kit xuit Syndication Feed, hodc Image ddi vdi ket xuit hinh anh,...

o Parameters: Cac thdng sd dau vao df tiiyc thi WebBot.

• SOAP Web service: He thdng cimg c ^ cdc API thdng dung de nhan kit qua thyc thi

(9)

TuyIn tgp Cdng trinh Nghien ciiu Cdng nghf Thdng tin va Tniyin thong 2010

WebBot. Sau day la cac API chinh cua djch vy:

Bang 1. Cac tham so trong URI ctia REST Web service

API 1 Mota 1 stringGetResult(int

WebBotJD, List<string>listPara ms)

stringGetResuItXM

! L(intWebBot_ID, stringxmlParams)

stringGetResultSing le(intWebBot_ID, stringparameter)

Ham nhan ket qua thyc thi cua WebBot cd ID la WebBot_ID vd danh sdch cac tham sd diu vao listParams. Ket qua tta vl la chuoi cd ciu tnic XML Tuong ty nhu GetResuh nhung danh sach cdc tham sd dau vao dugc la mgt chudi cd cau tnic XML tuan ty hda ciia List<stting>

Ham chuyen bift dimg de lay dii lifu don gian, vdi chi mdt tham sd dau vao kieu chudi, ket qua ciing la ket qud dau tien cd kieu chuoi

• Feed: ngudi sii dyng thdng thudng cd the sd dyng dich vu nay dl tich hgp vao cdc trinh dgc Feed hoac vao cac Gadget hifn cd mgt each nhanh chdng va tifn lgi. Hf thdng se biin nhung ttang tin tiic khdng ho ttg Feed tiianh cac djch vu ATOM Feed hodc RSS Feed dl ngudi dimg theo ddi tin tire tifn lgi va cap nhat nhit.

• Web clip: dimg de trich mdt vimg (box) tten ttang web nay va hien thi lai vimg dd len ttang web khac. Sau khi tao WebBot chgn viing cin su dyng va dang tai len hf thdng, ngudi dimg se dugc cung cip mdt doan ma de dua len website cua minh. Hf tiidng se chuyin ngi dung yfu thich tu ttang web khde vao trang web cua ngudi dimg.

VI. TH^rC NGHI$M

Tu cdc gidi phap ma chiing tdi da ttinh bay d tten, chiing tdi da xay dyng nen mgt hf thdng WebBot hodn chmh va dugc dua vao iing dyng tiiu nghifm vdi cac ling dyng sau:

A. He thdng website Video Search

Ddy la mdt website cho phep thyc hifn vifc tim kiem video clip vdi tir khda Id ten video clip ma ngudi dung mudn tim kiem tten cd hai ttang web www.youtube.com va www.metacafe.com.

De thyc hifn yeu cau nay ta can su dyng phdn he Hudn luyen WebBot de tao ra hai WebBot, mdt cho Youtube va mdt cho Metacafe. Uhg vdi mdi site moi WebBot ghi nhan cac hinh ddng gd vdo d tim kiem, giri yeu cau tim kiem sau dd xdc dinh tieu de, hinh dai difn, thdi lugng, chii thich cua cdc video clip ket qud.

Hai WebBot nay se dugc dang tdi len hf thdng de cd the ty ddng thyc thi va cung cap kit qud.

Khi cd mdt yeu cau tir website Video Search thi hf thdng se ggi tdi web service do WebBot cung cap.

Ung vdi tham so la tit khda can tim kiem, phdn he Thue thi vd Kit xudt ket qud se ty ddng van hanh cd hai WebBot da dugc huan luyfn trudc dd cho cd hai website www.youtube.com vd www.metacafe.com de nit trich cdc vimg thdng tin can thiit. Cudi cimg kit qud nit trich tdng hgp se dugc gui ve cho website Video Search.

Hinh 4. Website Video Search

B. RSS rut trich tin tuc thi gidi tit trang Tudi Tre Day la mgt ung dyng chuyin ddi ngi dung ciia ttang www.tuoitte.com.vn chuyfn myc tin thf gidi cung ling thdnh dang RSS de cho ngudi dimg dl dang tich hgp vao cdc hf thdng khac vi du nhu: Blog, Windows Gadget... Ung dung nay sir dung dinh dang diu ra RSS cua hf thdng WebBot nhdm bd sung tinh ndng cho cdc trang chua hd ttg RSS.

Phdn h? Hudn luy^n WebBot dugc dimg dk tao WebBot ti; ddng din myc Thl gidi ciia bdo Tudi Tre va nit trich danh sach cac bai vilt bao gdm tieu de, ngay ddng, hinh dai difn, tdm tit.

Sau dd phdn h? Thue thi vd Kit xudt kit qud sg djnh kl tiiyc tin WebBot tiieo mgt chu ky nhit djnh dl

(10)

TuyIn tap Cong trinh Nghien ciiu Cdng nghe Thdng tin va Truyin thdng 2010 dam bao cac thdng tin cung iing cho dau ra RSS ludn

ludn cap nhat nhirng thay ddi tit ttang ngudn www.tuoitte.com.vn.

^e.

fi . r; - • - H^' a * * '

"1^

Hlnh 5. RSS Tin the gidi www.tuoitre.com.vn

VII. KET LUAN

Trong bai viet nay, chiing tdi da ttinh bay phuang phap de xay dyng mgt hf thdng nit ttich thdng tin tii cac website vdi 4 dac diem: huan luyfn tdc WebBot cd kha nang nit ttich thdng tin tir cac website, tham sd hda gia tti dau vao trudc khi van hdnh WebBot, tdi van hanh WebBot theo mdt chu ky ty ddng hoac do ngudi dung ty chi dinh van hanh de nit trich thdng tin khi cd nhu cau va cung iing ra thdnh nhieu dang dich vu web nhu SOAP, REST web service, RSS, Atom feed, web clipping.

Vdi phuang phap nay, chung tdi da xay dyng hf thdng WebBot cd khd nang cung ling cac dich vu de sir dung cho cdc iing dyng nit trich thdng tin chiing khoan tryc tryc tuyen tai thdi diem dien ra cac phien giao dich hang ngay, hf thdng website search video, cac website so sanh gia thi trudng...

Hifn tai cdc WebBot cdn hoat ddng ddc lap, chua cd sy phdi hgp. Trong tuomg lai, chiing tdi se hd ttg ngii nghla cho timg WebBot va tang cudng kha ndng phdi hgp hoat ddng giira cac WebBot vdi nhau.

on Human factors in computing systems. New York, NY, USA: ACM, 2006, pp. 1541-1546.

[4] M. Tatsubori and K. Takashi, Decomposition and Abstraction of Web Applications for Web Service Extraction and Composition, IEEE, 2006.

[5] Y. Lu, Y. Hong, J. Varia, and D. Lee, "Pollock:

automatic generation of virtual web services from web sites," Proceedings of the 2005 ACM symposium on Applied computing. New York, NY, USA: ACM, 2005, pp. 1650-1655.

[6] A. Sugiura and Y. Koseki, "Internet scrapbook:

automating Web browsing tasks by demonsttation,'' Proceedings of the 11th annual ACM symposium on

User interface software and technology. New York, NY, USA: ACM, 1998, pp. 9-18.

[7] L. Liu, C. Pu, and W. \{dxi,XWRAP: an XML-enabled wrapper construction system for Web information sources, Washington, DC, USA: IEEE Comput. Soc, 2000.

[8] R. Baumgartner, S. Flesca, and G. Gottlob, "Visual Web Information Exttaction with Lixto," Proceedings of the 27th International Conference on Very Large Data Bases, San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2001, pp. 119-128.

[9] R. Baumgartner, G. Gottlob, and M. Herzog, "Scalable web data exttaction for online market intelligence,"

Proc. VLDB Endow., vol. 2, 2009, pp. 1512-1523.

[10] R. Baumgartner, W. Gatterbauer, and G. Gottlob,

"Web data exttaction system," Encyclopedia of Database Systems, 2009, pp. 1-9.

[11] J. Lopez, F. Bellas, A. Pan, and P. Montoto, "A Component-Based Approach for Engineering Enterprise Mashups," Proceedings of the 9th International Conference on Web Engineering, 2009, pp. 30 - 44.

[12] J.J. Garrett, "Ajax: A New Approach to Web Applications," AdaptivePath.com, 2005.

TAI LIEU THAM KHAO [1] T. O'Reilly, "What Is Web 2.0," 2005.

[2] R.J. Ennals and M.N. Garofalakis, "MashMaker:

mashups for the masses," Proceedings of the 2007 ACM SIGMOD international conference on Management of data, New York, NY, USA: ACM, 2007, pp. 1116-1118.

[3] J. Wong and J. Hong, "Marmite: end-user

programming for the web," CHI '06 extended abstracts

Referensi

Dokumen terkait