Phgm Minh Chuin vd Dtg Tgp chf KHOA HQC & CONG NGHE 173(13): 4 5 - 5 0
D i r B A O LIJ&N K E T D O N G T A C G I A SU" D y N G P H A N C U M B A N G I A M S A T Ud Phgm Minh Chuin''^", T r i n Dinh Khang', Le Thanh Huong',
Tran Manh T u i n ^ Le Hoing San^
'Trudng Dgi hgc Bach khoa Hd NQI, ^Trudng Dgi hgc Suphgm Ky thugt Hung Yen, 'Trudng Dgi hpc Thiy Lgi, ^Trudng Dgi hpc Khoa hgc Tif nhien. DHQGHN T O M T A T
Trong bii bao niy, chiing tdi de xuat mgt huong tiep c|n giai bii toin liln kit trong mang dong tie gia dya trln vile sii dyng phin cum ban giam sat md. Nghien cim niy nham x i c dinh cac tac gii c6 khi ning Hln ket vdi nhau trong tuang lai gin dya trgn m i i liln h i da c6 gitta cac tac gii. Cic tie gia d i tCmg viet bii ciing nhau hogc c6 cac bai viit tuong ty nhau c6 khi ning hinh thinh liln kit cao trong tuong lai. Bai bao xiy dung mo hinh mdi dya trln phin cum ban giam sat md tren da lilu thu thip dugc ve sy hop tic da cd giita cic tic gii. Md hinh dugc danh gia va so sinh vdi cic thuit toin lien quan. Cic ket qui thyc nghiem chi ra ring md hinh dugc de xuat cd chat lugng cao hon cic thuit toin dUgc so sinh
TCi' khda: Du bdo, mang ddng tdc gid, phdn cum bdn gidm sal. do do. Hen kit ddng tdc gid
GICJI T H I E U
Ciing vdi sy phat t r i i n manh me ciia Internet v i cac mang xa hgi, con ngudi d u g c gan ket vdi nhau dii each x a nhau. M a n g xa hdi vd hudng ddng nhat tryc t u y i n (Online Homogeneous Undirected Social Networks - O H U S N s ) la mgt loai mang x a hgi vdi dac d i i m l i c i c t h y c t h i trong mgng thugc ciing mOt loai va c i c liln k i t giira cac thuc the la vd hudng, cd cimg mgt kieu. Ciing n h u cac mang x a hgi khac, O H U S N s xay d u n g mgt cgng d d n g \&n n g u d i sii d u n g mang den cho ngudi sii d u n g mgt tien ich n h i t dinh: g i i i tri, k i t ndi bgn be, chia se tai nguyen, trao ddi cdng viec. Mgc dii cd nhiing d i e d i l m rieng bi?t, O H U S N s v i n bao gom mgt lugng dii li?u d u g c trao ddi mdi ngay vdi nhiing dac trung d l nhan b i i t [1].
Trong cgng ddng cac nha khoa hgc, sy liln k i t giua cac tac gia la t h y c sy can thiet v i d u g c quan tam. Cac tac gia da tirng cfing t i c vdi nhau d i cdng bd c i c k i t qua hay cdng trinh cua hg se co xu h u d n g hgp t i e vdi nhau trong t u a n g lai g i n . N h u vay, nhimg hgp tac trong t u a n g lai chu y i u d u a t r l n cac k i t q u i dS cd giiia cac tac gia. Tuy nhien, cac lien k i t mdi giiia cac tac gia cung cd the xuat h i l n
'Tel 0983081120
n l u giiia hg cd sy hgp tac vdi t i c gia chung nao do. Tren co sd dd cac liln ket trong mgng dong t i e g i i trong tuong lai g i n l i h o i n toan cd t h i thyc hiln dya t r l n nghiln ciiu v i cac nghien ciiu giiia c i c t i c gia trong mgng.
Nam 2010, d u b i o lien ket co giam sat dua tren n h i l u nguon da d u g c Lu v i cgng sy nghien ciiu [2]. Nam 2 0 1 1 , Hasan va Zaki [3]
khao sat v l v i l e d u b i o lien ket trong mang xa hgi va chi ra nhilu cong cu dugc sir dyng bao gdm: d u bao lien k i t dua t r l n c i c dgc t n m g , dya t r l n cac md hinh phan Idp hogc cac md hinh xac suat Bayes, cac md hinh quan he xac suit. Mgt so n g h i l n ciiu gan day cho thay cac thuat toan phan cum ban giam sat m d rat hi?u qua trong nhieu ITnh vuc n h u xit 1;^ anh [4], nhgn dang mau, nhan dgng khudn mat [5], danh g i i riii ro [6], d u bao p h i s i n [7].
Trong bai bao nay, chiing tSi d i x u i t phuang phap p h i n cym ban giam sat trong dy bao liln ket trong mang dong tac gia. Cac budc thyc hiln cua thuat toan dugc trinh b i y trong bai bao ciing vdi ket qua cii dat tren bg dir lieu cu the. C i c ket qua thuc nghiem nham danh g i i thuat toan de xuat dya trln c i c do do Precision, Recall va F-Measure.
Phan cdn lai ciia bai bao d u g c td chiic nhu sau: phan 2 cung cap cac k i l n thiic c a sd.
P h i n 3 trinh bay v l sir dung phan cum b i n 45
Phgm Minh Chuin vd Dtg Tap chi KHOA HQC & CONG NGHE 173(13): 45-50 giam sit md vio viec giai bai toan du bao liln
kit trong mang dong tic gia. Phan thii 4 dua ra cac kit qua thyc nghi?m dua tren cac sd lilu thu thap. Cudi ciing, phin 5 nit ra kit luan tir n ^ i l n cuu.
KIEN THtrc CO s d Bai toan dir bao lien ket mdi
Hinh I. Minh hpa mang ddng tdc gid
Dinh nghia 1.
Mgt mang ding tac gia ky hi|u la G^ = (V^\
gh^ pm_ j-^^ trong dd r = {t,, t2.. . IK} la tip cac mic thdi gian lien tiep (t, < /,, /< j =
1..IC), V^ = {vi, V2,—,VN} la mgt tgp cic mit (tac gii), P^' ={pi, P2,...,PM} la tap cic bai biova E^^ ^{(v, Vj. piu tij:. v„ v, € V^\ v,i^Vj. p„
€ P^ vd theT} la tap cac lien kit. K,N\kM tuong ling li so moc thdi gian, so tac gii va so bii bao.
Chung tdi minh hga dinh nghia 1 thdng qua mgt VI dy ve mang ddng tgc gii dugc bieu diln trln hinh 1. Trong vi dy nay, mang dong tic gii bao gdm 8 tac gia (N = 8), va 10 bai bao (M =10) vi cic bai bao dugc xuit bin tur nam 2000 din nim 2002 (K = 3). Ting si lien kit (cdng tie) li 22.
Bai toan dy bio lien ket mdi (cgng tie), tiic la dy bio nhirng cgp tic gii mi chua tirng cgng tic trong qua khii co cgng tic vdi nhau trong tudng lai hay khdng. Vi dy quan sat trln hinh 1 chung ta co thi thay hai cgp tic gii (5, 6) vi (5, 8) khdng cd cgng tic trong cac nam 2000 den 2002, viy lieu hg cd cgng tic trong nhiing nam tiep theo hay khdng?
Cac do tuwng ty theo trgng so lien kit Do tuang tu theo trgng so liln ket dugc tinh toan bdi xem xet bdi miic dg lien kit giua hai dinh trong mang ding tie gia, d day ©(u, v) kyhilu la miic do liln ketgiiiahainiitiivav.
Dinh nghia 2. (Weighted Common Neighbours: WON) [8].
" ' " d){u,z) + a(y,z),
l-=sr(u)r^r(v) 2 ^'^
Dinh nghia 3. (Weighted Adamic-Adar:
WAA) [8].
SlM,y.Au,v) =
2-.i^{iiY-T[y) (2)
Dinh nghia 4. (Weighted Jaccard Coefficient:
WJC) [9].
m{u,z)+miv,z)
iier(")r>nv) 2 (3)
Trong mang ding tic gia, muc dg liln ket
giira hai nut u, v (ky hieu w(u, v)) cd thi dugc xic dinh theo ba each sau:
a) Cdch I
Miic do liln kit giiia hai tic gia u, v dugc xac dinh thdng qua so bai bio ma hai tic gii nay da viit chung. Cdng thiic nay dugc de xuat bdi Murata and Moriyasu [8] nhu sau:
(n(u,v) = n^
(4)\!)Cach2
Trong [10], miic do lien ket giua hai tac gia ducfc tinti bSng tong trong so tuang ling v6i moi bai bao duoc viet chung b6i hai tac gia:.
(5)
Trong do, 51 se bang 1 nlu tie gia u co tham gia viit bai bio thii i, vi bing 0 niu nguyc lai, vi n, la so tac gii trong bii bio thii i.
c) Cdch 3
Trong [11], trgng so giua hai tic gia dugc tinh
dya tren vi tri ciia cac tac gia trong bai bao va
thdi gian ma bai bio dugc xuit bin. Xet hai
Phgm Minh' Chuin vd Dig Tap chi KHOA HOC & C6NG NGHE 173(13): 45-50 tac gia u, v trong danh sach cac tic gii xuit
hiln trong mgt bai bio, va vj tri tuang ling ciia hai tac gia la du vi dy. Gia sii du > dv vi trong bai bao cd nhiiu hon mgt tac gii. Khi dd, miic dg lien kit giiia hai tic gii u, v (DCL (u ,v)) trong bai bio dugc tinh theo cdng thiic sau.
(6)
Gii sii hai tac gia u va v viit chung P bai bao.
Khi do muc do liln kit giiia hai tie gia dugc tinh theo cdng thiic (7);
oi{u,v) =Y.DCL(d_%d^')*k(t^)
(7)Trong dd, d^ la vj tri cua tie gia u trong bii bio thii p, tp la thdi gian ma bai bio thu p dugc phin bien hogc chap nhan dang vi k(t ) = -£-—-, vdi to la thai gian diu tien ma hai tic gia nay da cgng tic, t^ la thdi gian hiln tai.
PHUONG PHAP DE X U A T
Trong phan nay, phuong phap phan cum ban giam sit md ip dyng cho bai toan du bao lien kit trong mang ddng tic gia dugc trinh biy.
Hinh 2 md ta sa do cua md hinh du bio liln kit trong mang ding tac gia (SSSFCRC) dya tren thugt toan phan cym bin giim sit. Cy the cic budc thyc hiln nhu sau;
Budc I: Tu dS lieu ban diu thu thap trong mang dong tic gia, xac djnh do tuang ty trong s i liln kit trong mang dong tic gii theo cdng thdc trinh bay d phin Cic dg tuang tu theo trgng sd lien ket.
Bif&c 2: Dfl li^u ban dau dugc chia thanh 2 phin: dii lieu huin luyin (training) va dir lieu kiim tra (testing). Trong do, dir lieu training la cac so li$u ciia thdng tin trong mgng tac gia d thdi diim hiln tai va qua khii. Khi dd quan he giiia cac tac ,c;ia trong dU lieu training dugc xac djnh cd la dong tic gia vdi nhau hay
khdng (hoin toan bilt dugc nhan cua cac liln kit). Dfl lilu testing la cac so lilu ciia thdng tin trong .mgng tic gia d thdi diem tuang lai can dy bao.
y^ Data X va cic tham so y^
Xac dinh dg tuang ty trgng sd Hen kit I ~ Training, nhan
cua training Testing Xac dinh tim
cum theo cic nhSn
m
Xac dinh thfing tin bfi trg Phan cum bin
giim sit md
/
Dy bao cic cap nijt khi nang / lien kit / Hinh 2. Sa dd md hinh dir bdo lien kit trong mgng
dong lac gid
Budc 3: Moi Ioai nhan cua training ta xic djnh trung binh cgng ciia cac liln kit di xac dinh lam tam cac cym cho tirng nhan ciia training. Cac tam cum xac dinh trong qua trinh training dugc ket hop vdi du li?u testing de xac dinh ma tran dfi thugc bo trg, Vdi ma tran thdng tin bd trg gifla cac liln ket vdi tam cac cum cua tung nhan dugc xac djnh d training dugc xac dinh: la khoang cich Euclid tir liln ket dd din tim cum ciia nhan trln tfing so khoing cich euclid tii liln ket dd din tam cac cum ciia nhan.
Bu&c 4- Thugt toin phan cum ban giim sat chuan SSSFC [12] vdi thong tin bo trg dugc xac dinh d budc 3, thuc hiln trln tap testing, vdi so cym bang 2. Khi dd phan cum SSSFC xac djnh dugc ma trgn dg thugc ciia cac cgp lien kit vao cic cym.
Budc 5: Tir kit qua phan cym da xac dinh dugc ma tr^n do thugc cua cic cap liln kit.
Tgi mdi lien ket xic dinh cum, dya trln ma tran do thugc. Dua vao thdng tin bd trg tii training xac djnh xem cym nio thugc vi cd liln kit, cum nao thugc ve khong liln kit.
47
Pham Minh Chuin vd Dtg Tap chf KHOA HOC & CONG NGHE 173(13): 45-50 THU" NGHIEM VA DANH GL^
Mo ta dir lien
Dfl lieu thuc nghilm li mgt mang ddng tic gii dugc xay dung tit tap cic bai bio dugc ding tren tap chi "Biophysical Journal" [13]
vdi mgt vii tieu chuan cy thi. Tdng sd bai bao thu dugc li 7,529, tdng so tic gii la 21,151 vi ting si liln kit li 68,706.Chia dfl lieu thanh hai phin theo thdi gian: Tl (2006 - 2011) vi T2 (2012-2016). Cd 4841 cap tic gii ling cii (cd it nhat mgt tic gii cgng tic chung trong Tl) dugc lya chgn vdi 192 (3.966 %) cap tac gia nhan 1, sd cdn lai dugc gin nhan 0. Do mit can bing ty II nin 192 cap tic gia mang nhan 0 dugc chgn ngiu nhiln di hinh thanh tap kiim tra gdm 384 cap tic gii (vdi ty II nhan 0-1 bang nhau).
Cic do do dugc sii dung bao gdm: Dg bao phii (recall). Do chinh xic (precision) va Fl- measureva va do llch chuan ciia Fl- Measure (Fl-STD), Ap dyng phuang phip kiim dmh 10-fold, va kit qui cuoi ciing dugc tinh theo trung binh cua 10 fold. Chung tdi thyc nghilm vdi ba dg tuong tu trgng si lien kit giua hai tic gia (WCN, WAA, WJC) da di cap trong phin 2.2.
Cac kit qua thuc nghiem
Ket qui thuc nghilm cua luge do du bao liln kit trong mang ddng tic gii sii dyng phan cum bin giim sat md vdi so cum bang 2 (khi do 1 cym la cac cap mit cd liln kit, I cym li cic cap nut khong liln kit).. Kit qua thuc nghiem thu dugc vdi phuang phip SSSFCRC so sanh vdi SVM [14] va Gboost [15] do day la phuang phip phan Idp tilu bilu vi da dugc nhilu nhi nghiln ciiu sii dung trong bii toan du bao lien kit trong mang xa hgi.
V V k
Hiih 3. Kel qua lime nghiem vai Rec
Hinh 6. Kit qud thuc nghiem vdi FI-STD
Tir kit qua thu dugc d hinh 3, hinh 4, hinh 5, hinh 6 ta thiy vdi do do Recall thi phuong phap SSSFCRC tit ban 2 phuang phap SVM vi Gboost vdi ci 3 bd dii lieu, dg do Precision thi Gboost tit vdi 2 bg dfl lieu vi SVM tot vdi 1 bg dir lieu, Fl- Measure thi SSSFCRC tdt vdi 2 bg dfl lieu vi Gboost tdt vdi mot bp du li?u. Nhu vgy vdi tong the ca 3 do do voi 3 bg dfl lieu thi phuang phap SSSFCRC tit hon ^ phuang phip SVM va phuang phap Gboost, v i dg in djnh Fl-STD thi phuong phap SSSFCRC cd do in djnh nhit vdi 2 bg dir lilu, phuang phap SVM cd do dn dinh nhat vdi mgt bg da lieu.
KET LUAN
Trong bii bio nay, mgt md hinh du bao liSn
ket trong mang ddng tic gii sii dung phan
Phgm Minh Chuin va Dtg Tgp chi KHOA HQC & C O N G NGHE 173(13); 4 5 - 5 0 cym b i n giam s i t m d d u g c d i x u i t . Ciing vdi
su p h i n tich v i trinh t y v i y nghTa t h y c h i l n , m6 hinh dugc cai dat trln cac bd dfl heu vdi c i c tham so khac nhau. K i t q u i thuc nghilm chi ra rang, md hinh d l x u i t cd k i t q u i phii hgp so vdi phuang phap SVM va Gboost cho bai toan niy dya trln cac tilu chi d i n h gia cy the.
D y a t r l n k i t q u i n g h i l n ciru trong bai, trong t u o n g lai chiing tdi se n g h i l n ciiu tiep d i tim ra cac bO tham sd phu h g p n h i t . Ddng thdi cd su cai tiln phu hgp va so sanh k i t q u i d u bao vdi c i c p h u a n g phap mgnh trong hgc may nhu: p h i n Idp dfl H?u, rirng n g i u nhien (random forest).
TAI L I E U T H A M K H A O 1. Wu, v . , & Zhou, X. (2015). Link prediction in social networks: the state-of-the-art. Science China Information Sciences, 58(1), 1-38.
2. Lu, Z., Savas, B., Tang, W., & Dhillon, I. S.
(2010). Supervised link prediction using multiple sources. In Data Mining (ICDM). 2010 IEEE lOth International Conference on (pp. 923-928). IEEE.
3. Al Hasan, M., & Zaki, M. J. (2011). A survey of link prediction in social networks. In Social network data analytics (pp. 243-275). Springer US.
4. Chuang, K. S., Tzeng, H. L., Chen, S., Wu, J.,
& Chen, T. J. (2006). Fuzzy c-means clustering with spatial information for image scgmenta.tion.computerized medical imaging and graphics, 30(1), 9-\5.
5. Agarwal, M., Agrawal, H., Jain, N., & Kumar, M. (2010). Face recognition using principle component analysis, eigenface and neural network. In Signal Acquisition and Processing, 2010. ICSAP'IO. International Conference on (pp.
310-314). IEEE.
6. Chen, J., Zhao, S„ & Wang, H. (2011). Risk analysis of flood disaster based on fuzzy clustering method. Energy Procedia, 5, 1915-1919.
7. Martin, A., Gayathri, V., Saranya, G., Gayathri, P., & Venkatesan, P. (2011). A hybrid model for bankruptcy prediction using genetic algorithm, fuzzy c-means and MARS. arXlv preprint arXiv: 1103.2110.
8. Xia F, Chen Z, Wang W, Li J, Yang L T (2014) MVCWalker: Random Walk-Based Most Valuable Collaborators , Recommendation Exploitmg Academic Factors. IEEE Transactions on Emerging Topics in Computing 2(3):364-375.
9. Bezdek, JC (1981). Pattern recognition Algorithms -with fuzzy objective function. Kluwer Academic Publishers.
10. Zhang, H., & Lu, J. (2009). Semi-supervised fuzzy clustering; A kernel-based approach.
Knowledge-Based Systems, 22 (6), 477-481.
11. Yasunori, E., Yukihiro, H., Makito, Y,, &
Sadaaki, M. (2009, August). On semi-supervised fuzzy c-means clustering. Print Fuzzy Systems, 2009. FUZZ-IEEE 2009 IEEE International Conference on (pp, 1119-1124). IEEE.
12. Yasunori, E,, Yukihiro, H., Makito, Y., &
Sadaaki, M, (2009, August). On semi-supervised fuzzy c-means clustering. Print Fuzzy Systems, 2009. FUZZ-IEEE 2009 IEEE International Conference on (pp. 1119-1124). IEEE.
13. Biophysical Joumal (2017). Retrieved from
"https://www.joumals.elsevier.com/biophysical- joumal/", Accessed on 10/07/2017.
14. Corinna Cortes, Vladimir Vapnik (1995), Support-vector networks, Machine Learning, 20(3), 273:297.
15. Carlos Becker, Roberto Rigamonti, Vincent Lepetit, and Pascal Fua CVLab, Ecole Polytechnique F'ed'erale de Lausanne, Switzerland (2013), Supervised Feature Learning for Curvilinear Structure Segmentation, International Conference on Medical Image Computing and Computer-Assisted Intervention, 526-533.
Pham Minh Chuin vd Dig Tgp chf KHOA HQC & CONG NGHE 173(13): 45 - 50
S U M M A R Y
USING SEMI-SUPERVISED FUZZY CLUSTERING METHOD IN CO- AUTHORSHIP LINK PREDICTION
Pham Minh Chuan''^', Tran Dinh Khang', Le Thanh Huong', Tran Manh Tuan^, Le Hoang Son*
' Hanoi University of Science and Technology, 'Hung Yen University of Technology and Education 'Thuyloi University, ''VNU University of Science, Vietnam National University In this paper, we propose a new approach for link prediction in the co-authorship network using semi-supervised fuzzy clustering. Link prediction aims to determine possible interaction between authors in the future based on existing links of a co-authorship network representing joint papers in a specific research domain. It is worthy remarked that authors who had joint or similar papers are likely to continue writing together. Since the evaluation contains of both quantitative and qualitative information, fuzzy models in the forms of semi-supervised learning are used to judge
^ e most similar authors to the considered one before making decision of interaction. A new semi- supervised fuzzy clustering model on the authorship network datasets has been proposed. Data labels in the training set are grouped to specify the clusters' centers which are further used in the construction of an additional matrix for the semi-supervised fijzzy clustering. The clustering algorithm produces a membership matrix of luiks m a cluster and final recommendation of outputs. It is implemented and compared against the relevant methods on the Biophysical Joumal datasets. It has been suggested that the results of the proposed method are better than those of the related ones Keywords: Prediction, co-authorship network semi-supervised clustering, validity index, co- authorship link
Ngay nhgn bdi: 26/9/2017; Ngay phan bien: 29/9/2017; Ngay duyitdang: 30/11/2017 Tel-0983 081120
50