New Method for Extracting Keyword for the Social Actor

(1)

 All sources 49  Internet sources 41  Own documents 8 [0]  https://archive.org/stream/arxiv-1303.3964/1303.3964_djvu.txt

14.1% 60 matches

[1]  https://arxiv.org/pdf/1303.3964.pdf 13.0% 52 matches

[2]  "CR-INT110-Semantic interpretation ...ot; dated 2017-10-09 6.2% 26 matches

[3]

 https://link.springer.com/chapter/10.1007/978-3-319-05476-6_9 10.7% 40 matches

 1 documents with identical matches

[5]  https://mafiadoc.com/keyword-extraction-...723dd05e32a77ea.html 6.0% 26 matches

[6]  https://archive.org/stream/arxiv-1212.4702/1212.4702_djvu.txt 5.8% 22 matches

[9]  eprints.binadarma.ac.id/2778/1/ICIBA2016-13-086-091-Nasution-SocialNetworkMining.pdf 3.4% 15 matches

[10]  https://arxiv.org/pdf/1604.06976.pdf 3.2% 15 matches

[11]  www.academia.edu/3144197/Simple_Search_Engine_Model_Adaptive_Properties_for_Doubleton 3.2% 10 matches

[12]  dl.acm.org/citation.cfm?id=2961663 3.3% 11 matches

[13]  https://link.springer.com/content/pdf/10.1007/978-3-319-05476-6_9.pdf 2.2% 10 matches

[14]  www.academia.edu/3144214/Simple_Search_Engine_Model_Adaptive_Properties 1.5% 5 matches

[15]  https://www.researchgate.net/publication...for_the_Social_Actor 1.7% 4 matches

[16]  "CR-INT136-Social network extractio...ot; dated 2017-10-09 1.8% 11 matches

[17]  https://link.springer.com/chapter/10.1007/978-3-319-67621-0_20 1.8% 8 matches

[18]  "CR-INT137-Enhancing to method for ...ot; dated 2017-10-09 1.5% 8 matches

[19]  "CR-INT135-Information Retrieval on...ot; dated 2017-10-09 1.5% 9 matches

[20]  iopscience.iop.org/article/10.1088/1742-6596/801/1/012020 1.5% 7 matches

[21]  https://www.researchgate.net/profile/Sha...tion-perspective.pdf 1.1% 5 matches

[22]  https://www.researchgate.net/publication/234059964_Knowledge_Sharing_A_Model 1.2% 2 matches

[23]  iopscience.iop.org/article/10.1088/1742-6596/801/1/012022 1.2% 6 matches

[24]  "3028-3639-1-RV.pdf" dated 2017-10-30 1.2% 6 matches

[25]  www.springer.com/cda/content/document/cda_downloaddocument/typeinst.pdf

33.6%

Results of plagiarism analysis from 2017-12-28 15:20 UTC

New Method for Extracting Keyword for the Social Actor.pdf

(2)

[25] www.springer.com/cda/content/document/cda_downloaddocument/typeinst.pdf 0.9% 2 matches

[26]  https://www.overleaf.com/latex/templates/springer-book-chapter/hrdcrfynnzjn 0.9% 2 matches

[27]  "3472-4529-1-SM.pdf" dated 2017-10-30 1.1% 6 matches

[28]  www.academia.edu/3144207/Keyword_Extraction_for_Identifying_Social_Actors 0.7% 4 matches

[29]  https://www.coursehero.com/file/9686843/typeinst/ 0.7% 2 matches

[30]  https://www.researchgate.net/profile/Mahyuddin_Nasution 0.7% 2 matches

[32]  https://vdocuments.site/vision-systems-segmentation-and-pattern-recognition.html 0.3% 3 matches

[33]  dl.acm.org/citation.cfm?id=2130616 0.6% 2 matches

[34]

 https://archive.org/details/arxiv-1212.4702 0.4% 1 matches

[36]  www.finedictionary.com/semantic relation.html 0.3% 1 matches

[37]  "ICTS_2017_paper_18.pdf" dated 2017-09-24 0.3% 1 matches

[38]  https://ai.arizona.edu/research/isi 0.3% 1 matches

[39]  https://rd.springer.com/content/pdf/10.1007/978-3-319-05476-6_9.pdf 0.2% 1 matches

[40]  doi.acm.org/10.1145/215206.215383 0.4% 1 matches

[41]  https://www.researchgate.net/publication...n_Positive_Documents 0.2% 1 matches

[43]  ieeexplore.ieee.org/xpl/conhome.jsp?punumber=1001810 0.3% 1 matches

[44]

 hinet.lindayi.me/paper.php?id=220711 0.4% 1 matches

[46]  doi.acm.org/10.1145/138859.138867 0.4% 1 matches

[47]  ingenst.academia.edu/MahyuddinKMNasution 0.2% 1 matches

[49]

 www.refdoc.fr/Detailnotice?cpsidt=17135176 0.3% 1 matches

[51]  "1733-1956-1-RV.pdf" dated 2017-09-26 0.1% 1 matches

(3)

10 pages, 5254 words

 The document contains a suspicious mixture of alphabets. This could be an attempt of cheating.

PlagLevel: selected / overall

133 matches from 53 sources, of which 45 are online sources.

Settings

Data policy: Compare with web sources, Check against my documents, Check against my documents in the organization repository, Check against organization repository, Check against the Plagiarism Prevention Pool

Sensitivity: Medium

Bibliography: Consider text

Citation detection: Reduce PlagLevel

(4)

--New

New Metho

Metho d

d for

for Extracting

Extracting Keyw

Keyword

ord

[33]

for

for the

the So

So cial

cial Actor

Actor

[12]

Mahyuddin K.M.Nasution

Information Technology Department

Fakultas Ilmu Komputer dan Teknologi Informasi (Fasilkom-TI) and

Centre of Information System

Universitas Sumatera Utara, Medan20155 USU, Sumatera Utara, Indonesia

[email protected], [email protected]

[3] Abstrac

Abstract.t. In this paper we study the relationship between query and

search engine by exploring some properties and also applying their

tionsto extract keyword for any social actor by proposing new method.

The proposed approach based on considering the result of search engine [3]

[3]

in the singleton and doubleton.In this paper, we develop a novel method for extracting keyword automatically from Web with mirror shade

con-[3]

cept (M2M).Results show the potential of the proposed approach, in experiment we get that the performance (recall and precision) of

worddepend on both weights (singleton andtﬁdf)and the distance of them.

[3] Keywords:

Keywords: singleton, doubleton, searh engine, query, information retrieval.

1

1 In

Intr

tro

oduction

duction

In the search space with a large repository such as Web, it is diﬃcult to obtain

accurate information about any social actor or social agent, that is endowed as an human agency which means recognising individually the attempt to grips the

challenge for changing world around the agent to a good world. In this case, there are major obstacles that often accompanies the search engine capabilities

such as ambiguity [2] and bias [5]. Therefore, it is always necessary corresponding

[51]

keywords to pry information out of the heaps of data or documents in Web.In

this paper we propose a newmethod for generating and selecting automatically

the keyword for someone as an social actor in Web based on the principles of

information retrieval (IR) and model of search engine.



[25]

Please note that the LNCS Editorial assumes that all authors have used the west-[25]

ern naming convention, with given names preceding surnames.This determines the

structure of the names in the running heads and the author index.

N.T. Nguyen et al. (Eds.): ACIIDS 2014, Part I, LNAI 8397, pp. 83–92, 2014.

c

[3]

(5)

84 M.K.M. Nasution

2

2 Problem

Problem Definition

Definition

We deﬁne some terminologies and the properties of a model of search engine [13,14,15,19].

event (singleton space of event) of web pages that contain an occurrence of

tx∈ωx

is a set of singleton search term of search engine.A doubleton search term isD = {{tx, y t

[2]

}:tx, y ∈ Σt}and its vector space denoted by Ωx∩Ωy is

a double search engine event(doubleton space of event) ofwebpages that

conditions we obtain another properties. Those properties are as follows.

(6)

New Method for Extracting Keyword for the Social Actor 85

starting from Eq.(7), based on Eq. (5) and then P1, i.e.,

|Ωx∩Ωy|=|Ωx|+|Ωy|+|Ωx∩Ωy| (8)

and we know that|Ωx|=|Ωx∩Ωx|and|Ωy|=|Ωy∩Ωy|, hen q. 8)t e E ( b

|Ωx∩Ωy|=|Ωx∩Ωy|+|Ωx∩Ωx|+|Ωy∩Ωy| (9)

As information of any social actor the singleton and the doubleton are the

basic of some properties of search engine statistically that related to the actor

social. However, either a singleton or a doubleton depend on formulating a query, [13]

i.e. where and how the keyword there:Some of techniques for mining keyword

from information sources have been proposed, for example is to estimate

siﬁers for labelling some messages by using simple [1] and more sophisticated [32]

[3,4] approaches.For features extractionseveral methods have been developed

[7]. Some of them are the substring search method [20], model and prototype

system [8], by using peer clustering [11], co-occurrence analysis [9], by using

[6]

ical chains [6], based on PageRank [21], laten sematic analysis [10], etc.In this case, thesingleton [16] and the doubleton [17] are the necessary condition for

gaining the information of social actor from Web because both singleton and doubleton contain bias and ambiguity, while other purpose requires a suﬃcient

condition [18] so that the major obstacles can be reduced or eliminated. Some

of the following formulas will be evidence against some of the approaches and

(7)

[0]

86 M.K.M.Nasution

3

3 The

The Prop

Prop osed

osed Approach

Approach

If the singleton is accompanied by a summary of the Web, then involvement

of the singleton and doubleton in thecomputation generates descriptions (as

keyword candidates) of an social actor as follows.

De

We construct a relationship of actors-snippets-wordsbased on frequency of [0]

words inWeb pages as environments of an social actor as follows.

[0]

De

Definitioninition 2.2. A relationship betweenso cial actors, web snipp ets and words is

[0]

j are the weights of

word.

Statistically, the task of relationship in Deﬁnition 2 is simply to gather and record information about words, features, and web pages where term weights

reﬂect the relative importance of words in web pages. One of the most common

type used in older retrieval models is known astf.idfweighting [12] whereby we

can generate the vectorνfor each word/termw, and then this information is used

for recognizing the diﬀerent social actors in web pages based on clustering all

words by using one of similarity measurements such as using Jaccard coeﬁcient

j c=|Ωa∩Ωb| |/(Ωa|+|Ωb| Ωa∩−Ωb |

[0]

|).For this purpose, we deﬁnethe words

undirected graphG= V ,E) to describe the relations between words( [12].

[11]

The micro-cluster is denoted byG

[1] _

=V ,E ,ww, , , f.ρ α

A micro-cluster is maximal cliquesub graph oft

[0]

o where the node represents

[0]

wordhasthe highest score in document. However,the collection ofmentioned

[0]

words do not exactly refer to the same social actors.To group the words into [0]

the appropriated cluster, we construct the trees of words.This based on an assumption that the words are that appear in same domains having closest

[0]

(8)

[11]

De

Definitioninition 4.4. A reeT is an optimal micro-clustert if and only ifT is a

sub-graph of micro cluster G_{, and is denoted by} _T ₌ __V

T, T,wwET,f, , ,ραhere w

VT ⊆V E, T ⊂E, ndwwT⊆wwa.

In building the optimal micro-cluster, we save the strongest relations in T

[1]

between a word and another inG _until_T _{has no cycle} [0]_{. We introduce an in-}

trusive word about the social actor,and there are at least one word of optimal

micro-cluster has strongest relation with the intrusive word, and an optimal

[0]

micro-cluster is a group of words refer to that social actor. However, the

[0]

lapkeyword also exists in the same list. We deﬁnea strategy to selectrelevant

[0]

wordsamong all list candidates.In this case, there are a few potentialwords as keyword candidates.

We also can generate for example another vectors fromΩ

[0]

because off−1_g _{is also one-one function, this means that}_V

T ⊆V hasssT as a

Lemma 2 declared that the words appeared frequently in certain snippets but

rarely in the remaining of snippets are that words strongly associated with one

of social actors only. Ifνx≥νythen|Ωx∩Ωa| . ≥ .Ωy≥∩Ω.a|. |therwise, O

the words that do not appear frequently in Web pages except on only an social actor, then the words are strongly associated with that social actor. The last

(9)

88 M.K.M. Nasution

of P1 and Theorem 1, we obtain

|Ωx∩Ωa| Ωy∩Ω≤a| |

Lemma 3 explains that distance between an social actor ta and candidate wo rds tx and ty can be used to select an appropriate keyword, or if μx ≥μy

Proposition 1.1. If the internval [0,1] divided by straight line into two ar eas:

[0,1 2)and [

1

2,1], then there are six p atterns of conditions satisfying the relation

b

Proof. Let us summarise the conditions of relation amongνandμinto (i)ν≥μ

and (ii)ν μ, and based on the conditionν≥μfrom Lemma 2 and Lemma 3

we can determine the value in{TRUE,FALSE}for relation patterns betweenν

andμ: 1) fν≥μ,( henI ν≥ 1 t

2 (FALSE). Thus, there are

only six patterns with TRUE value.

We can sort the candidate words by using six patterns of conditions for

fying Proposition 1, and the selected word as keyword is a candidate word with

(10)

Theore

Theore mm 2.2. e tL T is an optimal micro-cluster containing the keyword

candi-dates, then the suitable keyword is a keyword candidate with the highest value of

vector space ofp(a, S, w) and lowest value of mirror shade, where the distance b

Lemma 3 the mentioned word has a lowest value of mirror shade in [0,1], i.e.,

2, and with max-min values:

gives the maximum value, i.e.,δ= 1. It means that there is a keyword candidate in wwT as an optimal keyword, where ν = maxa lue andv μ= mina lue, orv

ν−μ=max−minvalues.

Three values of each wordw∈wwdetermine a relationship betweenwwand any social actor. The last theorem expresses that the suitable keyword will provide to

a query the enriched information with semantic relations of their contents, and

this give more eﬀectiveness retrieval of information. The eﬀectiveness of using

keywords dependent the query levels generally based onδx . δ.yif and only. iftxis suitable top keyword. This is an algorithm by using the micro-cluster and

the mirror-shade (therefore we called it as MM method (M2M)) for generating

keyword as follows.

generate(keyword) [9]

INPUT :Aset of social actors [16]

OUTPUT : keyword(s)of each social actor

STEPS : forseperatingT_{be trees.}

7.[7]Select a cluster from trees ofT_{by using a}_{predeﬁned stable attribute.}

(11)

90 M.K.M. Nasution

[5] Fig.

Fig. 1.1.The optimal micro-cluster

4

4 Exp

Exp eriment

eriment

Let us consider information context of social actors that includes all relevant

re-[5]

lationshipswith their interaction history, where Yahoo!search engines fall short of utilizing any speciﬁc information, especially micro-cluster information, and

[7]

just therefore we use full text index search in web snippets.In experiment, we

use maximum of 500 web snippets for search termta representing an actor, and

we consider words where the TF.IDF value .0[5]3×highest value of TF.[5]IDF, or

[5]

maximum number is 30 words.For example, Fig.1is a set of 30 words from web

[5]

snippets foran actor = ”Abdullah Mohd Zin”.We test for 143 names, and we

obtain 8 (5.[5]59%) actors without a cluster of candidate words, 14 (9.[5]09%) actors

[5]

with only one cluster, and 122 (85.[5]32%) persons have two or more keywords.In

[5]

a case of”Abdulah Mohd Zin”we have trees of wordsas micro clusters of words.

We can arrange the keyword candidates individualaccording to their

proxim-[5]

[5]

ity to the stable attribute”academic”, i.e.a set of words in SK = {sciences, faculty, associate, economic, prof, environment, career, journal, network, univer-sity,report, relationship, context,...}.[SK ndδ5maximum exactly determine thata ]

[5]

”network”be a keyword for actor”Abdullah Mohd Zin” as an academic (not

a politician).In this case, ﬁrst keyword is ”computer”, while second keyword is

”university”, and for dataset of ”Abdullah Mohd Zin” with 143 ﬁles we obtain

(12)

[6]

Fig.

Fig. 2.2.Recall and precision of the optimal micro-cluster

5

5 Conclusion

Conclusion

and F

and

Future

uture W

Work

ork

Studying to properties of relation between query and search engine gave the

semantic meaning to the social actors.One of them is to provide keyword for

any social actor or the clue about the social actor. The mirror-shade approach

played a role to select top keyword from summary of web pages about the actor.

Our near future work is toexperimentand look intoIR performance.

[8]

Referen

References

ces

[3]

1. Abu-Nimeh, S., Nappa, D., Wang, X., Nair, S.:A comparison of machine

learn-[3]

ingtechniques for phising detection.In:Proceedings of the Anti-Phising Working Groups 2nd Annual eCrime Researhers Summit, pp.60–69 (2007)

[3]

2. Adriani, M.:Using statistical term similarity for sense disambiguation in

cross-[3]

language information retrieval.Information Retrieval 2, 69–80 (2000)

[3]

3. Bergholz, A., Chang, J.-H., Paass, G., Reichartz, F., Strobel, S.:Improved phishing

[3]

detection using model-based features.In:Proceedings of Fifth Conference on Email and Anti-Spam (2008)

4. Bergholz, A., Beer, J.D., Glahn, S., Moens, M.-F., Paass, G., Strobel, S.:[ New3]

[3]

ﬁlteringapproaches for phishing email.Journal of Computer Security (2009)

[3]

5. Buckley, C., Dimmick, D., Soboroﬀ, I., Voorhees, E.:Bias and the limits of pooling

[3]

for large collections. Information Retrieval 10, 491–508 (2007)

6. Ercan, G., Cicekli, I.:[3]Using lexical chains for keyword extraction. [3]Information

Processing and Management 43, 1705–1714 (2007)

7. Fette, I., Sadeh, N., Tomasic, A.:[3]Learning to detect phising emails. In:[3]ACM

Proceedings of the 16th International Conference on World Wide Web, pp.

(13)

92 M.K.M. Nasution

[3]

8. HaCohen-Kerner, Y.: Automatic extraction of keywords froom abstracts. In:

Palade, V., Howlett, R.J., Jain, L. (eds.) KES 2003. LNCS, vol. 2773, pp. 843–849. Springer, Heidelberg (2003)

[3]

9. Kim, B.-M., Li, Q., Lee, K., Kang, B.-Y.:Extraction of representative keywords

considering co-occurrence in positive documents.In: Wang, L., Jin, Y. (eds.) FSKD 2005. LNCS (LNAI), vol. 3614, pp. 752–761. Springer, Heidelberg (2005)

10. L'Huillier, G., Hevia, A., Weber, R., R´ıos, S.: Laten semantic analysis and

key-[3]

word extraction for phising classiﬁcation. In:IEEE International Conference on

Intelligence and Security Informatics (ISI), Vancouver, BC, Canada, pp.129–131 (2010)

[3]

11. Liang, B., Tang, J., Li, J., Wang, K.-H.:Keyword extraction based peer clustering.

In: Jin, H., Pan, Y., Xiao, N., Sun, J. (eds.) GCC 2004. LNCS, vol. 3251, pp. 827–830. Springer, Heidelberg (2004)

[3]

12. Nasution, M.K.M., Noah, S.A.:Superﬁcialmethod for extracting social network for

academic using web snippets.In: Yu, J., Greco, S., Lingras, P., Wang, G., Skowron, A. (eds.) RSKT 2010. LNCS (LNAI), vol. 6401, pp. 483–490. Springer, Heidelberg (2010)

[2]

13. Nasution, M.K.M., Noah, S.A.M.:Extraction of academic social network from

line database. In: Noah, S.A.M., et al. (eds.[3)]IEEE Proceeding of 2011

tionalConference on Semantic Technology and Information Retrieval, Putrajaya, Malaysia, pp.64–69. IEEE (2011)

[0]

14. Nasution, M.K.M., Noah, S.A.M., Saad, S.: Social network extraction:Superﬁcial

[3]

method and information retrieval.In:Proceeding of International Conference on Informatics for Development (ICID 2011), pp.c2-110-c2-115 (2011)

[3]

15. Nasution, M.K.M., Noah, S.A.M.: Information Retrieval Model:A Social Network

[12]

Extraction Perspective.In: IEEE Proc.of CAMP 2012 (2012) [0]

16. Nasution, M.K.M.:Simple search engine model:Adaptive properties. Cornell

Uni-versity Library (arXiv:1212.3906v1) (2012) [0]

17. Nasution, M.K.M.: Simple search engine model:Adaptive properties for doubleton.

Cornell University Library (arXiv:1212.4702v1) (2012) [2]

18. Nasution, M.K.M.: Simple search engine model: Selective properties. Cornell Uni-versity Library (arXiv:1303.3964v1) (2012)

[3]

19. Nasution, M.K.M. (Mahyuddin):Kaedah dangkal bagi pengekstrakan rangkaian

sosial akademik dari Web, Ph.D.[12Dissertation, Universiti Kebangsaan Malaysia] (2013) (in Malay)

[3]

20. Okada, M., Ando, K., Le, S.S., Hayashi, Y., Aoe, J.-I.:An eﬃcientsubstring search

method by using delayed keyword extraction.Information Processing and Manage-ment 37, 741–761 (2001)

[3]

21. Wang, J., Liu, J., Wang, C.:Keyword extraction based on PageRank.In: Zhou,