• Tidak ada hasil yang ditemukan

The third level of the contraction deals with conflicting transitions. Before discussing it, we need

GridFTP-based Grid File System

3. The third level of the contraction deals with conflicting transitions. Before discussing it, we need

before to introduce the following notations:

• We denote by AT the set of transitions of T that are not connected to any inhibitor arc: tAT, if

pP,IH(p,t)0.

• Two transitions ti and tj are said to be twin, if p∈P, IH(p,ti)=IH(p,tj). This means that if both transitions are enabled for a given marking, then both are either inhibited or activated for this marking.

We denote hereafter by Twin the relation defined on

that AT² Twin and we have if (ti,tj)Twin, then (tj,ti)

Twin.

The idea is to leave out the comparison of the firing distance between two conflicting transitions t and t’

when its value is positive. More concretely, if two transitions cannot be inhibited, then if Dc [t,t’]≥0, then the transition t’ has no impact on the firing of t as long as both remain persistent. However if t is fired, then t’

will be disabled afterwards since both transitions are conflicting for the same markings. Therefore, there is no need even to re-compute the value of positive distances since the latter preserve their status as long as the conflicting transitions are not disabled in the run. However, when dealing with inhibited conflicting transitions, this property stands consistent only when the transitions are twin.

To make clearer these concepts, let us consider the ITPN of Fig.1. We have Inhib={(t,t),(t,t)}; hence the application of the last properties makes that the distances Dc[t,t], Dc[t,t], Dc [t,t] and Dc[t,t]

should be left out during the computation of a class as well as when performing the equivalence test.

Furthermore, we have AT={t,t,t,t} and Twin=AT²

{(t),(t ₄ . However, ng ements of Twin, on an ns

,t,t )} amo el

ly tr sitio

t₁ and t₂, on a hand, and t₄ and t₅, on the other hand, are in conflict for the initial marking. Therefore, since the distances Dc⁰[t₁,t₂], Dc⁰[t₂,t₁], Dc⁰[t₄,t₅] and Dc⁰[t₅,t₄] are positive, we do not need to re compute their value as long as the related transitions remain persistent. Furthermore, as the firing of t₁ (resp, t₄), disables t₂ (resp, t₅), and conversely, these distances stand to be useless for the equivalence test.

The exact construction GR [4], the tightest DBM over approximation GR [1] and the abstraction GRC produce all the same graph shown in Fig.2.a. However, the application of the last properties makes it possible to contract further the graph GRC, as depicted in Fig.2.b. Although it is smaller, we notice that the obtained graph is bisimilar to the former, as it allows gathering classes that derive the same firing sequences.

For instance, firing t₁ (resp, t₂), in GRC from the initial class Ec⁰ leads to the class.

Remark The class Eci corresponds to the node numbered (i) in the graph.

Fig.2. The exact Graph and its DBM approximations.

Ec¹ (resp, Ec⁶). These classes are not equal, but they are bisimilar indeed. In actual fact, the distances Dc[•,t₃], Dc[t₄,t₃] and Dc[t₅,t₃] which impede the equality to hold, are useless since t₃ is inhibiting t₄ and t₅. Furthermore, the classes Ec² and Ec⁸ are equivalent since the coefficient Dc[t₄,•] can be left out.

Finally, the classes Ec¹⁰ and Ec⁹ are gathered since the positive distances Dc[t₁,t₂] can be ignored; t₁ and t₂ are two twin conflicting transitions. Hence we obtain a much compact graph of 8 nodes and 16 edges, whereas the other constructions produce a graph of 11 classes and 22 edges. More formally, we introduce this contraction as an equivalence relation, defined as given next:

Definition.5 Let be a relation over state classes of the graph GRC, defined by: ((M,Dc), (M’,Dc’))

if:

(i) M=M’

(ii) tTi(M)

Dc[,t]=Dc’[,t], Dc[t,]= Dc ‘[t,] (iii) (t,t’)TwinConf(M)

(iv) (t,t’)Te(M)²-(Twi Conf(M)) n such that (t’, t),(t,t’)  Inhib, Dc [t,t’]= Dc‘[t,t’].

where sg(v) is a function which gives the sign of the value v, sg: {}{≥₀,<} such that ≥₀ (resp, <), denotes

"positive or null" (resp, strictly negative).

In concrete terms, two classes (M, Dc) and (M’, Dc’) are in the relation , if: (i) they enjoy the same marking; (ii) the maximum and minimum residual time of any inhibited transition is identical in both classes; (iii) for any pair of conflicting twin enabled transitions, the firing distance involving both transitions in both classes holds the same sign, and this distance must be equal in both classes only when it is negative; (iv) For all other pairs of enabled transitions that are not in the relation Inhib, the firing distance involving both transitions must be equal. Let us prove now that the relation is a bisimulation over the classes of the graph GRC.

Theorem.1 The relation is a bisimulation over the graph GRC.

By avoiding, on a hand, to compute some distances when working out each accessible class, and on the other hand, to compare them during the equivalence test, we succeed to reduce the computation effort of the approximated graph GRC. This construction achieves, in general, to reduce sensibly the size of the graphs, but however while loosing a bit of precision in the approximation. The last abstraction over the classes of the graph GRC is the quotient graph of GRC w.r.t the relation . It preserves both markings and firing sequences while it is, in general, smaller. The GRC may be more appropriate than GR to check over linear properties of the model, especially when the number of additional sequences that have been added due to constraint relaxation stands of limited number.

However, when the GRC provides a too coarse approximation, it may yield a larger graph than GR;

the additional sequences are too numerous to be wrapped by the contraction. In actual fact, the abstraction GRC is more convenient to be built when the number of inhibiting and conflicting transitions is important in the net, otherwise the construction of GR should be considered.

The tests have been performed on a Pentium V with a processor speed of 2,7 GHZ and 1,9 GB of memory capacity. They have been carried out while using different tools: TINA tool [10], ROMEO tool [9] and our tool named ITPNT. Their performances are assessed while considering three parameters, the number of classes, the number of edges, and finally in terms of computation times. It is noteworthy that ROMEO does not bring out some parameters; we

of using the GRC construction when dealing with conflicting and inhibiting transitions. For this effect, we have considered the ITPN given in Fig.3 while varying the intervals of transitions t,t₇ and t₈; the results of these experiments are reported in Tab.1. All the experiments show that the graph computation times are in favour of our algorithm. The construction of GRC achieves to reduce significantly the size of the graphs as well as their computation effort.

IV. EXPERIMENTAL RESULTS

Fig. 3 ITPN used in the experimentations.

TABLE I RESULTS OF EXPERIMENTS

V. CONCLUSIONS

We have proposed in this paper an efficient algorithm to contract the DBM over-approximation of the state class graph of preemptive systems. For this effect, we have shown in [2] that by relaxing a little bit in the precision of the DBM approximation, we can achieve to compute graphs that can be more appropriate, in certain cases, to model-checking the linear properties of the model. We have discussed in this paper how this construction can be improved yet more by leaving out all the distances that are useless for the class computation process. Hence, we have put forward an equivalence relation that reduces sensibly the size of the graphs as well as the effort of their computation. Experimental results have been reported to advocate the benefits of this approach.

approximation de l'espace d'état des systèmes préemptifs".

TSI, Hermes editions, VOL 28:9 2009. pp.1143-1170.

[2] A.Abdelli, and D.Yahiatene. " Efficient computation of state space over approximation of preemptive real time systems.

IEEE AICCSA 2008: 726-733.

[3] B. Berthomieu, and M. Diaz. "Modeling and verification of time dependant systems using Time Petri Nets". IEEE TSE, 17(3):(259-273), March 1991.

[4] B. Berthomieu, D. Lime, O. H. Roux, F.Vernadat:

Reachability Problems and Abstract State Spaces for Time Petri Nets with Stopwatches. Discrete Event Dynamic Systems 17(2): 133-158 (2007).

[5] G. Bucci, A. Fedeli, L. Sassoli, and E.Vicario. Timed State Space Analysis of Real-Time Preemptive Systems. IEEE TSE, Vol 30, No. 2, Feb 2004.

[6] Dill, D.L.: Timing assumptions and verification of finite-state concurrent systems; Workshop AVMFSS. Vol 407. (1989) 197-212.

[7] F. Cassez and K.G. Larsen. The Impressive Power of Stopwatches. LNCS, vol. 1877, pp. 138-152, Aug. 2000.

[8] Olivier H. Roux, Didier Lime: Time Petri Nets with Inhibitor Hyperarcs. Formal Semantics and State Space Computation.

ICATPN 2004: 371-390.

[9] ROMEO TOOL http://romeo.rts-software.org.

[10] TINA Tool http://www.laas.fr/tina/.

Platect: detecting plagiarism among a set of programs that solve identical problems

Tisha Melia, Ricky Suryadharma, Denvil Prasetya, Budianto, and Mario R. Mahardhika

Faculty of Computer Science, University of Indonesia

Email: [email protected], {ricky.suryadharma, denvil.prasetya, budianto71, mario.ray}@ui.ac.id

Abstract—We have developed a new method to detect plagiarism among a set of highly similar programs. We start by tokenizing each source code, align these tokens, and compute the statistical significance of the reported alignment. We propose a new scoring matrix that differ- entiates between conservative substitution of tokens to non conservative ones and introduces random shufflings of source codes to construct a statistical framework to interpret the result of source codes alignment. The result of our experiments shows that our algorithm is appropriate for detecting trivial to non trivial plagiarism techniques in source codes that are roughly equal in size. The assessment of our random shuffling processes shows a limited applicability but sheds light for future improvements.

I. INTRODUCTION

Detecting plagiarism is a well-established problem in the field of computer science.

Although numerous approaches have been proposed [3][5][11][2][1][4][6][8][7][9] to solve it, only a few addressed the issue of detecting plagiarism within a set of similar source codes. Furthermore, most of the previous approaches overlooked the necessity to provide a statistical analysis of how significant is their calculated similarity (i.e. are the two source codes similar due to plagiarism or by chance). In this study, we focus in developing an algorithm to address these issues.

An algorithm to detect similarity in a set of already similar source codes is an important issue for many educational institutions. In teaching programming courses, students are asked to solve a specific problem.

Thus, all of the submitted programs already have an inherent similarity: they are written to solve the same problem. Therefore, there is a need to develop an algorithm to detect plagiarism within a set of of already similar source codes and assess the statistical significance of the calculated similarity.

II. PROPOSEDAPPROACH

Previous approaches in detecting code similarity can be divided into two main flavors:

1) token-based approach, detecting similarity by token to token comparison between the two source codes.

These authors contributed equally to this work

2) structure-based approach, detecting similarity by aligning partial or completesyntax treestruc- tures of the two source codes.

For detecting similarity in a set of already similar source codes, structure-based approach may not be appropriate; a set of source codes that implement merge sort will likely give similar syntax tree structures. Hence, two programs that are developed independently can be detected to be very similar.

Our proposed approach is based on the work of Ji et.al. [5], that borrows the concept of pairwise alignment from the computational biology field to detect program similarity. Unfortunately, this existing approach focuses on finding alignment locally (i.e.

finds only similar contiguous regions between two programs) instead of globally (i.e. trying to align the program from the start to the end). The latter approach is appropriate in detecting similarities in programs that are roughly equal in size and is sensitive detecting dissimilarities among highly similar set of programs.

The first contribution of our proposed research is applying the concept of global pairwise alignment of two source codes. The last two contributions that our research will contribute are the development of a weighted scoring matrix to score an alignment (of source codes) and statistical framework to determine how significant the calculated similarity of two source codes is.

Our method starts by tokenizing each source code into astring of tokens (Figure 1, box 1). These strings of tokens are grouped by methods in which they originated from. These methods are rearranged such that two methods that have similar return values, parameters, and size are placed in the same order in their respective documents. In other words, if method a from document x is similar to method b from document y, then they should be placed as the ith method in their respective documents. The value of i does not matter as long as it is consistent between the two documents (i.e., aandb). This rearrangement gives way to unique methods of both documents to be placed at the end of the document. The placing is important as we gives less weight to mismatches that occurs at the end of the document. The next step

Fig. 1. The outline of our proposed approach to detect similarity between two source codes

is to align a string of tokens (from one source code) with another string of tokens (from another source code) by allowing gaps to occur in the beginning, middle, or last positions (Figure 1, box 2). The output is an alignment score; that is the sum of how many positions are either matched,mismatched, orskipped.

Each match, mismatch, and gap is scored differently according to the developed scoring matrix (Figure 1, box a). Using only the raw alignment score, we cannot determine if the two source codes are similar due to plagiarism or by chance. Therefore, it is important to assess if an observed alignment score is statistically significant or not (Figure 1, box 3). To perform such statistical test, we will develop a distribution model that is calculated from many alignment scores of unrelated source codes of the same length (Figure 1, box b). The original alignment score of the two source codes are compared to the distribution model. If the alignment score is located in the extreme positions of the distribution, it is likely that the similarity between the two source codes is due to plagiarism rather than by random chance. Further explanation of each step is provided below:

A. Tokenizing each source code into a string of tokens In this step, we parse the source codes and replace appropriate language-dependent keywords into a spe- cific tokens. It is done to avoid misalignment due to differences in variable names, function names, etc.

Any spaces and comments are also discarded.

B. Pairwise alignment of two strings of tokens Each string of tokens (from one source code) is aligned with another string of tokens by using the Needleman-Wunsch algorithm [10], that is based on dynamic programming. Each alignment will receive a score that is the sum of how many matches, mismatches, and gaps in each position. Typically, a match has a positive weight; whereas a mismatch or a gap have negative weights.

In programming, there are many keywords that are similar in functions (i.e, repetition can be done using a for loop, do-while loop, while loop, etc). Therefore, a mismatch between these similar keywords should be weighted positively. For

example, a mismatch between a for loop with a while loop should be weighted positively than a mismatch between a for loop with a variable declaration. For further reference, we will denote various substitution between similar keywords as conservative substitutions. The full description of our substitution matrix is available is described in Table II-B

Let n to be the number of keywords defined for a certain programming language. Then, our scoring matrixSwill be of sizen×n, in which each element Sij is the score of aligningtokeni withtokenj. C. Statistical significance of an alignment score

Given an alignment score, we would like to know how likely the observed score can occur by chance. To test the significance of a similarity score, we follows the following procedure:

1) Compute an alignment score between two source codes of size, approximately, n tokens.

2) Generate a distribution model from pairwise alignments of many unrelated source codes of size n. The unrelated source codes can be generated through random shuffling of one of the source code. The shuffled source codes are then aligned to produce another alignment score.

More alignment scores can be produced by re- peated shuffling. We characterized our shuffling process with two parameters: the number of new documents created by the random shuffling process (Sn) and the degree of shufflings (Sd) used in percentage. The latter parameter signifies how many lines of codes being shuffled in a document. For example, if the source code has 100 statements, an Sd of 20% means that we pick and switch two random statements in the source code twenty times.

3) From the generated distribution model, we can calculate the sampleµandσ. Ifxis our original alignment score, then a z-score can be obtained as follows:

z= x−µ (1)

token groups alignment score per match (do-)(whileloop withforloop 0.8

ifstatement withswitchstatement 0.8 int,short, andbyte 0.8 float anddouble 0.8

The calculatedz-score should reflect how signif- icant the observed alignment score is. The more significant it is, the more likely the two source codes are similar due to plagiarism

III. EXPERIMENTALSETUP

In evaluating the performance of our proposed al- gorithm, we tested it on a set of artificial data. The artificial data is originated from a set of unique source codes. We duplicated each source code by commit- ting common plagiarism techniques. We added trivial plagiarism alterations (e.g.,producing exact duplicates, changing variable names, etc) to non-trivial ones.

We purposely created artificial data to delineate the types of common plagiarisms that can be detected or not detected by our algorithm. Types of plagiarism techniques committed to produce our set of artificial data are listed below:

1) changing class names 2) adding dummy classes

3) changing the order of the classes 4) changing the order of methods in a class 5) adding dummy methods

6) changing names of methods

7) breaking up a method into two methods 8) changing variable names

9) adding dummy variables

10) changing the order of how the variables are used or declared

11) adding bogus lines that have no meaning (e.g.

duplicating previous assignments)

We produced t31 pairs of source codes by committing one to multiple alterations listed above. Among the 31 pairs, six pairs are unique and the rest are plagiarized pairs.

A. Validation

The validation of our proposed algorithm is carried out by creating a receiver operating characteristic (ROC) curve that plots the true positive rate against the false positive rate. In a set of similar source codes with known true labels (plagiarized work or not), the performance of our proposed algorithm is measured by by counting how many label is correctly predicted.

A prediction is counted as a true positive if both predicted and true labels agree that the source code is a plagiarized work. Similarly, if both labels agree that the work is an individual work, then the prediction is counted as atrue negative. In the case of a mismatch

Fig. 2. ROC curve depicting the performance of our proposed algorithm on artificial data

between the true label and the predicted label, the outcome can be either afalse positive(if the true label is an individual work) or a false negative (if the true label of the source code is a plagiarized work). The above information can be summarized in the Table II.

Note that in this report, we have not yet incorporated the z-values in determining our prediction label, which will be included in subsequent reports of our research. Instead, we opted to use the alignment score to determine the threshold for labeling our results. We define the true positive rate (TPR) and false positive rate (FPR) as:

T P R= T P

T P +F N, F P R= F P

F P +T N (2) A single measure for the quality of an ROC curve is the area under the curve (AUC) which runs from 0.5 for random predictions to 1 for perfect ones.

To assess the usefullness of our random shufflings method, we plot the z-score trend of applying a particular Sn and Sd. The random shuffling would be useful if the resulting z-scores are near zero for non-plagiarized pairs and are extremes (i.e. far from zero) for plagiarized pairs.

IV. RESULTS ANDDISCUSSIONS

AUC value indicates most of the artificial data are correctly classified

We applied our proposed algorithm on the 31 source code pairs, which consist of 25 plagiarized pairs and