• Tidak ada hasil yang ditemukan

Effective and Efficient Patch Validation via Differential Fuzzing

N/A
N/A
Protected

Academic year: 2023

Membagikan "Effective and Efficient Patch Validation via Differential Fuzzing"

Copied!
27
0
0

Teks penuh

(1)

저작자표시-비영리-변경금지 2.0 대한민국 이용자는 아래의 조건을 따르는 경우에 한하여 자유롭게

l 이 저작물을 복제, 배포, 전송, 전시, 공연 및 방송할 수 있습니다. 다음과 같은 조건을 따라야 합니다:

l 귀하는, 이 저작물의 재이용이나 배포의 경우, 이 저작물에 적용된 이용허락조건 을 명확하게 나타내어야 합니다.

l 저작권자로부터 별도의 허가를 받으면 이러한 조건들은 적용되지 않습니다.

저작권법에 따른 이용자의 권리는 위의 내용에 의하여 영향을 받지 않습니다. 이것은 이용허락규약(Legal Code)을 이해하기 쉽게 요약한 것입니다.

Disclaimer

저작자표시. 귀하는 원저작자를 표시하여야 합니다.

비영리. 귀하는 이 저작물을 영리 목적으로 이용할 수 없습니다.

변경금지. 귀하는 이 저작물을 개작, 변형 또는 가공할 수 없습니다.

(2)

Master’s Thesis

Effective and Efficient Patch Validation via Differential Fuzzing

Eui Bin Bae

Department of Computer Science and Engineering

Ulsan National Institute of Science and Technology

2022

(3)

Effective and Efficient Patch Validation via Differential Fuzzing

Eui Bin Bae

Department of Computer Science and Engineering

Ulsan National Institute of Science and Technology

(4)
(5)
(6)

Abstract

Developments in APR(automatic program repair) technology have made it possible to automate the creation of patches. However, due to the loose precision(the current technology only recognizes a patch if it passed a test), it is still left up to the developer to determine whether the generated patch is correct.

Accordingly, techniques for determining whether a generated patch is correct are being studied these days. Existing methods mainly score how likely a patch is correct. The calculated score is compared with a predetermined threshold to determine whether the patch is correct. However, it is difficult to improve both recall and precision no matter how the threshold is set. For example, in the case of ODS, it shows a problem that filters out many bad patches, but also filters out many correct patches. In this study, we try to solve this problem through an evidence-based method. We take different results before and after the patch as the minimum condition for determining that the patch is wrong. In this condition, if the input that derives different results represents the passing test, the different test results can be the evidence for showing that the patch is incorrect. We use Differential Fuzzing technology to find program inputs that have different results before and after patching. Also, in order to figure out whether the found input represents a passing test, we use the TEST-SIM+ heuristic, which is an improved version of the existing TEST-SIM heuristic. Finally, we develop a purging technology specialized for TEST-SIM+

technology.

(7)
(8)

Contents

I Introduction . . . 1

II Background and Related Work . . . 4

2.1 APR(Auto Program Repair) tools . . . 4

2.2 Fuzzing . . . 7

2.3 Test Classification . . . 7

III Our approach . . . 8

3.1 Differential Fuzzing . . . 8

3.2 Test classification . . . 8

IV Assessment . . . 9

4.1 Dataset . . . 9

4.2 Result . . . 9

4.3 Threshold based on Distance . . . 10

V Discussion . . . 12

VI Conclusion . . . 13

6.1 Future work . . . 13

References . . . 14

Acknowledgements . . . 17

(9)

List of Figures

1 Three phases in our workflow . . . 2 2 Motivating example . . . 6 3 Average of distance in each version . . . 10

(10)

I Introduction

The history of efforts to make debugging easier and faster can be said to be the same length as the history of programming. Based on such efforts, technologies such as fault-localization and automatic test gen- eration developed so far have contributed to solving the problem. Of course, these partial contributions alone couldn’t completely solve the problem. Ultimately, what we expect from these technologies is au- tomatic program repair(APR). Automatic program repair technology has also made significant progress with previous research results and technologies and has proven its effectiveness using bugs on the De- fects4j benchmark. Such bug fixing tools usually go through three steps. A patch candidatePcgenerated in the first step is validated in the second step, typically with a user-given test suite. IfPcpasses all tests, a developer should assessPcin the last step since passing all tests does not guarantee the correctness of Pc. Furthermore, the first-found patch is often incorrect in the multiple recent APR systems [1–4] . They generate a list of plausible(test-suite-passing) patches so that a correct patch existing in the patch space is not missed out. Among the list of plausible patches, developer should manually evaluate a correct patch. The last assessment step has not been studied much, since APR systems were initially designed to generate only the first patch that passes all available tests. Because the precision in a test is not strict enough, hundreds and sometimes even thousands of plausible patches are often generated [1–5], and the manual assessment step can take a long time. To avoid this, patches can be reviewed in the order of their rankings [1, 4], but ranking algorithms are often imprecise, and a correct patch is not necessarily ranked high.

Therefore, we need to propose a description of how to make such judgments more easily. For this problem, the currently most widely adopted approach is the score-based method. This method is scoring the results obtained from the patch and comparing it with a specific threshold, it can be seen as convenient to solve the problem because it can establish an intuitive and simple standard. However, we have to consider two important criteria for determining a patch: recall and precision. Since it is difficult to satisfy both of those criteria with a simple threshold comparison, we need to establish another criterion for judgment. So, we try to approach this problem through an evidence-based method. This method presents the evidence that a patch is wrong and reject the corresponding patch based on this. The minimum condition required for this rationale to be satisfied here is that it must produce a different result from the original program. However, this condition is not strict enough, and on the contrary, there is a risk of lowering the overall recall. Changes in results can also result from correct modifications made by a patch, since in this case the patch should be considered correct. In this case, if the change is a correct fix, it will only show up in failing tests. Thus, if we can know whether the other results we find appear in the passing test or not, we have an additional condition to reject the patch. In this work, we propose a tool to determine whether a test is pass or fail by using a fuzzing tool. For this approach to be successful, the following should be satisfied. First, incorrect patches should be filtered out as much as possible. That is, we need high recall. Second and more importantly, correct patches—considering their scarcity [5]

and importance—should not be discarded, meaning that we also need perfect (or near-perfect) precision.

Lastly, the proposed specification methodology should be easy to use.

1

(11)

We propose a novel specification methodology consisting of the following three phases.

Figure 1: Three phases in our workflow

Phase 1: Identifying a failing test. Instead of using a white-box approach, we advocate using a black-box approach for easy usability. Specifically, we suggest generalizing a failing test, a mandatory input to a test-driven APR system.

Phase 2: Parameterizing a failing test. To generalize the input of the existing test, we parameterize the original test (constant values appearing in the test code are parameterized) and mutate parameters using a fuzzer. An obtained parameterized test can be perceived as a parameterized unit test (PUT) [6]

or a property based test (PBT) [7, 8].

Phase 3: Classify each test using the observed program behavior from each test run from fuzzing.

There are two main considerations: whether a different output occurred and a trace record from the test.

By comparing the distance between the trace record and the trace record obtained from the passing test and failing of the written test suite we have, we finally make an Accept/Reject decision.

Overall, we make the following contributions in this work:

• A specification methodology for patch validation: We show for the first time how an existing test can be generalized to filter out incorrect patches. Our lightweight specification method allows developers to keep using their familiar interface for specification– i.e., test code. We apply a proposed specification methodology to the failing tests of 17 buggy versions in our dataset.

• Empirical findings: One concern of using our black-box approach is that, while performing differ- ential fuzzing, it can be difficult to observe a state difference between the versions since internal state differences occurred in the production code are not necessarily propagated to the test code. To assess this concern, we run a differential fuzzer over the 100 patches in our dataset. We found that in fact, many incorrect patches are in the vicinity of the original failing test; many incorrect patches are detected quickly (in about 30s).

• Reduction in manual patch-review effort: Our work is motivated by the high cost of assessing a

2

(12)

large number of plausible patches. By filtering out incorrect patches, the user only needs to review the remaining ones.

• Research on improvement of test classification and laying the groundwork for it: To present the shortcomings and limitations of the existing test classification, and to establish a baseline to evaluate it in order to suggest improvements in fuzzing and test classification formula in the future.

3

(13)

II Background and Related Work

2.1 APR(Auto Program Repair) tools

APR tool is literally a tool that automatically fixes bugs in the program. Such tools take a buggy program and its test suite as input, then suggest modifications to make the modified program pass the tests.

However, the patches generated through this simply pass the test and cannot be considered to have the correct behavior. One of the major challenges in APR is the overfitting problem [9] that occurs because many incorrect patches exist in the patch space; they are often not filtered out by a given test suite [5, 9].

Those patches are called plausible patches. To address this problem, many APR systems use one of the three techniques:

(1) patch ranking algorithms (2) score-based patch classification (3) evidence-based patch classification

Patch ranking

The idea of patch ranking is to identify patches that are more likely to be correct and place them at higher rankings than the others. For example, Prophet [10] ranks patch candidates based on a probabilistic model learned from existing patches. Other APR tools, such as ACS [11], CapGen [12], SimFix [13], and JAID [1] similarly perform patch ranking. However, the ranking algorithms are often imprecise, and it is common that a correct patch is not ranked first. As the patch space increases, this problem tends to be exacerbated, and correct patches are often missed out [5].

Score-based patch classification

Thus, many recent studies propose improved patch classification (PC) techniques [2, 14–19]. The goal of the PC technique is to filter out incorrect patches while keeping correct patches. Score-based PC techniques perform patch classification by computing the scores of the patches. In anti-patterns (com- mon patterns of incorrect patches) [14], a simple binary scoring scheme is used; patches belonging to anti-patterns receive the low score and are rejected. Otherwise, patches are accepted. However, as mentioned in patch ranking, the threshold is also often not precise enough, and it cannot be guaranteed that the correct patch is always ranked first, so it should be considered impossible to set a threshold that selects only the correct patch. To solve this problem, a more advanced score-based classification method has been proposed.

4

(14)

ML-based classification

Various ML-based classification techniques have also been developed, using hand-crafted features [19]

or embedding techniques [15]. These techniques compare the computed scores with a threshold to perform classification, and the threshold is typically chosen empirically (e.g., using training data). Al- though, it is still difficult to choose a threshold that makes recall high while keeping precision close to 100% [20]. Also, these approaches do not provide a semantic explanation for the classification decision.

Evidence-based approaches [2, 18] do not have the limitations of the score-based approaches since a patch is rejected only with concrete evidence of the error. Opad [18] uses a fuzzer to detect crashing patches. Fix2Fit [2] similarly uses a fuzzer to avoid generating crashsing patches. However, detecting non-crashing incorrect patches remains an open problem.

PATCH-SIM

PATCH-SIM [17] computes the path similarity between the execution paths before and after the patch.

It tests a given patch using written test and test generation tools, and records its execution path. Then, the distance between the execution path and the execution path before applying the patch is measured, and the patch is classified based on it. The classification formula is as follows.

classi f ication=









incorrect Ap ≥ threshold incorrect Ap ≥ Af correct otherwise

(1)

.

where,

Ap=max(distancep(t)∨classi f ication(t) =passing) (2) However, this classifier is also not precise enough. Even if the obtained result is closer to the case of the passing test than the case of the failing test, it cannot be considered that the closeness is sufficient to guarantee accuracy.

Evidence-based patch classification

So, we can apply Evidence-based patch classification to solve this problem more fundamentally. The following example shows its effectiveness.

5

(15)

Motivating example

Figure 2: Motivating example

Consider a scenario where an APR tool returns a list of plausible patches, and the user finds a correct patch among them. Suppose that the list contains many incorrect patches, including the one shown in Figure 2(a) and a correct patch shown in Figure 2(b), all of which pass all available tests. Note that the size of the list is often large.

For example, JAID [1] generates 1263 patches2 for the example buggy version (Math95 in the Defects4J benchmark [21]). To expedite the patch review process, the user may want to first filter out incorrect patches using a patch classification (PC) technique before reviewing the remaining patches. If she uses PATCH-SIM [17], one of the state-of-the-art PC tools, the example incorrect patch is failed to be filtered out. Being disappointed, she may try out a recent ML-based PC tool, ODS [18]. Unfor- tunately, she only finds that ODS is even more disappointing since it filters out the correct patch! This example illustrates the challenge of patch classification. The users would want to filter out incorrect patches as much as possible, but the last thing they would want is to discard correct patches, which are only scarcely available. Both PATCH-SIM [17] and ODS [18] use score-based approaches. They com- pute a score for a given patch and make a classification decision by comparing the obtained score with a chosen threshold. If a threshold is chosen conservatively as in PATCH-SIM, many incorrect patches are not filtered out. Meanwhile, if a threshold is determined more aggressively as in ODS, many correct patches are also filtered out. In this work, we use an evidence-based approach instead.

We reject a patch only when concrete evidence for rejection is found via fuzzing, thus guaranteeing high precision. Fuzzing is a test technique that randomly executes a test in which various random variables are input. It will be explained in more detail in section 2.2. Note that existing evidence-based approaches [2, 18] rely only on program crashes as concrete evidence. However, using only implicit oracles is not enough, and the current evidence-based approaches suffer from low recall [16]. To provide more help to developers, we generalize the evidence-based approach by looking for any kind of output discrepancies between patched and pre-patched versions.

6

(16)

2.2 Fuzzing

To find discrepancies using fuzzing, we start by generalizing a given failing test. Figure 2(c) shows the failing test for our example buggy version, and Figure 2(d) shows how we generalize the three constants of the existing test into three parameters d1, d2, and d3. Then, using a QuickCheck framework [7]

such as junit-quickcheck [8], we can obtain various random values for these parameters to perform differential fuzzing. In this part, giving purely random input can be inefficient, so recent fuzzing tools have adopted a Greybox approach that refers to coverage. In addition, the energy allocation techniques proposed in the DGF [20] have made it easy to approach a situation that may be of more interest to us.

One remaining problem is that not all output discrepancies evidence the incorrectness of the patch since certain output changes are expected with the patch.

2.3 Test Classification

To solve this problem, it is necessary to classify in which case the change occurred, and the comparison of the distance between execution paths used in PATCH-SIM can be used for the classification. They used TEST-SIM classifier, it works like below:

classi f ication=









passing Ap < Af

f ailing Ap > Af

discarded Ap = Af

(3)

.

where,

Ap=min(distance(t,t)∨classi f ication(t) =passing) Af =min(distance(t,t)∨classi f ication(t) = f ailing) (4)

7

(17)

III Our approach

Our proposed test classification tool consists of two main parts. First, we give parameterized tests into a differential fuzzing tool to generate numerous test runs. Second, classify each test based on the program behavior obtained from each run.

3.1 Differential Fuzzing

Our fuzzer takes as input a generalized test and a pair of pre-patched and patched versions. Then, it searches for an input that violates the given preservation invariant. We customize JQF [22], a cover- ageguided fuzzer for Java so that our two specification APIs (logOutIf and ignoreOutOfOrg) are sup- ported and differential fuzzing, which is not supported in the original JQF, can be performed. When generating random input, our fuzzer chooses a random input in the range of [c−δ,c+δ]wherec is the original constant replaced by a parameter, andδ is chosen adaptively. In general, the fuzzing space grows as a larger range is used, decreasing fuzzing efficiency. Our fuzzer gradually widens the range until a random input in the range.

3.2 Test classification

We will classify tests by applying the ideas of TEST-SIM presented in the PATCH-SIM tool. TEST-SIM utilizes Randoop to generate additional tests other than the existing written tests. To find out whether the generated tests are passing tests or failing tests, they used the execution path obtained from each test.

In the TEST-SIM idea, they compare the known passing and failing tests, and the execution path of a newly created test. If the execution path of the generated test is more similar to that from the passing test, it is regarded as a passing test, otherwise it is regarded as a failing test. To evaluate ’similarity’, LCS (Longest Common Sequence) was used. The distance is obtained through the following formula.

distance(a,b) =1− |LCS(a,b)|

max(|a|,|b|) (5)

In addition, since the generated tests have different structures, the results may vary greatly when looking at the entire execution path. Therefore, in TEST-SIM, only the execution path in the patched method is observed. Also, for the case where the patched method does not reach any passing tests, the TEST-SIM tool sets a threshold. Since such a case also occurs in this study, it is necessary to set the threshold separately. In TEST-SIM, an appropriate threshold is set by an experimental method through parameterization. In this study, in order to prevent the correct patch from being rejected as much as possible, the largest value among the observed passing test distances was adopted.

8

(18)

IV Assessment

We evaluate this Test Classification tool using the following metrics:

precision= (the number o f re jected incorrect patches)

(the total number o f re jected patches) (6) recall=(the number o f re jected incorrect patches)

(the total number o f incorrect patches) (7) To evaluate this, an experiment was performed on a total of 101 patches, and differential fuzzing was performed for 10 minutes for each patch. All our experiments are performed on Intel Xeon Gold CPU and 128GB memory.

4.1 Dataset

We used Defects4j Math subject for benchmark.

version 33 53 69 80 81 82 84 85 87 93 95 105 Total

Patches 8 6 3 19 27 16 12 2 4 3 6 4 110

Incorrect Patches 5 3 2 18 26 9 12 2 3 2 6 4 92

Table 1: Dataset from Defects4j Math project

4.2 Result

Reject Precision Recall

56 47/56 (83.9%) 47/92 (51.1%) Table 2: Result from our tool

The performance shown in the existing PATCH-SIM papers is as follows.

Project Incorrect Correct Incorrect Excluded Correct Excluded

Chart 23 3 14(60.9%) 0

Lang 11 4 6(54.5%) 0

Math 63 20 33(52.4%) 0

Time 13 2 9(69.2%) 0

Total 110 29 62(56.3%) 0

Table 3: Result in PATCH-SIM

9

(19)

Overall, it shows poor performance compared to PATCH-SIM. In PATCH-SIM, a recall of 56% was achieved and a recall of 52% was obtained even if only the math project was considered. In addition, there is no correct excluded patch, so 100% precision was obtained. The reason for this will be analyzed in the next chapter.

4.3 Threshold based on Distance

If none of the passing tests pass the patched method, we can consider a specific threshold as a comparison target to replace it.

Figure 3: Average of distance in each version

Looking at Table 4, it can be seen that, on average, both the passing test and the failing test cause a longer distance in the incorrect patch. Based on this, we can introduce a threshold, and in this study, the farthest among the measured failing test distances was adopted to maintain precision as much as possible. We were able to reject additional incorrect patches using this threshold. The result of applying it is as follows.

Reject Precision Recall

91 82/91 (90.1%) 82/92 (89.1%) Table 4: Result from our tool with threshold

There is no evaluation of untreated patches or non-Math subjects, but in terms of recall, our new tool

10

(20)

showed better results than PATCH-SIM. However, in PATCH-SIM, there was no correct excluded patch at all, whereas in our tool, the precision was less than 100%. Even if the recall showed better results, keeping the correct patch is relatively more important, so it will be necessary to find a way to increase the precision in future studies.

11

(21)

V Discussion

In fact, the test suite provided by Defects4j is not sufficient. For this shortage the PATCH-SIM result was classified based on more additional information because the result of the generated test classified as a passing test in the TEST-SIM was also utilized. In this study, it is necessary to establish this part because it just classified the test generated through fuzzing. For example, in this study, the case where different outputs occur is mainly considered, but the case where the same output occurs can be regarded as a kind of passing test.

12

(22)

VI Conclusion

In this study, we proposed a new condition for patch validation using fuzzing. In addition, the results when patch validation was applied by applying such criteria were evaluated. Through comparison with the results of other existing studies, the problems found in the results were analyzed, and areas to be supplemented were suggested. In future research, it is expected that improvement measures can be proposed and evaluated using these results.

6.1 Future work Test classifier

Since the distance using LCS used in TEST-SIM only looks at the execution order, there may be a lack of view of the overall program behavior. In the future, it is necessary to apply another test classification method that can supplement this.

Guidance for fuzzing

The current TEST-SIM idea is simply to compare the magnitude of the distance between the passing test case and the failing test case. However, if we want to guarantee the quality of a test, we need to show that it is really close enough to a passing test. Fuzzing can provide guidance in any way the user wants, so we can create situations that are close enough or not. Using this, it is expected that a more rigorous test classifier can be implemented.

Distance

In fact, since LCS is an overly basic distance calculation method, it lacks many parts to be applied to a systematic string like a program. Therefore, we will be able to find and apply a distance calculation method more suitable for program analysis.

Also, there are many versions that are not currently covered due to the problem of the fuzzer or the problem of the distance calculation method, and it is necessary to present an additional approach to cover these versions. For example, not only looking at the patched method as in TEST-SIM, but also how the tendency of the caller of the method or subsequent methods changes may be an important key in judging the behavior of the patch.

13

(23)

References

[1] L. Chen, Y. Pei, and C. A. Furia, “Contract-based program repair without the contracts: An ex- tended study,”IEEE Transactions on Software Engineering, 2020.

[2] X. Gao, S. Mechtaev, and A. Roychoudhury, “Crash-avoiding program repair,” inProceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2019, pp.

8–18.

[3] A. Ghanbari, S. Benton, and L. Zhang, “Practical program repair via bytecode mutation,” inPro- ceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2019, pp. 19–30.

[4] C.-P. Wong, P. Santiesteban, C. Kästner, and C. Le Goues, “Varfix: balancing edit expressiveness and search effectiveness in automated program repair,” in Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021, pp. 354–366.

[5] F. Long and M. Rinard, “An analysis of the search spaces for generate and validate patch generation systems,” in 2016 IEEE/ACM 38th International Conference on Software Engineering (ICSE).

IEEE, 2016, pp. 702–713.

[6] N. Tillmann and W. Schulte, “Parameterized unit tests,” ACM SIGSOFT Software Engineering Notes, vol. 30, no. 5, pp. 253–262, 2005.

[7] K. Claessen and J. Hughes, “Quickcheck: a lightweight tool for random testing of haskell pro- grams,” in Proceedings of the fifth ACM SIGPLAN international conference on Functional pro- gramming, 2000, pp. 268–279.

[8] P. Holser, “junit-quickcheck: Property-based testing, junit-style,” 2019.

[9] B. Allen, K. K. Chan, A. Milne, and S. Thomas, “Basel iii: Is the cure worse than the disease?”

International Review of Financial Analysis, vol. 25, pp. 159–166, 2012.

[10] F. Long and M. Rinard, “Automatic patch generation by learning correct code,” inProceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2016, pp. 298–312.

14

(24)

[11] Y. Xiong, J. Wang, R. Yan, J. Zhang, S. Han, G. Huang, and L. Zhang, “Precise condition synthesis for program repair,” in2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE). IEEE, 2017, pp. 416–426.

[12] M. Wen, J. Chen, R. Wu, D. Hao, and S.-C. Cheung, “Context-aware patch generation for better automated program repair,” in2018 IEEE/ACM 40th International Conference on Software Engi- neering (ICSE). IEEE, 2018, pp. 1–11.

[13] J. Jiang, Y. Xiong, H. Zhang, Q. Gao, and X. Chen, “Shaping program repair space with existing patches and similar code,” inProceedings of the 27th ACM SIGSOFT international symposium on software testing and analysis, 2018, pp. 298–309.

[14] Y. Wu, “Anti-patterns for java automated program repair tools,” in Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, 2020, pp. 1367–1369.

[15] H. Tian, K. Liu, A. K. Kaboré, A. Koyuncu, L. Li, J. Klein, and T. F. Bissyandé, “Evaluating representation learning of code changes for predicting patch correctness in program repair,” in2020 35th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2020, pp. 981–992.

[16] S. Wang, M. Wen, B. Lin, H. Wu, Y. Qin, D. Zou, X. Mao, and H. Jin, “Automated patch correct- ness assessment: How far are we?” inProceedings of the 35th IEEE/ACM International Confer- ence on Automated Software Engineering, 2020, pp. 968–980.

[17] Y. Xiong, X. Liu, M. Zeng, L. Zhang, and G. Huang, “Identifying patch correctness in test-based program repair,” in Proceedings of the 40th international conference on software engineering, 2018, pp. 789–799.

[18] J. Yang, A. Zhikhartsev, Y. Liu, and L. Tan, “Better test cases for better automated program repair,”

inProceedings of the 2017 11th Joint Meeting on Foundations of Software Engineering, 2017, pp.

831–841.

[19] H. Ye, J. Gu, M. Martinez, T. Durieux, and M. Monperrus, “Automated classification of overfitting patches with statically extracted code features,”IEEE Transactions on Software Engineering, 2021.

[20] M. Böhme, V.-T. Pham, M.-D. Nguyen, and A. Roychoudhury, “Directed greybox fuzzing,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, 2017, pp. 2329–2344.

[21] R. Just, D. Jalali, and M. D. Ernst, “Defects4j: A database of existing faults to enable controlled testing studies for java programs,” inProceedings of the 2014 International Symposium on Software Testing and Analysis, 2014, pp. 437–440.

15

(25)

[22] R. Padhye, C. Lemieux, and K. Sen, “Jqf: coverage-guided property-based testing in java,” inPro- ceedings of the 28th ACM SIGSOFT International Symposium on Software Testing and Analysis, 2019, pp. 398–401.

16

(26)

Acknowledgements

Some may say that this is only a master’s thesis, but since I am a person with many shortcomings, I was able to achieve even this with the help of many people. The person I would like to thank the most is Professor Jooyong Yi, who is also my advisor. When I first joined his lab, Professor Yi gave me the op- portunity to work with him, despite knowing that I had many shortcomings. Thanks to your generosity, I have been able to experience growing myself over the past year and a half. I am still inexperienced and make many mistakes, but it is thanks to Professor Yi that I have gained more confidence in myself and in the way forward.

I would also like to thank Prof. Mijung Kim and Prof. Hyungon Moon for reviewing this inexperi- enced thesis. During Defense, the comments made by the professors and further improvements that can be made in the future will be remembered not only in future research, but also in any work that will be done after graduation.

Finally, I would like to thank my family and friends who have supported me. The excessive stress I suffered whenever research didn’t go well could not have been overcome without them. Just as I overcame difficult moments with their help, I also hope that I can grow into a person who can be a strength to them when they are struggling.

17

(27)

Referensi

Dokumen terkait

Factors Affecting Farmers’ Participation in Agroforestry Farmer Group FFG Farmer’s participation in Gunung Ciremai National Park of Kuningan and Majalengka Districts is indicated by

There are the following types o f economic policy o f pedagogical universities: - the commercialization policy aimed at maintaining a certain pace o f development o f pedagogical