Scaling classical clone detection tools for ultra large datasets
-
Upload
imanmahsa -
Category
Technology
-
view
374 -
download
0
Transcript of Scaling classical clone detection tools for ultra large datasets
Scaling Classical Clone Detec/on Tools for Ultra-‐
Large Datasets
Jeffrey Svajlenko, Iman Keivanloo, Chanchal Roy IWSC 2013
Inter-‐Project Clone Detec/on
• Ac>ve research topic in the community.
• Goal: Construct inter-‐project clone corpus.
• Applica*ons • Study Global Developer Behavior • Discover Poten>al APIs and Libraries • Internet-‐Scale Clone Search
• API Recommenda>on • API Usage Support
• …
Problem: Inter-‐Project Detec/on
• Many state of the art tools do not scale to large datasets. (classical tools)
• Memory Requirements • Computa>onal Complexity • Execu>on Time • Underlying limita>ons in their algorithms or data structures.
• Instead novel scalable techniques are used. • Challenging to develop.
• Wish to use tools from a variety of domains when building an inter-‐project clone corpus.
Goal and Mo/va/on
GOAL To scale classical clone detec,on tools to ultra large dataset. MOTIVATION To allow classical clone detec>on tools to contribute to inter-‐project clone corpuses.
Shuffling Framework
• Scales classical tools to ultra-‐large datasets. • Using standard hardware. • Without modifying the original tool. • Incurs a loss of recall. • Method: Non-‐Determinis>c Dataset Par>>oning
Shuffling Framework -‐ Procedure
1. The source files of the dataset are randomly par>>oned into n equally sized subsets.
Ultra-‐Large Dataset
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Subset size dictated by clone detec>on tool’s scalability limits.
Shuffling Framework -‐ Procedure
2. Each subset is searched independently by the clone detec>on tool.
1 Clone Detec>on Tool
2 Clone Detec>on Tool
16 Clone Detec>on Tool 16
. . . 2
1
Shuffling Framework -‐ Procedure
3. The detected clone pairs are added to a clone repository.
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Detected Clones
Shuffling Framework -‐ Procedure 4. Steps (1) through (3) are repeated for r rounds.
Dataset Clone Repository
r rounds
n*r detec>on experiments
Shuffling Framework -‐ Evalua/on
Gold Standard • Clone detec>on report of a tool executed na>vely (without shuffling).
Total Recall • % of gold standard found afer r shuffling rounds of n par>>ons.
• Measure for unique clone pairs or unique cloned fragments.
Preliminary Study
• Test with “regular size” systems: • JHotDraw (20 KLOC, 285 files) • ArgoUML (190KLOC, 1845 files) • JDK1.7 (900KLOC, 6916 files)
• Tools: • CCFinder, Deckard, iClones, NiCad, SimCad, Simian
• Shuffling: 15 subsets, 30 shuffling rounds
• Measured: total recall afer each round
Preliminary Study – JDK1.7
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Recall
Round
Deckard (1834042)
iClones (49716)
NiCad (8105)
SimCad (549923)
Simian (217409)
n = 15 subsets, r = 30 rounds
Preliminary Study
• ~60-‐90% total recall achievable
• Shuffling performance varies by detec>on tool.
• Generally, a larger gold standard requires more rounds to get the same total recall.
Main Experiment: Dataset
IJaDataset 2.0: An Inter-‐Project Java Corpus • Keivanloo et al, 2012 (Proc. MSR)
• Crawled 25,000 Open-‐Source Java Projects
• 3 million java source files, 356 MLOC
• Outliers (>2000 lines) • 6238 removed
Experiment -‐ Hardware
Clone detec>on (shuffling):
• Worksta>on-‐Class Hardware • Quad Core CPU • 12-‐16GB of RAM • Above Average Disk IO
• ~$1000 PC
• Allocated on shared cloud resources. • Western Canada Research Grid (Bugaboo Cluster) • Amazon EC2 Instances
Experiment -‐ Tools
• Simian • NiCad • Deckard • CCFinderX • Terminated without explana>on.
• SimCad • Execu>on aborts on troublesome file.
• iClones • Compa>bility issue.
Simian
• IJaDataset2 • Scalability limit: RAM • 50,000 file subsets (58 par>>ons), 30 rounds • 8-‐12hr to par>>on, 4-‐10hr for detec>on (per round)
• Serng • Minimum Clone Size: 6 lines • No source normaliza>on (execu>on >me)
• Gold Standard • Amazon EC2 instance with 68GB of RAM • 300 billion clone pairs, 11 million cloned fragments
Simian: Cloned Fragment Recall
0.166903883
0.476927684
0.626533533
0.715431474
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Clon
e Fragmen
t Recall
Round
Considering only clone classes with <= 100 fragments.
Simian: Clone Recall (Trim)
0.24792718
0.619514665
y = 0.0067x + 0.0533 R² = 0.99585
y = 0.1364ln(x) + 0.1199 R² = 0.95064
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Total R
ecall
Rounds
Clone Pairs Cloned Fragments Linear (Clone Pairs) Log. (Cloned Fragments)
NiCad
• IJaDataset2 • Scalability: Limited data-‐structure size. • 10,000 file subsets, 289 par>>ons, 20 rounds • 7-‐15hr par>>oning, 23-‐31hr detec>on (per round)
• Serngs: • Clone Size: 10-‐2500 lines. • Minimum clone similarity: 70%
• Gold Standard • Not possible.
NiCad – Detec/on vs. Rounds
y = 245387x + 767852 R² = 0.99993
0.00E+00
1.00E+05
2.00E+05
3.00E+05
4.00E+05
5.00E+05
6.00E+05
7.00E+05
8.00E+05
9.00E+05
1.00E+06
0.00E+00
1.00E+06
2.00E+06
3.00E+06
4.00E+06
5.00E+06
6.00E+06
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Uniqu
e Clon
ed Fragm
ents Fou
nd
Uniqu
e Clon
es Fou
nd
Round
Unique Clones Found Unique Clone Fragments Found
Deckard
• IJaDataset • Scalability Limit: Execu>on >me. • 10,000 file subsets, 289 par>>ons, 20 rounds • 7-‐15hr par>>oning, 5-‐7 days detec>on (per round)
• Serngs: • Minimum Fragment Size: 50 tokens • Sliding Window: 5 tokens • Minimum Clone Similarity: 90% (tree)
• Gold Standard • Execu>on >me too long.
Deckard: Detec/on vs. Rounds
1.00E+07
1.10E+07
1.20E+07
1.30E+07
1.40E+07
1.50E+07
1.60E+07
1.70E+07
1.80E+07
1.90E+07
1 2 3 4 5 6 7 8 9 10
Uniqu
e Re
ported
Clone
Fragm
ents
Round
Deckard – Detec/on vs. Rounds (Trim)
Considering only clone classes with <= 10 fragments.
0.00E+00
2.00E+06
4.00E+06
6.00E+06
8.00E+06
1.00E+07
1.20E+07
1.40E+07
1.60E+07
1.80E+07
0.00E+00
2.00E+07
4.00E+07
6.00E+07
8.00E+07
1.00E+08
1.20E+08
1 2 3 4 5 6 7 8 9 10
Uniqu
e Clon
e Fragmen
ts Fou
nd
Round
Clones
Fragments
Main Experiment Conclusions
• Shuffling framework finds cloned fragments faster than the clone pair rela>onships between them.
• A large number of rounds may be needed to detect a sizable number of the clone pairs.
• Appropriate when loss of recall is acceptable. • Ex: contribu>ng towards mul>-‐tool clone corpus.
• Processing the clones found in a inter-‐project clone corpus can become itself a scalability issue.
Clone Recovery
How can we improve clone pair discovery? • Without a significant increase in rounds?
IDEA: Leverage Cloned Fragment Detec2on Ability • Apply Transi>ve Property on Clone Repository.
• If (A,B) and (B,C) then (A,C) • Perform clone search amongst cloned fragments.
Transi/ve Clone Recovery Test
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Recall
Round
Clone Recall Heuris>c Recall Recovered Recall
NiCad, JDK1.7
Transi/ve Clone Recovery Test
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Recall
Round
Clone Recall Heuris>c Recall Recovered Recall
Simian, JDK1.7
Future Work
1. Inves>gate addi>onal tools. 2. Inves>gate efficient clone recovery methods. 3. Directly compare with determinis>c approach. 4. Use the shuffling framework to contribute
towards an inter-‐project clone corpus (IJaDataset 2.0).
Thank You!