Embedded Lab. Park Yeongseong. Introduction State of the art Core values Design Experiment ...

19
Value-Based Program Characterization and Its Application to Software Plagiarism Detection Embedded Lab. Park Yeongseong ICSE 2011 Yoon-Chan Jhi, Xinran Wang, Sencun Zhu, Peng Liu, Dinghao Wu Penn State University Xiaoqi Jia State Key Laboratory of Information Security, Institute of Software, Chinese Academy of Sciences

Transcript of Embedded Lab. Park Yeongseong. Introduction State of the art Core values Design Experiment ...

Page 1: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A.

Value-Based Program Characterization and Its Application to Software Plagiarism De-

tection

Embedded Lab.Park Yeongseong

ICSE 2011

Yoon-Chan Jhi, Xinran Wang, Sencun Zhu, Peng Liu, Dinghao Wu Penn State University

Xiaoqi JiaState Key Laboratory of Information Security, Institute of Software,

Chinese Academy of Sciences

Page 2: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A.

Introduction State of the art Core values Design Experiment Discussion Conclusion Q&A

Contents

Page 3: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A.
Page 4: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A.

Identifying same or similar code is very im-portant

Previous works◦ Static source code comparison – C1◦ Static excutable code comparison – C2◦ Dynamic control flow based methods – C3◦ Dynamic API based methods – C4

Introduction

Page 5: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A.

Three highly desired requirements◦ R1 – Resiliency◦ R2 - Ability to directly work on binary executables◦ R3 – Platform independence

BUT!!!! Not satisfy requirement◦ Static source code comparison – C1 R1 R2◦ Static excutable code comparison – C2 R1◦ Dynamic control flow based methods – C3 R1 R3◦ Dynamic API based methods – C4 R3

Introduction

Page 6: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A.

Introduce new approach◦ Core-values

5 optimization options (-O0 ~ -O3, -Os) 3 Compilers ( GCC, TCC, WCC ) KlassMaster, Thicket, Loco/Diablo Obfusca-

tors

Introduction

Page 7: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A.

Code Obfuscation Techniques◦ data obfuscation, control obfuscation, layout obfus-

cation and preventive transformations◦ indirect branches, control-flow flattening, function-

pointer aliasing

Static Analysis Based Plagiarism Detection◦ String-based◦ AST-based◦ Token-based◦ PDG-based◦ Birthmark-based

State of the arts

Page 8: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A.

Dynamic Analysis Based Plagiarism Detec-tion◦ Whole program path based (WPP)◦ Sequence of API function calls birthmark(EXESEQ)◦ Frequency of API function calls

birthmark(EXEFREQ)◦ System call based birthmark

State of the arts

Page 9: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A.

Runtime values◦ The output operands of the machine instructions ex-

ecuted

Core values◦ Constructed from runtime values

Eliminate non-core values◦ If is not derived form , is not a core-value of ◦ If is not in the set of runtime values of is not a core-

value of

Core values

Page 10: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A.

Core values

Page 11: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A.

Not all values associated with the execution of a program are core-values◦ Value-updating instruction◦ Related to the program’s semantics

Design-Value Sequence Extrac-tion

Page 12: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A.

To refine value sequences◦ Sequential refinement – reduction rate 16%~34%◦ Optimization-based refinement – 5 optimization◦ Address removal – exclude pointer values

Design-Value Sequence Refinementand Similarity Metric

Page 13: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A.

Design-Overview

Page 14: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A.

Intel Quad-Core 2.00 GHz CPU 4GB RAM Linux machin QEMU 0.9.1

Questions1. resilient 2. false accusation3. credible

Experiment

Page 15: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A.

Obfuscation techniques◦ SandMark, KlassMaster : Java bytecode obfusca-

tors

Test application : Jlex◦ Lexical analyzer

Experiment-Obfuscation tool(resiliency)

Page 16: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A.

Test Application◦ 5 individual XML pasers:expat, libxml2, Parsifal,

rxp,xercesc

Experiment-Similar Programs(false accusation)

Page 17: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A.

Test application◦ Bzip2, gzip, oggenc, 9 of 11 programs

Result◦ Similarity scores between 0 and 0.27◦ zip and gzip similarity scores are 1.0

Same compression algorithm : deflate◦ zip and bzip2 similarity scores are 0.01 to 0.03

Different compression algorithm : block sorting

Experiment-Different Programs(credible)

Page 18: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A.

introduce a novel approach to dynamic characterization of executable programs.

The value-based method successfully dis-criminates 34 plagiarisms by SandMark, KlassMaster, Thicket.

Conclusion

Page 19: Embedded Lab. Park Yeongseong.  Introduction  State of the art  Core values  Design  Experiment  Discussion  Conclusion  Q&A.

Q&A