Post on 18-Feb-2017
Copyright 2011 Trend Micro Inc. 1
Bytewise approximate matching, searching and clustering Liwei Ren, Ph.D
Ray Cheng, Ph.D
Trend Micro Inc.
DFRWS USA 2015, August , 2015, Philadelphia, PA
Copyright 2011 Trend Micro Inc.
Agenda
• Background
• Six Matching Problems and Bytewise Relevance
• Current Work: A Framework of Theory, Algorithms, and Technologies
• Future Work
Classification 8/17/2015 2
Copyright 2011 Trend Micro Inc.
Background
• Similarity digesting schemes: – Problem: Given two binary strings s1 and s2, measure their similarity.
• Do a hash that preserves similarity property of strings.
• Measure similarity by comparing two hash values.
– Example: TLSH, ssdeep, sdhash
Classification 8/17/2015 3
Copyright 2011 Trend Micro Inc.
Background
• NIST specification document NIST.SP.800-168 introduces the concept of bytewise approximate matching :
– NIST document lists four cases to describe this concept:
• Object similarity detection: identify related artifacts, e.g. different versions of a document.
• Cross Correlation: identify artifacts sharing a common object.
• Embedded Object Detection: identify a given object inside an artifact.
• Fragment Detection: identify the presence of traces/fragments of a known artifact.
• Dr . Liwei Ren’s talk at DFRWS EU 2015: – A Theoretic Framework for Evaluating Similarity Digesting Tools
– Using a mathematical model to describe binary similarity.
4
Copyright 2011 Trend Micro Inc.
Six Matching Problems and Bytewise Relevance
• The NIST document does not cover all bytewise approximate matching cases.
• We generalized NIST cases to six cases:
Classification 8/17/2015 5
Copyright 2011 Trend Micro Inc.
Six Matching Problems and Bytewise Relevance
• Continued:
6
Copyright 2011 Trend Micro Inc.
Classification of NIST approximate matching cases
• Similarity Detection: identify related artifacts. – AM1 (approximate match)
• Cross Correlation: identify artifacts sharing a common object.
– EM3 (exact match cross-sharing)
• Embedded Object Detection: identify a given object inside an artifact.
– EM2 (exact match containment)
• Fragment Detection: identify the presence of traces/fragments of a known artifact.
– EM2 (one or more exact match containment)
Classification 8/17/2015 7
Copyright 2011 Trend Micro Inc.
Six Matching Problems and Bytewise Relevance
• Definition 1 : Given two strings R[1,..,n] and T[1,…,m], if one of six cases is true, we say R and T are bytewise relevant. – We denote this as BR(R,T)= 1, otherwise BR(R,T)= 0.
8
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• Define three fundamental problems using Bytewise Relevance: – Matching: Given O1 , O2 ∊ S, determine whether BR (O1,O2) =1.
– Searching : B ⊆ S is a bag of objects . Given o ∊ S , find b ∊ B such that BR (o, b )=1.
– Clustering: Given a bag B of objects, partition B into groups { G1, G2,…,Gm} based on BR.
• S = An object space S,
• O = An object in object space S,
•BR = Bytewise Relevance relationship for objects in S.
Classification 8/17/2015 9
Copyright 2011 Trend Micro Inc.
A Framework of Theory, Algorithms and Technologies
• Our bytewise relevance framework :
Classification 8/17/2015 10
Copyright 2011 Trend Micro Inc.
Matching
• The Six Matching Problems EM1 – AM3 – Identicalness EM1 : the solution is trivial.
– Containment EM2 : the solution is Rabin-Karp algorithm.
– Cross-sharing EM3 :
• We established a theory on this interesting problem : how to measure cross-sharing.
• We developed an algorithmic solution with theoretic analysis.
– Similarity AM1 :
• TLSH, ssdeep and sdhash
• Dr. Ren delivered a talk at DFRWS EU 2015: there are eight approaches to solve this problem.
– We designed a novel similarity digesting scheme TSFP.
– Approximate containment AM2: Two heuristic algorithms
– Approximate cross-sharing AM3: One heuristic algorithm
Classification 8/17/2015 11
Copyright 2011 Trend Micro Inc.
Searching
• For the relationship BR, the searching problem: – B is a bag of strings. Given a string T , find s ∊ B such that BR(T,
s)=1.
Classification 8/17/2015 12
Copyright 2011 Trend Micro Inc.
Searching
• How to solve searching problem? – Brute force approach : for every s ∊ B, we evaluate BR(T, s). Can
we scale to millions or billions?
– Candidate selection approach: two-step approach
• STEP 1: select a few candidates { s1, s2,…,sm} quickly
• STEP 2: evaluate each BR(T, sk).
– How to select good candidates?
• String fingerprinting: generate fingerprints from each string from B.
• Indexing Process: Index the fingerprints along with the string ID to create a index DB as FP-DB.
• Searching Process: given T, generate fingerprints {FP1, FP2,…,FPq} , we use them to search possible candidates from FP-DB.
– NOTE:
• This is similar to a keyword based search engine where the keywords are the fingerprints.
• The fingerprinting procedure is actually a special tokenization method. Classification 8/17/2015 13
Copyright 2011 Trend Micro Inc.
Future Work: Clustering Problem
• For the relationship BR, one has a clustering problem : – B is a bag of strings, partition B into groups of strings based on BR.
Classification 8/17/2015 14
Copyright 2011 Trend Micro Inc.
Future Work: Library and tools
• Analyze algorithms and measure performance. – Verify they can scale.
• For bytewise approximate matching, searching and clustering, – Library of functions
– API
– Tools
Classification 8/17/2015 15
Copyright 2011 Trend Micro Inc.
Application examples of Approximate Matching, Searching, Clustering • E-Discovery
– Comparing near duplicate documents
– Grouping near duplicate documents
• Digital forensic analysis
– Identifying similar objects or files
• Malware analysis
– Identifying similar malware or mutated malware
• Anti-plagiarism
– Detection of copyright violations
• Source code governance
• Spam filtering
• Data Loss Prevention
Classification 8/17/2015 16
Copyright 2011 Trend Micro Inc.
Q&A
• Thank you.
• Any questions?
• Email: – liwei_ren@trendmicro.com
– ray_cheng@trendmicro.com
17
Copyright 2011 Trend Micro Inc.
Application Example
• A search problem in DLP (Data Loss Prevension) system: – Problem: S = {d1, d2,…, dn} is a collection of confidential documents,.
Given any document T and 0<δ≤1, find a document d ∊ S such that RLV(d,T)≥ δ.
• RLV is a function to measure the relevance of two documents.
• Challenges: how to construct RLV and δ? How to make search scalable?
Classification 8/17/2015 18
Copyright 2011 Trend Micro Inc.
Application Example
• A clustering problem in e-Discovery: – Data are identified as potentially relevant by attorneys
– De-duplication technology. – Problem: partition S into groups based on the textual relevance.
Classification 8/17/2015 19
Copyright 2011 Trend Micro Inc.
Background
• Similarity digesting schemes: – A family of similarity preserving hashing techniques & tools
– Problem: Given two binary strings s1 and s2, measure the similarity by s= SIM(H(s1), H(s2)).
• H is a hash function that preserves string similarity.
• SIM is another function to measure similarity of two hash values
– Example: TLSH, ssdeep, sdhash
– Challenge: how to evaluate pros & cons between them?
Classification 8/17/2015 20
Copyright 2011 Trend Micro Inc.
Six Matching Problems and Bytewise Relevance
• Definition 2: Let X , Y ∊ { EM1,EM2, EM3 ,AM1, AM2, AM3}. If problem X is a special case of problem Y , we denote this as X ↪ Y.
• We have following relationship:
Classification 8/17/2015 21
EM1 EM2 EM3
AM1 AM2 AM3
↪ ↪
↪ ↪
↪ ↪ ↪