Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel...
Transcript of Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel...
![Page 1: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/1.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Predicting English Keywords fromJava Bytecodes
Pablo Ariel Duboue, PhD
Les Laboratoires FoulabMontreal, Quebec
Séminaires RALI-OLST, Université de Montréal
![Page 2: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/2.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Outline
IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering
DetailsCorpus AssemblyMain PipelineResultsApplications
Other TopicsGRIUM/RALIOther AcademicFocus on Technology
![Page 3: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/3.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Outline
IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering
DetailsCorpus AssemblyMain PipelineResultsApplications
Other TopicsGRIUM/RALIOther AcademicFocus on Technology
![Page 4: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/4.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Before Montreal
I Columbia UniversityI WSD in biology texts (GENIES)I Natural Language Generation in medical and
intelligence domains (MAGIC, AQUAINT)I Thesis: “Indirect Supervised Learning of Strategic
Generation Logic”, defended Jan. 2005.I Advisor: Kathy McKeownI Committee:
Hirschberg/Jurafsky/Rambow/Jebara
I IBM Research WatsonI AQUAINT: Question Answering (PIQuAnT)I Enterprise Search - Expert Search (TREC)I Connections between events (GALE)I Deep QA - Watson
![Page 5: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/5.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
In Montreal
I am passionate about improving societythrough language technology and split mytime between teaching, doing researchand contributing to free software projects
I Working with Prof. Nie at GRIUMI Taught a graduate class in NLG in ArgentinaI Contributed to Free Software projects, including
some of my ownI Doing some consulting focusing on startups
![Page 6: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/6.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Outline
IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering
DetailsCorpus AssemblyMain PipelineResultsApplications
Other TopicsGRIUM/RALIOther AcademicFocus on Technology
![Page 7: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/7.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Semantics, Java Bytecodes, Javadocs
I Motivation: Machine Learning for NaturalLanguage Generation
I Finding good semantic representations “in thewild” is very rare
I Level of detail of semantic representations vs.natural language
I Similarities with binary code and codecomments
I Reverse Engineering practitioners could toleratenoisy text
I As discussed in the INLG panel last summer
![Page 8: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/8.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Java Bytecodes
I JVM is a stack machineI The set of opcodes (~200) is small to simplify
porting to new architectures.I The opcodes fall into six categories:
I Load/store (e.g. aaload, bastore)I Arithmetic/logic (e.g. iadd, fcmpg)I Type conversion (e.g. i2b, f2d)I Object construction and manipulation (new,
putfield)I Operand stack manipulation (e.g. swap,
dup2_x1)I Control flow (e.g. if_icmpgt,goto)I Method invocation and return (e.g.
invokedynamic, lreturn)
![Page 9: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/9.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
LDC and CALL
I While bytecodes represent a reducedvocabulary, they can incorporate names ofclasses or methods and string constants
ldc pushes a constant onto the operandstack (number or string)
getfield instance and field namegetstatic classname and field name
invokedynamic invokes a dynamic method
![Page 10: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/10.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Javadocs
I Javadocs are standardized Java commentsI Include special mark-up in the form of ’@’
constructionsI @param, @throws, @return among others
I In my work, I focus on the comments associatedwith each method
I Example:I Creates a CacheRandom instance with a given
cache capacity. @param capacity Thecapacity of the cache.
I Adjusts the relative offset where the matchbegins to an absolute value. Only used byAwkMatcher to adjust the offset for streammatches. @return The length of the match.
![Page 11: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/11.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Outline
IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering
DetailsCorpus AssemblyMain PipelineResultsApplications
Other TopicsGRIUM/RALIOther AcademicFocus on Technology
![Page 12: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/12.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
What is Reverse Engineering
I From WikipediaReverse engineering is the process ofdiscovering the technological principlesof a device, object, or system throughanalysis of its structure, function, andoperation. (...) The same techniques aresubsequently being researched forapplication to legacy software systems(...) to replace incorrect, incomplete, orotherwise unavailable documentation.
I REcon: the premier reverse engineeringconference, held yearly at Montreal
![Page 13: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/13.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Reverse Engineering Example
private f i n a l i n t c( i n t ) {0 aload_01 get f ie ld org . jpc . emulator . f . v4 invokeinterface org . jpc . support . j .e ( )9 aload_010 get f ie ld org . jpc . emulator . f . i13 invokev i r tua l org . jpc . emulator . motherboard .q.e( )16 aload_017 get f ie ld org . jpc . emulator . f . j20 invokev i r tua l org . jpc . emulator . motherboard .q.e( )23 iconst_024 i s to re_225 i load_126 i f l e 12829 aload_030 get f ie ld org . jpc . emulator . f .b33 invokev i r tua l org . jpc . emulator . processor . t .w( )
![Page 14: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/14.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Reverse Engineering Example
private f i n a l i n t c( i n t ) {0 aload_01 get f ie ld org . jpc . emulator . f . v4 invokeinterface org . jpc . support . j .e ( )9 aload_010 get f ie ld org . jpc . emulator . f . i13 invokev i r tua l org . jpc . emulator . motherboard .q .e( )16 aload_017 get f ie ld org . jpc . emulator . f . j20 invokev i r tua l org . jpc . emulator . motherboard .q .e( )23 iconst_024 i s to re_225 i load_126 i f l e 12829 aload_030 get f ie ld org . jpc . emulator . f .b33 invokev i r tua l org . jpc . emulator . processor . t .w( )
![Page 15: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/15.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Reverse Engineering Example
private f i n a l i n t c( i n t ) {0 aload_01 get f ie ld org . jpc . emulator . f . v4 invokeinterface org . jpc . support . j .e ( )9 aload_010 get f ie ld org . jpc . emulator . f . i13 invokev i r tua l org . jpc . emulator . motherboard .q .e( )16 aload_017 get f ie ld org . jpc . emulator . f . j20 invokev i r tua l org . jpc . emulator . motherboard .q .e( )23 iconst_024 i s to re_225 i load_126 i f l e 12829 aload_030 get f ie ld org . jpc . emulator . f .b33 invokev i r tua l org . jpc . emulator . processor . t .w( )
![Page 16: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/16.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Reverse Engineering Example
private f i n a l i n t c( i n t ) {0 aload_01 get f ie ld org . jpc . emulator . f . v4 invokeinterface org . jpc . support . j .e ( )9 aload_010 get f ie ld org . jpc . emulator . f . i13 invokev i r tua l org . jpc . emulator . motherboard .q.e( )16 aload_017 get f ie ld org . jpc . emulator . f . j20 invokev i r tua l org . jpc . emulator . motherboard .q.e( )23 iconst_024 i s to re_225 i load_126 i f l e 12829 aload_030 get f ie ld org . jpc . emulator . f .b33 invokev i r tua l org . jpc . emulator . processor . t .w( )
![Page 17: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/17.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Reverse Engineering Example
private f i n a l i n t c( i n t ) {0 aload_01 get f ie ld org . jpc . emulator . f . v4 invokeinterface org . jpc . support . j .e ( )9 aload_010 get f ie ld org . jpc . emulator . f . i13 invokev i r tua l org . jpc . emulator . motherboard .q.e( )16 aload_017 get f ie ld org . jpc . emulator . f . j20 invokev i r tua l org . jpc . emulator . motherboard .q.e( )23 iconst_024 i s to re_225 i load_126 i f l e 12829 aload_030 get f ie ld org . jpc . emulator . f .b33 invokev i r tua l org . jpc . emulator . processor . t .w( )
![Page 18: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/18.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Outline
IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering
DetailsCorpus AssemblyMain PipelineResultsApplications
Other TopicsGRIUM/RALIOther AcademicFocus on Technology
![Page 19: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/19.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Debian
I Using the Debian archiveI apt-file search - -package-only .jar
I 1,400+ packagesI dpkg-query -p package name
I Look for Source fieldI dpkg-source -x source .dsc
I Search for Java source files.I dpkg -x binary .deb
I Search for jars, disassemble the methods.
I Assembling the Bytecodes / Javadoc CorpusI Disassemble using jclassinfo - -disasmI Dump Javadoc comments using qdox.
I A lightweight Java source parsing library.I Heuristically match source methods to compiled
methods.I Normalize source code signatures to binary
signatures.
![Page 20: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/20.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Numbers
I Final corpus:I 1M methods.I 35M words.I 24M JVM instructions.
I This corpus is 3x bigger than the one discussed inthe REcon talk
![Page 21: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/21.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Outline
IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering
DetailsCorpus AssemblyMain PipelineResultsApplications
Other TopicsGRIUM/RALIOther AcademicFocus on Technology
![Page 22: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/22.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Pipeline
1. HTML detagging2. PTB tokenizer3. Morfessor4. cclparser5. Naive Bayes
![Page 23: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/23.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Morfessor
I Unsupervised morphome detectionI http://www.cis.hut.fi/projects/morpho/
I CacheRandom → Cache + RandomI GenericCache.DEFAULT_CAPACITY → Generic +
Cache. + DEFAULT_CAPACITYI someFileName → some + FileNameI PatternStreamInput → Pattern + Stream + Input
![Page 24: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/24.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
CCL ParserI CCL Parser is an unsupervised parser that does
not require POS tagsI Unsupervised POS induction, incremental (can
deal with long sentences)I Yoav Seginer (2007), Fast Unsupervised
Incremental Parsing. ACL.I http://www.seggu.net/ccl/I GPLv2 – but current codebase does not save
trained modelsI ( ( ( ( ( ( ( ( ( ( (creates a) cache random)
instance (with a)) given) cache) capacity. (@param)) capacity the) capacity) of the)cache))
I As chunkerI [creates a] [cache random] [instance] [with a]
[given] [cache] [capacity.] [@ param][capacity the] [capacity] [of the] [cache][same] [as cache random] [generic] [cache.][default_capacity]
I [default] [constructor] [given] [default] [access][to prevent instantiation] [outside the][package]
![Page 25: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/25.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Naive Bayes
I P(term|bytecodes)I In case of complex opcodes (e.g., ldc “This is a
very long string”), the count for the opcode issplit between:
I 0.5 for the full opcode, as a wholeI 0.5 / #parts for each subpart ({ldc, This, is, a, very,
long, string})
![Page 26: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/26.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Outline
IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering
DetailsCorpus AssemblyMain PipelineResultsApplications
Other TopicsGRIUM/RALIOther AcademicFocus on Technology
![Page 27: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/27.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Overall Results
I Top scoring terms, using one count per opcode
Term P R F@ param 0.73 0.64 0.685
0.73 0.63 0.679object 0.97 0.06 0.114
@ throws 0.72 0.05 0.099text 0.64 0.02 0.038
property 0.69 0.01 0.031description 0.72 0.01 0.029@ return the 0.78 0.01 0.028
0.80 0.01 0.026
![Page 28: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/28.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Without Per-opcode Normalization
Term P R F@ generated 0.76 0.80 0.783
replaced 0.93 0.60 0.734@ param 0.64 0.74 0.690
icu 0.75 0.49 0.600o the 0.47 0.75 0.582
@ stable 0.72 0.45 0.561@ inheritdoc 0.42 0.60 0.495@ return the 0.41 0.52 0.463
receiver 0.72 0.31 0.440
![Page 29: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/29.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Where to go from here
I The meaning in the bytecodes is not in thepresence of individual opcodes but in theirsequencing
I MOTIF analysis in bioinformatics
I Comparable SMTI Most systems (e.g., Munteanu and Marcu
(2006)) use either an aligned corpora or abilingual dictionary
I I can try to obtain that by asking developers towrite descriptions for segments of the code
I Alternatively, I can try to adapt TextTiling tobytecodes
I Suggested by another Foulaber (Danukeru)
![Page 30: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/30.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Outline
IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering
DetailsCorpus AssemblyMain PipelineResultsApplications
Other TopicsGRIUM/RALIOther AcademicFocus on Technology
![Page 31: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/31.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Applications in Reverse EngineeringI Hinting Subroutines
I The motivating example at the beginning.I “Beacon identification” in Software Engineering.
I Custom (malware) VMsI Identifying which methods correspond to
different VM operations (addition, jump, etc).
I Dalvik Word Clouds.I Use dex2jar, obtain word clouds for the whole
executable.I Maybe the user can tell if anything looks fishy
there?
I Flagging Suspicious Methods.I Finding methods that can be described with
keywords very different from the rest of theexisting methods.
I Can be done with dynamically generatedbytecodes.
![Page 32: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/32.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Applications Outside Reverse Engineering
I Semantic SearchI Searching for methods related to certain English
termsI Query expansion using bytecodes
I Software Engineer DocumentationI Generating documentation from bytecodesI Long term goal
![Page 33: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/33.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Outline
IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering
DetailsCorpus AssemblyMain PipelineResultsApplications
Other TopicsGRIUM/RALIOther AcademicFocus on Technology
![Page 34: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/34.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Snippets and Sentence CompressionI Improving Information Retrieval user experience
and engine performance by having bettersnippets
I Working closely with Dr. Jing He.I Summarization snippets seem better than
regular snippets but are much longer⇒sentence compression
I Query: wine romeI Page:
http://penelope.uchicago.edu/%7Egrout/encyclopaedia_romana/wine/wine.html
I Bing snippet: Return to Notae. Wine and Rome.Now nearly extinct in the wild, grapes (vitisvinifera) grew throughout the ancientMediterranean, the juice readily fermenting asthe enzymes ...
I Summarization: Wine almost always was mixedwith water for drinking; undiluted wine merumwas considered the habit of provincials andbarbarians. The earliest work on wine andagriculture was written in Punic. Indeed, by 154BC, says Pliny, wine production in Italy wasunsurpassed.
![Page 35: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/35.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Outline
IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering
DetailsCorpus AssemblyMain PipelineResultsApplications
Other TopicsGRIUM/RALIOther AcademicFocus on Technology
![Page 36: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/36.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Taught Graduate Class in Argentina
I My alma materI Universidad Nacional de Cordoba
I Natural Language GenerationI
http://wiki.duboue.net/index.php/2011_FaMAF_Intro_to_NLGI Touched NLG from DBs, Summarization and
decoding in SMTI 12 students, about a fourth of the total PhD
students in the dept
I Large NLP GroupI http://pln.famaf.unc.edu.ar/I Possibilities for visiting people from Montreal
![Page 37: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/37.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Student Projects
I Natural Language Generation for SoftwarePatches
I http://nlg4patch.com.ar/
I Natural Language Generation for UML diagramsI ongoing
I Referring Expression Evaluation using DBpediaI HLT-NAACL 2012 Short Paper “On The Feasibility
of Open Domain Referring ExpressionGeneration Using Large Scale Folksonomies”
I Surface Realization of Spanish using the SemParCorpus
![Page 38: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/38.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Outline
IntroductionAbout the SpeakerBytecodes as Weak SemanticsReverse Engineering
DetailsCorpus AssemblyMain PipelineResultsApplications
Other TopicsGRIUM/RALIOther AcademicFocus on Technology
![Page 39: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/39.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Free Software
I Debian scienceI apertium, transfer-based machine translation for
related language-pairs
I NLTKI Personal Projects
I Farmer text supportI php-nlgenI NLG in Puredata
I http://www.ohloh.net/accounts/DrDub
![Page 40: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/40.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Tech Scene MontrealI Foulab
I Montreal oldest and more prestigioushackerspace
I http://foulab.orgI Hackerspaces are community-operated physical
places, where people can meet and work ontheir projects.
I http://hackerspaces.org for the full listI Open House every Tuesday night, everybody is
welcomed
I Hack-a-thonsI Upcoming:
http://quebecouvert.org/events/hackonslacorruption/
I Notman houseI The “House of the Web” in MontrealI http://notman.org/
![Page 41: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/41.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Consulting
I R&D for start-upsI Focusing on companies with positive
contributionsI Quick turnaround from ideas to usersI http://honeypot.matchfwd.com
I Own venturesI 4opiniones.com
![Page 42: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/42.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Summary
I I have presented a work-in-progress targettingthe automated documentation generation fromcompiled code
I Most recent progress is in unsupervisedterminology identification
I Currently working in improved ML
![Page 43: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/43.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Acknowledgements
I GRIUMI Prof. Nie and Dr. Jing He
I FoulabI Danukeru
I REcon organizersI Subgraph.
I Annie Ying
![Page 44: Details Java Bytecodeskeywords4bytecodes.org/RALI-OLST2012.pdf · Java Bytecodes Pablo Ariel Duboue, PhD Les Laboratoires Foulab Montreal, Quebec Séminaires RALI-OLST, Université](https://reader034.fdocument.pub/reader034/viewer/2022052612/5f0a9ccb7e708231d42c7d38/html5/thumbnails/44.jpg)
Keywords forBytecodes
Dr. Duboue
IntroductionThe Speaker
Bytecodes as Semantics
Reverse Engineering
DetailsCorpus Assembly
Main Pipeline
Results
Applications
Other TopicsGRIUM/RALI
Other Academic
Focus on Technology
Summary
Contacting the Speaker
I Email: [email protected] Website: http://duboue.netI Twitter: @pabloduboueI LinkedIn: http://linkedin.com/in/pabloduboueI IRC: DrDub
http://keywords4bytecodes.org
I Always looking for new collaborationopportunities
I Very interested in teaching a class either inMontreal or on-line