Turning XML to XLS on the JVM, without loosing your Sanity, with Groovy
Transcript of Turning XML to XLS on the JVM, without loosing your Sanity, with Groovy
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
Turning XML into XLS with GroovyNick Burch
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
Turning XML to XLS, on the JVM, without loosing your sanity, with Groovy!
2
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
What is Groovy?
3
Groovy, now Apache Groovy, is a JVM based language Optionally Typed, Dynamic Language Many features inspired by Python, Ruby, SmallTalk Java Friendly – Can use Java classes & libraries, but also
can donate classes back to be used in Java Seemless integration with Java, similar syntax A lot less boilerplate than Java! But Java is learning...
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
Fun with PDF XML
4
You can attach comments in a PDF These have text, along with colour, and optionally some
standard Dublin Coreesque metadata Best done with Acrobat or open source tools Surprisingly popular with many business sectors,
Quanticate’s included, mostly for good reason! Good news – Acrobat can export as XML as XFDF
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
5
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
Groovy and XML
6
http://groovylang.org/processingxml.html Very easy and lightweight way to start processing XML Can initially treat XML as a big map, and access elements
just with dots Multiple children of the same type treated as a list Can access attributes as properties with the @ prefix For simple XML, very short, succinct code
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
And in code....
7
XmlSlurper and XmlParser main 2 entry points
def response = new XmlSlurper().parseText(books)
def firstAuthor = response. value.books.book[0].author
assert firstAuthor.text() == 'Manuel De Cervantes'
assert firstAuthor.@id == 1
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
Groovy and advanced XML
8
http://groovylang.org/processingxml.html However... Still have a full XML DOM on hand Can run arbitrary DOM and XPath querys Feels a bit like JQuery etc – friendly, friendly, advanced
selector, back to friendly again! Call selector method, then find or findAll with a closure
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
Groovy and lists
9
http://docs.groovylang.org/next /html/documentation/workingwithcollections.html
Define simply inline, eg def stuff = [“1”,”b”,”Test”] Run something for each, eg stuff.each { l println l }→ Filter and transform with grep and collect methods Can run any method on all entries with * syntax, eg
stuff*.trim().sort().unique()
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
Groovy and Strings
10
http://docs.groovylang.org/latest/html/documentation/#allstrings
Can be regular Java strings, or GStrings with extra methods and functionality, Groovy sorts this for you
Use tripple single quotes to force Java string Use tripple quotes for multiline strings In GStrings, can interpolate with ${...} eg
“Hello ${world}” or “1+3 is ${1+3}” or “List is {l.size()} big”
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
PoC to process the XML
11
File xmlFile = new File(args[0])
def xml = new XmlSlurper().parse(xmlFile)
// Process all the Annotations
// Find everything under /xfdf/annots/freetext
// If that has no Subject, is the Domain
// If that does, grab text from /contentsrichtext/body/p
def freetext = xml.annots.freetext
println "There are ${freetext.size()} annotations"
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
PoC to process the XML
12
def lastDomain = "N/A"freetext.each { ft > def body = ft."contentsrichtext".body def content = body.text() // Get all text of entries of P etc under the body
if (ft['@page']) { If (! ft['@subject'].isEmpty()) {println "P ${ft['@page']} D ${lastDomain} S ${ft['@subject']} V ${content}" } else { lastDomain = content }
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
Apache POI
13
https://poi.apache.org/ Pure Java library for reading and writing most Microsoft
Office file formats Especially strong on SpreadSheets (XLS and XLSX) More closely alligned to the file formats than the
applications, which can sometimes cause surprises (eg where Excel doesn’t store what you thought it did...)
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
Simple XLS Exporter
14
Workbook wb = new HSSFWorkbook()Sheet s = wb.createSheet(“Variables”)dvars.eachWithIndex { v, idx > Row r = s.createRow(idx) r.createCell(0).setCellValue(dname) r.createCell(1).setCellValue(v.variable) r.createCell(2).setCellValue(v.pages.join(" ")) r.createCell(3).setCellValue(v.comments.join(" : "))}(0..3).each { col s.autoSizeColumn(col) }→(new File(“output.xls”)).withOutputStream { out > wb.write(out) }
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
Fancy filterable headers
15
public static Sheet headerSheet(Workbook wb, String name, List<String> headers) { CellStyle csHeader = makeHeaderStyle(wb, headerHeight) Sheet s = wb.createSheet(name) Row r = s.createRow(0) r.setHeightInPoints(headerHeight+1) headers.eachWithIndex { col, idx > Cell c = r.createCell(idx) c.setCellValue(col) c.setCellStyle(csHeader) } s.setAutoFilter(new CellRangeAddress(0, 0, 0, headers.size()1)) return s}
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
The Real Thing
16
Final Requirements were a bit more involved, and there were more edge cases than initially expected...
XML Processing code: ~150 lines XLS Export code: ~150 lines I’m sure a Groovy expert could get that shorter without
affecting readability or maintainability! Uses Gradle to fetch Apache POI, do single Shadow Jar
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
17
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
18
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
19
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
20
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
Learning Groovy
21
Groovy In Action book http://groovylang.org/learn.html Getting Started and Module Guides
http://groovylang.org/documentation.html StackOverflow questions Groovy Docs, esp. for Java enhanced eg
http://docs.groovylang.org/docs/groovy2.4.13/html/groovyjdk/java/lang/String.html
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
22
© Quanticate 2018Our Services: Biostatistics • Clinical Programming • Clinical Data Management • Medical Writing • Pharmacovigilance
Our Values: Relationships • Excellence • Accountability • Customer Focus • Happiness
Any Questions?
23