Scalable OCR with NiFi and Tesseract
-
Upload
hadoop-summit -
Category
Technology
-
view
560 -
download
5
Transcript of Scalable OCR with NiFi and Tesseract
![Page 1: Scalable OCR with NiFi and Tesseract](https://reader034.fdocument.pub/reader034/viewer/2022042513/586e8cf11a28aba0038b86a1/html5/thumbnails/1.jpg)
Scalable OCR With NiFi & Tesseract Casey Stella & Michael Miklavcic
![Page 2: Scalable OCR with NiFi and Tesseract](https://reader034.fdocument.pub/reader034/viewer/2022042513/586e8cf11a28aba0038b86a1/html5/thumbnails/2.jpg)
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Introduc>on
à Casey Stella – Currently a data scienAst on Apache Metron – Previously Architect in Hortonworks Professional Services
à Michael Miklavcic – Currently an engineer on Apache Metron – Previously Architect in Hortonworks Professional Services
About the Speakers
![Page 3: Scalable OCR with NiFi and Tesseract](https://reader034.fdocument.pub/reader034/viewer/2022042513/586e8cf11a28aba0038b86a1/html5/thumbnails/3.jpg)
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
OCR At Scale: The Challenge
à Unstructured data is growing aggressively
à Much of this data is in the form of PDF images of text – This appears to be the case inside of organizaAons much more than on the internet
à There is much we can do to extract meaning from this – NLP is one of our most mature and rich branches of machine learning – Simple textual analysis would be sufficient to have rich insights
à OCR enables us to extract textual informaAon from images in an intelligent way
![Page 4: Scalable OCR with NiFi and Tesseract](https://reader034.fdocument.pub/reader034/viewer/2022042513/586e8cf11a28aba0038b86a1/html5/thumbnails/4.jpg)
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
OCR At Scale: Use-‐cases in Medicine
à The Problem – Radiologists make notes about paAents – Doctors interpret these notes and make diagnoses based on the radiologist findings – SomeAmes, the radiologists find things that are serendipitous or are not definiAve.
à The Value ProposiAon – Building a data pipeline at scale to analyze radiologist reports and look for indicaAons of missed
diagnoses – This is correct place for advanced analyAcs: in the loop with humans
![Page 5: Scalable OCR with NiFi and Tesseract](https://reader034.fdocument.pub/reader034/viewer/2022042513/586e8cf11a28aba0038b86a1/html5/thumbnails/5.jpg)
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
OCR At Scale: Use-‐cases in Journalism
à The Problem – Journalists are now asked to analyze large volumes of data – The Panama Papers alone were 2.6TB of data, much of it in scanned images of pages – FOIA requests can quickly outstrip the reading capability of a single person or team
à The Value ProposiAon – Building a scalable data pipeline to extract the text from the data journalists are asked to mine
enables more advanced analyAcs and be]er reporAng. – This is a tool to enable be]er journalism
![Page 6: Scalable OCR with NiFi and Tesseract](https://reader034.fdocument.pub/reader034/viewer/2022042513/586e8cf11a28aba0038b86a1/html5/thumbnails/6.jpg)
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Methodology : OCR
à Conversion – Take PDF’s and turn them into TIFF files, page-‐wise – GhostScript via Ghost4j
à Preprocessing – Prepare images by enhancing text and cleaning up arAfacts – Enable cleaner text extracAon – A preprocessing pipeline using ImageMagick under the hood
à ExtracAon – OCR phase using Tesseract
![Page 7: Scalable OCR with NiFi and Tesseract](https://reader034.fdocument.pub/reader034/viewer/2022042513/586e8cf11a28aba0038b86a1/html5/thumbnails/7.jpg)
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Image Preprocessing
à ImageMagick is a standard open source library and tool to do rich and robust image processing.
à ImageMagick is great J – There is a large and mature community of users – It has been around for years and has all the primiAves that you could ask for
à ImageMagick is confusing K – Image preprocessing can be a daunAng task for the user – ImageMagick can be arcane at Ames
![Page 8: Scalable OCR with NiFi and Tesseract](https://reader034.fdocument.pub/reader034/viewer/2022042513/586e8cf11a28aba0038b86a1/html5/thumbnails/8.jpg)
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Image Preprocessing
à Community + ImageMagick = Magical – People have started making layers on top of ImageMagick to do common tasks aimed at a certain
domain – Fred Weinhaus did this for text cleaning!
à What we did is port this interface over to Java and expose it as a library
à It currently supports – UnrotaAon (i.e. straightening images) – Greyscale – Enhance brightness – Text Smoothing – More!
![Page 9: Scalable OCR with NiFi and Tesseract](https://reader034.fdocument.pub/reader034/viewer/2022042513/586e8cf11a28aba0038b86a1/html5/thumbnails/9.jpg)
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Preprocessing -‐ Before and AJer -‐g -‐e stretch -‐f 25 -‐o 20 -‐t 30 -‐u -‐s 1 -‐T -‐p 20
![Page 10: Scalable OCR with NiFi and Tesseract](https://reader034.fdocument.pub/reader034/viewer/2022042513/586e8cf11a28aba0038b86a1/html5/thumbnails/10.jpg)
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Methodology : Scale
à Apache Nifi is an easy-‐to-‐use, highly customizable data processing system firmly integrated with the Hadoop Ecosystem – Configurable prioriAzaAon, throughput/latency tradeoffs – Full data provenance across the pipeline – Easy to use interface for customizing the pipeline
à Each of the phases in the pipeline becomes NIFI Processors – This allows for a highly customizable tool
![Page 11: Scalable OCR with NiFi and Tesseract](https://reader034.fdocument.pub/reader034/viewer/2022042513/586e8cf11a28aba0038b86a1/html5/thumbnails/11.jpg)
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi + Hadoop
![Page 12: Scalable OCR with NiFi and Tesseract](https://reader034.fdocument.pub/reader034/viewer/2022042513/586e8cf11a28aba0038b86a1/html5/thumbnails/12.jpg)
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Pipeline Architecture
![Page 13: Scalable OCR with NiFi and Tesseract](https://reader034.fdocument.pub/reader034/viewer/2022042513/586e8cf11a28aba0038b86a1/html5/thumbnails/13.jpg)
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo
![Page 14: Scalable OCR with NiFi and Tesseract](https://reader034.fdocument.pub/reader034/viewer/2022042513/586e8cf11a28aba0038b86a1/html5/thumbnails/14.jpg)
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
OCR is necessary, but not sufficient
à Providing this kind of uAlity is a necessary step, but there are missing pieces
à Does not handle human handwriAng as of yet – Deep learning advances are closing the gap on this
à Even with very good image preprocessing, errors can creep into documents – Kerning errors : rn -‐> m – Unresolvable blemishes leading to random noise
à Good error correcAon can require advanced NLP and can be domain specific – See patent #20160019430: “Targeted opAcal character recogniAon for medical terminology”
![Page 15: Scalable OCR with NiFi and Tesseract](https://reader034.fdocument.pub/reader034/viewer/2022042513/586e8cf11a28aba0038b86a1/html5/thumbnails/15.jpg)
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ques>ons?
All of this sorware shown in this presentaAon is open source and located at h]ps://github.com/mmiklavc/scalable-‐ocr
Find us on Twi]er
@casey_stella
@MikeMiklavcic