Getting Started with Unstructured Data
-
Upload
dataversity -
Category
Education
-
view
2.319 -
download
0
Transcript of Getting Started with Unstructured Data
November 17, 2011
Getting Started with Unstructured DataChristine Connors & Kevin LynchTriviumRLG LLC
Thursday, November 17, 2011
Meta
✤ Presenter: Christine Connors
✤ @cjmconnors
✤ Presenter: Kevin Lynch
✤ @kevinjohnlynch
✤ Principals at www.triviumrlg.com
✤ Partnering with Dataversity
Thursday, November 17, 2011
Agenda
✤ What is unstructured data?
✤ Where do we find it?
✤ How important is it?
✤ How do we visualize it?
✤ Machine processing for actionable data
✤ Tools
Thursday, November 17, 2011
What is unstructured data?
✤ Data which is
✤ Not in a database
✤ Does not adhere to a formal data model
✤ Content
Thursday, November 17, 2011
Isn’t that a misnomer?
✤ Problematic term
✤ The presence of object metadata or aesthetic markup does not alone give ‘structure’ in this sense of the word
✤ Object metadata = machine or applied properties
✤ Aesthetic markup = stylesheets; rendering information
✤ Semi-structured data is typically treated as unstructured for the purposes of machine processing and analysis
Thursday, November 17, 2011
Types of ‘un’structured data
✤ Text-based documents
✤ Word processing, presentations, email, blogs, wikis, tweets, web pages, web components (read/write web)
✤ Audio/video files
Thursday, November 17, 2011
Where do we find it?
✤ Office productivity suites
✤ Content management systems
✤ Digital asset management systems
✤ Web content management systems
✤ Wikis, blogs, comment & discussion threads
✤ Social networking tools
✤ Twitter, Yammer, instant messengers
Thursday, November 17, 2011
85%
15%
Structured Unstructured
Is it really that important?
Thursday, November 17, 2011
What’s in that 80-85%?
✤ Progress reports - created in a word processor
Thursday, November 17, 2011
What’s in that 80-85%?
✤ Dashboards - created in presentation software
Thursday, November 17, 2011
What’s in that 80-85%?
✤ Progress reports - color coded text in a spreadsheet
Thursday, November 17, 2011
What’s in that 80-85%?
✤ Brainstorming - in messaging systems
✤ Decision making - in email
Thursday, November 17, 2011
What’s in that 80-85%?
✤ Business intelligence - on the web and more
Thursday, November 17, 2011
How can we make the data more actionable?
✤ Identify it
✤ Convert to a format you can work with
✤ Add structure, meaning:
✤ information extraction
✤ annotation
✤ content analytics
Thursday, November 17, 2011
What about enterprise search?
✤ First line of defense
✤ Points you at the highest relevancy ranked data via pattern matching and statistical analysis
✤ Does not assist in other visualizations or transformations without further machine processing
Thursday, November 17, 2011
Information Extraction
✤ Token identification - “tokenization”
✤ Part-of-speech tagging - “POS” tagging (noun, verb, adverb, adjective, etc.)
✤ Phrase identification - noun phrase
✤ Entity extraction - people, places, events, dates, organizations
Thursday, November 17, 2011
Information Extraction
✤ Cluster analysis - group related information, where relationship may not be known
✤ Classification - mapping to specific categories
✤ Dependency identification / Rule generation
✤ Relationship detection - e.g. “Joe” “is CEO” at “IBM”
✤ Summarization - key concepts or key sentences
Thursday, November 17, 2011
Open Tools
✤ GATE – General Architecture for Text Engineering, from the University of Sheffield, with many users and excellent documentation.
✤ GATE has customizable document and corpus processing pipelines. GATE is an architecture, a framework, and a development environment, with a clean separation of algorithms, data, and visualization.
Thursday, November 17, 2011
Open Tools
✤ UIMA – Unstructured Information Management Architecture (IBM’s Watson uses this), originated at IBM, now an Apache project.
✤ Component software architecture with a document processing pipeline similar to GATE. Focus on performance and scalability, with distributed processing (web services).
Thursday, November 17, 2011
UIMA
Fred is theCenter CEO of
OrganizationPerson
CeoOf
Arg2:OrgArg1:Person
PPVPNPParser
Named Entity
Relationship
Center Micros
Common Analysis Structure (CAS)
Artifact (e.g., Document)
Analysis Results (i.e., Artifact Metadata)
UIMA CASRepresentation now
Alignedwith XMI standard
UIMA’s Basic Building Blocks are Annotators. They iterate over an artifact to discover new types based on existing ones and update the Common Analysis Structure (CAS) for
upstream processing.
Chart byIBM
Thursday, November 17, 2011
UIMA
Image byIBM
Thursday, November 17, 2011
Commercial Tools
✤ Oracle Data Mining (Text Mining)
✤ IBM SPSS
✤ SAS Text Miner
✤ Smartlogic
✤ Lots of acquisitions going on in the “big data” space
✤ HP acquired Autonomy
✤ Oracle acquired Endeca
Thursday, November 17, 2011
A Note on Tools
✤ UIMA and GATE – comprehensive suite of capabilities, with learning curves.
✤ Commercial tools range from unstructured capabilities inside DBMSs like Oracle, to Business Objects business intelligence tools (who acquired Inxight from Xeroc Parc).
✤ Your mileage will vary. The biggest differentiator is your knowledge of your data.
Thursday, November 17, 2011
What can unstructured data look like post-processing?
Thursday, November 17, 2011
Machine Processing
Machine Processing Platform
Natural Language Processing
Statistical Analysis
Rules-based Classifica-
tion
Semantic Analysis
Unstructured Data
IndexAPI
Visualizations
Federated Search
Data StoresThursday, November 17, 2011
Questions?
Thursday, November 17, 2011
Thank youChristine ConnorsKevin Lynchwww.triviumrlg.com
Thursday, November 17, 2011