WHAT IS HYDRA? Findability Day 2012
Hydra is technology
Hydra brings structure
What is unstructured data?
• A linguistic excuse?
News articles
Plain text that contains invaluable metadata for search, such as:
• Title
• Author byline
• Lead paragraph
Hydra is about your data
• Enrich your documents with metadata, to power your search
• Language detec+on • Sen+ment analysis
• Headline extrac+on • Regular expression matching and extrac+on
• Filter out unwanted documents
• Collect statistics
• Export to Staging environments
Before Hydra
Before Hydra
Hydra scales
Hydra Design Objectives
Scalability
• Possible to connect any number of processing machines
Fault tolerance
• Failiure of a stage affects only a single document
• Failiures can be automaticly detected
Robustness
• Stages and nodes are completely independent (no domino-
effect)
Development ease
• Allow test driven pipeline development
What about Hadoop and Big Data?
Usecases for document enrichment
• Pagerank • Analy+cs Hadoop & Map/Reduce advantages • Huge scalability • Ability to work on en+re document set at once
Hadoop & Map/Reduce drawbacks • Batch processing • Time-‐to-‐index
Hydra integrated with Hadoop
Blue – First round of indexing only Red – Second round of indexing Purple – All documents
Hydra in summary
Hydra
• can chew through almost anything
• has many heads
• regenerates
• scales
Hydra is Open Source
• Other committers
• The role of Findwise
For more information:
• http://www.findwise.com/hydra
• http://findwise.github.com/Hydra
• Email: [email protected]
Joel Westberg [email protected]
@joelwes
Top Related