Using Graph theory to understand Intent & Concepts - Neo4j User Group (January 2013)

39
tumra.com Using Graph Theory to understand Intent & Concepts – January 2013

Transcript of Using Graph theory to understand Intent & Concepts - Neo4j User Group (January 2013)

tumra.com  

Using Graph Theory to understand Intent & Concepts – January 2013  

UNDERSTANDING INTENT & CONCEPTS  

tumra.com  

•  Use case: -  Enhancing Social TV user experience -  Matching users to content that interests them

•  Topics we’ll cover: -  Natural Language Processing -  Graph Theory -  Machine Learning

USE CASE ENHANCED SOCIAL TV  

tumra.com  

•  Objectives: -  Increase engagement with content -  Enhance multi-channel user experience

•  We built a prototype solution: -  Mines unstructured data in real-time -  Understands:

-  What interests individual users -  Entities & Concepts (People, Places, Events)

tumra.com  

THANKS FOR LISTENING  Help users to “follow the story” regardless of the news outlet, integrated to web / second-screen  

THE CHALLENGE  

Photo Credit: byrion on Flickr (cc)

Magic?!?!

THE PROBLEM  

tumra.com  

Unstructured Data

Awesomeness!

THE PROBLEM  

tumra.com  

•  Little useful data to work with… -  Streams of continuous live TV -  Have to create metadata

•  Where did we start? -  Ingest several live news channels -  Extract whatever data was available:

-  In-video text using OCR -  Subtitles / Closed Captions

We used a simple N-Gram model for exact matches; then Apache Lucene for everything else…  

STEP 1 NAMED ENTITY RECOGNITION  

tumra.com  

EXAMPLE N.E.R.  

tumra.com  

“David Cameron and the German Chancellor Angela Merkel meets to discuss the debt crisis and signal

their approval for greater eurozone integration.”  

EXAMPLE N.E.R.  

tumra.com  

“David Cameron and the German Chancellor Angela Merkel meets to discuss the debt crisis and signal

their approval for greater eurozone integration.”  

INITIAL SOLUTION  

tumra.com  

Unstructured Data Awesomeness!

NoSQL

NER

OH NO!!! *facepalm*  

Photo Credit: cesarastudillo on Flickr (cc)

DISAMBIGUATION  

tumra.com  

•  Which “David Cameron”? -  We have many in our Knowledgebase -  Sportsmen, actors, painters & characters…

•  Our initial simplistic approach was naïve -  Works great with unambiguous matches -  Best-case returns top-scoring entity

•  We needed a smarter approach

RECAP  

tumra.com  

•  We have an effectively ‘flat’ KB of Entities -  “David Cameron” -> Politician (Person) -  “Angela Merkel” -> Politician (Person) -  “German Chancellor” -> Political office (Concept) -  “Debt” -> Economic concept (Concept) -  “Eurozone” -> Economic area (Place)

•  We needed a way to find relationships

between Entities

THE BIG IDEA  

Graphs allow us to store relationships between entities, and graph algorithms allow us to interrogate those connections…  

GRAPH DATABASES  

tumra.com  

Apache Giraph

Neo4J Graph Lab

Golden Orb

… of course there are many more open-source & proprietary ones  

… it had to be fast, scalable, active development  

SO, WHICH ONE?  

tumra.com  

???

We had 250 million Nodes, and 4 billion Edges… great initial results but horrendously inefficient!

Example: “David Cameron” & “Angela Merkel”  

STEP 2 BUILDING RELATIONSHIPS  

tumra.com  

INITIAL IMPROVEMENTS  

tumra.com  

•  We didn’t need everything… just: -  People: “David Cameron”, “Angela Merkel” -  Places: “London”, “Downing Street”, “Eurozone” -  Concepts: “Debt”, “President”, “Eurozone” -  Things: Companies, Products etc.

•  Pruned the graph using Map/Reduce

•  This reduced the number of Entities… -  … but we still had billions of connections

EXAMPLE PEOPLE, PLACES, CONCEPTS  

tumra.com  

“David Cameron and the German Chancellor Angela Merkel meets to discuss the debt crisis and signal

their approval for greater eurozone integration.”  

EXAMPLE PEOPLE, PLACES, CONCEPTS    

tumra.com  

“David Cameron and the German Chancellor Angela Merkel meets to discuss the debt crisis and signal

their approval for greater eurozone integration.”  

People Concepts Places

DISAMBIGUATION  

David Cameron

(footballer) David Cameron (actor)

David Cameron

(politician)

David Cameron (painter)

Angela Merkel

Politician Head of

State

Living Person

Possibilities: shortest path, number of common connections etc.  

Sure all that extra metadata was tasty but we didn’t need it all to solve the use-case…

So we used Map/Reduce to count the common

connections  

STEP 3 SIMPLIFYING THE GRAPH  

tumra.com  

SIMPLIFIED  

David Cameron

(footballer) David Cameron (actor)

David Cameron

(politician)

David Cameron (painter)

Angela Merkel

Woah … that looks a lot like Least Cost Routing problem  

3 1 1

LEAST COST PATH  

David Cameron

(footballer) David Cameron (actor)

David Cameron

(politician)

David Cameron (painter)

Angela Merkel

1 / number of common connections = cost  

1/3 1/1 1/1

UPDATED SOLUTION  

tumra.com  

Unstructured Data Awesomeness!

Neo4J

NER

NoSQL

Disambiguation

RECAP  

tumra.com  

•  Graphs allow us to interrogate relationships -  Disambiguate when faced with multiple possibilities -  Infer more about the context of what’s happening

•  Went through iterations of improvements

-  Kept our Entity data in NoSQL = TB’s

-  Used the Graph as an index of sorts = GB’s •  Neo4j was a great fit for our needs

Some queries were taking ‘seconds’ and we needed to go a lot faster because TV wont wait for us …

Do we really need to check the Graph everytime?  

STEP 4 MAKING IT WORK REAL-TIME  

tumra.com  

ENTER MACHINE LEARNING  

tumra.com  

•  We can use simple predictors to estimate the likelihood of Entities occurring

-  i.e. every time we’ve looked for “David Cameron” in the past the best match was the Politician

•  Keeping a ‘probabilistic context’ of recent

Entities allows us to detect shifts in topics -  Works especially well on News channels

-  Reduces the demand on Graph lookups

Looks complicated, but its basically just counting & division  

BAYES THEOREM  

Photo Credit: mattbuck007 on Flickr (cc)

We solved the problem for English, but what about other languages?  

STEP 5 MAKING IT WORK WORLDWIDE  

tumra.com  

LANGUAGE  

tumra.com  

•  Our core Entities of ‘People’, ‘Places’, & ‘Concepts’ are language agnostic…

•  We needed a way to ditch ‘language’ and

jump straight to entities… -  The colour ‘Red’ means the same thing regardless of

you calling it ‘Rot’, ‘Rouge’ or ‘赤’

•  Again, Graphs could solve the problem

LANGUAGE INDEPENDENT  

Color: Red

Red أحمر

Rouge

Rot Röd

Rojo 紅

Typical response time ~30ms … relevancy improves over time and learns new entities ‘online’  

PROBLEM SOLVED  

tumra.com  

FINAL SOLUTION  

tumra.com  

Unstructured Data Awesomeness!

Neo4J

NER

NoSQL

Disambiguation Language Model

Machine Learning

“TUMRA” is a transliteration of the Sanskrit word for “BIG”; we thought it’s a great name … ( and the .COM was available )

ABOUT US  

tumra.com  

•  We’ve built a product… -  Our ‘Digital Marketing Optimization’ platform

improves conversion rates & customer satisfaction for eCommerce & Marketing campaigns

-  Launches Q1 2013

•  What else do we do? -  ‘Big Data’ & ‘Data Science’ professional services -  Bespoke prototype & solution development

tumra.com  

THANKS FOR LISTENING  

TUMRA You?

We’re hiring! Data Scientists & Developers

[email protected]

tumra.com  

THANKS FOR LISTENING

Questions?  tumra.com

[email protected]  

twitter.com/tumra