Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

Bigger Data. Better Results.™

Semi-‐supervised Learning for Micro-‐blog Classifica9on

Techniques for Analyzing Text Data

Nick Pendar, Ph.D. Skytree Inc.

October 8, 2015

Overview

•  Introduc9on •  Uses of text data •  Sources of text data •  Case Study •  Challenge of crea9ng training sets •  Experiments

•  Conclusions

Sources of Text Data

•  Internal •  Email, chat and other communica9on •  Whitepapers, patents & other IP •  Business documents (e.g., purchase orders, CRM notes) •  System logs, service records & other technical data

•  External •  Web, social media, news •  Public records •  Electronic health records •  Scien9fic/professional publica9ons •  Libraries & other specialized repositories

ML on text data?

•  Recommenda9on/Search –  Show me documents relevant to me. •  In response to a search query (IR) •  Similar to another document (Nearest Neighbor Search, Classifica9on, Clustering) •  Related to my interests and people like me (Recommenda9on)

•  Categoriza9on / eDiscovery –  Show me documents about topic X. (Classifica9on, IR, NN, Clustering) –  Show me documents relevant to this li9ga9on. (Classifica9on, IR, NN, Clustering) –  Show me documents I can delete. (Classifica9on, IR, NN, Clustering)

•  Analy9cs –  What is X thinking about Y? (Classifica9on, Sen9ment Analysis, NLP) –  Why did the customer make a certain decision? (Classifica9on, NLP, Informa9on

Extrac9on) –  What is the func9on of X in organiza9on Y? (Classifica9on, NLP, Informa9on

Extrac9on)

A Classifier at Heart

•  Many tasks can be considered classifica9on tasks.

•  Classifica9on (supervised ML) needs training data.

•  Clean and enough training data leads to a good classifier.

•  Challenge: Clean and enough training data is o_en hard to find.

•  O_en most of a data scien9st’s 9me is spent collec9ng and cleaning data.

Case Study: Micro-‐blog Classifica9on

•  Client interested in classifying tweets for beaer search/access.

•  500M tweets per day.

•  150M English

•  Challenges: –  Data size and velocity –  Tweets are short containing very liale signal. –  Dynamic nature of data •  Changing topics: Need to update exis9ng classifiers, build new ones. •  Changing vocabulary •  Lack of training data

Possible Sources of Training Data

•  Human annota9on •  Pro: Accurate; Con: Expensive, doesn’t scale

•  Hashtags •  Pro: Abundant; Con: Extremely noisy

•  Unambiguous Keywords •  Pro: Accurate; Con: Hard to curate, Low recall

•  Comprehensive Keyword Set •  Pro: Large coverage; Con: Noise •  ML can reduce noise by leveraging big data

Collec9ng Ini9al Keyword Set

•  Specific Handles/HashTags or Handle/HashTag Paaerns

•  @apple, #apple, @samsung*, #samsung* •  Very low yield (e.g., 0.01% for Samsung)

•  Knowledge Graph Nodes •  Collect keywords/9tles under a category/node (e.g., Sport, NBA) •  More keywords: ~10,000 for NBA •  Higher yield (e.g., ~1% for NBA)

Example Keywords

•  NBA kevin garnett gary harris george gervin chicago bulls indiana pacers phoenix suns steven adams new orleans pelicans dirk nowitzki robert horry steve nash amir johnson brooklyn nets

•  Sports a bracciuta a brazzos atv enduro atv off-road racing atv racing aba guresi abseiling adi murai adventure racing aerobatics aggressive inline skating aid climbing aikido air hockey

Building the Training Set

•  Posi9ve examples: A set of tweets containing target keywords.

•  Need nega9ve examples •  Somehow select examples least likely to be in posi9ve set. •  Trained a kernel density es9ma9on model on the posi9ve set. •  Selected boaom 5 percen9le as nega9ve data. •  Result: Slightly beaer on cross-‐valida9on, worse on unseen data. •  Uniformly selected rest of data as nega9ve. •  Result: Slightly worse on cross-‐valida9on, beaer on unseen data.

•  Use all of the rest of the data as nega9ve, i.e., unbalanced •  Let Skytree so_ware handle the unbalanced dataset.

–  Decided to use second approach.

Workflow Summary

Ini$al Concept

e.g., NBA

Workflow Summary

Ini$al Concept

External Knowledge Source

Iden$fy Ini$al

Keywords

Ini$al Keywords

e.g., Wikipedia

kevin garnett gary harris george gervin chicago bulls …

Workflow Summary

Ini$al Concept

Unlabeled Documents


Iden$fy Ini$al

Keywords

Ini$al Keywords

Search, Score,

Split Data + -‐

i.e., tweets

Workflow Summary

Ini$al Concept

Unlabeled Documents


Iden$fy Ini$al

Keywords

Ini$al Keywords

Search, Score,

Split Data + -‐

Iden$fy ML

Features

Vector Space

Represent

ML-‐Ready Dataset

i.e., indica9ve words/phrases

Experiments

•  Goal: Create an ini9al classifier with high precision, and improve it over 9me.

•  Built classifiers with the following featuriza9on techniques:

•  1-‐gram (single word) features •  1-‐gram and 2-‐gram (2-‐word sequence) features in combina9on

•  AutoModel™ find best-‐performing model •  “smart-‐search” finds best parameter seongs. •  Op9mized models for best precision at top 25%. •  System also suggests best parameters for op9mum Gini, F-‐score, Accuracy

Results

•  Results: ~98-‐99% F-‐Score Typical

But does it actually work?

•  Model picks up paaerns beyond the keywords NBA –  Time will reveal whether McAdoo's decision is right –  "@livelaughPRELLA: I'm s9ll stuck on James Michael-‐McAdoo's decision to enter

the dra_. Like, no...” hell no! –  is Jeff any good ? I never seen him play but heard al Jefferson P right? lol but if your

talking about him ... –  @Espngreeny ' team had year guys. ' & had Delk, McCarty, Anderson, Sheppard,

Padgea, Epps, Turner, Pope, Evans, Edwards, Mills Machine Learning –  Latent Seman9c Indexing: How Does the LSI Algorithm Work? –  Hidden Markov Models: The Baum-‐Welch Algorithm #compsci #datascience

#math #stem #mathchat

But does it actually work?

•  Removed one of the ini9al keywords to see if model picks it up.

Sports without “baseball” –  Boomer Esiason: For Baseball Season Opener, Mets player's wife should have

goaen a C-‐sec9on: Daniel Murphy ... hap://t.co/RtCoDMcDTP –  RT @rob_giorgi: Glad baseball season is back. I wanna go to a red sox game soon

NBA without “NBA” –  #NBA Thursday Night Live Commentary: Dallas #Mavericks vs Los Angeles #Clippers via @VAVEL_USA

–  New Adidas x Undefeated Forum Mid NBA All Star Game Black/White Und_d Men Shoes

Conclusions

•  Created training sets automa9cally leveraging Wikipedia as external knowledge source.

•  Trained high-‐precision (and high-‐recall) models. This is only possible using big data.

•  The resul9ng models go beyond the ini9al keywords.

•  The resul9ng models are adaptable to changing topics.

Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

Data & Analytics

Transcript of Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data