Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

19
Bigger Data. Better Results.™ Semisupervised Learning for Microblog Classifica9on Techniques for Analyzing Text Data Nick Pendar, Ph.D. Skytree Inc. October 8, 2015

Transcript of Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

Page 1: Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

Bigger Data. Better Results.™

Semi-­‐supervised  Learning  for  Micro-­‐blog  Classifica9on

Techniques  for  Analyzing  Text  Data

Nick  Pendar,  Ph.D. Skytree  Inc.

October  8,  2015

Page 2: Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

Overview

•  Introduc9on •  Uses  of  text  data •  Sources  of  text  data •  Case  Study •  Challenge  of  crea9ng  training  sets •  Experiments

•  Conclusions

Page 3: Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

Sources  of  Text  Data

•  Internal •  Email,  chat  and  other  communica9on •  Whitepapers,  patents  &  other  IP •  Business  documents  (e.g.,  purchase  orders,  CRM  notes) •  System  logs,  service  records  &  other  technical  data

•  External •  Web,  social  media,  news •  Public  records •  Electronic  health  records •  Scien9fic/professional  publica9ons •  Libraries  &  other  specialized  repositories

Page 4: Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

ML  on  text  data?

•  Recommenda9on/Search –  Show  me  documents  relevant  to  me. •  In  response  to  a  search  query  (IR) •  Similar  to  another  document  (Nearest  Neighbor  Search,  Classifica9on,  Clustering) •  Related  to  my  interests  and  people  like  me  (Recommenda9on)

•  Categoriza9on  /  eDiscovery –  Show  me  documents  about  topic  X.  (Classifica9on,  IR,  NN,  Clustering) –  Show  me  documents  relevant  to  this  li9ga9on.  (Classifica9on,  IR,  NN,  Clustering) –  Show  me  documents  I  can  delete.  (Classifica9on,  IR,  NN,  Clustering)

•  Analy9cs –  What  is  X  thinking  about  Y?  (Classifica9on,  Sen9ment  Analysis,  NLP) –  Why  did  the  customer  make  a  certain  decision?  (Classifica9on,  NLP,  Informa9on  

Extrac9on) –  What  is  the  func9on  of  X  in  organiza9on  Y?  (Classifica9on,  NLP,  Informa9on  

Extrac9on)

Page 5: Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

A  Classifier  at  Heart

•  Many  tasks  can  be  considered  classifica9on  tasks.

•  Classifica9on  (supervised  ML)  needs  training  data.

•  Clean  and  enough  training  data  leads  to  a  good  classifier.

•  Challenge:  Clean  and  enough  training  data  is  o_en  hard  to  find.

•  O_en  most  of  a  data  scien9st’s  9me  is  spent  collec9ng and  cleaning  data.

Page 6: Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

Case  Study:  Micro-­‐blog  Classifica9on

•  Client  interested  in  classifying  tweets  for  beaer  search/access.

•  500M  tweets  per  day.

•  150M  English

•  Challenges: –  Data  size  and  velocity –  Tweets  are  short  containing  very  liale  signal. –  Dynamic  nature  of  data •  Changing  topics:  Need  to  update  exis9ng  classifiers,  build  new  ones. •  Changing  vocabulary •  Lack  of  training  data

Page 7: Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

Possible  Sources  of  Training  Data

•  Human  annota9on •  Pro:  Accurate;  Con:  Expensive,  doesn’t  scale

•  Hashtags •  Pro:  Abundant;  Con:  Extremely  noisy

•  Unambiguous  Keywords •  Pro:  Accurate;  Con:  Hard  to  curate,  Low  recall

•  Comprehensive  Keyword  Set •  Pro:  Large  coverage;  Con:  Noise •  ML  can  reduce  noise  by  leveraging  big  data

Page 8: Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

Collec9ng   Ini9al  Keyword  Set

•  Specific  Handles/HashTags  or  Handle/HashTag  Paaerns

•  @apple,  #apple,  @samsung*,  #samsung* •  Very  low  yield  (e.g.,  0.01%  for  Samsung)

•  Knowledge  Graph  Nodes •  Collect  keywords/9tles  under  a  category/node  (e.g.,  Sport,  NBA) •  More  keywords:  ~10,000  for  NBA •  Higher  yield  (e.g.,  ~1%  for  NBA)

Page 9: Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

Example  Keywords

•  NBA kevin garnett gary harris george gervin chicago bulls indiana pacers phoenix suns steven adams new orleans pelicans dirk nowitzki robert horry steve nash amir johnson brooklyn nets

•  Sports a bracciuta a brazzos atv enduro atv off-road racing atv racing aba guresi abseiling adi murai adventure racing aerobatics aggressive inline skating aid climbing aikido air hockey

Page 10: Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

Building  the  Training  Set

•  Posi9ve  examples:  A  set  of  tweets  containing  target  keywords.

•  Need  nega9ve  examples •  Somehow  select  examples  least  likely  to  be  in  posi9ve  set. •  Trained  a  kernel  density  es9ma9on  model  on  the  posi9ve  set. •  Selected  boaom  5  percen9le  as  nega9ve  data. •  Result:  Slightly  beaer  on  cross-­‐valida9on,  worse  on  unseen  data. •  Uniformly  selected  rest  of  data  as  nega9ve. •  Result:  Slightly  worse  on  cross-­‐valida9on,  beaer  on  unseen  data.

•  Use  all  of  the  rest  of  the  data  as  nega9ve,  i.e.,  unbalanced •  Let  Skytree  so_ware  handle  the  unbalanced  dataset.

–  Decided  to  use  second  approach.

Page 11: Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

Workflow  Summary

Ini$al  Concept  

e.g.,  NBA

Page 12: Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

Workflow  Summary

Ini$al  Concept  

External  Knowledge  Source  

Iden$fy  Ini$al  

Keywords  

Ini$al  Keywords  

e.g.,  Wikipedia

kevin garnett gary harris george gervin chicago bulls …

Page 13: Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

Workflow  Summary

Ini$al  Concept  

Unlabeled  Documents  

External  Knowledge  Source  

Iden$fy  Ini$al  

Keywords  

Ini$al  Keywords  

Search,  Score,  

Split  Data  +   -­‐  

i.e.,  tweets

Page 14: Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

Workflow  Summary

Ini$al  Concept  

Unlabeled  Documents  

External  Knowledge  Source  

Iden$fy  Ini$al  

Keywords  

Ini$al  Keywords  

Search,  Score,  

Split  Data  +   -­‐  

Iden$fy  ML  

Features  

Vector  Space  

Represent  

ML-­‐Ready  Dataset  

i.e.,  indica9ve  words/phrases

Page 15: Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

Experiments

•  Goal:  Create  an  ini9al  classifier  with  high  precision,  and  improve  it  over  9me.

•  Built  classifiers  with  the  following  featuriza9on  techniques:

•  1-­‐gram  (single  word)  features •  1-­‐gram  and  2-­‐gram  (2-­‐word  sequence)  features  in  combina9on

•  AutoModel™  find  best-­‐performing  model •  “smart-­‐search”  finds  best  parameter  seongs. •  Op9mized  models  for  best  precision  at  top  25%. •  System  also  suggests  best  parameters  for  op9mum  Gini,  F-­‐score,  Accuracy

Page 16: Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

Results

•  Results:  ~98-­‐99%  F-­‐Score  Typical

Page 17: Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

But  does  it  actually  work?

•  Model  picks  up  paaerns  beyond  the  keywords NBA –  Time  will  reveal  whether  McAdoo's  decision  is  right –  "@livelaughPRELLA:  I'm  s9ll  stuck  on  James  Michael-­‐McAdoo's  decision  to  enter  

the  dra_.  Like,  no...”  hell  no! –  is  Jeff  any  good  ?  I  never  seen  him  play  but  heard  al  Jefferson  P  right?  lol  but  if  your  

talking  about  him  ... –  @Espngreeny  '  team  had    year  guys.  '  &    had  Delk,  McCarty,  Anderson,  Sheppard,  

Padgea,  Epps,  Turner,  Pope,  Evans,  Edwards,  Mills Machine  Learning –  Latent  Seman9c  Indexing:  How  Does  the  LSI  Algorithm  Work? –  Hidden  Markov  Models:  The  Baum-­‐Welch  Algorithm  #compsci    #datascience  

#math  #stem  #mathchat

Page 18: Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

But  does  it  actually  work?

•  Removed  one  of  the  ini9al  keywords  to  see  if  model  picks  it  up.

Sports  without  “baseball” –  Boomer  Esiason:  For    Baseball  Season  Opener,  Mets  player's  wife  should  have  

goaen  a  C-­‐sec9on:  Daniel  Murphy  ...  hap://t.co/RtCoDMcDTP –  RT  @rob_giorgi:  Glad  baseball  season  is  back.  I  wanna  go  to  a  red  sox  game  soon

NBA  without  “NBA” –  #NBA  Thursday  Night  Live  Commentary:  Dallas  #Mavericks  vs  Los  Angeles  #Clippers  via  @VAVEL_USA

–  New  Adidas  x  Undefeated  Forum  Mid  NBA  All  Star  Game  Black/White  Und_d  Men  Shoes

Page 19: Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data

Conclusions

•  Created  training  sets  automa9cally  leveraging  Wikipedia  as  external  knowledge  source.

•  Trained  high-­‐precision  (and  high-­‐recall)  models.  This  is  only  possible  using  big  data.

•  The  resul9ng  models  go  beyond  the  ini9al  keywords.

•  The  resul9ng  models  are  adaptable  to  changing  topics.