Towards Biomedical Data Integration for Analyzing the Evolution of Cognition
Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data
-
Upload
dataversity -
Category
Data & Analytics
-
view
818 -
download
0
Transcript of Smart Data Webinar: Machine Learning Techniques for Analyzing Unstructured Business Data
Bigger Data. Better Results.™
Semi-‐supervised Learning for Micro-‐blog Classifica9on
Techniques for Analyzing Text Data
Nick Pendar, Ph.D. Skytree Inc.
October 8, 2015
Overview
• Introduc9on • Uses of text data • Sources of text data • Case Study • Challenge of crea9ng training sets • Experiments
• Conclusions
Sources of Text Data
• Internal • Email, chat and other communica9on • Whitepapers, patents & other IP • Business documents (e.g., purchase orders, CRM notes) • System logs, service records & other technical data
• External • Web, social media, news • Public records • Electronic health records • Scien9fic/professional publica9ons • Libraries & other specialized repositories
ML on text data?
• Recommenda9on/Search – Show me documents relevant to me. • In response to a search query (IR) • Similar to another document (Nearest Neighbor Search, Classifica9on, Clustering) • Related to my interests and people like me (Recommenda9on)
• Categoriza9on / eDiscovery – Show me documents about topic X. (Classifica9on, IR, NN, Clustering) – Show me documents relevant to this li9ga9on. (Classifica9on, IR, NN, Clustering) – Show me documents I can delete. (Classifica9on, IR, NN, Clustering)
• Analy9cs – What is X thinking about Y? (Classifica9on, Sen9ment Analysis, NLP) – Why did the customer make a certain decision? (Classifica9on, NLP, Informa9on
Extrac9on) – What is the func9on of X in organiza9on Y? (Classifica9on, NLP, Informa9on
Extrac9on)
A Classifier at Heart
• Many tasks can be considered classifica9on tasks.
• Classifica9on (supervised ML) needs training data.
• Clean and enough training data leads to a good classifier.
• Challenge: Clean and enough training data is o_en hard to find.
• O_en most of a data scien9st’s 9me is spent collec9ng and cleaning data.
Case Study: Micro-‐blog Classifica9on
• Client interested in classifying tweets for beaer search/access.
• 500M tweets per day.
• 150M English
• Challenges: – Data size and velocity – Tweets are short containing very liale signal. – Dynamic nature of data • Changing topics: Need to update exis9ng classifiers, build new ones. • Changing vocabulary • Lack of training data
Possible Sources of Training Data
• Human annota9on • Pro: Accurate; Con: Expensive, doesn’t scale
• Hashtags • Pro: Abundant; Con: Extremely noisy
• Unambiguous Keywords • Pro: Accurate; Con: Hard to curate, Low recall
• Comprehensive Keyword Set • Pro: Large coverage; Con: Noise • ML can reduce noise by leveraging big data
Collec9ng Ini9al Keyword Set
• Specific Handles/HashTags or Handle/HashTag Paaerns
• @apple, #apple, @samsung*, #samsung* • Very low yield (e.g., 0.01% for Samsung)
• Knowledge Graph Nodes • Collect keywords/9tles under a category/node (e.g., Sport, NBA) • More keywords: ~10,000 for NBA • Higher yield (e.g., ~1% for NBA)
Example Keywords
• NBA kevin garnett gary harris george gervin chicago bulls indiana pacers phoenix suns steven adams new orleans pelicans dirk nowitzki robert horry steve nash amir johnson brooklyn nets
• Sports a bracciuta a brazzos atv enduro atv off-road racing atv racing aba guresi abseiling adi murai adventure racing aerobatics aggressive inline skating aid climbing aikido air hockey
Building the Training Set
• Posi9ve examples: A set of tweets containing target keywords.
• Need nega9ve examples • Somehow select examples least likely to be in posi9ve set. • Trained a kernel density es9ma9on model on the posi9ve set. • Selected boaom 5 percen9le as nega9ve data. • Result: Slightly beaer on cross-‐valida9on, worse on unseen data. • Uniformly selected rest of data as nega9ve. • Result: Slightly worse on cross-‐valida9on, beaer on unseen data.
• Use all of the rest of the data as nega9ve, i.e., unbalanced • Let Skytree so_ware handle the unbalanced dataset.
– Decided to use second approach.
Workflow Summary
Ini$al Concept
e.g., NBA
Workflow Summary
Ini$al Concept
External Knowledge Source
Iden$fy Ini$al
Keywords
Ini$al Keywords
e.g., Wikipedia
kevin garnett gary harris george gervin chicago bulls …
Workflow Summary
Ini$al Concept
Unlabeled Documents
External Knowledge Source
Iden$fy Ini$al
Keywords
Ini$al Keywords
Search, Score,
Split Data + -‐
i.e., tweets
Workflow Summary
Ini$al Concept
Unlabeled Documents
External Knowledge Source
Iden$fy Ini$al
Keywords
Ini$al Keywords
Search, Score,
Split Data + -‐
Iden$fy ML
Features
Vector Space
Represent
ML-‐Ready Dataset
i.e., indica9ve words/phrases
Experiments
• Goal: Create an ini9al classifier with high precision, and improve it over 9me.
• Built classifiers with the following featuriza9on techniques:
• 1-‐gram (single word) features • 1-‐gram and 2-‐gram (2-‐word sequence) features in combina9on
• AutoModel™ find best-‐performing model • “smart-‐search” finds best parameter seongs. • Op9mized models for best precision at top 25%. • System also suggests best parameters for op9mum Gini, F-‐score, Accuracy
Results
• Results: ~98-‐99% F-‐Score Typical
But does it actually work?
• Model picks up paaerns beyond the keywords NBA – Time will reveal whether McAdoo's decision is right – "@livelaughPRELLA: I'm s9ll stuck on James Michael-‐McAdoo's decision to enter
the dra_. Like, no...” hell no! – is Jeff any good ? I never seen him play but heard al Jefferson P right? lol but if your
talking about him ... – @Espngreeny ' team had year guys. ' & had Delk, McCarty, Anderson, Sheppard,
Padgea, Epps, Turner, Pope, Evans, Edwards, Mills Machine Learning – Latent Seman9c Indexing: How Does the LSI Algorithm Work? – Hidden Markov Models: The Baum-‐Welch Algorithm #compsci #datascience
#math #stem #mathchat
But does it actually work?
• Removed one of the ini9al keywords to see if model picks it up.
Sports without “baseball” – Boomer Esiason: For Baseball Season Opener, Mets player's wife should have
goaen a C-‐sec9on: Daniel Murphy ... hap://t.co/RtCoDMcDTP – RT @rob_giorgi: Glad baseball season is back. I wanna go to a red sox game soon
NBA without “NBA” – #NBA Thursday Night Live Commentary: Dallas #Mavericks vs Los Angeles #Clippers via @VAVEL_USA
– New Adidas x Undefeated Forum Mid NBA All Star Game Black/White Und_d Men Shoes
Conclusions
• Created training sets automa9cally leveraging Wikipedia as external knowledge source.
• Trained high-‐precision (and high-‐recall) models. This is only possible using big data.
• The resul9ng models go beyond the ini9al keywords.
• The resul9ng models are adaptable to changing topics.