Quantifying Fan Engagement using Social Media
-
Upload
edward-kwartler -
Category
Social Media
-
view
227 -
download
2
description
Transcript of Quantifying Fan Engagement using Social Media
SOCIAL MEDIA ANALYTICS TO QUANTIFY FAN ENGAGEMENTDR. ROBERT BAKER
TED KWARTLER
Get a more complete profile of your fans to inform business decisions and improve ROI calculations.
Basics
Where are the fans?
Who are the fans?
What are fans talking about?
How do the fans feel towards the team?
What is the point of all this?
AGENDA
If only there had been social media, the Yankees could have profiled my experience.
A FAN’S EXPERIENCE
BASICSWHAT IS TEXT MINING?
Before text mining. After text mining.
SOCIAL MEDIA ANALYTICS REQUIRES TEXT MINING
Text mining lets you “drink from a fire hose” of information and distill useful meaning.
Organized intoDocument Term Matrix (DTM)Term Document Matrix (TDM)
Apply standard and domain specific rules
Unstructured natural language texts
WHAT IS TEXT MINING?
Insight&
Recommendation
Text mining is an emerging technology that can be used to augment existing data by making unstructured text available for analysis and decision making.
surveys tweets
articlesemails
blogs
reviews
Natural language
texts
Many sources including emails, forum posts, tweets, books, pdfs, reviews, transcripts etc.
EXAMPLE UNSTRUCTURED TEXT SOURCES
Unstructured natural language texts
杜兰特和詹姆斯谁才是当今联盟的头牌?这是最近很火热的话题。一方面杜兰特高居得分榜首位,在MVP权力榜上也雄踞第一;另一方面詹姆斯带领热火一切为了三连冠,比赛沉稳 ...
Had my first experience at TD Garden when my Bulls came to play the Celtics. Being someone with an out of state license living in Boston, I usually carry my passport anyway, but I had a friend in town and wanted to clear up this ID controversy I read so much about in the rules.
EXAMPLE PRE-PROCESSING STEPS
(or other software e.g. Python NLTK)
1.Make all text lower case2.For twitter, remove “RT” for retweet.
3.Remove symbols like “@”4.Remove punctuation5.Remove numbers6.Remove Urls e.g. http://www.espn.com
7.Remove extra whitespace8.Remove “stopwords”9.Others as needed depending on objective (e.g. stemming)
In a “bag of words” text mining methodology the corpus must be cleaned. Cleaning often means making items lower case, removing
punctuation, numbers and extra whitespace. In unique instancesdomain specific rules are applied (e.g. removing “RT” for retweet).
Apply standard and domain specific rules
Cleaned Version: no doubt derek jeter makes my top all time with babe lou yankee clipper mick
Translated Version:Durant and James, who is the league's first card today? This is a very hot topic recently. On the one hand Durant highest scoring top position in the standings MVP authority also ranked first; on the other hand, James led the Heat everything for three consecutive years, the race calm ...Cleaned Version: durant james who league first card today very hot topic recently on one hand durant highest scoring top position standings MVP authority ranked first other hand, james led heat everything three consecutive years race calm ...
杜兰特和詹姆斯谁才是当今联盟的头牌?这是最近很火热的话题。一方面杜兰特高居得分榜首位,在MVP权力榜上也雄踞第一;另一方面詹姆斯带领热火一切为了三连冠,比赛沉稳 ...
Once cleaned the documents and terms are organized into large matrices.
Often they are very sparse and may contain tens of thousands of data points.
Attributes may be single words or word tokens of 2 or more words.
Organized into Document Term Matrix Term Document Matrix
DATA ORGANIZATION
no doubt derek jeter makes my top all time with babe lou yankee clipper mick
Document
no doubt
derek
jeter
top
durant
james
termN
Tweet_1 1 1 1 1 1 0 0 0
Sina_1 0 0 0 0 1 2 2 1
docN … … … … … … … …
Term Tweet_1
Sina_1
docN
no 1 0 …
doubt 1 0 …
jeter 1 0 …
top 1 1 …
termN 0 1 …
durant james who league first card today very hot topic recently on one hand durant highest scoring top position standings MVP authority ranked first other hand, james led heat everything three consecutive years race calm ...
Document Term Matrix
Term Document Matrix
WHERE ARE THE FANS?LOCATION BASED ATTRIBUTES
DODGERS TWITTER FOLLOWERS -10K SAMPLE
INDIANS TWITTER FOLLOWERS -10K SAMPLE
NYY TWITTER FOLLOWERS -10K SAMPLE
Team Total Followers
Sample
Bing API Geo-Located
Median Distance to Stadium
Dodgers ~540K First 10K
2,854 1,372 miles
Indians ~225K First 10K
3,774 319 miles
Yankees ~1.18K First 10K
1,335 713 miles
WHO ARE THE FANS?COMMON DEMOGRAPHIC EXTRACTION
Sample of 3262 of 10k Followers Geo-located IDs
Zip City Population
Avg house value
Income below
poverty
Total busines
ses
Total househol
ds
91766
Pomona, CA
71,599 $142,800 15.4% 803
93301
Bakersfield, CA
12,248 $109,600 20.4% 1,438
91606
North Hollywood,
CA
44,958 $170,100 15.4% 622 14,903
From Twitter locations to zip code then demographic data.
WE CAN GET MORE GRANULAR.
Sample of 3775 of 10k Followers Geo-located IDs
Zip City Population
Avg house value
Income below
poverty
Total busines
ses
Total househol
ds
44107
Lakewood,
OH
52,244 $117,900 16.4% 945 25,333
44139
Solon, OH
24,356 $215,700 16.4% 1,155 8,693
44304
Akron, OH
5,916 $56,300 13.0% 172 1,637
WE CAN GET MORE GRANULAR.
From Twitter locations to zip code then demographic data.
Sample of 1335 of 10k Followers Geo-located IDs
Zip City Population
Avg house value
Income below
poverty
Total busines
ses
Total househol
ds
10462
Bronx, NY
75,784 $192,600 27.9% 1002 29855
14223
Buffalo, NY
22,665 $85,700 13.9% 328 9832
75060
Irving, TX
45,980 $83,300 17.2% 503
WE CAN GET MORE GRANULAR.
From Twitter locations to zip code then demographic data.
FURTHER INSIGHTS OF ZIP 91766, POMONA CA
At the zip code and metropolitan area there are countless dimensions that may aid in fan segmentation and marketing.
• Ranked #1 Drought Riskiest Cities• Ranked #15 Riskiest for Identity Theft• Ranked #5 Most Irritation Prone City
Sources: http://www.census.gov
http://emergency.cdc.gov/snaps/data/39/39153.htmhttp://www.bestplaces.net/rankings/zip-code/ohio/akron/44304
• Ranked #8 Healthiest• Ranked #13 Best City for Teleworking• Ranked #6 Most Single City
Population
White Black HispanicAsian Hawaiin IndianOther
Gender
male female
Households
total.households house w/child
Immigration
Mexico El Savador PhilippinesGutemala Korea ChinaVietnam Iran
FURTHER INSIGHTS OF ZIP 44304, AKRON OH
Population
White Black AsianHawaiin Indian Other
Gender
male female
Households
total.households house w/child
Immigration
India Germany YugoslaviaUK Italy CanadaChina other
• Ranked #1 Best City for Thanksgiving• Ranked #4 Best Cities for Teleworking• Ranked #25 America’s Best Cities for Dating
Sources: http://www.census.gov
http://emergency.cdc.gov/snaps/data/39/39153.htmhttp://www.bestplaces.net/rankings/zip-code/ohio/akron/44304
• Ranked #64 Most Popular City for the Holidays• Ranked #73 America’s Most Stressful Cities• Ranked #140 2005 Best Places to Live
At the zip code and metropolitan area there are countless dimensions that may aid in fan segmentation and marketing.
FURTHER INSIGHTS OF ZIP 10462, BRONX NY
• Ranked #2 Least Crime for Large Metro Area
• Ranked #2 Sleepless Cities 2011• Ranked #3 Most Single Cities
Sources: http://www.census.gov
http://emergency.cdc.gov/snaps/data/39/39153.htmhttp://www.bestplaces.net/rankings/zip-code/ohio/akron/44304
• Ranked #9 Most Irritation Prone Cities• Ranked #14 Healthiest Cities• Ranked #28 Most Playful Cities
Population
White Black HispanicAsian Hawaiin IndianOther
Gender
male female
Households
total.households house w/child
Immigration
Dominican Jamaica MexicoGuyana Ecuador CaribbeanHonduras Ghana
At the zip code and metropolitan area there are countless dimensions that may aid in fan segmentation and marketing.
WHAT ARE THE FANS TALKING ABOUT?INTERESTING TOPICS AND NAMED ENTITY RECOGNITION
• Free Twitter API
• Tweets mentioning “Indians”
• 7/31 & 8/1
• “Tokenize” single words into unique two word groups
• Trade mentions• Masterson to Cardinals for Ramsey• Cabrera to Nationals for Walters
• Throwback jerseys for KC Royals game
• Mariners game attendees 7/31
1.1K Tweets
DIFFERENCES OF WORD CLOUDS SIMPLE WORD CLOUD, CLOUD, COMMON CLOUD AND POLARIZED CLOUD
text1 text2
text2
text1 text2 text2text1
Simple Word Cloud
Commonality & Polarized Cloud
Comparison Cloud
12K Tweets• Includes a mix free API access and full fire hose paid API over 48 distinct hours
• Sampling occurred August 1 and August 13
• Tweets mentioning “Dodgers” most often discussed
• Clayton Kershaw’s appearance on Jimmy Kimmel Live
• FCC Chairman’s letter to Time Warner CEO about the Dodger’s TV Channel
2K Spanish Tweets
• Free Twitter API Spanish language search over 48 distinct hours
• Sampling occurred July 29 and August 12
• Tweets mentioning “Dodgers” and used Spanish most often discussed
• The AP story of Dan Haren beating the Braves
• Vin Scully retiring was a smaller topic although present
Dodgers beat Braves with 2 homers Kemp http://t.co/9U7xiIPOdo #news
Example:Dodgers vencen a Bravos con 2 jonrones de Kemp http://t.co/9U7xiIPOdo #noticias
235 BlogsTreemap
Sentiment
• July 29-July 31• Group is Correlated Topic Modeling
• Color is sentiment
• Area is blog length
• Takeaways:• Babe Ruth’s birthday is shared with
Laurence Fishburn, born in Augusta Georgia – picked up blogs mentioning “birthdays on this date”
• Eli Manning wants to remember advice of Derek Jeter
• Pending trade deadline• ESPNNewYork writer Wallace Matthews• Game recaps
Dissimilar Words
• Full FB Firehose of public posts
• Sampling occurred • Dodgers:July 29 – July 31• Yankees:July 28 – July 31
• FB mentions of Dodgers and Yankees tagged as English
• Marketing posts about Spike Lee requested a Red New York Yankees World Series edition fitted cap
Words in Common
• Full FB Firehose of public posts
• Sampling occurred • Dodgers:July 29 – July 31• Yankees:July 28 – July 31
• FB mentions of Dodgers and Yankees tagged as English
• As expected trades to improve the season towards the end of the deadline were mentioned by both teams
COMPARATIVE ANALYSIS – BIGRAMS IN COMMON
• Full FB Firehose of public posts
• Sampling occurred • Dodgers: Jul 29, -- Jul 31• Yankees: Jul 28 – Jul 31
• FB mentions of Dodgers and Yankees tagged as English
red sox
Equal Mentions
FEELINGS TOWARDS THE TEAMSIMPLE SENTIMENT ANALYSIS
Many words in natural language but there is steep decline in everyday usage.
Follows a predictable distribution. Zipf’s Law
1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89 93 970
100,000
200,000
300,000
400,000
500,000
600,000
700,000
800,000
900,000
EXAMPLE POLARITY SCORING IN TWITTER
Top two words in English spoken language are “the” and “be”. Top two words in Twitter are “RT” and “I”. However the power distribution is similar and follows Zipf’s law.
Top 100 Word Usage from 3M Tweets
Surprise is a sentiment. Hit by a bus! – Negative polarity but surprising.
Won the lottery! – Positive polarity but still surprising.
Use the University of Pittsburgh’s MPQA Lexicon & Illocution Inc’s 10K top Twitter words.
Keyword Scanning for polarity
SENTIMENT POLARITY ANALYSIS
R script scans for 3546 positive words, and 5701 negative words. It adds
positive words and subtracts negative ones. The final
score represents the polarity of the social interaction.
•I loathe the Tigers. -1•I love Lou Whittaker. He was the best. +2
•I like the Tigers but dislike going to the stadium. 0
DODGER SENTIMENT ON TWITTER 9/5
Median: -1Mean: -0.47
INDIANS SENTIMENT ON TWITTER 9/5
Median: 0Mean: -0.1198
YANKEE SENTIMENT ON TWITTER 9/5
Median: 0Mean: -0.118
IN COMPARISON…
dodgers rhp josh beckett won't return this season
hey..yankees....can ya score some runs?!
indians activate murphy from disabled list http://t.co/bqliintwsf
Team Tweets>=1
Tweets<=-1
Total w/o 0
% positive
Yankees 280 406 686 41%
Indians 290 456 746 39%
Dodgers 448 1,226 1,674 27%
WHAT IS THE POINT OF ALL THIS?TARGETED MARKETING EFFORTS, EVANGELISTS, REFINED SEGMENTATION, MEDIA MIX MODELING LEADING TO ROI
EXAMPLE IDENTIFY EVANGELISTS, INFLUENCERS & DETRACTORS
• When engaging on social media it is important to note the clout of followers in terms of status updates, and followers
• Running sentiment analysis on updates/posts adds context to the voice of the customer
• Appending other data allows for additional segmentation, and differentiated customer experiences e.g. my Yankee story
10K Indians Followers less 138 outliers
MEDIA MIX MODELING FOR SOCIAL MEDIA ROI
• In lieu of actual sales merchandise data and marketing spend, tracked Amazon Sales Rank hourly from 4/1 to 8/31
•Relative measure of sales against other “Sports and Outdoors” category items
•Lower number is better
DODGER CAP AVERAGE HOURLY SALES RANK PER DAY
1-Ap
r
4-Ap
r
7-Ap
r
10-A
pr
13-A
pr
16-A
pr
19-A
pr
22-A
pr
25-A
pr
28-A
pr
1-May
4-May
7-May
10-M
ay
13-M
ay
16-M
ay
19-M
ay
22-M
ay
25-M
ay
28-M
ay
31-M
ay
3-Ju
n6-
Jun9-
Jun
12-Ju
n
15-Ju
n
18-Ju
n
21-Ju
n
24-Ju
n
27-Ju
n
30-Ju
n3-
Jul6-
Jul9-
Jul
12-Ju
l
15-Ju
l
18-Ju
l
21-Ju
l
24-Ju
l
27-Ju
l
30-Ju
l
2-Au
g
5-Au
g
8-Au
g
11-A
ug
14-A
ug
17-A
ug
20-A
ug
23-A
ug
26-A
ug
29-A
ug
0
500
1000
1500
2000
2500
3000
3500
4000
4500
Amazon sales rank when seen as a time series exhibits is not stationary. Overall the Dodgers has an increasing trend despite being successful on field and has some periodicity based on
day of week.
Time Series Decomposition
• Econometric forecasting TSD was used in an attempt to isolate social media impact and understand sales rank patterns
• Trend is likely the impact of baseball season excitement then waning to other sports
• Seasonal may be the impact of retail day of the week cycles
• Leaving random as the dependent variable in the media mix GLM
Tweets to Decomposed Amazon Sales Rank
• Correlation is only -0.08.
• Given the tweets are examined against ‘random’ or unexplained data the relationship may still be relevant.
• As this is proxy data for sales of a single item, results not conclusive
0 10 20 30 40 50 60 70 80 90 100
-1000
-800
-600
-400
-200
0
200
400
600
800
1000
*removed dates with missing data
Tweets to Average Daily Amazon Sales Rank
• Much stronger correlation -0.24
• Leads one to believe the more a team tweets the lower the sales rank
• As this is proxy data for sales of a single item, results not conclusive
*removed dates with missing data
0 10 20 30 40 50 60 70 80 90 1000
500
1000
1500
2000
2500
3000
3500
4000
4500
Media mix modeling
*removed dates with missing data
• Given the likely relationship:• Set up a GLM using marketing efforts media spend with the dependent variable being revenue, ticket sales, merchandise sales etc.
• The coefficients of the inputs illustrate the impact of the channel marketing spends leading you to ROI
𝑓 (𝑠𝑎𝑙𝑒𝑠 )=𝛽 0+𝛽1 (𝑠𝑜𝑐𝑖𝑎𝑙 .𝑚𝑒𝑑𝑖𝑎 .𝑠𝑝𝑒𝑛𝑑 )+𝛽2 (𝑡𝑟𝑎𝑑𝑖𝑡𝑖𝑜𝑛𝑎𝑙 .𝑚𝑘𝑡𝑔 .𝑠𝑝𝑒𝑛𝑑 )+𝛽3 (𝑡𝑒𝑎𝑚 .𝑝𝑒𝑟𝑓𝑜𝑟𝑚𝑎𝑛𝑐𝑒 )…𝛽𝑛+𝜖Example:
The goal is increased model lift, and accuracy by incorporating social media spend. The coefficient of the variable demonstrates the impact. This will allow you to
calculate a ROI of social spend.
Want example R scripts for the visuals?www.sportsanalytics.org starting 9/15
FURTHER INFO