AINL 2016: Khudobakhshov
-
Upload
lidia-pivovarova -
Category
Science
-
view
214 -
download
0
Transcript of AINL 2016: Khudobakhshov
Развлекательная социальная сетьОдноклассники
2016
How to Estimate User’s Actual Age and Gender
Vitaly Khudobakhshov
Agenda
• Problem statement
• Social graph analysis
• NLP methods
• Behavior and user’s interests analysis
• Statistical approach
Vitaly Khudobakhshov, 2016
1 Problem statement
Vitaly Khudobakhshov, 2016
It is not about a situation where user consciously hides his or her gender or age and behaves consistently.
2 Problem statement
Vitaly Khudobakhshov, 2016
• Let’s suppose that we have users who don’t set their birth date or gender (default value problem)
• or set wrong values for some reason (e.g. mistakes and so on)
3 Problem decomposition
Vitaly Khudobakhshov, 2016
AgeEstimation
SocialGraphAnalysis
GenderEstimation
NLP Interests Statistics
4 Social Graph Analysis
Social Graph
• Is represented as an adjacency list
• user -> [(user0, label0), (user1, label1),…]
• Social graph is an undirected graph with labeled edges
• An edge may have multiple labels (classmates, parents, etc.)
Vitaly Khudobakhshov, 2016
5 User’s Graph
What is a User’s Graph?
• User’s graph is a graph which is induced by star-shaped tree
• user -> [(user0, label0), (user1, label1),…]
Vitaly Khudobakhshov, 2016
John
John’sMother
John’sFather
John’sGirlfriend
AaronDavid
Sara
6 Social Graph Analysis
Local Properties of User’s Graph
• Number of friends
• Connected components
• Number of triangles
• and so on
Vitaly Khudobakhshov, 2016
7 Age Estimation by Local Properties
Motivation
Vitaly Khudobakhshov, 2016
John
1995
1970
1992
?
1992
1968
Classmates
Parents
Relationship
8 Age Estimation by Local Properties
Data Sources
• Classmate label should be a strong feature (school, college).
• Colleague label definitely is not that good.
• How about a group of friends who are the same age?
Vitaly Khudobakhshov, 2016
9 Some obstacles
Quality of the Model
• No ground truth.
• How to check?
Vitaly Khudobakhshov, 2016
Quality of the Data
• Labeling is incomplete.
11 Confidence
Vitaly Khudobakhshov, 2016
Which source of the estimation is better?
The first attempt is something like this:
C = 1 – 1 / #friends
Does it work?
12 Age Estimation: Step 2
Vitaly Khudobakhshov, 2016
1 – classmates (school)2 – classmates (college)3 – max component
Not so good
13 Confidence
Vitaly Khudobakhshov, 2016
Common sense formula
Here is an easy way to solve the problem:
Cschool = 1 – 1 / #friends + 0.002
Ccollege = 1 – 1 / #friends + 0.001
Cmax = 1 – 1 / #friends
14 So you want to write a fugue?
Model quality
• No ground truth.
• There are special cases (e.g. Eschool=Ecollege=Emax).
• We can try to maximize accuracy with respect to model parameters.
Vitaly Khudobakhshov, 2016
15 NLP and Gender Estimation
Advantages
Vitaly Khudobakhshov, 2016
• Simple models are easy to understand: I/YOU + ADJ/VERB with gender
Disadvantages• Very difficult in case of a multilingual environment
• Coverage is not very good
• Privacy concerns
15 Communities and Interests
How it works
Vitaly Khudobakhshov, 2016
• Male persons prefer cars and extreme sports.
• Female persons prefer something else.
Conclusion
• There are gender specific communities and gender neutral communities.
• Divide and rule
17 Interests and Gender Estimation
Advantages
Vitaly Khudobakhshov, 2016
• Language independent
• Good coverage
Disadvantages
• Thresholds selection
• Small and gender neutral communities
17 Statistics
Advantages
Vitaly Khudobakhshov, 2016
• Language independent
• Not very sensitive to special characters (or may be preprocessed)
• Near to maximum possible coverage
18 Conclusion
Vitaly Khudobakhshov, 2016
• Models are complimentary to each other.
• Simple methods may produce very good results due to big data issues.
• We can gain better results without privacy violation.