AINL 2016: Khudobakhshov

24
Развлекательная социальная сеть Одноклассники 2016 How to Estimate User’s Actual Age and Gender Vitaly Khudobakhshov

Transcript of AINL 2016: Khudobakhshov

Развлекательная социальная сетьОдноклассники

2016

How to Estimate User’s Actual Age and Gender

Vitaly Khudobakhshov

Agenda

• Problem statement

• Social graph analysis

• NLP methods

• Behavior and user’s interests analysis

• Statistical approach

Vitaly Khudobakhshov, 2016

1 Problem statement

Vitaly Khudobakhshov, 2016

It is not about a situation where user consciously hides his or her gender or age and behaves consistently.

2 Problem statement

Vitaly Khudobakhshov, 2016

• Let’s suppose that we have users who don’t set their birth date or gender (default value problem)

• or set wrong values for some reason (e.g. mistakes and so on)

3 Problem decomposition

Vitaly Khudobakhshov, 2016

AgeEstimation

SocialGraphAnalysis

GenderEstimation

NLP Interests Statistics

4 Social Graph Analysis

Social Graph

• Is represented as an adjacency list

• user -> [(user0, label0), (user1, label1),…]

• Social graph is an undirected graph with labeled edges

• An edge may have multiple labels (classmates, parents, etc.)

Vitaly Khudobakhshov, 2016

5 User’s Graph

What is a User’s Graph?

• User’s graph is a graph which is induced by star-shaped tree

• user -> [(user0, label0), (user1, label1),…]

Vitaly Khudobakhshov, 2016

John

John’sMother

John’sFather

John’sGirlfriend

AaronDavid

Sara

6 Social Graph Analysis

Local Properties of User’s Graph

• Number of friends

• Connected components

• Number of triangles

• and so on

Vitaly Khudobakhshov, 2016

7 Age Estimation by Local Properties

Motivation

Vitaly Khudobakhshov, 2016

John

1995

1970

1992

?

1992

1968

Classmates

Parents

Relationship

8 Age Estimation by Local Properties

Data Sources

• Classmate label should be a strong feature (school, college).

• Colleague label definitely is not that good.

• How about a group of friends who are the same age?

Vitaly Khudobakhshov, 2016

9 Some obstacles

Quality of the Model

• No ground truth.

• How to check?

Vitaly Khudobakhshov, 2016

Quality of the Data

• Labeling is incomplete.

10 Age Estimation: Step 1

Vitaly Khudobakhshov, 2016

11 Confidence

Vitaly Khudobakhshov, 2016

Which source of the estimation is better?

The first attempt is something like this:

C = 1 – 1 / #friends

Does it work?

12 Age Estimation: Step 2

Vitaly Khudobakhshov, 2016

1 – classmates (school)2 – classmates (college)3 – max component

Not so good

13 Confidence

Vitaly Khudobakhshov, 2016

Common sense formula

Here is an easy way to solve the problem:

Cschool = 1 – 1 / #friends + 0.002

Ccollege = 1 – 1 / #friends + 0.001

Cmax = 1 – 1 / #friends

14 So you want to write a fugue?

Model quality

• No ground truth.

• There are special cases (e.g. Eschool=Ecollege=Emax).

• We can try to maximize accuracy with respect to model parameters.

Vitaly Khudobakhshov, 2016

15 NLP and Gender Estimation

Advantages

Vitaly Khudobakhshov, 2016

• Simple models are easy to understand: I/YOU + ADJ/VERB with gender

Disadvantages• Very difficult in case of a multilingual environment

• Coverage is not very good

• Privacy concerns

15 Communities and Interests

How it works

Vitaly Khudobakhshov, 2016

• Male persons prefer cars and extreme sports.

• Female persons prefer something else.

Conclusion

• There are gender specific communities and gender neutral communities.

• Divide and rule

16 Interests and Gender Estimation

Vitaly Khudobakhshov, 2016

17 Interests and Gender Estimation

Advantages

Vitaly Khudobakhshov, 2016

• Language independent

• Good coverage

Disadvantages

• Thresholds selection

• Small and gender neutral communities

18 Statistics

Vitaly Khudobakhshov, 2016

17 Statistics

Advantages

Vitaly Khudobakhshov, 2016

• Language independent

• Not very sensitive to special characters (or may be preprocessed)

• Near to maximum possible coverage

18 Conclusion

Vitaly Khudobakhshov, 2016

• Models are complimentary to each other.

• Simple methods may produce very good results due to big data issues.

• We can gain better results without privacy violation.