You are what you say: Privacy risks of public mentions

25
Natural Language Processing Lab National Taiwan You are what you say: Privacy risks of public mentions Dan Frankowski et al. University of Minnesota SIGIR 2006 Presentor: Chun-Yuan Teng

description

You are what you say: Privacy risks of public mentions. Dan Frankowski et al. University of Minnesota SIGIR 2006 Presentor: Chun-Yuan Teng. Motivation. “Public data” + “Private data” + “IR Algorithm” = Privacy risk. Example of privacy risk. - PowerPoint PPT Presentation

Transcript of You are what you say: Privacy risks of public mentions

Page 1: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

You are what you say: Privacy risks of public mentions

Dan Frankowski et al.University of Minnesota

SIGIR 2006Presentor: Chun-Yuan Teng

Page 2: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

Motivation

• “Public data” + “Private data” + “IR Algorithm” = Privacy risk

Page 3: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

Example of privacy risk

• Privacy risk: Link datasets with overlapping users

• “blog” + “purchase history” = “someone”

• Ex: 吳若權 or 紫微斗數

Page 4: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

Examples of privacy encroachment

• People are judged by their preference

• Rating + Mention in porn in forum?

Page 5: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

Research questions

Risks of dataset release What are the risks to user privacy when

releasing a dataset? Altering the dataset

How can dataset owners alter the dataset they release to preserve user privacy?

Self defense How can users protect their own privacy?

Page 6: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

Experimental setup

• Ratings– Large– 140K users. max 6K rats, average 90, median 33.– 9K movies. max 49K rats, average 1,403, median 207– 12.6M ratings

• Forum mentions– Small– 133 forum posters– 1,685 different movies– 3,828 movie mentions

Page 7: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

RQ1: Risks of dataset release

• How to evaluate the risks?• What’s the risky algorithms?

Page 8: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

K-anonymity & K-identification

• K-anonymity (In Cryptography)– Sweeney: “A dataset release provides k-

anonymity protection if the information for each person contained in data cannot be distinguished from k-1 individuals in the data”

• K-identification– K-identification is a measure of how well an

algorithm can narrow each user in a dataset to one of k users in another dataset

Page 9: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

K-identification (cont.)

• We know target user t in ratings data, too

• t is k-identified if at position k or higher on the likely list.

• In paper, k=1,5,10,100. We’ll talk about 1-identification, because it’s the scariest.

• Likely list– u1, s1– u2, s2– u3, s3 (t)

– u4, s4– …

• Above, t is 3-identified, also 4-identified, 5-identified, etc., but NOT 2-identified

Page 10: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

An observation of data

Number of ratings of an item by percentile

0

10000

20000

30000

40000

50000

60000

0% 20% 40% 60% 80% 100%Item percentile

Nu

mb

er

of ra

tin

gs

• Low Rated item may be a good indicator

Page 11: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

Algorithms to identify users

• Set Intersection algorithm• TF-IDF algorithm• Scoring algorithm• Scoring algorithm with ratings

Page 12: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

Set Intersection algorithm

Page 13: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

Set Intersection algorithm• Find users who rate EVERY movie the target user ment

ioned– They all have same likeliness score

• Ignore rating value entirely

• RESULT: 1-identification rate: 7%

• MEANING: 7% of the time there was one user at the top of the likely list, and it was the target user

• Room for improvement– For target user with many mentions, no one possible

Page 14: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

TF-IDF algorithm

• Score each user by similarity to the target user. Score more highly if– User has rated more mentions of target– User has rated mentions of rarely rated movies

• For us: “word” is a movie, “document” (bag of words) is a user

• Score is cosine similarity to the target user• RESULTS: 1-ident rate of 20% (compared to 7% from S

et Int.)• Room for improvement

– over-weights any mention for ratings user who rated few movies– high-scoring users have 4 ratings and 1 mention

Page 15: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

Scoring algorithm

• Emphasizes mentions of rarely-rated movies, de-emphasizes number of ratings a user has

• A user who has rated a mention is 10-20 times more likely to be the target user than one who has not

Page 16: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

Examples

• Example– Target user t mentioned A, B, C, rated 20, 50, 1000 tim

es (from 10,000 users)– User u1 rated A, user u2 rated B, C

• u1 score: 0.9981 * 0.05 * 0.05 = 0.0025• u2 score: 0.05 * 0.9501 * 0.9001= 0.043• u2 more likely to be target t

• Rating a mention is good, rare even better

Page 17: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

Scoring algorithm with rating

• The same as above algorithm• Add threshold to add the rating

feature

Page 18: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

Percent of k-identified

Page 19: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

RQ2: altering the dataset

• Perturbation: Change rating value– Rating is not needed

• Generalization: group items– Dataset becomes less useful

• Suppression: hide data– Using in following experiments

Page 20: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

RQ2: Altering the dataset

• We won’t modify forum data– users wouldn’t like it. Focus on ratings data

• Rarely-rated items are identifyingIDEA: Release a ratings dataset suppressing

all “rarely-rated” items

Page 21: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

RQ2: Altering the dataset

Page 22: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

RQ3: Self Defense

• The question is how user protect their own privacy

• Suppression: suppress rare-rated movie– May not be accepted by user

• Misdirection

Page 23: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

Suppression

• Not significant if more than 20%

Page 24: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

Misdirection

• Mention popular items is more effective

• Mention a popular item, more users increase their score

Page 25: You are what you say:  Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

Conclusion

• A new problem in IR– Interesting and hard

• Hard to preserve privacy– You need to suppress large data