You are what you say: Privacy risks of public mentions

Natural Language Processing LabNational Taiwan University

You are what you say: Privacy risks of public mentions

Dan Frankowski et al.University of Minnesota

SIGIR 2006Presentor: Chun-Yuan Teng


Motivation

• “Public data” + “Private data” + “IR Algorithm” = Privacy risk


Example of privacy risk

• Privacy risk: Link datasets with overlapping users

• “blog” + “purchase history” = “someone”

• Ex: 吳若權 or 紫微斗數


Examples of privacy encroachment

• People are judged by their preference

• Rating + Mention in porn in forum?


Research questions

Risks of dataset release What are the risks to user privacy when

releasing a dataset? Altering the dataset

How can dataset owners alter the dataset they release to preserve user privacy?

Self defense How can users protect their own privacy?


Experimental setup

• Ratings– Large– 140K users. max 6K rats, average 90, median 33.– 9K movies. max 49K rats, average 1,403, median 207– 12.6M ratings

• Forum mentions– Small– 133 forum posters– 1,685 different movies– 3,828 movie mentions


RQ1: Risks of dataset release

• How to evaluate the risks?• What’s the risky algorithms?


K-anonymity & K-identification

• K-anonymity (In Cryptography)– Sweeney: “A dataset release provides k-

anonymity protection if the information for each person contained in data cannot be distinguished from k-1 individuals in the data”

• K-identification– K-identification is a measure of how well an

algorithm can narrow each user in a dataset to one of k users in another dataset


K-identification (cont.)

• We know target user t in ratings data, too

• t is k-identified if at position k or higher on the likely list.

• In paper, k=1,5,10,100. We’ll talk about 1-identification, because it’s the scariest.

• Likely list– u1, s1– u2, s2– u3, s3 (t)

– u4, s4– …

• Above, t is 3-identified, also 4-identified, 5-identified, etc., but NOT 2-identified


An observation of data

Number of ratings of an item by percentile

0

10000

20000

30000

40000

50000

60000

0% 20% 40% 60% 80% 100%Item percentile

Nu

mb

er

of ra

tin

gs

• Low Rated item may be a good indicator


Algorithms to identify users

• Set Intersection algorithm• TF-IDF algorithm• Scoring algorithm• Scoring algorithm with ratings


Set Intersection algorithm


Set Intersection algorithm• Find users who rate EVERY movie the target user ment

ioned– They all have same likeliness score

• Ignore rating value entirely

• RESULT: 1-identification rate: 7%

• MEANING: 7% of the time there was one user at the top of the likely list, and it was the target user

• Room for improvement– For target user with many mentions, no one possible


TF-IDF algorithm

• Score each user by similarity to the target user. Score more highly if– User has rated more mentions of target– User has rated mentions of rarely rated movies

• For us: “word” is a movie, “document” (bag of words) is a user

• Score is cosine similarity to the target user• RESULTS: 1-ident rate of 20% (compared to 7% from S

et Int.)• Room for improvement

– over-weights any mention for ratings user who rated few movies– high-scoring users have 4 ratings and 1 mention


Scoring algorithm

• Emphasizes mentions of rarely-rated movies, de-emphasizes number of ratings a user has

• A user who has rated a mention is 10-20 times more likely to be the target user than one who has not


Examples

• Example– Target user t mentioned A, B, C, rated 20, 50, 1000 tim

es (from 10,000 users)– User u1 rated A, user u2 rated B, C

• u1 score: 0.9981 * 0.05 * 0.05 = 0.0025• u2 score: 0.05 * 0.9501 * 0.9001= 0.043• u2 more likely to be target t

• Rating a mention is good, rare even better


Scoring algorithm with rating

• The same as above algorithm• Add threshold to add the rating

feature


Percent of k-identified


RQ2: altering the dataset

• Perturbation: Change rating value– Rating is not needed

• Generalization: group items– Dataset becomes less useful

• Suppression: hide data– Using in following experiments


RQ2: Altering the dataset

• We won’t modify forum data– users wouldn’t like it. Focus on ratings data

• Rarely-rated items are identifyingIDEA: Release a ratings dataset suppressing

all “rarely-rated” items


RQ2: Altering the dataset


RQ3: Self Defense

• The question is how user protect their own privacy

• Suppression: suppress rare-rated movie– May not be accepted by user

• Misdirection


Suppression

• Not significant if more than 20%


Misdirection

• Mention popular items is more effective

• Mention a popular item, more users increase their score


Conclusion

• A new problem in IR– Interesting and hard

• Hard to preserve privacy– You need to suppress large data

You are what you say: Privacy risks of public mentions

Documents

Transcript of You are what you say: Privacy risks of public mentions