You are what you say: Privacy risks of public mentions
-
Upload
teranika-fullerton -
Category
Documents
-
view
45 -
download
6
description
Transcript of You are what you say: Privacy risks of public mentions
Natural Language Processing LabNational Taiwan University
You are what you say: Privacy risks of public mentions
Dan Frankowski et al.University of Minnesota
SIGIR 2006Presentor: Chun-Yuan Teng
Natural Language Processing LabNational Taiwan University
Motivation
• “Public data” + “Private data” + “IR Algorithm” = Privacy risk
Natural Language Processing LabNational Taiwan University
Example of privacy risk
• Privacy risk: Link datasets with overlapping users
• “blog” + “purchase history” = “someone”
• Ex: 吳若權 or 紫微斗數
Natural Language Processing LabNational Taiwan University
Examples of privacy encroachment
• People are judged by their preference
• Rating + Mention in porn in forum?
Natural Language Processing LabNational Taiwan University
Research questions
Risks of dataset release What are the risks to user privacy when
releasing a dataset? Altering the dataset
How can dataset owners alter the dataset they release to preserve user privacy?
Self defense How can users protect their own privacy?
Natural Language Processing LabNational Taiwan University
Experimental setup
• Ratings– Large– 140K users. max 6K rats, average 90, median 33.– 9K movies. max 49K rats, average 1,403, median 207– 12.6M ratings
• Forum mentions– Small– 133 forum posters– 1,685 different movies– 3,828 movie mentions
Natural Language Processing LabNational Taiwan University
RQ1: Risks of dataset release
• How to evaluate the risks?• What’s the risky algorithms?
Natural Language Processing LabNational Taiwan University
K-anonymity & K-identification
• K-anonymity (In Cryptography)– Sweeney: “A dataset release provides k-
anonymity protection if the information for each person contained in data cannot be distinguished from k-1 individuals in the data”
• K-identification– K-identification is a measure of how well an
algorithm can narrow each user in a dataset to one of k users in another dataset
Natural Language Processing LabNational Taiwan University
K-identification (cont.)
• We know target user t in ratings data, too
• t is k-identified if at position k or higher on the likely list.
• In paper, k=1,5,10,100. We’ll talk about 1-identification, because it’s the scariest.
• Likely list– u1, s1– u2, s2– u3, s3 (t)
– u4, s4– …
• Above, t is 3-identified, also 4-identified, 5-identified, etc., but NOT 2-identified
Natural Language Processing LabNational Taiwan University
An observation of data
Number of ratings of an item by percentile
0
10000
20000
30000
40000
50000
60000
0% 20% 40% 60% 80% 100%Item percentile
Nu
mb
er
of ra
tin
gs
• Low Rated item may be a good indicator
Natural Language Processing LabNational Taiwan University
Algorithms to identify users
• Set Intersection algorithm• TF-IDF algorithm• Scoring algorithm• Scoring algorithm with ratings
Natural Language Processing LabNational Taiwan University
Set Intersection algorithm
Natural Language Processing LabNational Taiwan University
Set Intersection algorithm• Find users who rate EVERY movie the target user ment
ioned– They all have same likeliness score
• Ignore rating value entirely
• RESULT: 1-identification rate: 7%
• MEANING: 7% of the time there was one user at the top of the likely list, and it was the target user
• Room for improvement– For target user with many mentions, no one possible
Natural Language Processing LabNational Taiwan University
TF-IDF algorithm
• Score each user by similarity to the target user. Score more highly if– User has rated more mentions of target– User has rated mentions of rarely rated movies
• For us: “word” is a movie, “document” (bag of words) is a user
• Score is cosine similarity to the target user• RESULTS: 1-ident rate of 20% (compared to 7% from S
et Int.)• Room for improvement
– over-weights any mention for ratings user who rated few movies– high-scoring users have 4 ratings and 1 mention
Natural Language Processing LabNational Taiwan University
Scoring algorithm
• Emphasizes mentions of rarely-rated movies, de-emphasizes number of ratings a user has
• A user who has rated a mention is 10-20 times more likely to be the target user than one who has not
Natural Language Processing LabNational Taiwan University
Examples
• Example– Target user t mentioned A, B, C, rated 20, 50, 1000 tim
es (from 10,000 users)– User u1 rated A, user u2 rated B, C
• u1 score: 0.9981 * 0.05 * 0.05 = 0.0025• u2 score: 0.05 * 0.9501 * 0.9001= 0.043• u2 more likely to be target t
• Rating a mention is good, rare even better
Natural Language Processing LabNational Taiwan University
Scoring algorithm with rating
• The same as above algorithm• Add threshold to add the rating
feature
Natural Language Processing LabNational Taiwan University
Percent of k-identified
Natural Language Processing LabNational Taiwan University
RQ2: altering the dataset
• Perturbation: Change rating value– Rating is not needed
• Generalization: group items– Dataset becomes less useful
• Suppression: hide data– Using in following experiments
Natural Language Processing LabNational Taiwan University
RQ2: Altering the dataset
• We won’t modify forum data– users wouldn’t like it. Focus on ratings data
• Rarely-rated items are identifyingIDEA: Release a ratings dataset suppressing
all “rarely-rated” items
Natural Language Processing LabNational Taiwan University
RQ2: Altering the dataset
Natural Language Processing LabNational Taiwan University
RQ3: Self Defense
• The question is how user protect their own privacy
• Suppression: suppress rare-rated movie– May not be accepted by user
• Misdirection
Natural Language Processing LabNational Taiwan University
Suppression
• Not significant if more than 20%
Natural Language Processing LabNational Taiwan University
Misdirection
• Mention popular items is more effective
• Mention a popular item, more users increase their score
Natural Language Processing LabNational Taiwan University
Conclusion
• A new problem in IR– Interesting and hard
• Hard to preserve privacy– You need to suppress large data