1 Privacy Protection with Genetic Algorithms 報告者:林惠珍...
-
date post
21-Dec-2015 -
Category
Documents
-
view
236 -
download
3
Transcript of 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍...
1
Privacy Protection with Genetic Algorithms
報告者:林惠珍
運用基因演算法來作隱私保護
2
Outline
Introduction Basics of Micro-Aggregation and Methods Privacy Protection Through Genetic-Algorithm-
Based Micro-aggregation A Hybrid Approach Experimental Results Conclusions and Future Challenges
3
Outline
Introduction Basics of Micro-Aggregation and Methods Privacy Protection Through Genetic-Algorithm-
Based Micro-aggregation A Hybrid Approach Experimental Results Conclusions and Future Challenges
4
Privacy!!
Privacy V.S.Data utility
Data collectionStatistics
Data aggregation
Releasing
Respondent
Safe
5
Contribution
Micro-aggregation for distorting data and guaranteeing respondents privacy.
Optimal micro-aggregation is NP-hard, so the author uses GA and some modification to solve the problem.
A hybrid method for solving above problem.
6
Outline
Introduction Basics of Micro-Aggregation and Methods Privacy Protection Through Genetic-Algorithm-
Based Micro-aggregation A Hybrid Approach Experimental Results Conclusions and Future Challenges
7
SDC(Statistical Disclosure Control)
(Statistical Disclosure Limitation, SDL)
Data Transform
Public
Data utilityStatistical
confidentiality
Respondent
Enough protection &Minimize information loss
Method
Micro-aggregation Micro-data個人資料
Clustering problem
Cluster size!
8
Two goals for micro-aggregation
Preserving data utility. Protecting the privacy of the respondents.
9
Preserving data utility
As less noise as possible into data
So, we should aggregate similar elements instead of different ones.
10
Protecting the privacy of the respondents
Data have to be sufficiently modified to make re-identification difficult.
Increasing the number of aggregated elements can increase data privacy.
11
Whether two elements are similar
Similarity function
ex: Euclidean Distance
Univariate Data set
Element numbers in Duni
The i-th element in Duni
Average element
Multivariate Data set
Dimension numbers of each element
The j-th component of the average element
The j-th component of the i-th element in Dmulti
Multiple subsets
Subset numberElement numbers in the i-th subset
The j-th element in the i-th subset
The average element of the i-th subset
12
Micro-aggregation problem(k-micro-aggregation problem)
SSE k
A security parameter. Determines the minimum cardinality of the subsets.
Data set D(n elements)
To obtain a k-partitionHomogeneity of is maximized
A k-partition of D is a partition where its parts have, at least, k elements of D.ex: k=3
3
54
Average element = 4
4
4
(SSE的值要小 )
NP-hard for multivariate
data sets
Use heuristic methods!!
Definition
13
Multivariate Micro-Aggregation Methods
Minimum Spanning Tree Partitioning (MSTP) Maximum Distance Method (MD) Maximum Distance to Average Vector Method
(MDAV) Variable-MDAV
14
Minimum Spanning Tree Partitioning (MSTP)
Step:
1. MST construction
2. Edge cutting
3. Cluster generation
Limitation:In its foundation, MST.
Fail to properly adapt to the scattered data points.
15
Maximum Distance Method (MD)The main advantage is its simplicity and it achieves very good results in most data sets.
r
s
Most distant (by Euclidean Distance)
Form a group with r(s) and the closet k-1 elements.
Check the remaining element numbers.
1. num>=2krepeat
2. k<=num<=2k-1a new group
3. num<=k-1assign each element to the closet group
Micro-aggregated data: Replacing each record by the centroid of the group to which it belongs.
Shortcoming:computational complexity
16
Maximum Distance to Average Vector Method (MDAV) MDAV improves on MD in terms of
computational complexity while maintaining the performance in terms of SSE.
MDAV is the most popular method used for micro-aggregating data sets.
17
MDAV Algorithm
Build two groups at each iteration.
When (RR<=2k-1)1. RR<k
assign each element to the closet group 2. RR>=k
a new group
18
MDAV Process
Centroid c
Distance Matrix
Most distant
s
r
Most distant
Distance Matrix
Micro-aggregated data: Replacing each record by the centroid of the group to which it belongs.
Shortcoming:Lack of flexibilityIt only generates subsets of fixed cardinality k.
19
Variable-MDAV
V-MDAV intends to overcome the limitation by computing a variable-size k-partition with a computational cost similar to the MDAV cost.
20
V-MDAV Process
Distance Matrix
Centroid c
Check the remaining element numbers.1. RR>=k form groups
2. RR<=k-1 assign each element to the closet group
Distance Matrix
Most distant
e
Closet
Distance: d_in
e_minCloset
Distance: d_out
If (d_in < γ*d_out)assign e_min to the current group
MDAV is the most popular one, so authors use it as a reference for comparison.
extend the group ( up to k-1 )
21
Outline
Introduction Basics of Micro-Aggregation and Methods Privacy Protection Through Genetic-
Algorithm-Based Micro-aggregation A Hybrid Approach Experimental Results Conclusions and Future Challenges
22
Coding sequence Initializing the population The fitness function Selection scheme and genetic operators (crossover
& mutation)
23
Coding Sequence
Binary codings : N-ary codings : Real-valued codings :
0 1 1 0 10 0 1 1 0 ….
2 3 2.3 1.9 53.4 4.5 2.7 2 3.1 ….
B A D F EA C C B F ….
24
Univariate V.S. Multivariate
Univariate micro-aggregation : binary codings Data set : 3 25 1 6 9 8 4 5 10 11 20 17 Sorted data set : 1 3 4 5 6 8 9 10 11 17 20 25 Binary codings may be : But, there is no way of sorting multivariate records wit
hout giving a higher priority to one of the attributes.
0 0 0 01 1 0 0 1 0 00
25
Univariate V.S. Multivariate (cont.) Multivariate micro-aggregation : N-ary codings
Maximum number of groups Each symbol represents one group of the k-partition. Chromosome length : the number of records in the da
ta set The i-th gene value →the group of the k-partition whic
h the i-th record in the data set belongs to
26
Example
n = 11k = 3G = 11/3 = 3
3-character alphabet:A、 B、 C
Chromosome length: 11
A B CAA B B C C A A
3-partition: group A = {1,2,3,10,11} group B = {4,5,6} group C = {7,8,9}
27
Initializing the Population
Generally using random method n records and G different alphabet symbols :
But, only a small fraction meets the cardinality constraints.
“In an optimal k-partition, each group has between k and 2k-1 records.” (Domingo & Mateo)
Minimum number of groups
possible chromosomes
28
Initializing the Population (cont.) Random initialization is not suitable to obtain
candidate optimal k-partitions. So, the cardinality constraints must be embedded
in the initialization procedure. →Algorithm 2
Guarantee that each group( part) has at most 2k-1 elements.
29
The Fitness Function
Obtain a measure of the homogeneity of the groups in the k-partition represented by a given chromosome through SSE.
The goal is to minimize SSE. Thus, the fitness value of a chromosome is
s: group的總數ni:第 i個 group的 record 數目
Penalize the chromosome which includes a non-optimal k-partition.
30
Selection Scheme and Genetic Operators
Selection scheme : roulette-wheel selection Genetic operators : one-point crossover and mut
ation
31
Outline
Introduction Basics of Micro-Aggregation and Methods Privacy Protection Through Genetic-Algorithm-
Based Micro-aggregation A Hybrid Approach Experimental Results Conclusions and Future Challenges
32
A Hybrid Approach
GA MDAV
Good SSEAdapting to very large data sets
Low performance to very large data sets
Worse than GA in terms of SSE
Hybrid approach
1. Good SSE2. Adapting to very large data sets
Name: Two-step partitioning
33
Two-step partitioning k→ small value K→ larger than k and K% k = 0 ; small enough to be suitable for GA
Ex: k=3; K=21Use MDAV to build 3-partition
Use MDAV to build macro-groups (sets of average vectors) of size K/k (21/3=7)
K-partition
Replace the vectors by the k original records
Finally, apply the GA to each macro-group in the K-partition in order to generate an optimal or near optimal k-partition of the macro-group.
34
One-step MDAV V.S. Two-step MDAV
Better
35
Outline
Introduction Basics of Micro-Aggregation and Methods Privacy Protection Through Genetic-Algorithm-
Based Micro-aggregation A Hybrid Approach Experimental Results Conclusions and Future Challenges
36
Experiment Approaches : GA-based micro-aggregation
Hybrid micro-aggregation Comparison with MDAV and ES (exhaustive sear
ch). ES is only possible with tiny data sets of up to 11 elements.
Data sets : 1. The example data set (Table 1) 2. Small data sets 3. Real and large data sets
Each experiment consists of 12,100 runs of GA.Mutation rate: 0、 0.1、 0.2、 0.3、 0.4、 0.5、 0.6、 0.7、 0.8、 0.9、 1→11種Crossover rate: 0、 0.1、 0.2、 0.3、 0.4、 0.5、 0.6、 0.7、 0.8、 0.9、 1→11種 Population size: 10、 20、 30、 40、 50、 60、 70、 80、 90、 100→10種GA was run 10 times for each parameter setting.
37
Results for the Running Example
GA running time depends on the number of generations.
Most of the tests converge in less than 5,000 iterations.
Although MDAV is faster, the SSE obtained with the GA is better. (90% →14.82)
38
Results in Small Data Sets
Mutation rate should be low. Ex: 0.1
GA-based approach cannot deal with large data sets.Same!!
39
Results in Real and Large Data Sets
Use the hybrid technique.
1000 x 2
1000 x 2
1080 x 13
4092 x 11
Better
40
Outline
Introduction Basics of Micro-Aggregation and Methods Privacy Protection Through Genetic-Algorithm-
Based Micro-aggregation A Hybrid Approach Experimental Results Conclusions and Future Challenges
41
Conclusions and Future Challenges
The reported experimental results demonstrate the usefulness of the proposed methods and open the door to an invigorating research line.
Lots of questions remain open : Look for better codings. Test the efficiency of other selection algorithms. Evaluate the importance of genetic operators such as m
ultiple-point crossover or inversion.