1 Privacy Protection with Genetic Algorithms 報告者:林惠珍...

41
1 Privacy Protection with Genetic Algorithms 報報報 報報報 運運運運運運運運運運運運運
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    236
  • download

    3

Transcript of 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍...

Page 1: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

1

Privacy Protection with Genetic Algorithms

報告者:林惠珍

運用基因演算法來作隱私保護

Page 2: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

2

Outline

Introduction Basics of Micro-Aggregation and Methods Privacy Protection Through Genetic-Algorithm-

Based Micro-aggregation A Hybrid Approach Experimental Results Conclusions and Future Challenges

Page 3: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

3

Outline

Introduction Basics of Micro-Aggregation and Methods Privacy Protection Through Genetic-Algorithm-

Based Micro-aggregation A Hybrid Approach Experimental Results Conclusions and Future Challenges

Page 4: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

4

Privacy!!

Privacy V.S.Data utility

Data collectionStatistics

Data aggregation

Releasing

Respondent

Safe

Page 5: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

5

Contribution

Micro-aggregation for distorting data and guaranteeing respondents privacy.

Optimal micro-aggregation is NP-hard, so the author uses GA and some modification to solve the problem.

A hybrid method for solving above problem.

Page 6: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

6

Outline

Introduction Basics of Micro-Aggregation and Methods Privacy Protection Through Genetic-Algorithm-

Based Micro-aggregation A Hybrid Approach Experimental Results Conclusions and Future Challenges

Page 7: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

7

SDC(Statistical Disclosure Control)

(Statistical Disclosure Limitation, SDL)

Data Transform

Public

Data utilityStatistical

confidentiality

Respondent

Enough protection &Minimize information loss

Method

Micro-aggregation Micro-data個人資料

Clustering problem

Cluster size!

Page 8: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

8

Two goals for micro-aggregation

Preserving data utility. Protecting the privacy of the respondents.

Page 9: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

9

Preserving data utility

As less noise as possible into data

So, we should aggregate similar elements instead of different ones.

Page 10: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

10

Protecting the privacy of the respondents

Data have to be sufficiently modified to make re-identification difficult.

Increasing the number of aggregated elements can increase data privacy.

Page 11: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

11

Whether two elements are similar

Similarity function

ex: Euclidean Distance

Univariate Data set

Element numbers in Duni

The i-th element in Duni

Average element

Multivariate Data set

Dimension numbers of each element

The j-th component of the average element

The j-th component of the i-th element in Dmulti

Multiple subsets

Subset numberElement numbers in the i-th subset

The j-th element in the i-th subset

The average element of the i-th subset

Page 12: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

12

Micro-aggregation problem(k-micro-aggregation problem)

SSE k

A security parameter. Determines the minimum cardinality of the subsets.

Data set D(n elements)

To obtain a k-partitionHomogeneity of is maximized

A k-partition of D is a partition where its parts have, at least, k elements of D.ex: k=3

3

54

Average element = 4

4

4

(SSE的值要小 )

NP-hard for multivariate

data sets

Use heuristic methods!!

Definition

Page 13: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

13

Multivariate Micro-Aggregation Methods

Minimum Spanning Tree Partitioning (MSTP) Maximum Distance Method (MD) Maximum Distance to Average Vector Method

(MDAV) Variable-MDAV

Page 14: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

14

Minimum Spanning Tree Partitioning (MSTP)

Step:

1. MST construction

2. Edge cutting

3. Cluster generation

Limitation:In its foundation, MST.

Fail to properly adapt to the scattered data points.

Page 15: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

15

Maximum Distance Method (MD)The main advantage is its simplicity and it achieves very good results in most data sets.

r

s

Most distant (by Euclidean Distance)

Form a group with r(s) and the closet k-1 elements.

Check the remaining element numbers.

1. num>=2krepeat

2. k<=num<=2k-1a new group

3. num<=k-1assign each element to the closet group

Micro-aggregated data: Replacing each record by the centroid of the group to which it belongs.

Shortcoming:computational complexity

Page 16: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

16

Maximum Distance to Average Vector Method (MDAV) MDAV improves on MD in terms of

computational complexity while maintaining the performance in terms of SSE.

MDAV is the most popular method used for micro-aggregating data sets.

Page 17: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

17

MDAV Algorithm

Build two groups at each iteration.

When (RR<=2k-1)1. RR<k

assign each element to the closet group 2. RR>=k

a new group

Page 18: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

18

MDAV Process

Centroid c

Distance Matrix

Most distant

s

r

Most distant

Distance Matrix

Micro-aggregated data: Replacing each record by the centroid of the group to which it belongs.

Shortcoming:Lack of flexibilityIt only generates subsets of fixed cardinality k.

Page 19: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

19

Variable-MDAV

V-MDAV intends to overcome the limitation by computing a variable-size k-partition with a computational cost similar to the MDAV cost.

Page 20: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

20

V-MDAV Process

Distance Matrix

Centroid c

Check the remaining element numbers.1. RR>=k form groups

2. RR<=k-1 assign each element to the closet group

Distance Matrix

Most distant

e

Closet

Distance: d_in

e_minCloset

Distance: d_out

If (d_in < γ*d_out)assign e_min to the current group

MDAV is the most popular one, so authors use it as a reference for comparison.

extend the group ( up to k-1 )

Page 21: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

21

Outline

Introduction Basics of Micro-Aggregation and Methods Privacy Protection Through Genetic-

Algorithm-Based Micro-aggregation A Hybrid Approach Experimental Results Conclusions and Future Challenges

Page 22: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

22

Coding sequence Initializing the population The fitness function Selection scheme and genetic operators (crossover

& mutation)

Page 23: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

23

Coding Sequence

Binary codings : N-ary codings : Real-valued codings :

0 1 1 0 10 0 1 1 0 ….

2 3 2.3 1.9 53.4 4.5 2.7 2 3.1 ….

B A D F EA C C B F ….

Page 24: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

24

Univariate V.S. Multivariate

Univariate micro-aggregation : binary codings Data set : 3 25 1 6 9 8 4 5 10 11 20 17 Sorted data set : 1 3 4 5 6 8 9 10 11 17 20 25 Binary codings may be : But, there is no way of sorting multivariate records wit

hout giving a higher priority to one of the attributes.

0 0 0 01 1 0 0 1 0 00

Page 25: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

25

Univariate V.S. Multivariate (cont.) Multivariate micro-aggregation : N-ary codings

Maximum number of groups Each symbol represents one group of the k-partition. Chromosome length : the number of records in the da

ta set The i-th gene value →the group of the k-partition whic

h the i-th record in the data set belongs to

Page 26: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

26

Example

n = 11k = 3G = 11/3 = 3

3-character alphabet:A、 B、 C

Chromosome length: 11

A B CAA B B C C A A

3-partition: group A = {1,2,3,10,11} group B = {4,5,6} group C = {7,8,9}

Page 27: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

27

Initializing the Population

Generally using random method n records and G different alphabet symbols :

But, only a small fraction meets the cardinality constraints.

“In an optimal k-partition, each group has between k and 2k-1 records.” (Domingo & Mateo)

Minimum number of groups

possible chromosomes

Page 28: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

28

Initializing the Population (cont.) Random initialization is not suitable to obtain

candidate optimal k-partitions. So, the cardinality constraints must be embedded

in the initialization procedure. →Algorithm 2

Guarantee that each group( part) has at most 2k-1 elements.

Page 29: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

29

The Fitness Function

Obtain a measure of the homogeneity of the groups in the k-partition represented by a given chromosome through SSE.

The goal is to minimize SSE. Thus, the fitness value of a chromosome is

s: group的總數ni:第 i個 group的 record 數目

Penalize the chromosome which includes a non-optimal k-partition.

Page 30: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

30

Selection Scheme and Genetic Operators

Selection scheme : roulette-wheel selection Genetic operators : one-point crossover and mut

ation

Page 31: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

31

Outline

Introduction Basics of Micro-Aggregation and Methods Privacy Protection Through Genetic-Algorithm-

Based Micro-aggregation A Hybrid Approach Experimental Results Conclusions and Future Challenges

Page 32: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

32

A Hybrid Approach

GA MDAV

Good SSEAdapting to very large data sets

Low performance to very large data sets

Worse than GA in terms of SSE

Hybrid approach

1. Good SSE2. Adapting to very large data sets

Name: Two-step partitioning

Page 33: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

33

Two-step partitioning k→ small value K→ larger than k and K% k = 0 ; small enough to be suitable for GA

Ex: k=3; K=21Use MDAV to build 3-partition

Use MDAV to build macro-groups (sets of average vectors) of size K/k (21/3=7)

K-partition

Replace the vectors by the k original records

Finally, apply the GA to each macro-group in the K-partition in order to generate an optimal or near optimal k-partition of the macro-group.

Page 34: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

34

One-step MDAV V.S. Two-step MDAV

Better

Page 35: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

35

Outline

Introduction Basics of Micro-Aggregation and Methods Privacy Protection Through Genetic-Algorithm-

Based Micro-aggregation A Hybrid Approach Experimental Results Conclusions and Future Challenges

Page 36: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

36

Experiment Approaches : GA-based micro-aggregation

Hybrid micro-aggregation Comparison with MDAV and ES (exhaustive sear

ch). ES is only possible with tiny data sets of up to 11 elements.

Data sets : 1. The example data set (Table 1) 2. Small data sets 3. Real and large data sets

Each experiment consists of 12,100 runs of GA.Mutation rate: 0、 0.1、 0.2、 0.3、 0.4、 0.5、 0.6、 0.7、 0.8、 0.9、 1→11種Crossover rate: 0、 0.1、 0.2、 0.3、 0.4、 0.5、 0.6、 0.7、 0.8、 0.9、 1→11種 Population size: 10、 20、 30、 40、 50、 60、 70、 80、 90、 100→10種GA was run 10 times for each parameter setting.

Page 37: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

37

Results for the Running Example

GA running time depends on the number of generations.

Most of the tests converge in less than 5,000 iterations.

Although MDAV is faster, the SSE obtained with the GA is better. (90% →14.82)

Page 38: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

38

Results in Small Data Sets

Mutation rate should be low. Ex: 0.1

GA-based approach cannot deal with large data sets.Same!!

Page 39: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

39

Results in Real and Large Data Sets

Use the hybrid technique.

1000 x 2

1000 x 2

1080 x 13

4092 x 11

Better

Page 40: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

40

Outline

Introduction Basics of Micro-Aggregation and Methods Privacy Protection Through Genetic-Algorithm-

Based Micro-aggregation A Hybrid Approach Experimental Results Conclusions and Future Challenges

Page 41: 1 Privacy Protection with Genetic Algorithms 報告者:林惠珍 運用基因演算法來作隱私保護.

41

Conclusions and Future Challenges

The reported experimental results demonstrate the usefulness of the proposed methods and open the door to an invigorating research line.

Lots of questions remain open : Look for better codings. Test the efficiency of other selection algorithms. Evaluate the importance of genetic operators such as m

ultiple-point crossover or inversion.