1 Privacy Protection with Genetic Algorithms 報告者：林惠珍...

1

Privacy Protection with Genetic Algorithms

報告者：林惠珍

運用基因演算法來作隱私保護

2

Outline

Introduction Basics of Micro-Aggregation and Methods Privacy Protection Through Genetic-Algorithm-

Based Micro-aggregation A Hybrid Approach Experimental Results Conclusions and Future Challenges

3

Outline



4

Privacy!!

Privacy V.S.Data utility

Data collectionStatistics

Data aggregation

Releasing

Respondent

Safe

5

Contribution

Micro-aggregation for distorting data and guaranteeing respondents privacy.

Optimal micro-aggregation is NP-hard, so the author uses GA and some modification to solve the problem.

A hybrid method for solving above problem.

6

Outline



7

SDC(Statistical Disclosure Control)

(Statistical Disclosure Limitation， SDL)

Data Transform

Public

Data utilityStatistical

confidentiality

Respondent

Enough protection &Minimize information loss

Method

Micro-aggregation Micro-data個人資料

Clustering problem

Cluster size!

8

Two goals for micro-aggregation

Preserving data utility. Protecting the privacy of the respondents.

9

Preserving data utility

As less noise as possible into data

So, we should aggregate similar elements instead of different ones.

10

Protecting the privacy of the respondents

Data have to be sufficiently modified to make re-identification difficult.

Increasing the number of aggregated elements can increase data privacy.

11

Whether two elements are similar

Similarity function

ex： Euclidean Distance

Univariate Data set

Element numbers in Duni

The i-th element in Duni

Average element

Multivariate Data set

Dimension numbers of each element

The j-th component of the average element

The j-th component of the i-th element in Dmulti

Multiple subsets

Subset numberElement numbers in the i-th subset

The j-th element in the i-th subset

The average element of the i-th subset

12

Micro-aggregation problem(k-micro-aggregation problem)

SSE k

A security parameter. Determines the minimum cardinality of the subsets.

Data set D(n elements)

To obtain a k-partitionHomogeneity of is maximized

A k-partition of D is a partition where its parts have, at least, k elements of D.ex: k=3

3

54

Average element = 4

4

4

(SSE的值要小 )

NP-hard for multivariate

data sets

Use heuristic methods!!

Definition

13

Multivariate Micro-Aggregation Methods

Minimum Spanning Tree Partitioning (MSTP) Maximum Distance Method (MD) Maximum Distance to Average Vector Method

(MDAV) Variable-MDAV

14

Minimum Spanning Tree Partitioning (MSTP)

Step：

1. MST construction

2. Edge cutting

3. Cluster generation

Limitation：In its foundation, MST.

Fail to properly adapt to the scattered data points.

15

Maximum Distance Method (MD)The main advantage is its simplicity and it achieves very good results in most data sets.

r

s

Most distant (by Euclidean Distance)

Form a group with r(s) and the closet k-1 elements.

Check the remaining element numbers.

1. num>=2krepeat

2. k<=num<=2k-1a new group

3. num<=k-1assign each element to the closet group

Micro-aggregated data： Replacing each record by the centroid of the group to which it belongs.

Shortcoming：computational complexity

16

Maximum Distance to Average Vector Method (MDAV) MDAV improves on MD in terms of

computational complexity while maintaining the performance in terms of SSE.

MDAV is the most popular method used for micro-aggregating data sets.

17

MDAV Algorithm

Build two groups at each iteration.

When (RR<=2k-1)1. RR<k

assign each element to the closet group 2. RR>=k

a new group

18

MDAV Process

Centroid c

Distance Matrix

Most distant

s

r

Most distant

Distance Matrix

Micro-aggregated data： Replacing each record by the centroid of the group to which it belongs.

Shortcoming：Lack of flexibilityIt only generates subsets of fixed cardinality k.

19

Variable-MDAV

V-MDAV intends to overcome the limitation by computing a variable-size k-partition with a computational cost similar to the MDAV cost.

20

V-MDAV Process

Distance Matrix

Centroid c

Check the remaining element numbers.1. RR>=k form groups

2. RR<=k-1 assign each element to the closet group

Distance Matrix

Most distant

e

Closet

Distance： d_in

e_minCloset

Distance： d_out

If (d_in < γ*d_out)assign e_min to the current group

MDAV is the most popular one, so authors use it as a reference for comparison.

extend the group ( up to k-1 )

21

Outline

Introduction Basics of Micro-Aggregation and Methods Privacy Protection Through Genetic-

Algorithm-Based Micro-aggregation A Hybrid Approach Experimental Results Conclusions and Future Challenges

22

Coding sequence Initializing the population The fitness function Selection scheme and genetic operators (crossover

& mutation)

23

Coding Sequence

Binary codings ： N-ary codings ： Real-valued codings ：

0 1 1 0 10 0 1 1 0 ….

2 3 2.3 1.9 53.4 4.5 2.7 2 3.1 ….

B A D F EA C C B F ….

24

Univariate V.S. Multivariate

Univariate micro-aggregation ： binary codings Data set ： 3 25 1 6 9 8 4 5 10 11 20 17 Sorted data set ： 1 3 4 5 6 8 9 10 11 17 20 25 Binary codings may be ： But, there is no way of sorting multivariate records wit

hout giving a higher priority to one of the attributes.

0 0 0 01 1 0 0 1 0 00

25

Univariate V.S. Multivariate (cont.) Multivariate micro-aggregation ： N-ary codings

Maximum number of groups Each symbol represents one group of the k-partition. Chromosome length ： the number of records in the da

ta set The i-th gene value →the group of the k-partition whic

h the i-th record in the data set belongs to

26

Example

n = 11k = 3G = 11/3 = 3

3-character alphabet：A、 B、 C

Chromosome length： 11

A B CAA B B C C A A

3-partition： group A = {1,2,3,10,11} group B = {4,5,6} group C = {7,8,9}

27

Initializing the Population

Generally using random method n records and G different alphabet symbols ：

But, only a small fraction meets the cardinality constraints.

“In an optimal k-partition, each group has between k and 2k-1 records.” (Domingo & Mateo)

Minimum number of groups

possible chromosomes

28

Initializing the Population (cont.) Random initialization is not suitable to obtain

candidate optimal k-partitions. So, the cardinality constraints must be embedded

in the initialization procedure. →Algorithm 2

Guarantee that each group( part) has at most 2k-1 elements.

29

The Fitness Function

Obtain a measure of the homogeneity of the groups in the k-partition represented by a given chromosome through SSE.

The goal is to minimize SSE. Thus, the fitness value of a chromosome is

s： group的總數ni：第 i個 group的 record 數目

Penalize the chromosome which includes a non-optimal k-partition.

30

Selection Scheme and Genetic Operators

Selection scheme ： roulette-wheel selection Genetic operators ： one-point crossover and mut

ation

31

Outline



32

A Hybrid Approach

GA MDAV

Good SSEAdapting to very large data sets

Low performance to very large data sets

Worse than GA in terms of SSE

Hybrid approach

1. Good SSE2. Adapting to very large data sets

Name： Two-step partitioning

33

Two-step partitioning k→ small value K→ larger than k and K% k = 0 ； small enough to be suitable for GA

Ex： k=3； K=21Use MDAV to build 3-partition

Use MDAV to build macro-groups (sets of average vectors) of size K/k (21/3=7)

K-partition

Replace the vectors by the k original records

Finally, apply the GA to each macro-group in the K-partition in order to generate an optimal or near optimal k-partition of the macro-group.

34

One-step MDAV V.S. Two-step MDAV

Better

35

Outline



36

Experiment Approaches ： GA-based micro-aggregation

Hybrid micro-aggregation Comparison with MDAV and ES (exhaustive sear

ch). ES is only possible with tiny data sets of up to 11 elements.

Data sets ： 1. The example data set (Table 1) 2. Small data sets 3. Real and large data sets

Each experiment consists of 12,100 runs of GA.Mutation rate： 0、 0.1、 0.2、 0.3、 0.4、 0.5、 0.6、 0.7、 0.8、 0.9、 1→11種Crossover rate： 0、 0.1、 0.2、 0.3、 0.4、 0.5、 0.6、 0.7、 0.8、 0.9、 1→11種 Population size： 10、 20、 30、 40、 50、 60、 70、 80、 90、 100→10種GA was run 10 times for each parameter setting.

37

Results for the Running Example

GA running time depends on the number of generations.

Most of the tests converge in less than 5,000 iterations.

Although MDAV is faster, the SSE obtained with the GA is better. (90% →14.82)

38

Results in Small Data Sets

Mutation rate should be low. Ex： 0.1

GA-based approach cannot deal with large data sets.Same!!

39

Results in Real and Large Data Sets

Use the hybrid technique.

1000 x 2

1000 x 2

1080 x 13

4092 x 11

Better

40

Outline



41

Conclusions and Future Challenges

The reported experimental results demonstrate the usefulness of the proposed methods and open the door to an invigorating research line.

Lots of questions remain open ： Look for better codings. Test the efficiency of other selection algorithms. Evaluate the importance of genetic operators such as m

ultiple-point crossover or inversion.

1 Privacy Protection with Genetic Algorithms 報告者：林惠珍...

Documents

Transcript of 1 Privacy Protection with Genetic Algorithms 報告者：林惠珍...