A Study on Measuring Distance between Two Trees 阮夙姿 教授 Advisor: 阮夙姿 教授 林陳輝...

Post on 04-Jan-2016

284 views 2 download

Transcript of A Study on Measuring Distance between Two Trees 阮夙姿 教授 Advisor: 阮夙姿 教授 林陳輝...

A Study on Measuring Distance between Two Trees

Advisor: 阮夙姿 教授阮夙姿 教授Presenter : 林陳輝林陳輝

CSIE, National Chi Nan University 2

OutlineIntroduction

Problem definition

Related workThe metric and algorithms

Mixture distanceBasic algorithmThe modified algorithm

Mixture - matching distanceMixture - matching distance

Conclusions and Future work

CSIE, National Chi Nan University 3

Introduction

Evolutionary tree

Comparing trees

Comparing trees is not easy

-Phylogenetic tree, wikipedia

CSIE, National Chi Nan University 4

Mixture tree

taxa

Time

S.-C. Chen and B. G. Lindsay, “Building Mixture Trees from Binary Sequence Data,” Biometrika, 2006.

CSIE, National Chi Nan University 5

Problem definition

1111

99 88

11 33 55 77

A B C D E F G H

v1

v2 v3

v4v5 v6

v7

•The leaves are associating taxas

•There is a time parameter on every internal node

CSIE, National Chi Nan University 6

OutlineIntroduction

Problem definition

Related workThe metric and algorithms

Mixture distanceBasic algorithmThe modified algorithm

Mixture - matching distanceMixture - matching distance

Conclusions and Future work

CSIE, National Chi Nan University 7

Related workPath difference metric

dp(T1, T2) = ||d(T1) – d(T2)||2

d(Ti) is a vector that contains all pair leaves distance of

Ti.

M. A. Steel and D. Penny, “Distributions of Tree Comparison Metrics – Some New Results,” Syst. Biol. 42(2):126-141, 1993.

CSIE, National Chi Nan University 8

Related workNodal metric

In full binary trees, the complexity is O(n3).In complete binary trees, the complexity is O(n2 log n). John Bluis and Dong-Guk Shin, “Nodal Distance Algorithm: Calculating a Phylogenetic Tree Comparison Metric,” Proc. of the 3rd IEEE Symposium on BioInformatics and BioEngineering, 87- 94, 2003

leaves. are for ,) ,() ,(Distance21

yx,yxDyxD TT

CSIE, National Chi Nan University 9

Related work

Matching distanceP. W. Diaconis and S. P. Holmes, “Matchings and Phylogenetic Trees.," Proc. Natl Acad Sci U S A, Vol. 95, No. 25, pp. 14600~14602, 1998.

The algorithm for matching distanceG. Valiente, A Fast Algorithmic Technique for Comparing Large Phylogenetic Trees," SPIRE, pp. 370~375, 2005.

CSIE, National Chi Nan University 10

Matching Representation

1 2

3 4

5 6

0

0

0

0

07 8

9 10

11

{1,2} {5,6} {3,7} {4,8} {9,10}

CSIE, National Chi Nan University 11

Matching distance

{1,2} {5,6} {3,7} {4,8} {9,10}

{1,3} {4,6} {2,7} {5,8} {9,10}

The distance is 2

3 4

5 6

8

9 10

7

1 2

2 5

4 6

8

9 10

7

1 3

11 11T1

T2

T1

T2

CSIE, National Chi Nan University 12

OutlineIntroduction

Problem definition

Related workThe metric and algorithms

Mixture distanceBasic algorithmThe modified algorithm

Mixture - matching distanceMixture - matching distance

Conclusion and Future work

CSIE, National Chi Nan University 13

Mixture distance and algorithmsDefinition:

pTi (x, y) is time parameter of the LCA of leaves x, y

leaves. are for ,),(),(Distance21

yx,yxpyxp TT

99

11 33

A B C D

v1

v3v2

99

22 33

A BC D

v1

v3v2

CSIE, National Chi Nan University 14

Distance conditions

The distance from an object to itself is zero.

The distance from A to B is the same as the distance from B to A.

The Triangle Inequality holds true.

- J. Felsenstein, Inferring phylogenies. Sunderland, MA: Sinauer Associates, 2004.

CSIE, National Chi Nan University 16

Algorithm

C(n, 2)

Algorithmic idea: grouping

Full binary tree99

11 33

A B C D

v1

v2

88

44

11

A B C D

v1

v2

v3v3

AB: |8 – 1| = 7

AC: |8 – 9| = 1

AD: |8 – 9| = 1

BC: |4 – 9| = 5

BD: |4 – 9| = 5

CD: |1 – 3| = 2

Distance = 21

leaves. are for ,),(),(Distance21

yx,yxpyxp TT

CSIE, National Chi Nan University 17

99

77 88

22 33 44 55

A B C D E F G H

v1

v2 v3

v4 v5 v6v7

T199

66 88

11 33 44 55

HG FA B CD E

v1

v2 v3

v4 v5 v6v7

T2

Algorithm

CSIE, National Chi Nan University 18

99

HG FA B CD E

T2

Red:1 Green:1

99

7788

22 33 44 55

A B C D E F G H

v1

v2v3

v4 v5 v6v7

Red:0 Green:1

Red:1 Green:0

Red:0 Green:1

Red:1 Green:0

66 88

11 33 44 55

v1

v2 v3

v4 v5 v6v7

Red:1Green:1

Red:2 Green:2

T1

|pT1(v1) - pT2

(v6)| × (1 × 1+0 × 0) = |9 - 4| × (1*1+0*0) =

5

|pT1(v1) - pT2

(v7)| × (0 × 0+1 × 1) = |9 - 5| × (0*0+1*1) =

4

|pT1(v1) - pT2

(v3)| × (1 × 1+1 × 1) = |9 - 8| × (1*1+1*1) =

2

CSIE, National Chi Nan University 19

T2

99

66 88

11 33 44 55

HG FA B CD E

v1

v2 v3

v4 v5 v6v7

Red:0 Green:1

Red:0Green:1

99

77 88

22 3344 55

A B C D E F G H

v1

v2 v3

v4 v5 v6v7

T1

Red:1 Green:0

Red:1 Green:0

Red:0 Green:0

Red:0 Green:0

Red:0Green:2

Red:2Green:0

|pT1(v2) - pT2

(v2)| × (2 × 0 + 0 × 0) = |7 - 6| × (2 × 0 + 0 × 0) =

0|pT1(v2) - pT2

(v3)| × (0 × 1 + 0 × 1) = |7 - 8| × (0 × 1 + 0 × 1)

= 0|pT1(v2) - pT2

(v1)| × (2 × 2 + 0 × 0) = |7 - 9| × (2 × 2 + 0 × 0) =

8

Red:2Green:2

CTLab
2/0--2/0--1/1 |- 2/0--0/0

CSIE, National Chi Nan University 20

Complexity analysis

For every internal node of T1, coloring all leaves

needs O(n).

Counting distance in T2 needs O(n).

The time complexity is O(n2).

CSIE, National Chi Nan University 21

The modified algorithm

Boost up the basic algorithm

Too much empty color information

CSIE, National Chi Nan University 22

T2

99

66 88

11 33 44 55

HG FA B CD E

v1

v2 v3

v4 v5 v6v7

Red:0 Green:1

Red:0Green:1

99

77 88

22 3344 55

A B C D E F G H

v1

v2 v3

v4 v5 v6v7

T1

Red:1 Green:0

Red:1 Green:0

Red:0 Green:0

Red:0 Green:0

Red:0Green:2

Red:2Green:0

|pT1(v2) - pT2

(v2)| × (2 × 0 + 0 × 0) = |7 - 6| × (2 × 0 + 0 × 0) =

0|pT1(v2) - pT2

(v3)| × (0 × 1 + 0 × 1) = |7 - 8| × (0 × 1 + 0 × 1)

= 0|pT1(v2) - pT2

(v1)| × (2 × 2 + 0 × 0) = |7 - 9| × (2 × 2 + 0 × 0) =

8

Red:2Green:2

Empty color information

CTLab
2/0--2/0--1/1 |- 2/0--0/0

CSIE, National Chi Nan University 23

T2

99

66 88

11 33 44 55

HG FA B CD E

v1

v2 v3

v4 v5 v6v7

T2

99

88

11

A B CD

v1

v3

v4

CSIE, National Chi Nan University 24

The modified algorithm

Finding LCA in constant time with O(n) preprocessing

MA Bender, MIF Colton, The LCA Problem Revisited, Proc. LATIN, 2000

2-way merge problemR.C.T. Lee, S. S. Tseng, R.C. Chang and Y. T. Tsai, Introduction to the Design and Analysis of Algorithms. McGraw-Hill Education, 2005

CSIE, National Chi Nan University 25

9

7 8

2 3 4 5

HG FA B CD E

v1

v2 v3

v4 v5 v6v7

T2

9

6 8

1 3 4 5

A B C D E F G H

v1

v2 v3

v4 v5 v6v7

T1

1 2

3

4 5

6

7

8 9

10

11 12

13

14

15

1 2 45 8 911 12

CSIE, National Chi Nan University 26

9

7 8

2 3 4 5

HG FA B CD E

v1

v2 v3

v4 v5 v6

v7

T2

1 2

45 8 911 12

1, 2 11, 12 5,84, 9

13 v4 |1 – 2| (1 1 + 0 0) = 19

6 8

1 3 4 5

A B C D E F G H

v1

v2 v3

v4 v5 v6v7

T1

1 2

3

4 5

6

7

8 9

10

11 12

13

14

15

1 2

CSIE, National Chi Nan University 27

9

7 8

2 3 4 5

HG FA B CD E

v1

v2 v3

v4 v5 v6

v7

T2

45 8 9

11 12

1, 2 11, 12 5,84, 9

1, 2, 11, 12 4, 5, 8, 9

1, 2, 4, 5, 8, 9, 11, 12

|9 – 7| (2 2 – 0 0) = 8

9

6 8

1 3 4 5

A B C D E F G H

v1

v2 v3

v4 v5 v6v7

T1

1 2

3

4 5

6

7

8 9

10

11 12

13

14

15

9

1 5

v1

v4

3 13v7

11 121 2

1 2

15

HGA B

CSIE, National Chi Nan University 28

Complexity analysis

To reconstruct subtree of T1 is in linear time

Counting distance in reconstructed subtree needs O(m).

The height of complete binary tree is O(logn)

The total complexity is O(nlogn) in complete binary tree.

CSIE, National Chi Nan University 29

OutlineIntroduction

Problem definition

Related worksThe metric and algorithms

Mixture distanceBasic algorithmThe modified algorithm

Mixture - matching distanceMixture - matching distance

Conclusions and Future work

CSIE, National Chi Nan University 30

Mixture-matching distance

Distance =

i is matching distance between T1 and T2.

PTm denotes the product of all time parameter in Tm

2 ,1 , and ,for , /1 mnPPiPP mnmn TTTT

CSIE, National Chi Nan University 31

9

7 8

2 3 4 5

HG FA B CD E

T2

9

6 8

1 3 4 5

A B C D E F G H

T1

1 2 3 4 5 6 7 8

9 10 11 12

13 14

15

1 2 4 58

9 11 10

367

12

13 14

15

{1, 2} {3, 4} {5, 6} {7, 8} {9,10} {11, 12} {13, 14}

{1, 2} {3, 6} {4, 5} {7, 8} {9,12} {10, 11} {13, 14}

Distance = 1 - (25920 / 60480) + 2 ≒ 2.571

604801 TP

259202 TP

T1

T2

CSIE, National Chi Nan University 32

0

1

The sameNo different leaves

i

i transposition

Distance

Distance = 1 - (25920 / 60480) + 2 ≒ 2.571

The time complexity is O(n)

2 ,1 , and ,for , /1 mnPPiPP mnmn TTTTDistance =

CSIE, National Chi Nan University 33

OutlineIntroduction

Problem definition

Related worksThe metric and algorithms

Mixture distanceBasic algorithmThe modified algorithm

Mixture - matching distanceMixture - matching distance

Conclusions and Future work

CSIE, National Chi Nan University 34

Conclusions

Metric ConsiderenceTime complexity

Full binary tree

Complete binary tree

Path difference metric Structure N/ANodal distance Structure O(n3) O(n2logn)

Mixture distanceStructure and

time parameterO(n2) O(nlogn)

Matching distance Structure O(n)

Mixture-matching distance

Structure and

time parameterO(n)

CSIE, National Chi Nan University 35

Future work

Improve the time complexity

Extend to k - ary trees

Add mutation point

Thanks for Your Listening.