TWIX Trees WIth eXtra splits - Homepage of Martin Theus · Martin Theus, Sergej Potapov –...

Martin Theus, Sergej Potapov – Lehrstuhl für Rechnerorientierte Statistik und Datenanalyse, Universität AugsburgSimon Urbanek – AT&T Research Labs, Florham Park, NJ

TWIX – Trees WIth eXtra splits

Sergej PotapovMartin Theus

Simon Urbanek

TWIX

Trees WIth eXtra splits


TWIX – Trees WIth eXtra splits 2

Motivation

• Where the classical CART algoritm fails– Greedy algorithms never go for a (locally) second best solution,

which would result in a better overall (global) solution.

Example: XOR-data



Motivation



Example: XOR-data

-3 -2 -1 0 1 2 3 4 5 6 7 8

-3

-2

-1

0

1

2

3

4

5

6

7

8

9

10



Motivation



Example: XOR-data

-3 -2 -1 0 1 2 3 4 5 6 7 8

0

-3 -2 -1 0 1 2 3 4 5 6 7 8

-3

-2

-1

0

1

2

3

4

5

6

7

8

9

10



Motivation



Example: XOR-data

-3 -2 -1 0 1 2 3 4 5 6 7 8

0

>=3.532669

1

< 3.532669

2

1.437037

>=5.679445

y

< 3.151441

1

>=3.151441

2

1.290780

< 3.248369

x

>=2.703216

1

< 2.703216

2

1.806452

>=3.248369

x

1.532075

< 5.679445

y

1.500000

x

-3 -2 -1 0 1 2 3 4 5 6 7 8

-3

-2

-1

0

1

2

3

4

5

6

7

8

9

10



Trees: More Problems

• Non-orthogonal splitting directions …

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3





-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

< -0.05426408

1

>=0.09766749

1

< 0.09766749

2

1

>=-0.05426408

V1

1

>=-0.5117182

V2

>=-1.571775

1

< -1.758858

1

>=-1.758858

2

1

< -1.571775

V2

1

< -0.8793875

V1

>=-0.8793875

2

1

< -0.5117182

V2

1

< 0.2145846

V1

< 1.029791

1

>=1.029791

2

2

>=0.8353015

V2

< 0.8353015

2

2

>=0.2145846

V1

1

V2

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3





-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

< -0.05426408

1

>=0.09766749

1

< 0.09766749

2

1

>=-0.05426408

V1

1

>=-0.5117182

V2

>=-1.571775

1

< -1.758858

1

>=-1.758858

2

1

< -1.571775

V2

1

< -0.8793875

V1

>=-0.8793875

2

1

< -0.5117182

V2

1

< 0.2145846

V1

< 1.029791

1

>=1.029791

2

2

>=0.8353015

V2

< 0.8353015

2

2

>=0.2145846

V1

1

V2

-3 -2 -1 0 1 2 3

-3

-2

-1

0

1

2

3

… can not be handled by single trees, no matter how we split



Bagging Revisited

Bagging = Boostrap Aggegation, tries to simulate an infinite sam-ple by bootstrapping, i.e. sampling from the original sample withreplacement.

Repeat N times:

1. Generate a bootstrap sample Di of size n.

2. Fit model f̂Di.

Depending on the problem the N results are aggregated:

• Classification: g(x) = argmaxc!C

N!

i=1

I(fDi(x) = c)

• Regression: g(x) =1

N

N!

i=1

fDi(x)



Ensembles

• General IdeaUse many “different” classifier and combine them to get more accurate results.

• Bagging: Instability of trees yields different models

• Random Forests: Restrict input space randomly to get wider range of models

• Boosting: Iterate to up-weight “bad” points



Ensembles

• General IdeaUse many “different” classifier and combine them to get more accurate results.

• Bagging: Instability of trees yields different models

• Random Forests: Restrict input space randomly to get wider range of models

• Boosting: Iterate to up-weight “bad” points

Question:Why use randomly generated (sub-optimal) models?



Tree Mechanics

• CART is a recursive partitioning algorithm

• Each node is split according to the maximum gain in the loss function

• Mountain plots shows the loss function for a variable for all possible split points

Loss Functions Mountain Plot

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.1

0.2

0.3

0.4

0.5

p

Entropy

Gini

Missclassification

6400 6600 6800 7000 7200 7400 7600 7800 8000 8200 8400

0

100

200

300



Tree Mechanics

• CART is a recursive partitioning algorithm

• Each node is split according to the maximum gain in the loss function

• Mountain plots shows the loss function for a variable for all possible split points

Loss Functions Mountain Plot

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.1

0.2

0.3

0.4

0.5

p

Entropy

Gini

Missclassification

6400 6600 6800 7000 7200 7400 7600 7800 8000 8200 8400

0

100

200

300

6400 6600 6800 7000 7200 7400 7600 7800 8000 8200 8400

1

3

5

7

9



Idea behind TWIX

• Since the greedy CART algorithm not necessarily finds the “optimal” tree, try second best splits.

• Use these forests for aggregation

• Expect better results for both single trees and aggregations

• How to find “good” candidates for second best splits?

• Number of inner nodes grows exponentially with the number of levels in the tree

⇒ so does the number of alternative trees

Problems



Second Best Splits: South African Heart Data

100 120 140 160 180 200

05

10

15

20

sbp

Deviance

0 5 10 15 20 25 30

05

10

15

20

tobacco

Deviance

2 4 6 8 10 12 14

05

10

15

20

ldl

Deviance

10 15 20 25 30 35 40

05

10

15

20

adiposity

Deviance

20 30 40 50 60 70

05

10

15

20

typea

Deviance

20 25 30 35 40 45

05

10

15

20

obesityDeviance

0 20 40 60 80 100 120

05

10

15

20

alcohol

Deviance

20 30 40 50 60

05

10

15

20

age

Deviance



Second Best Splits: Global vs. Local

• When searching for a “best” split point, we can either look for– all top n greatest deviance gains, or– only look for local maxima

• ExampleTop 6 splits

100 120 140 160 180 200

05

10

15

20

sbp

Deviance

0 5 10 15 20 25 30

05

10

15

20

tobacco

Deviance

2 4 6 8 10 12 14

05

10

15

20

ldl

Deviance

10 15 20 25 30 35 40

05

10

15

20

adiposity

Deviance

20 30 40 50 60 70

05

10

15

20

typea

Deviance

20 25 30 35 40 45

05

10

15

20

obesity

Deviance

0 20 40 60 80 100 120

05

10

15

20

alcohol

Deviance

20 30 40 50 60

05

10

15

20

age

Deviance



Second Best Splits: Forcing Variables

• Often a single variable dominates the potential deviance gain, and shadows all other variables⇒ Many probably good split points are lost.

• Solution:Force a minimum number of split points for each variable.

• Example: top 6 vs. top 3

100 120 140 160 180 200

05

10

15

20

sbp

Deviance

0 5 10 15 20 25 30

05

10

15

20

tobacco

Deviance

2 4 6 8 10 12 14

05

10

15

20

ldl

Deviance

10 15 20 25 30 35 40

05

10

15

20

adiposity

Deviance

20 30 40 50 60 70

05

10

15

20

typea

Deviance

20 25 30 35 40 45

05

10

15

20

obesity

Deviance

0 20 40 60 80 100 120

05

10

15

20

alcohol

Deviance

20 30 40 50 60

05

10

15

20

age

Deviance

100 120 140 160 180 200

05

10

15

20

sbp

Deviance

0 5 10 15 20 25 30

05

10

15

20

tobaccoDeviance

2 4 6 8 10 12 14

05

10

15

20

ldl

Deviance

10 15 20 25 30 35 40

05

10

15

20

adiposity

Deviance

20 30 40 50 60 70

05

10

15

20

typea

Deviance

20 25 30 35 40 45

05

10

15

20

obesity

Deviance

0 20 40 60 80 100 120

05

10

15

20

alcohol

Deviance

20 30 40 50 60

05

10

15

20

age

Deviance



Second Best Splits: Forcing Variables

• Often a single variable dominates the potential deviance gain, and shadows all other variables⇒ Many probably good split points are lost.

• Solution:Force a minimum number of split points for each variable.

• Example: top 6 vs. top 3

100 120 140 160 180 200

05

10

15

20

sbp

Deviance

0 5 10 15 20 25 30

05

10

15

20

tobacco

Deviance

2 4 6 8 10 12 14

05

10

15

20

ldl

Deviance

10 15 20 25 30 35 40

05

10

15

20

adiposity

Deviance

20 30 40 50 60 70

05

10

15

20

typea

Deviance

20 25 30 35 40 45

05

10

15

20

obesity

Deviance

0 20 40 60 80 100 120

05

10

15

20

alcohol

Deviance

20 30 40 50 60

05

10

15

20

age

Deviance

100 120 140 160 180 200

05

10

15

20

sbp

Deviance

0 5 10 15 20 25 30

05

10

15

20

tobaccoDeviance

2 4 6 8 10 12 14

05

10

15

20

ldl

Deviance

10 15 20 25 30 35 40

05

10

15

20

adiposity

Deviance

20 30 40 50 60 70

05

10

15

20

typea

Deviance

20 25 30 35 40 45

05

10

15

20

obesity

Deviance

0 20 40 60 80 100 120

05

10

15

20

alcohol

Deviance

20 30 40 50 60

05

10

15

20

age

Deviance

56



Second Best Splits: Grid Search

• In some situations good split points might not even be associated with some (local) maximum in deviance gain.

(Remember the XOR Example)

• Grid searches are most exhaustive, but also most expensive.

100 120 140 160 180 200

05

10

15

20

sbp

Deviance

0 5 10 15 20 25 30

05

10

15

20

tobacco

Deviance

2 4 6 8 10 12 14

05

10

15

20

ldl

Deviance

10 15 20 25 30 35 40

05

10

15

20

adiposity

Deviance

20 30 40 50 60 70

05

10

15

20

typea

Deviance

20 25 30 35 40 45

05

10

15

20

obesity

Deviance

0 20 40 60 80 100 120

05

10

15

20

alcohol

Deviance

20 30 40 50 60

05

10

15

20

age

Deviance



Implementation: The Grid

• If we allow sj splits per node on level j of the tree, we get a max-imum of

S =k!

i=1

s2i!1

i

trees for a tree with no more than k levels. Example:s = (7, 4, 2) ! S = 72

0

· 421

· 222

= 7 · 16 · 16 = 1792

!Work on a grid of computers



Implementation: The Grid

• If we allow sj splits per node on level j of the tree, we get a max-imum of

S =k!

i=1

s2i!1

i

trees for a tree with no more than k levels. Example:s = (7, 4, 2) ! S = 72

0

· 421

· 222

= 7 · 16 · 16 = 1792

!Work on a grid of computers

PVMController



Using the R Package

• The most important tuning parameters are– method

Which split points will be used? This can be "deviance" (default), "grid" or "local". If the method is set to: local the program uses the local maxima of the split function(entropy), deviance all values of the entropy, grid grid points.

– topn.methodone of "complete"(default) or "single". A specification of the consideration of the split points. If set to "complete" it uses split points from all variables, else it uses split points per variable.

– topNinteger vector. How many splits will be selected and at which level? If length 1, the same size of splits will be selected at each level. If length > 1, for example topN=c(3,2), 3 splits will be chosen at first level, 2 splits at second level and for all next levels 1 split.

– levelmaximum depth of the trees. If level set to 1, trees consist of root node.

– Stopping Rules:■ minsplit

the minimum number of observations that must exist in a node.■ minbucket

the minimum number of observations in any terminal <leaf> node.■ Devmin

the minimum improvement on entropy by splitting.



South African Heart Desease Data cont.

Training Validation Test




• To get a “fair”, i.e. generalizable and not too overfitted classifier, we usually split the data into 3 chunks:






– TrainingAll models are trained using the training data







– ValidationThe “best” model is selected using the validation data(The chosen model is then estimated with training+validation)








– TestThe performance is then assessed with the test data








– TestThe performance is then assessed with the test data


What about trees?Model Structure = Model parameters



The Dataset



The Dataset

• 10 Variables, 462 Observations



The Dataset


• Target: Coronary Heart Disease (chd), 34,63% = 160 cases



The Dataset



• Inputs:continuous– sbp systolic blood pressure– tobacco cumulative tobacco (kg)– ldl low density lipoprotein cholesterol– adiposity– typea type-A behavior– obesity– alcohol current alcohol consumption– age age at onset



The Dataset



• Inputs:continuous– sbp systolic blood pressure– tobacco cumulative tobacco (kg)– ldl low density lipoprotein cholesterol– adiposity– typea type-A behavior– obesity– alcohol current alcohol consumption– age age at onset

discrete– famhist family history of heart disease (Present, Absent)

101 218



The Dataset: Univariate

sbp

101 218101 218




sbp

101 218101 218101 218

0.0

1.0




sbp




101 218

0.0

1.0

0 31.2

0.0

1.0

0.98 15.33

0.0

1.0

6.7 42.5

0.0

1.0

13 78

0.0

1.0

14.7 46.6

0.0

1.0

0 147.2

0.0

1.0

15 64

0.0

1.0

sbp tobacco ldl adiposity

typea obesity alcohol age



The Dataset:Bivariate sbp

0 5 10 15 20 25 30 10 20 30 40 15 25 35 45 20 30 40 50 60

100

140

180

220

05

1015202530

tobacco

ldl

24

68

10

14

10

20

30

40

adiposity

typea

20

40

60

80

15

25

35

45

obesity

alcohol

050

100

150

100 140 180 220

20

30

40

50

60

2 4 6 8 10 14 20 40 60 80 0 50 100 150

age



The Dataset: Multivariate

sbp

tobacco

ldl

adiposity

typea

obesity

alcohol

age



The Dataset: Multivariate

sbp

tobacco

ldl

adiposity

typea

obesity

alcohol

age

sbp

tobacco

ldl

adiposity

typea

obesity

alcohol

age

!

! !

!

! ! !!

! ! ! !

!!!!!! ! !

! !!!! !!!

! !! ! ! !

! !!! !

! ! !!! !! !!! ! !! ! !!!

! ! !

! !

! !!!!

!!! !!!

!

0.60 0.65 0.70 0.75 0.80

0.6

00

.65

0.7

00

.75

0.8

0

CCR of 130 trees

CCR of training data

CC

R o

f te

tst

da

ta

+

!

!!

!

!

!!!

!!!

!!!

!!

!

!

!

!

!!

!! !

!

!

!!

!

!!!

!!!!!!

!

!

!

!

!!!

!

!

!!!!

!

!!!!

!!

!

!

!

!

!

! !

!

!!!

!

!

! !

!!!

!

!!

!!!

!!!

!!

!!

!!

!

!

!

!!!!!!!

!!!!!!

!!

!

!!!!

!

!!

!

!

!!!

!!!

!!!

!

50 100 150

51

01

5

Training Deviance

Te

st

De

via

nce

!

Deviances of trees (n= 130 )

Training Deviance

Fre

qu

en

cy

50 100 150

05

10

15

20

25



TWIX: Diagnostics

• For a given “Multitree” we can compare deviance and classification rate on training and test/validation data.

Example:+ rpart Tree



TWIX: Tree Selection




• The CCR (Correct Classification Rate) of the top TWIX trees are better than those of greedy trees and many other classif. methods

Quest:How to find the “best” trees from the validation data?




• The CCR (Correct Classification Rate) of the top TWIX trees are better than those of greedy trees and many other classif. methods

Quest:How to find the “best” trees from the validation data?

• Several approaches:

(a) Sort trees according to: ■ training deviance, ■ validation deviance, ■ validation CCR,

and pick the best!

(b) Avoid extreme trees, i.e. forget trees having the worst deviances or CCRs and repeat (a).

(c) Look for structural properties like balance of tree, purity and size of leaves

(d) Identify clusters among the trees and avoid selecting trees from a “bad” cluster

(e) Use a mixture from (a) – (d) …



Looking at Tree Clusters

• Metric: Jaccard Coefficient

• Create groups via MDS and hierarchical clustering

22

1 2 3 4 5 6 7 8

020

40

60

80

100

120

Dimensionen

stress

!!!!!!

0.60

0.65

0.70

0.75

0.80

!!

0.60

0.65

0.70

0.75

0.80

0.60

0.65

0.70

0.75

0.80

0.60

0.65

0.70

0.75

0.80

1 2 3 4

010

20

30

40

50

16 KAPITEL 2. METRIKEN AUS DER LITERATUR

Somit unterscheiden sich die Bäume in genau drei Variablen, nämlich sbp, ldl undtypea. Für den Matching Koeffizienten erhält man dann:

dmc(B1, B2) =19

·9!

i=1

| !1,i ! !2,i |=

=19

· (0 + 1 + 1 + 0 + 0 + 0 + 0 + 1 + 0) =

=19

· 3 =13" 0.33

Aufgund diese Ergbnisses beträgt die Ähnlichkeit der beiden Bäume 1! 13 " 67%.

2.2.2 Der Jaccard Koeffizient

Der Jaccard Koeffizient basiert auf der selben Überlegung wie der Matching Koeffizient,mit dem Unterschied, dass die Normierung beim Jaccard Koeffizient nur mit der tatsäch-lich verwendeten Anzahl von Splitvariablen stattfindet.Somit ergibt sich für den Jaccard Koeffizienten folgende Definition:

djacc(Bi, Bj) :=1

| Vi # Vj | ·n!

k=1

| !i,k ! !j,k |

Der Normierungsfaktor | Vi # Vj | entspricht der Anzahl derjenigen Variablen, die tat-sächlich in den Bäumen Bi und Bj als Splitvariablen verwendet worden sind, dabei stelltVj die Menge der im Baum j verwendeten Splitvariablen dar. Analog zum Matching Ko-effizienten entspricht ! der boolsche Vektor der verwendeten Splitvariablen.

Beweis: djacc ist eine MetrikDer Jaccard Koeffizient kann durch den Matching Koeffizienten dargestellt werden:

djacc(Bi, Bj) :=1

| Vi # Vj | ·n!

k=1

| !i,k ! !j,k |= n

| Vi # Vj | · dmc(Bi, Bj)

Somit erfolgt der Beweis, dass djacc eine Metrik ist analog zu dmc. !

Beispiel 2: Es werden wie im Beispiel 1 die Bäume aus der Abbildung 2.1 verwendet. Daaus dem Beispiel 1 der Matching Koeffizienten für diese zwei Bäume bereits bekannt ist,muss man nur noch | V1 # V2 | ermitteln.V1 = {age,sbp,ldl,famhist}V2 = {age,typea,famhist}$ | V1 # V2 |=| {age,sbp,ldl,typea,famhist} |= 5Folglich ergibt sich für den Jaccard Koffizienten folgendes Ergebnis:

dj(B1, B2) =9

| V1 # V2 | · dmc(B1, B2) =95

· 13

=35

Damit beträgt die Ähnlichkeit der beiden Bäume approximativ 40%.



Looking at Forests

• Traceplots show a tree ensemble in a single framework

Cluster 1

23

tobacco age

age ldltypea tobacco

famhisttypea ldltobacco

ldltobacco agetypea famhist

typea famhisttobacco

ldltobacco

Level

1

2

3

4

5

6

7

8



Looking at Forests

• Traceplots show a tree ensemble in a single framework

Cluster 1

23

tobacco age

age ldltypea tobacco

famhisttypea ldltobacco

ldltobacco agetypea famhist

typea famhisttobacco

ldltobacco

tobacco age

tobacco age famhisttypea ldl

famhisttypea tobacco ldlalcohol age adiposityobesity sbp

ldl adiposity sbpagetypeaobesity tobacco famhist

typeaobesity ldlage sbpfamhisttobaccoalcohol

ldlagetypeaobesity tobacco sbpfamhistalcohol adiposity

adiposity sbpldlalcoholobesity tobacco

alcohol typeaobesity age

Cluster 3Level

1

2

3

4

5

6

7

8



The Competitors: On 100 random samples

• Logistic Regression

glm(response~., data=dataTrain, family="binomial")

• Traditional CART (from rpart)

rpart(response ~ ., data=dataTrain, parms=list(split='information'))

• Bagging (from ipred)

bagging(response~., data=dataTrain)

• SVM (from e1070)

svm(response~.,data=dataTrain)

!! None of the methods has been fine-tuned !!

!

!

! ! ! !

! !!

! !!!!!

rpart.trsvm.tr

log..reg

TWIX.Agg

bagging

0.15 0.20 0.25 0.30 0.35 0.40

!

!

!

!

!

!

!

!

!



TWIX: Results

• 100 runs of a (20,3) TWIX tree, local maxima

rpart (training)

svm (training)

TWIX (training)

!

!

! ! ! !

! !!

! !!!!!

rpart.trsvm.tr

log..reg

TWIX.Agg

bagging

0.15 0.20 0.25 0.30 0.35 0.40

!

!

!

!

!

!

!

!

!



TWIX: Results


rpart (training)

svm (training)

TWIX (training)

logistic (test)

!

!

! ! ! !

! !!

! !!!!!

rpart.trsvm.tr

log..reg

TWIX.Agg

bagging

0.15 0.20 0.25 0.30 0.35 0.40

!

!

!

!

!

!

!

!

!



TWIX: Results


rpart (training)

svm (training)

TWIX (training)

logistic (test)

svm (test)

!

!

! ! ! !

! !!

! !!!!!

rpart.trsvm.tr

log..reg

TWIX.Agg

bagging

0.15 0.20 0.25 0.30 0.35 0.40

!

!

!

!

!

!

!

!

!



TWIX: Results


rpart (training)

svm (training)

TWIX (training)

logistic (test)

svm (test)

TWIX agg (test)

!

!

! ! ! !

! !!

! !!!!!

rpart.trsvm.tr

log..reg

TWIX.Agg

bagging

0.15 0.20 0.25 0.30 0.35 0.40

!

!

!

!

!

!

!

!

!



TWIX: Results


rpart (training)

svm (training)

TWIX (training)

logistic (test)

svm (test)

TWIX agg (test)

TWIX (test)

!

!

! ! ! !

! !!

! !!!!!

rpart.trsvm.tr

log..reg

TWIX.Agg

bagging

0.15 0.20 0.25 0.30 0.35 0.40

!

!

!

!

!

!

!

!

!



TWIX: Results


rpart (training)

svm (training)

TWIX (training)

logistic (test)

svm (test)

TWIX agg (test)

TWIX (test)

Bagging (test)

!

!

! ! ! !

! !!

! !!!!!

rpart.trsvm.tr

log..reg

TWIX.Agg

bagging

0.15 0.20 0.25 0.30 0.35 0.40

!

!

!

!

!

!

!

!

!



TWIX: Results


rpart (training)

svm (training)

TWIX (training)

logistic (test)

svm (test)

TWIX agg (test)

TWIX (test)

Bagging (test)

rpart (test)



TWIX Results: Parallel Coordinates

26

rpart (training)

svm (training)

TWIX (training)

logistic (test)

svm (test)

TWIX agg (test)

TWIX (test)

Bagging (test)

rpart (test)

0.12 0.435

rpart tr

svm tr

TWIX Tr

log. reg

svm te

TWIX Agg

TWIX Te

bagging

rpart te

0.12 0.435

rpart tr

svm tr

TWIX Tr

log. reg

svm te

TWIX Agg

TWIX Te

bagging

rpart te

0.12 0.435

rpart tr

svm tr

TWIX Tr

log. reg

svm te

TWIX Agg

TWIX Te

bagging

rpart te




26

rpart (training)

svm (training)

TWIX (training)

logistic (test)

svm (test)

TWIX agg (test)

TWIX (test)

Bagging (test)

rpart (test)

0.12 0.435

rpart tr

svm tr

TWIX Tr

log. reg

svm te

TWIX Agg

TWIX Te

bagging

rpart te

0.12 0.435

rpart tr

svm tr

TWIX Tr

log. reg

svm te

TWIX Agg

TWIX Te

bagging

rpart te




26

rpart (training)

svm (training)

TWIX (training)

logistic (test)

svm (test)

TWIX agg (test)

TWIX (test)

Bagging (test)

rpart (test)

0.12 0.435

rpart tr

svm tr

TWIX Tr

log. reg

svm te

TWIX Agg

TWIX Te

bagging

rpart te




26

rpart (training)

svm (training)

TWIX (training)

logistic (test)

svm (test)

TWIX agg (test)

TWIX (test)

Bagging (test)

rpart (test)

0.12 0.435

rpart tr

svm tr

TWIX Tr

log. reg

svm te

TWIX Agg

TWIX Te

bagging

rpart te

0.12 0.435

rpart tr

svm tr

TWIX Tr

log. reg

svm te

TWIX Agg

TWIX Te

bagging

rpart te



How does TWIX compare?

• Are the resulting ensembles similar to other tree-based ensemble methods?

27

TWIX





27

rForests

Force the useof splits in all

variables

TWIX





27

Bagging

many splitsclose to

greedy split

rForests

Force the useof splits in all

variables

TWIX



Conclusion



Conclusion

• For the South-African-Heart Data– Single TWIX-trees out-perform traditional trees and usually bagged trees – Aggregated TWIX beats bagged trees and reaches top performance– TWIX gives good single alternative tree models



Conclusion


• Still room for performance improvement– Better (more stable) tree selection– Improved selection of “second best” splits– Improved aggregation of the trees (weights, boosting, …)

– More tests on more datasets (mlbench, “report78”)– Better understanding of the tree-families



Conclusion




• Computational effort can be high, but parallel computing helps



Conclusion





• Understanding of tree ensembles still poor, classical metrics fail



Conclusion





• Understanding of tree ensembles still poor, classical metrics fail

• Complex methods are hard to implement and hard to test, importance of “reproducible research” cannot be underestimated!

TWIX Trees WIth eXtra splits - Homepage of Martin Theus · Martin Theus, Sergej Potapov –...

Documents

Transcript of TWIX Trees WIth eXtra splits - Homepage of Martin Theus · Martin Theus, Sergej Potapov –...