TWIX Trees WIth eXtra splits · 2005. 4. 7. · 2 Martin Theus, Sergej Potapov – Lehrstuhl für...

Martin Theus, Sergej Potapov – Lehrstuhl für Rechnerorientierte Statistik und Datenanalyse, Universität AugsburgSimon Urbanek – AT&T Research Labs, Florham Park, NJ

TWIX – Trees WIth eXtra splits

Sergej PotapovMartin Theus

Simon Urbanek

Trees WIth eXtra splits

Motivation

• Where the classical CART algoritm fails– Greedy algorithms never go for a (locally) second best solution,

which would result in a better overall (global) solution.

Example: XOR-data

-3 -2 -1 0 1 2 3 4 5 6 7 8

>=3.532669

< 3.532669

1.437037

>=5.679445

< 3.151441

>=3.151441

1.290780

< 3.248369

>=2.703216

< 2.703216

1.806452

>=3.248369

1.532075

< 5.679445

1.500000

-3 -2 -1 0 1 2 3 4 5 6 7 8

Trees: More Problems

• Non-orthogonal splitting directions …

-3 -2 -1 0 1 2 3

< -0.05426408

>=0.09766749

< 0.09766749

>=-0.05426408

>=-0.5117182

>=-1.571775

< -1.758858

>=-1.758858

< -1.571775

< -0.8793875

>=-0.8793875

< -0.5117182

< 0.2145846

< 1.029791

>=1.029791

>=0.8353015

< 0.8353015

>=0.2145846

-3 -2 -1 0 1 2 3

… can not be handled by single trees, no matter how we split

Bagging Revisited

Bagging = Boostrap Aggegation, tries to simulate an infinite sam-

ple by bootstrapping, i.e. sampling from the original sample with

replacement.

Repeat N times:

1. Generate a bootstrap sample Di of size n.

2. Fit model f̂Di.

Depending on the problem the N results are aggregated:

• Classification: g(x) = argmaxc∈C

I(fDi(x) = c)

• Regression: g(x) =1

fDi(x)

Ensembles

• General IdeaUse many “different” classifier and combine them to get more accurate results.

• Bagging: Instability of trees yields different models

• Random Forests: Restrict input space randomly to get wider range of models

• Boosting: Iterate to down-weight “bad” points

Question:Why use randomly generated (sub-optimal) models?

Tree Mechanics

• CART is a recursive partitioning algorithm

• Each node is split according to the maximum gain in the loss function

• Mountain plots shows the loss function for a variable for all possible split points

Loss Functions Mountain Plot

0.0 0.2 0.4 0.6 0.8 1.0

Entropy

Missclassification

6400 6600 6800 7000 7200 7400 7600 7800 8000 8200 8400

Idea behind TWIX

• Since the greedy CART algorithm not necessarily finds the “optimal” tree, try second best splits.

• Use these forests for bagging

• Expect better results for both single trees and aggregations

• How to find “good” candidates for second best splits?

• Number of inner nodes grows exponentially with the number of levels in the tree

⇒ so does the number of alternative trees

Problems

Second Best Splits: South African Heart Data

100 120 140 160 180 200

Deviance

0 5 10 15 20 25 30

tobacco

Deviance

2 4 6 8 10 12 14

Deviance

10 15 20 25 30 35 40

adiposity

Deviance

20 30 40 50 60 70

Deviance

20 25 30 35 40 45

obesityDeviance

0 20 40 60 80 100 120

alcohol

Deviance

20 30 40 50 60

Deviance

Second Best Splits: Global vs. Local

• When searching for a “best” split point, we can either look for– all top n greatest deviance gains, or– only look for local maxima

• ExampleTop 6 splits

100 120 140 160 180 200

Deviance

0 5 10 15 20 25 30

tobacco

Deviance

2 4 6 8 10 12 14

Deviance

10 15 20 25 30 35 40

adiposity

Deviance

20 30 40 50 60 70

Deviance

20 25 30 35 40 45

obesityDeviance

0 20 40 60 80 100 120

alcohol

Deviance

20 30 40 50 60

Deviance

Second Best Splits: Forcing Variables

• Often a single variable dominates the potential deviance gain, and shadows all other variables⇒ Many probably good split points are lost.

• Solution:Force a minimum number of split points for each variable.

• Example: top 6 vs. top 3

100 120 140 160 180 200

Deviance

0 5 10 15 20 25 30

tobacco

Deviance

2 4 6 8 10 12 14

Deviance

10 15 20 25 30 35 40

adiposity

Deviance

20 30 40 50 60 70

Deviance

20 25 30 35 40 45

obesity

Deviance

0 20 40 60 80 100 120

alcohol

Deviance

20 30 40 50 60

Deviance

100 120 140 160 180 200

Deviance

0 5 10 15 20 25 30

tobaccoDeviance

2 4 6 8 10 12 14

Deviance

10 15 20 25 30 35 40

adiposity

Deviance

20 30 40 50 60 70

Deviance

20 25 30 35 40 45

obesity

Deviance

0 20 40 60 80 100 120

alcohol

Deviance

20 30 40 50 60

Deviance

Second Best Splits: Grid Search

• In some situations good split points might not even associated with some (local) maximum in deviance gain.

(Remember the XOR Example)

• Grid searches are most exhaustive, but also most expensive.

100 120 140 160 180 200

Deviance

0 5 10 15 20 25 30

tobacco

Deviance

2 4 6 8 10 12 14

Deviance

10 15 20 25 30 35 40

adiposity

Deviance

20 30 40 50 60 70

Deviance

20 25 30 35 40 45

obesity

Deviance

0 20 40 60 80 100 120

alcohol

Deviance

20 30 40 50 60

Deviance

Implementation: The Grid

• If we allow sj splits per node on level j of the tree, we get a max-imum of

S =k∏

s2i−1

trees for a tree with no more than k levels. Example:

s = (7, 4, 2) ⇒ S = 720

· 421

· 222

= 7 · 16 · 16 = 1792

⇒Work on a grid of computers

PVMController

Implementation: The R-Package

TWIX {TWIX} R Documentation

Top Classification Trees

Description

Top Classification Trees

TWIX(formula, data = NULL, test.data = NULL, subset = NULL,

method = "deviance", topn.method = "complete", cluster = NULL,

minsplit = 20, minbucket = round(minsplit/3), Devmin = 0.01,

topN = 1, level = 6, st = 1, cl.level = 2, tol = 0.01, ...)

Arguments

formula formula of the form y ~ x1 + x2 + ..., where y must be a factor and x1,x2,... arenumeric.

data an optional data frame containing the variables in the model(training data).

test.data a data frame containing new data.

subset an optional vector specifying a subset of observations to be used.

method Which split points will be used? This can be "deviance" (default), "grid" or "local".If the method is set to: local the program uses the local maxima of the splitfunction(entropy), deviance all values of the entropy, grid grid points.

topn.method one of "complete"(default) or "single". A specification of the consideration of the splitpoints. If set to "complete" it uses split points from all variables, else it uses split pointsper variable.

cluster name of the cluster, if parallel computing will be used.

minsplit the minimum number of observations that must exist in a node.

minbucket the minimum number of observations in any terminal <leaf> node.

Devmin the minimum improvement on entropy by splitting.

topN integer vector. How many splits will be selected and at which level? If length 1, the samesize of splits will be selected at each level. If length > 1, for example topN=c(3,2), 3splits will be chosen at first level, 2 splits at second level and for all next levels 1 split.

level maximum depth of the trees. If level set to 1, trees consist of root node.

st step parameter for method "grid".

cl.level parameter for parallel computing.

tol parameter, which will be used, if topn.method is set to "single".

... further arguments to be passed to or from methods.

a list with the following components :

call the call generating the object.

trees a list of all constructed trees, which include ID, Dev ... for each tree.

greedy.tree greedy tree

“Driving the Beast”

• The most important tuning parameters are– method

Which split points will be used? This can be "deviance" (default), "grid" or "local". If the method is set to: local the program uses the local maxima of the split function(entropy), deviance all values of the entropy, grid grid points.

– topn.methodone of "complete"(default) or "single". A specification of the consideration of the split points. If set to "complete" it uses split points from all variables, else it uses split points per variable.

– topNinteger vector. How many splits will be selected and at which level? If length 1, the same size of splits will be selected at each level. If length > 1, for example topN=c(3,2), 3 splits will be chosen at first level, 2 splits at second level and for all next levels 1 split.

– levelmaximum depth of the trees. If level set to 1, trees consist of root node.

– Stopping Rules:■ minsplit

the minimum number of observations that must exist in a node.■ minbucket

the minimum number of observations in any terminal <leaf> node.■ Devmin

the minimum improvement on entropy by splitting.

South African Heart Desease Data cont.

• To get a “fair”, i.e. generalizable and not too overfitted classifier, we usually split the data into 3 chunks:

– TrainingAll models are trained using the training data

– ValidationThe “best” model is selected using the validation data(The chosen model is then estimated with training+validation)

– TestThe performance is then assessed with the test data

Training Validation Test

What about trees?Model Structure = Model parameters

The Dataset

• 10 Variables, 462 Observations

• Target: Coronary Heart Disease (chd), 34,63% = 160 cases

• Inputs:continuous– sbp systolic blood pressure– tobacco cumulative tobacco (kg)– ldl low density lipoprotein cholesterol– adiposity– typea type-A behavior– obesity– alcohol current alcohol consumption– age age at onset

discrete– famhist family history of heart disease (Present, Absent)omi

101 218101 218101 218

The Dataset: Univariate

101 218

0 31.2

0.98 15.33

6.7 42.5

14.7 46.6

0 147.2

sbp tobacco ldl adiposity

typea obesity alcohol age

The Dataset:Bivariate

0 5 10 15 20 25 30 10 20 30 40 15 25 35 45 20 30 40 50 60

1015202530

tobacco

adiposity

obesity

alcohol

100 140 180 220

2 4 6 8 10 14 20 40 60 80 0 50 100 150

The Dataset: Multivariate

tobacco

adiposity

obesity

alcohol

tobacco

adiposity

obesity

alcohol

The Competitors: On 100 random samples

• Logistic Regression

glm(response~., data=dataTrain, family="binomial")

• Traditional CART

rpart(response ~ ., data=dataTrain, parms=list(split='information'))

• Bagging

bagging(response~., data=dataTrain, nbagg=100)

• SVM

svm(response~.,data=dataTrain)

!! None of the methods has been further tuned !!

Competitors Results: Error Rates

• Trad. Trees

• Bagging

• Support Vektor

Machines

• Log. Regression ! !

log..reg

svm.te

bagging

rpart.te

0.25 0.30 0.35 0.40 0.45

TWIX: Diagnostics

• For a given “Multitree” we can compare deviance and classification rate on training and test/validation data.

Example:

! !!! !

!!!!!!

!!!!!!!!

!!!!!!

50 100 150 200

Training Deviance

Deviances of trees (n= 870 )

50 100 150 200

! !! ! !! !!

! ! ! ! ! !

! ! !!! ! !! !! !! !

!! !! !!! ! ! ! !!!!

!! !!!! !!!! !! ! !

! ! !!!!! ! !!!!!!!!!!!!!! ! ! !

!!!!!!!!!!!! !!! !

!!! !!!!!!!!!!!!!!!! !! !!!!!!!!!!

!! !! ! ! !!!!!!!!!!!!!!!!

!! !! !!!!!!!!!!!!!!!! !

!!!!!!!!!!!!!! !!! !

!! !!!!!!!!!!!!! !

! !!!!!!!!!!!!!!!!!!!!! !!

!!!!!!!!!! !!!! !!! !!!

!!! !!!!!!!!!!!!!!!! !

! !! ! ! !! !!!!! !

! ! !! !!!!!!! !!!!!!! !!! ! ! !! ! !! ! !

!! !! ! !! ! !

! ! ! !

0.65 0.70 0.75 0.80 0.85

CCR of 870 trees

CCR of training data

+rpart Tree

TWIX: Tree Selection

• The CCR (Correct Classification Rate) of the top TWIX trees are far better than those of greedy trees and most other classification methods

Quest:How to find the “best” trees from the validation data?

• Currently:

Sort trees according to– training deviance– test deviance– training CCR– test CCR

and pick the best!

TWIX: Results

• Local Maxima across all variables (50 samples)

– 1,1

– 2,1

– 4,2

– 6,3

– 8,4

– 10,5 !! ! !

!! !!! !

!!! !! ! !

0.25 0.30 0.35 0.40 0.45

TWIX: Results

• Local Maxima, within all variables (50 samples)

– 1,1

– 2,1

– 4,2

– 6,3

– 8,4

– 10,5 !

0.25 0.30 0.35 0.40 0.45

TWIX: Results

• 100 runs of a (10,5,2) tree

! !! !

rpart.tr

TWIX.Tr

log..reg

TWIX.Bag

bagging

0.10 0.15 0.20 0.25 0.30 0.35 0.40 0.45

rpart (training)

svm (training)

TWIX (training)

TWIX (validation)

logistic (test)

svm (test)

TWIX bag (test)

TWIX (test)

Bagging (test)

rpart (test)

Bagged TWIX

• What bagging should look like:

#trees

Bagged TWIX: Example

• Bagging TWIX trees does often not improve the CCR.

0 50 100 150 200 250 300

error of single trees

0 10 20 30 40 50

bagseq

error of bagged trees

Conclusion

• For the South-African-Heart Data– Single TWIX-trees out-perform traditional tree and usually bagged trees

– Bagged TWIX beats bagged trees and reaches top performance

– TWIX gives good single alternative tree models

• Still much room for performance improvement– Stopping rules and pruning– Better tree selection– Improved selection of second best splits

– More tests on more datasets– Better understanding of the tree-families

• Computational effort is high, but parallel computing speeds up dramatically

• Complex methods are hard to implement and hard to test

TWIX Trees WIth eXtra splits · 2005. 4. 7. · 2 Martin Theus, Sergej Potapov – Lehrstuhl für...

Documents

Transcript of TWIX Trees WIth eXtra splits · 2005. 4. 7. · 2 Martin Theus, Sergej Potapov – Lehrstuhl für...

MINISTERUL EDUCAȚIEI NAȚIONALE DIRECȚIA · PDF fileBirjarul Iona Potapov e alb din cap până-n picioare, ca o nălucă. Stă pe capră încovoiat, atât cât se poate încovoia

Cuento para niños de Alta Capacidad … y para sus …€¦ · Gema Theus - Contacto: mamaoca1@yahoo.es. Cuento para niños de Alta Capacidad … y para sus padres. y profesores.

„Burnout ist out!“ - Universität Hamburg · Querab SS 2011 Christina Urbanek „Burnout ist out!“ Stress und Überforderung vorbeugen Herzlich Willkommen! Querab SS 2012 Dipl.

NATIONAL BUREAU OF ECONOMIC RESEARCH · 2020. 3. 20. · Robin Burgess, Matthew Hansen, Benjamin A. Olken, Peter Potapov, and Stefanie Sieber NBER Working Paper No. 17417 September

VOLOOLLTA REDONDA EM DESTAQUEQQUEUEcomo vara, anzol, bóia e outros, Zoológico também tem lazer para a Terceira idade Ocadeirante Ma theus Pires de Oliveira, 15 anos, foi escolhido

Par Patrick LANDOLT-THEUS

Potapov Algebra 11 metod reco - Шевкин.Ru · УДК 372.8:[512+517] ББК 74.262.21 П64 Серия«МГУ—школе»основанав1999году ПотаповМ.К.

Nigel KeNNedy Piotr wylezol - konzerthaus- · PDF filezweistimmige inventionen (um 1723) fassung für Violine und Violoncello beata urbanek-Kalinowska solo-Violoncello duKe elliNgtoN

naukairozwoj.uksw.edu.plnaukairozwoj.uksw.edu.pl/sites/default/files/mgr ciuła -urbanek... · Antonina Kloskowska, ... kultura ludowa/kultura masowa (niska)" ... a popularna i masowa

Bericht Alkolenker-Studie 2010 Alkolenker-Studie 2010.pdfAlkolenker-Studie, Wien, Nov. 2010 1 Österreichische Alkolenker-Studie 2010 Wien, November 2010 Gregor BARTL Katharina URBANEK

WordPress.com · 2017. 9. 23. · mundial, se enfrentará con el inmaculado ruso Nikolai Potapov (17-0-1, 8 KO), número dos de la clasificación, donde el ganador se adjudicará

NEV · Web viewNEV EVSZAM1 DIJ1 EVSZAM2 DIJ2 EVSZAM3 DIJ3 EVSZAM4 DIJ4 Ajkay Félix 1937 Zipernowsky Almási Kristóf 2002 Csáki Almási Sándor 1996 Urbanek 2005 Kandó Ángyánné

forma5.com · Presentation nunilv f€"atures Dimensions 13 . Theus ... ergonomía y elegancia, un proyecto consagrado a combinar comodidad y estética. En este afán, el departamento

Fortpflanzungsbiologie bei Rothirsch und Reh - bkpjv.ch · Fortpflanzungsbiologie bei Rothirsch und Reh Aus- und Weiterbilldungstag der KoAWI in Cazis 28. April 2012 Toni Theus prakt.

Europan 12 · PDF fileRodolphe Luscher, Präsident Europan Schweiz Karin Sandeck, Präsidentin Europan Deutschland ... M. Theus informe les personnes présentes que la prio

Prezentacja programu PowerPoint - miasto RybnikKarolina Imiolczyk Aleksandra Tymrakiewicz Julia Hozer Marcin Urbanek Jakub Han Anna Kapol 10 20 30 50 60 Nagroda publÎCihðscl glosowanle

Classification and regression trees · Презентация на конгрессе IUFRO 22.09.2017 Peter Potapov, Alexey Yaroshenko, Svetlana Turubanova, Ilona Zhuravleva, Elena

On Extensions of Partial n-Quasigroups of Order 4math.nsc.ru/~potapov/12.pdf · By an n-quasigroup of order k we mean an algebra with the universe Σ={0,1, ... 136 POTAPOV Proposition

Potapov m a Salickii a i Shahmatov a v Ekonomika Sovremennoi

Haftung von Emittenten 14Okt2010 - dorda.at · Sigrid Urbanek Urbanek & Rudolph Stefan Weber Weber Maxl Werner Wiesner BMF Martin Winner Übernahmekommission, WU Wien Andreas Zahradnik