…@liris.cnrs.fr - ... Laboratoire d'InfoRmatique en Image et Systèmes d'information LIRIS UMR...
-
date post
19-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of …@liris.cnrs.fr - ... Laboratoire d'InfoRmatique en Image et Systèmes d'information LIRIS UMR...
…@liris.cnrs.fr - http://liris.cnrs.fr/...
Laboratoire d'InfoRmatique en Image et Systèmes d'informationLIRIS UMR 5205 CNRS/INSA de Lyon/Université Claude Bernard Lyon 1/Université Lumière Lyon 2/Ecole Centrale de Lyon
Université Claude Bernard Lyon 1, bâtiment Nautibus43, boulevard du 11 novembre 1918 — F-69622 Villeurbanne cedex
http://liris.cnrs.fr
UMR 5205
DDDM'08, Pisa - 15/12/2008
DDDM'08, Pisa - 15/12/2008
Parameter Tuning for Differential Miningof String Patterns
J.Besson, C. Rigotti, I. Mitasiunaite and J.-F. Boulicaut
DDDM'08, Pisa - 15/12/2008 2
Tuning extraction parametersLocal pattern mining: itemsets, closed itemsets, episodes, seq. patterns, substrings
…. under constraints (monotonic or not or neither, pattern shapes, occurrence properties, measures …)
can select/focus ….… where to look in the parameter space ?often easy when a single threshold… but when multiple constraints/multiple thresholds ?
DDDM'08, Pisa - 15/12/2008 3
Two different kinds of tuning
1) exploratory stage: find in parameter space promising areas
2) fine grain tuning: ako greedy strategy by small local exploration of the parameter space
DDDM'08, Pisa - 15/12/2008 4
Tools ?
Best ever tool used in exploratory stage to find promising setting of the parameters in local pattern mining ??? …
DDDM'08, Pisa - 15/12/2008 5
Tools
GREP + Word Count
method: manual mix count extracted patterns choose points in parameter space random walk try local greedy strategy having in mind known properties of the constraints
(when applicable) and domain knowledge
DDDM'08, Pisa - 15/12/2008 6
Tools… when several parameters, several thresholds, e.g., minimal support and maximal support on another dataset …
perform more exhaustive exploration of pattern space
draw curves depicting the extraction landscape
DDDM'08, Pisa - 15/12/2008 7
Tools / landscapeExamples
QuickTime™ et undécompresseur TIFF (non compressé)
sont requis pour visionner cette image.
DDDM'08, Pisa - 15/12/2008 8
Obtaining extraction landscapes
use script - can need a lot of resources to execute - too much time needed to explore a large parameter space (several parameters)
use a global model of the presence of the local patterns to estimate the number of patterns
reuse/adapt a model - not so much exist develop a new global model - each kind of patterns and
each conjunction of constraints can be a research problem in itself
incorporate K of domain ? Global analytical model even more complex to exhibit …
DDDM'08, Pisa - 15/12/2008 9
What about sampling the pattern space ?
sounds too naive, needing complicated frameworks
how to sample ?
size of the sample ?
number of pattern in the sample that satisfy the constraints ?
using domain knowledge ?
how to estimate value for the whole pattern space ?
DDDM'08, Pisa - 15/12/2008 10
What about simple choices ?
sampling with replacement in pat. that satisfies the syntactic constraints (conjunction of constraints)
number of patterns in the sample that satisfy the constraints compute probability to satisfy the constraints for each patterns
(incorporate K of the domain) in the sample approx. number of patterns that sat. the constraints (in the
sample)
sample size: growth the sample up to convergence of percentage of patterns satisfying the constraints
estimate the number of patterns in the pattern space that satisfy the constraints: percentage of the pat. that sat. syntactic constraints
DDDM'08, Pisa - 15/12/2008 11
Whole process
1) built an initial sample of Psynt
2) comp. estimate of E(N) from the sample
3) add more patt. to the sample
4) comp. estimate of E(N) from the sample
5) if estimate changes a lot goto 3)
DDDM'08, Pisa - 15/12/2008 12
Using it in freq. substring mining
Two datasets: R1 and R2 (two sets of strings)
Constraints having size Z appearing at least min times in R1 appearing no more than max times in R2
Consider exact and approx. matching
DDDM'08, Pisa - 15/12/2008 13
Pattern space and K of domain
string over an alphabet of 4 or 8 symbols
K of domain as three models of symbol distribution Me - independent symbols with equal frequency Md - independent symb. with different frequencies Mm - first order Markov model
for given p, and Me or Md or Mm, we have the proba that exits at-least one occ. of p in a string
from binomial distribution we have the proba that p sat. min and max support constraints
DDDM'08, Pisa - 15/12/2008 14
Example / random data
4 symb. Md (0.4, 0.1, 0.2, 0.3) 100 strings of length 1000 in R1 and R2 , exact match
DDDM'08, Pisa - 15/12/2008 15
Example / random data
4 symb. Mm, 100 strings of length 1000 in R1 and R2, exact and approx. match
DDDM'08, Pisa - 15/12/2008 16
Example / gene promoter seq.
4 symb. A,C,G,T - Md, strings of 4000 symb., 29 in R1 and 21 in R2 - approx. match
DDDM'08, Pisa - 15/12/2008 18
Conclusion
Drawing extraction landscape for parameter tuning, in local pattern extraction, using pattern space sampling …seems possible …… at-least in some cases… using simple framework… incorparating K of domain (to some extend - many works on proba of a given patt. to sat. constraints)
simplier than building a global analytical modelfaster than running real extractions
… sufficient in exploratory stage ?… companion software?
DDDM'08, Pisa - 15/12/2008 19
Example / random data
8 symb. Me, 100 strings of length 30000 in R1 and R2, approx. match
DDDM'08, Pisa - 15/12/2008 20
Pb - Sampling / estimate
kind of sampling (with replacement ?)
specific sampling (ako stratified sampling) for some constraints ?
kinds of patterns ?
quality of estimates … occurrences of different patterns are not independent
DDDM'08, Pisa - 15/12/2008 21
Pb - Other parameters added
size of starting set
convergence criterion ? 5% ?
size of additional subsets
… not so hard to tune ?
DDDM'08, Pisa - 15/12/2008 22
Number of patterns
conjunction of constraints C
patterns in patt. space PS
for each patt. p, let var Xp=1 if p sat. C or Xp=0 if p not sat. C
N = nb of patt. that sat. C = sum of Xp over PS
E(N) = sum of E(Xp) over PS
E(Xp) = proba that p sat. C
Psynt = patt. in PS that sat. syntactic constraint in C
E(N) = sum of E(Xp) over Psynt
DDDM'08, Pisa - 15/12/2008 23
Number of patterns
comp. NS = sum of E(Xp) over a sample of Psynt
comp. ratio NR = NS/sample size
use NR * size of Psynt as an estimate of E(N)