Post on 05-May-2018
S. (M
uthu
) Mut
hukr
ishn
an
Goo
gle
ador
ism
s, m
yslic
eofp
izza
On
the
Infr
astr
uctu
re fo
r N
etw
ork
Dat
a M
inin
g:
Con
cept
s and
Exp
erie
nces
Tal
k O
verv
iew
�D
ata
Ana
lysi
s in
Diff
eren
t Com
mun
ities
�A
lgor
ithm
s, D
atab
ases
and
Net
wor
king
�In
fras
truct
ure
Vie
w o
f Dat
a A
naly
sis
�Ex
ampl
e 1:
Cel
lpho
neC
all T
raff
ic�
Exam
ple
2: IP
Pac
ket T
raff
ic S
tream
s�
Exam
ple
3: W
eb T
raff
ic
�Pe
rspe
ctiv
es
Ear
liest
Kno
wn
Dat
a A
naly
sis
�N
atur
al a
nd p
oliti
cal o
bser
vatio
ns m
ade
upon
th
e B
ills o
f Mor
talit
y, b
y Jo
hn G
raun
t, 16
62.
�U
se fi
gure
s of b
irths
and
dea
ths i
n Lo
ndon
col
lect
ed b
y pa
rishe
s.�
Few
beg
gars
star
ve to
dea
th, p
olyg
amy
is ir
ratio
nal,
Hea
d is
to
o bi
g fo
r the
bod
y,�
�H
ow m
any
M/F
? H
ow m
any
mar
ried/
sing
le?
Wha
t yea
rs
frui
tful/m
orta
l and
in w
hat i
nter
vals
?��
Kno
wle
dge
of th
ese
is n
eces
sary
to e
ase
gove
rnm
ent,
bala
nce
Parti
es a
nd fa
ctio
ns b
oth
in c
hurc
h an
d st
ate.
But
is it
ne
cess
ary
for o
ther
s bes
ides
the
king
and
his
min
iste
rs?
�Fi
rst l
ife in
sura
nce
tabl
es, b
y Ed
mun
d H
alle
y, 1
693.
Dat
a A
naly
sis i
n D
iffer
ent C
omm
uniti
es�
Net
wor
king
:�
Min
ing
anom
alie
s usi
ng tr
affic
feat
ure
dist
ribut
ions
A.L
akhi
na, M
.Cro
vella
, C.D
iot.
SIG
CO
MM
05.
�A
lgor
ithm
s:�
Stre
amin
g an
dsu
blin
eara
ppro
xim
atio
n of
ent
ropy
and
in
form
atio
n di
stan
ces.
S. G
uha,
A. M
cGre
gor,
S.V
enka
tasu
bram
ania
n. S
OD
A 2
006.
�D
atab
ases
:�
Hol
istic
UD
AFs
at st
ream
ing
spee
ds.
G.C
orm
ode,
T. J
ohns
on, F
.Kor
n, S
.Mut
hukr
ishn
an,
O.S
pats
chec
k, D
.Sriv
asta
va. S
IGM
OD
200
4.
entro
py
Use
r def
ined
agg
rega
te fu
nctio
n (U
DA
F): E
ntro
py.
Det
aile
d vi
ew o
f CD
Rs
IP
GSM
M
SCV
oIP
VoI
PTD
MA
M
SCoriginating
terminating
�Ea
ch n
etw
ork
elem
ent w
rites
reco
rds (
Cal
l D
etai
l Rec
ords
) of w
hat i
t see
s in
diff
eren
t fo
rmat
s.
�C
DR
sare
seve
ral h
undr
ed b
ytes
long
.
�It
is h
ard
to e
ven
join
the
reco
rds.
Ana
lyzi
ng C
DR
s: D
ata
�D
ata:
�TD
MA
:Er
icss
on, L
ucen
t, N
orte
l etc
.MSC
s; G
SM a
nd
UM
TS:
MSC
s; V
oIP:
Gat
eway
s; G
PRS:
SGSN
s, G
GSN
s, an
d M
MSC
s; S
MS
logs
.�
10�s
of d
iffer
ent d
ata
form
ats.
�Si
de ta
bles
: LER
G. H
ands
et in
fo. T
runk
info
.�
Abo
ut 1
Tby
te/m
onth
for l
arge
car
riers
.
switc
h
Dat
a co
llect
ion
poin
t
Ana
lyzi
ng C
DR
s: A
naly
ses
�A
naly
ses:
�10
0�s o
f rep
orts
a m
onth
.
�Ex
ampl
e A
naly
ses:
�D
ropp
ed c
alls
per
han
dset
type
�2A
or 2
B c
onne
ctio
ns.
�Fr
audu
lent
tran
sit c
alls
�C
ell a
djac
ency
gra
ph
D1
D2
D3
Dis
tant
Tow
er P
robl
em
(Par
tial)
Solu
tion:
Find
a d
ropp
ed c
all
usin
g ce
lltow
erC
imm
edia
tely
pre
cedi
nga
succ
essf
ul c
all u
sing
cel
ltow
er D
si
gnifi
cant
ly fa
r aw
ayfr
om C
.
Ana
lyzi
ng C
DR
s: In
fras
truc
ture
�C
halle
nge
is n
ot th
e si
ze o
f the
dat
a.
�un
ders
tand
ing
the
data
, tra
nsla
ting
a bu
sine
ss
prob
lem
dow
n to
CD
R a
naly
sis.
�Tu
rnar
ound
tim
e: D
ays o
r wee
ks.
�Sm
all t
eam
of a
naly
sts r
espo
nsib
le.
Infr
astr
uctu
re:
�Lar
ge d
isks
.�M
ultip
le C
PU m
achi
nes.
�Scr
iptin
g la
ngua
ges,
stan
dard
file
syst
em.
Tal
k O
verv
iew
�D
ata
Ana
lysi
s in
Diff
eren
t Com
mun
ities
�A
lgor
ithm
s, D
atab
ases
and
Net
wor
king
�In
fras
truct
ure
Vie
w o
f Dat
a A
naly
sis
�Ex
ampl
e 1:
Cel
lpho
neC
all T
raff
ic�
Exam
ple
2: IP
Pac
ket T
raff
ic S
tream
s�
Exam
ple
3: W
eb T
raff
ic
�Pe
rspe
ctiv
es
Ana
lyzi
ng IP
Tra
ffic
(ISP
Vie
w):
Dat
a
�SN
MP,
IP fl
ows,
pack
et h
eade
r log
s, pa
cket
co
nten
ts.
�R
outin
g ta
bles
, BG
P up
date
s, Fa
ult a
larm
s. �
OC
48, 1
92, 7
68:x
Tbyt
es/h
our.
�6M
pkts
/sec
to 9
6Mpk
ts/s
ec.
Pack
et tr
affic
seen
as s
tream
s, ot
her l
ogs m
ay
be st
ored
in d
atab
ases
.
Ana
lyzi
ng IP
Tra
ffic
: Ana
lyse
s�
Rea
l tim
e, ro
uter
spee
d an
alys
is.
�Ex
ampl
e:
�R
epor
ting,
SLA
med
iatio
n.�
Ano
mal
y/A
ttack
det
ectio
n.�
Law
ful i
nter
cept
�M
onito
ring
failu
res.
�Tr
affic
cla
ssifi
catio
n.
Lat
in A
mer
ican
Car
rier
:R
eal-T
ime
Tra
ffic
Insi
ght (
NA
RU
S)
Gra
ph a
nd t
able
show
ing p
roto
cols
(a
nd a
ssoc
iate
d s
ess
ion c
ounts
and
byt
es)
runnin
g o
n c
ust
om
er�
s netw
ork
�G
igas
cope
is a
n SQ
L-ba
sed
oper
atio
nal I
P tra
ffic
an
alys
is to
ol a
t AT&
T.
�H
as tw
o le
vel a
rch.
�
Low
-leve
l que
riesp
erfo
rm
initi
al fa
st se
lect
ion
and
aggr
egat
ion
on h
igh
spee
d st
ream
.�
Com
plex
agg
rega
tion
on
high
leve
l, at
mon
itor s
erve
r�
Dep
endi
ng o
n th
e ca
pabi
litie
s of t
he N
IC,
can
push
ope
rato
rs a
nd
low
-leve
l que
riesi
nto
it.
NIC
Rin
g B
uffe
r
Low
Low
Low
Hig
hH
igh
App
NIC
Gig
asco
peA
rchi
tect
ure
Sel
ect t
b, S
rcIP
, cou
nt(*
)Fr
om U
DP
Gro
up B
y tim
e/60
as
tb, S
rcIP
Sel
ect t
b, S
rcIP
, sum
(Cnt
)Fr
om S
ubq
Gro
up B
y tb
, Src
IP
Sel
ect t
b, S
rcIP
, co
unt(*
) as
Cnt
From
UD
PG
roup
By
time/
60 a
s tb
, Src
IP
Sub
q:
GSQ
L Q
uery
Spl
ittin
g
Low
leve
l
Gig
asco
pe, S
tatu
sC
urre
ntly
supp
orts
:�
GSQ
L,U
DA
Fs.
�st
ream
agg
rega
te q
uerie
s.
�Sa
mpl
ing.
�O
pera
torc
an b
e sp
ecia
lized
to
mos
t stre
am sa
mpl
ing
met
hods
.�
Mos
t com
plex
que
ries c
an b
e ex
ecut
ed w
ith sa
mpl
ing
to
prov
ide
sem
antic
ally
cor
rect
ou
tput
.
�Sk
etch
es.
�H
eavy
hitt
ers,
quan
tiles
.
�R
egex
mat
cher
for f
low
s.�
Mat
ch c
onte
nts a
cros
s pac
kets
in
pres
ence
of d
uplic
ates
, out
-of-
orde
r or
ove
rlapp
ing
pack
ets.
�H
eartb
eats
.�
Prel
im d
istri
bute
d im
plem
enta
tion.
�D
eplo
yed
Ted
John
son
Ir
ina
Roz
enba
umV
lad
Shka
peny
uk O
liver
Spa
tsch
eck.
Dat
a St
ream
s: A
lgor
ithm
s and
App
licat
ions
, S. M
uthu
kris
hnan
,Fo
unda
tions
of T
heor
etic
al C
ompu
ter S
cien
ce, 2
005.
Sam
plin
g O
pera
tor
�M
any
sam
plin
g al
gorit
hms k
now
n fo
r IP
traff
ic st
ream
s.�
Uni
form
rand
om sa
mpl
ing
�Pr
iorit
y sa
mpl
ing
�V
alue
sam
plin
g�
Dis
tinct
, inv
erse
, min
-wis
e sa
mpl
ing.
�O
bser
vatio
n:
�M
ost s
ampl
ing
algo
rithm
s hav
e a
over
all c
omm
on e
xecu
tion
stru
ctur
e.
�O
ur a
ppro
ach:
�D
efin
e an
d op
timiz
e a
sing
le sa
mpl
ing
oper
ator
.
Stre
am S
ampl
ing
Ope
rato
r
�C
an b
e sp
ecia
lized
for w
ide
varie
ty o
f stre
am sa
mpl
ing
algo
rithm
s.�
Enco
urag
es e
xper
imen
tatio
n an
d de
velo
pmen
t of n
ew
sam
plin
g al
gorit
hms.
Select<select expression list>.
From<stream>.
Where<predicate>.
Group by<group-by variables definition list>.
Cleaning when
<predicate>.
Cleaning by<predicate>.
[Having<predicate>].
–Cleaning when
�co
nditi
on fo
r trig
gerin
g a
clea
ning
pha
se.
–Cleaning by
�co
nditi
on fo
r sam
ple
redu
ctio
n.
T. Jo
hnso
n, S
. Mut
hukr
ishn
an a
nd I.
Roz
enba
um, S
IGM
OD
200
2.
Subs
et-S
um S
ampl
ing
Que
ry
Wha
t are
the
size
s (in
byt
es) o
f all
flow
s see
n ov
er a
time
inte
rval
of 6
0 se
cond
s?
Selecttb, srcIP, dstIP, UMAX(sum(len), ss_threshold())
FromSOURCE
Wheressample(len, 1000) = TRUE
Group Bytime/60 as tb, srcIP, dstIP
Cleaning Whenss_need_to_clean() = TRUE
Cleaning Byss_do_clean(sum(len))
Havingss_final_clean(sum(len))
�Th
is q
uery
retu
rns a
sam
ple
of 1
000
elem
ents
for e
very
60
seco
ndin
terv
al
Hea
vy H
itter
s Que
ry
List
the
num
ber o
f byt
es a
nd th
e nu
mbe
r of p
acke
ts fo
r de
stin
atio
n IP
add
ress
es w
hich
acc
ount
for a
t lea
st 1
%
of th
e to
tal t
raff
ic.
Select
tb,destIP, sum(len), count(*).
From
SOURCE.
Group By
time/60 astb,dstIP.
Cleaning Whenlocal_count(100) = TRUE.
Cleaning By
count(*) < current_bucket()-.
first(current_bucket()).
Sam
plin
g O
pera
tor
War
stor
y:�
Dur
ing
SYN
floo
ding
and
DD
OS
atta
cks,
Cis
coN
etflo
wge
nera
tor i
s ove
rwhe
lmed
and
pro
duce
s us
eles
s out
put.
�Pa
cket
sam
plin
g do
es n
ot p
rovi
de a
ccur
ate
flow
sa
mpl
es.
�B
y co
mbi
ning
flow
sam
plin
g an
d flo
w g
ener
atio
n lo
gic
usin
g th
e sa
mpl
ing
oper
ator
,Gig
asco
pepr
oduc
es m
eani
ngfu
l, va
luab
le fl
ow sa
mpl
es e
ven
at
peak
rate
s of f
low
s suc
h as
in a
ttack
s.
Exa
mpl
e A
pplic
atio
n�
Hea
vy h
itter
q-g
ram
in p
acke
t con
tent
s.�
Des
ign
sam
plin
g+sk
etch
ing
met
hod
to sk
ip o
ver
vast
num
ber o
f pac
kets
.
�O
rder
s of m
agni
tude
impr
ovem
ent o
ver p
rior
wor
k in
net
wor
king
, ski
ppin
g fr
actio
n of
pac
kets
!S.
Bha
ttach
aryy
a, A
. Mad
eria
, S. M
uthu
kris
hnan
and
T. Y
e.Sp
rint A
TL T
echn
ical
Rep
ort,
2006
.
IP T
raff
ic A
naly
sis:
Infr
astr
uctu
re�
Cha
lleng
e:�
Size
, rat
e of
dat
a. A
naly
ses:
Sim
ple.
�Tu
rnar
ound
tim
e: M
inut
es, d
ays.
�M
oder
ate
size
d te
am o
f ana
lyst
s.�
Spec
ial i
nfra
stru
ctur
e:�
Opt
ical
split
ters
, NIC
�
Mul
tiple
CPU
mac
hine
s�
Dat
a st
ream
man
agem
ent s
yste
ms (
DSM
Ss)
Tal
k O
verv
iew
�D
ata
Ana
lysi
s in
Diff
eren
t Com
mun
ities
�A
lgor
ithm
s, D
atab
ases
and
Net
wor
king
�In
fras
truct
ure
Vie
w o
f Dat
a A
naly
sis
�Ex
ampl
e 1:
Cel
lpho
neC
all T
raff
ic�
Exam
ple
2: IP
Pac
ket T
raff
ic S
tream
s�
Exam
ple
3: W
eb T
raff
ic
�Pe
rspe
ctiv
es
Goo
gle
Sear
ch
Web
Imag
eV
ideo
New
sU
sene
t Gro
ups
Blo
gs
Cal
cula
tor
Co. Con
vert
units
,C
alcu
late
.
Adv
ertis
ing
AdW
ords
AdS
ense
Partn
er si
tes
Earth
Map
Fina
nce
Tren
dsW
ritel
yPe
rson
aliz
eFr
oogl
e�
.
Exa
mpl
e: S
pons
ored
Sea
rch
�A
dver
tiser
s wan
t to
plac
e ad
s in
resp
onse
to u
ser
quer
ies.
�H
ave
to fi
gure
out
wha
t que
ries a
re in
tere
stin
g, h
ow
muc
h to
bid
on
each
que
ry, w
hat i
s the
bud
get,�
�Pr
oble
m:G
iven
a se
t of q
uerie
s and
a p
oten
tial b
id,
outp
ut th
e di
strib
utio
n of
�N
umbe
r of c
licks
exp
ecte
d�
Expe
cted
pos
ition
on
the
ad li
st�
Expe
cted
pric
e.
�In
put:
quer
ies,
ads s
how
n, b
ids,
pric
e, e
tc.T
erab
ytes
of
data
on
1000
�s o
f com
mod
ity m
achi
nes.
Map
Red
uce
[Dea
n. G
hem
awat
OSD
I04]
�Pa
ralle
l pro
gram
min
g in
fras
truct
ure
at G
oogl
e.�
Use
rs sp
ecify
map
and
redu
ce fu
nctio
ns.
�In
put:
set o
f rec
ords
.�
Each
reco
rd is
map
ped
to a
set o
f (ke
y, v
alue
) pai
rs.
�A
ll pa
irs w
ith sa
me
key
are
cons
ider
ed to
geth
er a
nd
a re
duce
func
tion
is a
pplie
d to
the
valu
es.
�Sy
stem
aut
omat
ical
ly ta
kes c
are
of
�Pa
ralle
lizin
g on
100
�s++
com
mod
ity m
achi
nes.
�Fa
ult t
oler
ance
�Sc
hedu
ling,
load
bal
ance
, int
er-m
achi
ne
com
mun
icat
ion,
etc
.
Tra
ffic
Est
imat
ion
Usi
ng M
apR
educ
e(m
ade-
up e
xerc
ise)
�Lo
gs c
onsi
st o
f (q,
b 1,p
1,b2,q
2,..,c
).�
qis
the
quer
y.�
b iis
the
bid
of a
dver
tiser
inith
plac
e an
d p i
the
pric
e.�
cis
the
ad c
licke
d on
.
�M
apto
(q,b
i,pi,i
,1 if
c=i
)for
all
i; q
is th
e ke
y.�
Red
uce
will
hav
e al
l rec
ords
with
sam
e q.
Cal
cula
te.
�nu
mbe
r of c
licks
,�
aver
age
posi
tion,
�av
erag
e co
st p
er c
lick,
etc
.�
Run
this
per
iodi
cally
and
inde
x fo
r eac
h q.
Loo
kup
whe
n ad
verti
ser q
uerie
s.
Web
Tra
ffic
Ana
lysi
s: In
fras
truc
ture
�Te
raby
tes o
f dat
a on
100
0�s o
f com
mod
ity m
achi
nes.
�10
0�s o
f eng
inee
rs ru
nnin
g m
any
anal
yses
si
mul
tane
ousl
y an
y da
y.�
Enor
mou
sly
succ
essf
ul a
tGoo
gle
for m
achi
ne le
arni
ng,
grap
h co
mpu
ting
to in
dex
gene
ratio
n.
�O
ther
infr
astru
ctur
e: B
igTa
ble,
Stu
bby,
�
Map
Red
uce
was
use
d fo
r 29k
jobs
, dea
lt w
ith 3
k TB
, 300
+ pr
ogra
ms,
79k
mac
hine
day
s, in
Aug
04,
[OSD
I04]
Tal
k O
verv
iew
�D
ata
Ana
lysi
s in
Diff
eren
t Com
mun
ities
�A
lgor
ithm
s, D
atab
ases
and
Net
wor
king
�In
fras
truct
ure
Vie
w o
f Dat
a A
naly
sis
�Ex
ampl
e 1:
Cel
lpho
neC
all T
raff
ic�
Exam
ple
2: IP
Pac
ket T
raff
ic S
tream
s�
Exam
ple
3: W
eb T
raff
ic
�Pe
rspe
ctiv
es
Sum
mar
y
1000
�s o
f m/c
�s, G
FS,
Map
Red
uce,
Big
tabl
e,
�
Opt
ical
split
ters
, N
ICs,
stre
am m
gmt
engi
nes.
File
syst
em, s
crip
t la
ngua
ge, p
aral
lel
CPU
s.M
ainl
y sy
stem
s.A
lg/D
B si
nce
96.
Mai
nly
publ
.N
o pu
blic
atio
ns
Larg
e nu
mbe
r of
engi
neer
s/an
alys
tsSm
all/M
oder
ate
# of
rese
arch
ers
Smal
l tea
m o
f an
alys
ts.
PB/m
onth
ho
urs/
days
Nea
rly a
ll se
rvic
es.
TB/h
our
min
/hou
rs/d
ays
Det
ect a
ttack
s, ap
pl.
--TB
/mon
th--
wee
kly/
mon
thly
--R
epor
ts.
Web
Tra
ffic
(Sea
rch
Engi
ne)
IP T
raff
ic(I
SP)
Cel
lpho
netra
ffic
(cel
lco)
Cha
lleng
es�
Dat
a cl
eani
ng�
Bui
ld g
ener
al in
fras
truct
ure
for d
ata
clea
ning
.�
Ex: G
ener
al sy
stem
for S
NM
P da
ta c
lean
ing.
�
Mak
ing
IP st
ream
ana
lyse
s sys
tem
dis
trib
uted
.�
Shor
t ter
m: D
istri
bute
d G
igas
cope
/CM
ON
.�
Long
term
: Pla
net M
apR
educ
efo
r IP
traff
ic a
naly
sis.
�Pr
ivac
y in
web
dat
a an
alys
is.
�St
ory:
Whe
re is
my
spou
se?
�Th
eory
of a
ppro
xim
ate,
priv
ate
com
putin
g in
theo
ry
and
data
base
s res
earc
h.
Ack
now
ledg
emen
ts�
Than
ks to
�
Anj
a Fe
ldm
ann
for g
uidi
ng m
e th
roug
h th
is ta
lk.
�Jo
n Fe
ldm
an fo
r hel
p w
ith M
apR
educ
e fo
r spo
nsor
ed se
arch
.�
Nat
han
Ham
ilton
for 5
+ ye
ars o
f col
labo
ratio
n on
cel
lula
r dat
a an
alys
is.
�Te
d, O
liver
, and
Div
esh
at A
T&T
for s
ever
al y
ears
of j
oint
w
ork
on G
igas
cope
.�
Supr
atik
Bha
ttach
aryy
a an
d Ta
o Y
e at
Spr
int f
or jo
int w
ork
on
CM
ON
.
�Th
anks
to st
uden
ts a
nd c
olle
ague
s at R
utge
rs M
assD
AL.
�Th
anks
to c
olle
ague
s at G
oogl
e, S
prin
t, A
T&T,
Nar
us.