Post on 04-Jul-2015
@leekgroup
@simplystats
why you should care about sta6s6cs
Jeff Leek Johns Hopkins Bloomberg Biosta6s6cs
jtleek@gmail.com
@leekgroup
@simplystats
credits
• slides shamelessly borrowed from: – Ingo Ruczinski (Google: “ingo’s pond”) – Josh Akey (UW Genomics) – Karl Broman (Google: “the stupidest thing broman”)
@leekgroup
@simplystats
why this stuff maNers
@leekgroup
@simplystats
seems like an exci6ng result!
hNp://www.nature.com/nm/journal/v12/n11/full/nm1491.html
@leekgroup
@simplystats
stunning problems
@leekgroup
@simplystats
how it went down
hNp://www.nature.com/news/2011/110111/full/469139a/box/1.html
@leekgroup
@simplystats
s6ll going on
@leekgroup
@simplystats
worth a watch
hNp://www.birs.ca/events/2013/5-‐day-‐workshops/13w5083/videos/watch/201308141121-‐Baggerly.mp4
@leekgroup
@simplystats
worth a read
hNp://www.iom.edu/Reports/2012/Evolu6on-‐of-‐Transla6onal-‐Omics.aspx
@leekgroup
@simplystats
what were the problems?
• irreproducibility • lack of coopera6on
• silly predic6on rules • study design/batch effects • procedures not locked down
Exper6se
Transparency
@leekgroup
@simplystats
6p #1: know the analysis
hNp://bit.ly/OgW3xv
@leekgroup
@simplystats
6p #2: care about the analysis
Drinkel et al. Oganometalics 2013
@leekgroup
@simplystats
6p #3: have a data/analysis sharing plan
hNp://www.nature.com/nature/journal/v467/n7314/full/467401b.html
@leekgroup
@simplystats
6p #4: know where to get help
hNp://www.biostat.jhsph.edu/consult/
@leekgroup
@simplystats
6p #5: no subs6tute for the real thing
@leekgroup
@simplystats
“central dogma” of sta6s6cs
Adapted from Josh Akey
@leekgroup
@simplystats
sample size
@leekgroup
@simplystats
some experiment
@leekgroup
@simplystats
example calcula6ons
@leekgroup
@simplystats
beNer technology ≠ no variability
hNp://www.nature.com/nbt/journal/v29/n7/full/nbt.1910.html
@leekgroup
@simplystats
power
@leekgroup
@simplystats
bad study design
78% of genes differen6ally expressed
@leekgroup
@simplystats
group and date “confounded”
@leekgroup
@simplystats
uh-‐oh!
@leekgroup
@simplystats
confounding:
associa6on between shoe size and literacy in kids
@leekgroup
@simplystats
proteomics
@leekgroup
@simplystats
proteomics
@leekgroup
@simplystats
gene expression
@leekgroup
@simplystats
gene expression
@leekgroup
@simplystats
gwas
@leekgroup
@simplystats
gwas
@leekgroup
@simplystats
confounding is a big deal
hNp://www.nature.com/nrg/journal/v11/n10/full/nrg2825.html
@leekgroup
@simplystats
confounding and study design
@leekgroup
@simplystats
6p #6: randomiza6on
@leekgroup
@simplystats
an example study
@leekgroup
@simplystats
a bad design
@leekgroup
@simplystats
stra6fied design
@leekgroup
@simplystats
more good study characteris6cs
• Balanced
• Replicated • Has Controls
@leekgroup
@simplystats
6p #7: look at the data
hNp://en.wikipedia.org/wiki/Anscombe's_quartet
@leekgroup
@simplystats
summarizing data
hNp://www.biostat.wisc.edu/~kbroman/topten_worstgraphs/
@leekgroup
@simplystats
replicates
@leekgroup
@simplystats
watch the scale!
@leekgroup
@simplystats
log transform is common/useful
@leekgroup
@simplystats
bland-‐altman plots
hNp://en.wikipedia.org/wiki/Bland%E2%80%93Altman_plot
@leekgroup
@simplystats
beware ridiculograms!
@leekgroup
@simplystats
ack! math!
€
X1,…,XM
€
Y1,…,YN
€
Y =1N
Yii=1
N
∑
€
X = 1M
Xii=1
M
∑
€
sX2 =
1M −1
(Xi − X )2i=1
M
∑
€
sY2 =
1N −1
(Yi −Y )2i=1
N
∑
Observa6ons:
Averages:
SD2 or variances:
@leekgroup
@simplystats
an important issue
@leekgroup
@simplystats
t-‐sta%s%c: you’ll see this a lot*
€
Y − X sY2
N+
sX2
M
Invented to improve beer: hNp://en.wikipedia.org/wiki/Student's_t-‐test
@leekgroup
@simplystats
p-‐values
Original Sta6s6c
@leekgroup
@simplystats
how to calculate
{# |Sperm| ≥ |Sobs|} P-‐value = # of Permuta6ons
Observed Sta6s6c = 2
@leekgroup
@simplystats
6p #8: know what a p-‐value is(n’t)
The probability of observing a sta6s6c that extreme if the null hypothesis is true. The p-‐value is not • Probability the null is true • Probability the alterna6ve is true • A measure of sta6s6cal evidence
@leekgroup
@simplystats
an easy mistake to make
@leekgroup
@simplystats
a problem
@leekgroup
@simplystats
a problem
@leekgroup
@simplystats
a problem
@leekgroup
@simplystats
mul6ple comparison error rates • Family wise error rate:
Pr(# False Positives ≥ 1) • False discovery rate:
• EFP (e-‐values) E[# False Positives]
€
E #False Positives# Of Discoveries"
# $ %
& '
@leekgroup
@simplystats
difference in interpreta6on Suppose 550 out of 10,000 genes are significant at 0.05 level
P-‐value < 0.05 Expect 0.05*10,000 = 500 false posi6ves False Discovery Rate < 0.05 Expect 0.05*550 = 27.5 false posi6ves Family Wise Error Rate < 0.05 The probability of at least 1 false posi6ve ≤ 0.05
@leekgroup
@simplystats
read this
hNp://www.pnas.org/content/100/16/9440.long
@leekgroup
@simplystats
the inevitable
hNp://simplysta6s6cs.org/2013/08/26/sta6s6cs-‐meme-‐sad-‐p-‐value-‐bear/
@leekgroup
@simplystats
why I’m sympathe6c
@leekgroup
@simplystats
beware of “hacking” sta6s6cs
@leekgroup
@simplystats
be nice to the poor sta6s6cian
@leekgroup
@simplystats
6p #9: correla6on and causa6on
hNp://xkcd.com/552/
@leekgroup
@simplystats
most common mistake
Fit regression models (correla7ons) followed by: “In summary, our results support a causal rela%onship of breasxeeding in infancy with recep6ve language at age 3 and with verbal and nonverbal IQ at school age. These findings support Na6onal and interna6onal recommenda6ons to promote exclusive breasxeeding through age 6 months and con6nua6on of breasxeeding through at least age 1 year.”
@leekgroup
@simplystats
predic6on and associa6on
@leekgroup
@simplystats
diagnos6cs
@leekgroup
@simplystats
6p #10: know these quan66es
@leekgroup
@simplystats
key quan66es as frac6ons
@leekgroup
@simplystats
important to keep in mind
@leekgroup
@simplystats
general popula6on
@leekgroup
@simplystats
general popula6on
@leekgroup
@simplystats
at risk subpopula6on
@leekgroup
@simplystats
at risk subpopula6on
@leekgroup
@simplystats
summary of 6ps 1. know the analysis 2. care about the analysis 3. have a data sharing plan 4. know where/when to get help 5. this isn’t a subs6tute for learning sta6s6cs 6. randomize in your study design 7. look at your data 8. know what p-‐values are(n’t) 9. beware causality creep 10. know the key diagnos6c quan66es