TechnicalIssuesinAggregatingandAnalyzingDatafrom
HeterogeneousEHRSystemsJosh Denny, MD, MS
[email protected] University, Nashville, Tennessee, USA
2/12/2015
EHRdataaredense
196,693 individuals in an EHR DNA Biobank (BioVU)• Mean follow‐up – 5.7 yrs• Distinct ICD9 codes – 19 million• Labs – 121 million
– Distinct labs – 5948– Avg labs/patient – 662
• Drugs – 122 million• Notes – 26 million (average 132 notes/individual)• Radiology tests – 2 million
Identify phenotype of interest
Case & control algorithm
development and refinement
Manual review; assess
precision
Deploy at site 1 Genetic
association tests;
replicate
PPV≥95%
PPV<95%
ApproachtoEHRphenotyping
Validate at other sites
Extant Genotypes
Clinical Notes(NLP - natural language
processing)Billing codesICD9 & CPT
MedicationsePrescribing
& NLPLabs & test results
NLP
Whatwe’velearned‐ FindingphenotypesintheEMR
True cases
Findingcases:RheumatoidArthritis
255 507 1184
Definite Cases(algorithm-defined)
Possible Cases(require manual review)
Controls(algorithm-defined)
Excluded(algorithm-defined)
7121
Analysis
Optional Manual Review
ReplicatingknownstudiesintheEHR
0.5 5.01.0Odds Ratio
rs2200733 Chr. 4q25rs10033464 Chr. 4q25rs11805303 IL23Rrs17234657 Chr. 5rs1000113 Chr. 5rs17221417 NOD2rs2542151 PTPN22rs3135388 DRB1*1501rs2104286 IL2RArs6897932 IL7RArs6457617 Chr. 6rs6679677 RSBN1rs2476601 PTPN22rs4506565 TCF7L2rs12255372 TCF7L2rs12243326 TCF7L2rs10811661 CDKN2Brs8050136 FTOrs5219 KCNJ11rs5215 KCNJ11rs4402960 IGF2BP2
Atrial fibrillation
Crohn's disease
Multiple sclerosis
Rheumatoid arthritis
Type 2 diabetes
disease gene / regionmarker
2.0
Am J Hum Genet. 2010;86:560‐72.
observedpublished
DiscoveryscienceineMERGE
Am J Hum Genet. 2011;89:529-42
Algorithms can be deployed
across multiple EMRs
Analyses can be performed using extant
data
CompletedeMERGEGWASDiseases• Dementia• Cataracts• Autoimmune Hypothyroidism• Diverticulosis/diverticulitis• Type 2 Diabetes• Diabetic retinopathy • Herpes zoster• PheWAS• Peripheral Arterial Disease• Venous Thromboembolism • Glaucoma• Ocular hypertension• Abdominal Aortic Aneurysm • Colon polyps
Endophenotypes• PR Duration• QRS Duration• HDL/LDL• height • white blood cell counts• red blood cell counts• Cardiorespiratory Fitness• ESR levels• Platelet levels
Pharmacogenomic phenotypes• ACE inhibitor cough• Heparin induced thrombocytopenia• Resistant hypertension• Drug Induced Liver Injury• C. difficile colitisbold=GWAS completed with
significant results
Selected consortia contributions• Height• QTc• Rheum. Arthritis• Myocardial Infarction
Genetics Consortium• Intl. Mult Sclerosis Genet.
Consort.• Genomic Investigation of
Statin Therapy
85 phenotypes from eMERGE, PGRN, PCORnet47 have validation data118 total implementations
Hypothyroidismalgorithm
Performanceof88PhenotypeAlgorithmsinPheKB
0%
20%
40%
60%
80%
100%
Primary site Secondary sites
Positiv
e Pred
ictiv
e Va
lue
Positive Predict Value
Site Implementations
MedianDrug-induced
liver injury
The phenome‐wide association study
Target phenotype
chromosomal locationasso
ciat
ion
P va
lue
Target genotype
diagnosis code
asso
ciat
ion
P va
lue
PheWAS requirement: A large cohort of patients with genotype data and many diagnoses
The genome‐wide association studyExample new PheWAS associations for IRF4
Known: hair, skin, eye color
Phenotype Cases ControlsClopidogrel in CV disease 225 468Warfarin stable dose 1,167 N/AEarly Repolarization 544 2,609Vancomycin stable dose 1,067 N/AC. difficile colitis 941 1,710Anthracycline cardiomyopathy 528 N/AGuillain-Barre Syndrome 97 6,536Heart Transplant 181 N/AKidney transplant 1,078 N/AClopidogrel in strokes/TIAs 6 123Statin-related myopathy 11 4,342Heparin-induced thrombocytopenia 73 2,300CV events with COX2 therapy 85 395Serious bleeding during warfarin 259 276Amiodarone toxicity (lung, thyroid) 97 343Chronic inflammatory polyneuropathy 12 14,000*
Rheumatic Heart Disease 108 3,464ACEi cough 1,174 978Fluoroquinolones and tenopathy 87 537Warfarin stable dose in children 92 N/AMetformin efficacy 80 N/AMetformin and cancer 619 421Bisphosphonates and Atypical Fracture/Jaw Osteonecrosis 16 1,454
Wolff-Parkinson-White 197 5,551Steroid-induced Osteonecrosis 83 352Shellfish Anaphylaxis 157 14,000*Aspirin Anaphylaxis 101 4,334Bell's Palsy# 577 14,000*
Studyingdrugresponseswith
GWAS
Bowton et al., Sci Trans Med. 2014
“Only” about 120,000 samples at time of study – underpowered for many rare outcomes
90% participated in >1 study
Strengths
• Rich, longitudinal data stores • Ability to go back to the chart to find out more• Research‐quality phenotypes available via algorithms
• Potential for closed‐loop discovery and implementation
• Expensive testing available “for free”• Ability to explore rare, detailed, drug‐response, and mortal phenotypes
• Samples easily reused for many studies
Challenges
• Developing algorithms takes time and people, and then implementation requires local expertise
• EHR data can be inaccurate, heterogeneous, unavailable, lack organization, have different storage structures
• Fragmentation between healthcare systems• Mining of EHR data is not trivial (though improving): text data, duration and temporality
Howdoyousharegeneticdata?
Site 5
Site 1
Site 2
Site 3Site 4
Edges (unique DUAs): n(n‐1)/2 = 10
Site 5
Site 1
Site 2
Site 3Site 4
Edges: n = 5
Coordinating Center
10 sites = 45 vs. 1020 sites = 190 vs. 2030 sites = 435 vs. 30
Coordinating Center
: pediatric sites
Kaiser Permanente
Network DNA samples GWAS
eMERGE 361k 51k (100k)
Million Veterans Program 350k 200k
Kaiser Permanente 300k 100k
Total >1 million >351k
Top Related