頁尾文字 2015/9/18 頁尾文字 1 Data Mining for Healthcare Documents 陳啟煌...

Post on 28-Dec-2015

233 views 6 download

Transcript of 頁尾文字 2015/9/18 頁尾文字 1 Data Mining for Healthcare Documents 陳啟煌...

112/04/19頁尾文字1

Data Mining for Healthcare Documents

陳啟煌臺灣大學計資中心程式組

2011.10.27

關於我 陳啟煌 學歷

– 交大資工、台大資工、台大電機 經歷

– 興匯財務顧問公司、台大計資中心 Email:vinchen@ntu.edu.tw

112/04/19頁尾文字3

Outlines

Introduction Biomedical Semantic Similarity Measure Semantic-driven Keyword Matching Extractor Web-based Discharge Summary System Healthcare Mining Project with Mongolia Conclusions and Future Works

112/04/19頁尾文字4

Clinical Mining

Clinical Database

Clinical Pathways

IntroductionIntroduction

In IOM 2000 report, 44,000 to 98,000 unIn IOM 2000 report, 44,000 to 98,000 unnecessary deaths per yearnecessary deaths per year– Death rate equivalent to three jumbo jets cra

shed every two days– Motor vehicle accidents: 43,458– breast cancer:42,297 – AIDS:16,516

Suggested SolutionsSuggested Solutions

Development of IT infrastructuresDevelopment of IT infrastructures– Computerized Physician Order Entry (CPOE CPOE

))• Order Sets: to do the right thing easier.Order Sets: to do the right thing easier.• Alerts / remindersAlerts / reminders• Clinical guidelineClinical guideline

Restriction on working hoursRestriction on working hours Greater staffing to patient ratiosGreater staffing to patient ratios

112/04/19頁尾文字8

Motivation

Clinical Pathway– a way of treating a patient with a

standardized procedure in order to• Enhance the efficiency, • Increase the quality,• Lower the costs,• Shorten the length of stay in hospital.

Usually represented in a script book and/or flow chart diagram

Order Sets System Evolution

Paper Order Sets– Predefined orders written on paper.

Electronic Order Sets– Just a UI to create and lookup order sets

Knowledge-based Order Sets– Machine Learning– Interactive UI to user.

How to Create Order Sets

Committee– Traditional method, time-consuming

Feedback system– Interaction with users, suggestions

Data mining– Find patterns from existed clinical data

Raw Data

112/04/19頁尾文字13

Introduction

112/04/19頁尾文字14

Motivation

Free-Text Reports– Discharge summaries – Radiology reports– Pathology reports– Enclose treatments can be extracted,

learned, and gained knowledge

112/04/19頁尾文字15

Motivation

Biomedical Semantic Similar Terms exists in medical reposts.

– “congestive heart failure”,”cardiac decompensation “, and “volume overload”

112/04/19頁尾文字16

Approaches

Biomedical Semantic Similarity Measure– Calculate semantic similarity between terms

A Powerful Extractor– To view, verify, extract data items from reports

Structuralized – Providing Highly Interactive Editor

• Auto-complete• Model essay• User phrases

112/04/19頁尾文字17

Biomedical Semantic Similarity Measure

112/04/19頁尾文字18

Introduction(1/4)

Ontology-techniques– Ontology Tree

• Single ontology

• Cross ontology

– Path length, Edge counting

Corpus-based techniques– Context vector measure, Latent semantic

analysis (LSA)

112/04/19頁尾文字19

Introduction(2/4)

The Web Corpus– The Web is providing unprecedented access

to the information as well as interacting with people’s daily lives.

– The idea of using the Web as a corpus for NLP research is getting popular.

112/04/19頁尾文字20

Introduction(3/4)

How to analyze each document directly of the Web?

112/04/19頁尾文字21

Introduction(4/4)

Web search engines– Efficient interface

– Numerous documents & high growth rate

– Google – page count

112/04/19頁尾文字22

Background and Related Work

Ontology-techniques – Single ontology

• Edge counting

• Information content

• Feature based

• Hybrid

– Cross ontology• Hliaoutakis etc.

112/04/19頁尾文字23

Methodologies

Sample Construction Feature Definitions Feature Selection Strategy Machine Learning Model

– Support Vector Machine Model

112/04/19頁尾文字24

Sample Construction(1/3)

112/04/19頁尾文字25

Sample Construction(2/3)

112/04/19頁尾文字26

Sample Construction(3/3)

In our study, we collect– 1500 synonymous term pairs

– 1500 non-synonymous term pairs

112/04/19頁尾文字27

Feature Definitions(1/4)

Features–Co-occurrence

• A

• a

• B

112/04/19頁尾文字28

Feature Definitions(2/4)

Features–Co-occurrence

• A

–Semantic distance• A

112/04/19頁尾文字29

Feature Definitions(3/4)

”Apoptosis known as programmed cell death” The phrase known as indicates a synonymous

relationship between the apoptosis and the programmed cell death.

”Apoptosis known as programmed cell death”– Google page count - 141

” Isoflavone known as Cyclooxygenase”– Google page count - 0

112/04/19頁尾文字30

Feature Definitions(4/4)

Features– Lexico-syntactic pattern

• P known as Q H( P known as Q )/H( P ∩ Q )• of P (Q)• P (Q)• and P (Q• , P (Q• against P (Q• prevalence of P Q• patients with P Q • P/Q• P, Q

112/04/19頁尾文字31

Feature Selection Strategy

Rank the features according to their ability to express synonymy by F-score:

112/04/19頁尾文字32

Support Vector Machine Model(1/2)

112/04/19頁尾文字33

Support Vector Machine Model(2/2)

LIBSVM 2.89– C-SVC

• Linear• Polynomial degree=2• Polynomial degree=3• RBF

– nu-SVC• Linear• Polynomial degree=2• Polynomial degree=3• RBF

112/04/19頁尾文字34

Datasets(1/5)  Concept 1 Concept 2 Human

Anemia Appendicitis 0.031

Dementia Atopic Dermatitis 0.062

Bacterial Pneumonia Malaria 0.156

OsteoporosisPatent Ductus Arteriosu

s0.156

Amino Acid Sequence Anti-Bacterial Agents 0.156

Acquired Immunodeficiency Syndrome

Congenital Heart Defects

0.062

Otitis Media Infantile Colic 0.156

Meningitis Tricuspid Atresia 0.031

Sinusitis Mental Retardation 0.031

Hypertension Kidney Failure 0.5

Hyperlipidemia Hyperkalemia 0.156

Hypothyroidism Hyperthyroidism 0.406

Sarcoidosis Tuberculosis 0.406

Vaccines Immunity 0.593

Asthma Pneumonia 0.375

Table 1: Dataset 1 of 36 medical term pairs

112/04/19頁尾文字35

Datasets(2/5)  Concept 1 Concept 2 Human

Diabetic Nephropathy Diabetes Mellitus 0.5

Lactose IntoleranceIrritable Bowel Syndro

me0.468

Urinary Tract Infection Pyelonephritis 0.656

Neonatal Jaundice Sepsis 0.187

Sickle Cell Anemia Iron Deficiency Anemia 0.437

Psychology Cognitive Science 0.593

Adenovirus Rotavirus 0.437

Migraine Headache 0.718

Myocardial Ischemia Myocardial Infarction 0.75

Hepatitis B Hepatitis C 0.562

Carcinoma Neoplasm 0.75

Pulmonary Valve Stenosis

Aortic Valve Stenosis 0.531

Failure To Thrive Malnutrition 0.625

Breast Feeding Lactation 0.843

Antibiotics Antibacterial Agents 0.937

Table 1: Dataset 1 of 36 medical term pairs

112/04/19頁尾文字36

Datasets(3/5)

  Concept 1 Concept 2 Human

Seizures Convulsions 0.843

Pain Ache 0.875

Malnutrition Nutritional Deficiency 0.875

Measles Rubeola 0.906

Chicken Pox Varicella 0.968

Down Syndrome Trisomy 21 0.875

Table 1: Dataset 1 of 36 medical term pairs

112/04/19頁尾文字37

Datasets(4/5)  Concept 1 Concept 2 Physician Expert

Renal Failure Kidney Failure 4 4

Heart Myocardium 3.3 3

Stroke Infarct 3 2.8

Abortion Miscarriage 3 3.3

Delusion Schizophrenia 3 2.2

Congestive Heart Failure

Pulmonary Edema 3 1.4

Metastasis Adenocarcinoma 2.7 1.8

Calcification Stenosis 2.7 2

Diarrhea Stomach Cramps 2.3 1.3

Mitral Stenosis Atrial Fibrillation 2.3 1.3

Chronic ObstructivePulmonary Disease

Lung Infiltrates 2.3 1.9

Rheumatoid Arthritis Lupus 2 1.1

Brain TumorIntracranial Hemorrhag

e2 1.3

Carpel Tunnel Syndrome

Osteoarthritis 2 1.1

Diabetes mellitus Hypertension 2 1

Table 2: Dataset 2 of 30 medical term pairs

112/04/19頁尾文字38

Datasets(5/5)  Concept 1 Concept 2 Physician Expert

Acne Syringe 1.7 1.2

Antibiotic Allergy 1.7 1

CortisoneTotal Knee Replacemen

t1.7 1.2

Pulmonary Embolus Myocardial Infarction 1.7 1.4

Pulmonary Fibrosis Lung Cancer 1.3 1

Cholangiocarcinoma Colonoscopy 1.3 1

Lymphoid Hyperplasia Laryngeal Cancer 1 1

Multiple sclerosis Psychosis 1 1

Appendicitis Osteoporosis 1 1

Rectal Polyp Aorta 1 1

Xerostomia Alcoholic Cirrhosis 1 1

Peptic Ulcer Disease Myopia 1 1

Depression Cellulites 1 1

Varicose Vein Entire Knee Meniscus 1 1

Hyperlidpidemia Metastasis 1 1

Table 2: Dataset 2 of 30 medical term pairs

112/04/19頁尾文字39

Experiment Results

Rank Feature F(i)

1 NGD 0.2751

2 WebPMI 0.237

3 , X (Y 0.1648

4 X/Y 0.1632

5 X(Y) 0.1606

6 X, Y 0.1585

7 WebOverlap 0.1173

8 WebDice 0.0555

9 WebJaccard 0.0347

10 of X (Y) 0.0185

11 and X (Y 0.0093

12 against X (Y 0.0027

13 patients with X Y 0.0017

14 X known as Y 0.0014

15 prevalence of X Y 0.0011

112/04/19頁尾文字40

Experiment Results

Figure 3.4(a): Correlation vs. No of features and training samples using C-SVC with linear kernel

112/04/19頁尾文字41

Experiment Results

Figure 3.4(b): Correlation vs. No of features and training samples using C-SVC with polynomial degree=2 kernel

112/04/19頁尾文字42

Experiment Results

Figure 3.4(c): Correlation vs. No of features and training samples using C-SVC with polynomial degree=3 kernel

112/04/19頁尾文字43

Experiment Results

Figure 3.4(d): Correlation vs. No of features and training samples using C-SVC with RBF kernel

112/04/19頁尾文字44

Experiment Results

Figure 3.5(a): Correlation vs. No of features and training samples using nu-SVC with linear kernel

112/04/19頁尾文字45

Experiment Results

Figure 3.5(b): Correlation vs. No of features and training samples using nu-SVC with polynomial degree=2 kernel

112/04/19頁尾文字46

Experiment Results

Figure 3.5(c): Correlation vs. No of features and training samples using nu-SVC with polynomial degree=3 kernel

112/04/19頁尾文字47

Experiment Results

Figure 3.5(d): Correlation vs. No of features and training samples using nu-SVC with RBF kernel

112/04/19頁尾文字48

Experiment Results

ModelMaximum correlation

Number of samplesNumber of features

C-SVC(Linear) 0.758 1500 9

C-SVC(Poly=2) 0.776 1200 7

C-SVC(Poly=3) 0.759 300 13

C-SVC(RBF) 0.612 1100 10

nu-SVC(Linear) 0.798 900 7

nu-SVC(Poly=2) 0.766 300 11

nu-SVC(Poly=3) 0.736 300 12

nu-SVC(RBF) 0.743 100 11

112/04/19頁尾文字49

Experiment Results

Table 5: Correlation vs. Dataset 1 and Dataset 2 with physician scores and expert scores of differe

nt models

Model Dataset 1Dataset 2(Phy)

Dataset 2(Exp)

C-SVC(Linear) 0.758 0.689 0.482

C-SVC(Poly=2) 0.776 0.698 0.479

C-SVC(Poly=3) 0.759 0.649 0.395

C-SVC(RBF) 0.612 0.388 0.171

nu-SVC(Linear) 0.798 0.705 0.496

nu-SVC(Poly=2) 0.766 0.671 0.424

nu-SVC(Poly=3) 0.736 0.641 0.384

nu-SVC(RBF) 0.743 0.632 0.373

112/04/1950

Result comparisonTable 3.4 Result comparison for Dataset 1

Measure Dataset 1

SemDist 0.726(2)

Path length 0.422(5)

Leacock & Chodorow

0.600 (3)

Wu & Palmer 0.498(4)

Proposed 0.798 (1)

112/04/1951

Result comparisonTable 3.5: Results comparison for Dataset 2

MeasureDataset 2((Phys

ician)Dataset 2(EXPE

RT)

Path length 0.512(4) 0.731(2)

Leacock & Chodorow

0.358(7) 0.497(5)

Lin 0.522(3) 0.565(4)

Resnik 0.534(2) 0.61(3)

Jiang & Conrath

0.506(5) 0.741(1)

Vector(All sect, 1M notes)

0.436(6) 0.497(5)

Proposed 0.705(1) 0.496(6)

112/04/19頁尾文字52

Semantic-driven Keyword Matching Extractor

112/04/19頁尾文字53

Introduction

For Structuralized Clinical Data– Data can be directly exported for further anal

yzing and mining For Non-structuralized Clinical Data

– Data need to be further processed to extract the relevant information

112/04/19頁尾文字54

Background and Related works

Marking concepts and related semantics– Cancer Text Information Extraction System (ca

TIES) Extracting data items fill the outcomes into t

he predefined template– IBM Watson Research Center & Mayo Clinic

Providing the verification user interface– Commercial natural language processing (NL

P) engines

112/04/19頁尾文字55

Architecture

Apply match pattern on textual reports

Send matching profile

Review and verify matched information

Clinical data warehouse

Textualclinical reports

Matching metadata

Retrievekeyword list

Select keyword

Retrievematching profile

Store structuralized data

Case-oriented template schema

Keyword selection interface

Information matching modules

Textual documentsviewer

Extraction verification editor

112/04/19頁尾文字56

Methodology

The default common keyword lists of each type of textual documents

the personal keyword lists – matching the keyword and the keywords with relate

d semantic – mapping the corresponding matching rules using th

e retrieved matching pattern and applying the matching rules on the textual reports

– Date, 2009/01/01, 12/01 – Size, “4.9 x 1 x 1.8” length x width x height

112/04/19頁尾文字57

Result

112/04/19頁尾文字58

Result

112/04/19頁尾文字59

Discharge Summary System

112/04/19頁尾文字60

Background

Old Discharge summary system(Dis32)– Client/Server Architecture – Install/upgrade client applications

Web Discharge summary system– Service-Oriented Architecture– 2009.10 Online

112/04/19頁尾文字61

112/04/19頁尾文字62

Motivation

Discharge summary user interface– Chief Complaint, Brief History – Free-Text field– How to generate a list of suggesting phrases

112/04/19頁尾文字63

Motivation

Auto-Complete

112/04/19頁尾文字64

Language Modeling

We want to compute P(w1,w2,w3,w4,w5…wn), the probability of a sequence

Alternatively we want to compute P(w5|w1,w2,w3,w4): the probability of a word given some previous words

The model that computes P(W) or P(wn|w1,w2…wn-1) is called the language model.

112/04/19頁尾文字65

SRILM

SRILM– The SRI Language Modeling Toolkit – SRILM is a toolkit for building and applying s

tatistical language models (LMs)– http://www.speech.sri.com/projects/srilm/

112/04/19頁尾文字66

SRILM

Three Main Functionalities – Generate the n-gram count file from the corpus – Train the language model from the n-gram count file – Calculate the test data perplexity using the trained la

nguage mode

112/04/19頁尾文字67

Implementation

N-gram Count File– Chief Complaint, Brief History

Static– Phrase lists

Dynamic– AJAX + AutoComplete toolkit

112/04/19頁尾文字68

Discharge notes

112/04/19頁尾文字69

Results

System Name Time Spent

Client-server system 652 seconds

(00:10:52)

Web-based system 372 seconds

(00:06:12)

The average consumed time (Measure unit: seconds (hh:mm:ss)

7 intern participants

112/04/19頁尾文字70

Healthcare Mining Project with Mongolia

112/04/19頁尾文字71

Background

Taiwan — Mongolia– National Science Council– Mongolian Ministry of Education, Culture an

d Sciences NTU — MUST

– Mongolian University of Science and Technology

3-Year Project– 2009/8/1 – 2012/7/31

112/04/19頁尾文字72

Motivation

Reduce cost– Length of stay in hospital – Early detection of disease

Improve quality and patient safety– SOP, Clinical Pathways

112/04/19頁尾文字73

Motivation

Clinical Pathway– a way of treating a patient with a standardize

d procedure in order to• Enhance the efficiency, • Increase the quality,• Lower the costs,• Shorten the length of stay in hospital.

Usually represented in a script book and/or flow chart diagram

112/04/19頁尾文字74

Project Goal

Build A Data Mining framework for– Early detection of disease

• Find out the sequential patterns between different diseases

– Standardized therapeutic procedure • Discover clinical pathways and clinical guide

112/04/19頁尾文字75

Mining Clinical Pathway

Clinical Database

Clinical Pathways

112/04/19頁尾文字76

Clinical Data

The clinical data include– Patient information,– Diagnosis– Sequences of physicians orders taken at diff

erent time moments.

112/04/19頁尾文字77

112/04/19頁尾文字78

Clinical Sequence Mining system diagram

DataPreparation

Data Pre-Processing

Mining Model

HistoricalDiagnosisDatabase

OrdersSequenceKnowledge

base Alert and Reminding

System

Clinical Pathway Creation System

112/04/19頁尾文字79

Data Preparation

Inpatient Department raw data– From 2007/1/1 to 2007/5/26

Discharge notes– with admission/discharge diagnosis, chief co

mplaint. 22,000 records Diagnosis records in IPD

– with ICD9 code Related orders in IPD

112/04/19頁尾文字80

Data Preparation

Chief complaint– For scheduled chemotherapy– Total

• 791 cases• 33,771 physician orders

112/04/19頁尾文字81

Data Pre-processing

Select relevant data according to the order type attribute– Drop some non-meaningful orders such as n

ursing care, Administration routine orders.

112/04/19頁尾文字82

Order Type Statistics

ordertypecode cnt ordercntR 10309 368

T 6180 135

A 6063 20

L 5569 175

M 4026 84

D 814 25

X 360 41

B 168 5

O 106 58

J 47 6

E 40 14

P 12 4

I 11 3

N 6 2

112/04/19頁尾文字83

Mining Model

Sequence Clustering Algorithm Mining Tool

– Microsoft SQL Server 2005– Sequence Clustering Model– Visualize Data Analysis

Parameter– Support– Confidence

112/04/19頁尾文字84

Sequence Clustering Mining

Sequence Clustering algorithm finds clusters of cases that contain similar paths in a sequence.

112/04/19頁尾文字85

Sequence Clustering Sample

CustomID (Sequence Data)1 (30) (60 90)

2 (10 20) (30) (40 60 70)

3 (30 50 70)

4 (30) (40 70) (90)

5 (90)

Sequential Pattern :

(30) (90) 、 (30) (40 70)

112/04/19頁尾文字86

Mapping

Custom Patient Item Order Shopping Cart Concurrent Orders

Result

112/04/19頁尾文字88

112/04/19頁尾文字89

112/04/19頁尾文字90

112/04/19頁尾文字91

112/04/19頁尾文字92

Sequence Sample

09029CZP Bilirubin, total

08011CZP CBC & platelet

08013CZP WBC differential count

09015CZP (Blood)Creatinine

09002CZP (Blood)UN

09025CZP AST(GOT)

09026CZP ALT(GPT)

09038CZP Albumin(Blood)

09021CZP Sodium, Na

09022CZP Potassium, K

血小板

白血球

肌酸酐

膽紅素

肝功能指數

肝功能指數

清蛋白

112/04/19頁尾文字93

The SAGE Guideline Model

Standards-Based Sharable Active Guideline Environment– Developed by

• Stanford Medical Informatics, IDX Systems Corporation, Apelon Inc., Intermountain Health Care, Mayo Clinic and University of Nebraska Medical Center

The Protégé

112/04/19頁尾文字95

Activity Graphs

Aspirin Therapy for diabetic patients

112/04/19頁尾文字96

112/04/19頁尾文字97

Cooperation Architecture

Hospital in Mongolia

VM-DB VM-Web VM-DB VM-Web

Hospital in Taiwan

VM Images

Model Feedback

112/04/19頁尾文字98

Cloud Architecture

Health Mining Server

Hospital in Taiwan

Hospital in Mongolia

Hospital in Canada

112/04/19頁尾文字99

Conclusions

A measure that uses page counts calculate semantic similarity between two given concepts.

A semantic-driven keyword matching extractor help extract data item from reports

112/04/19頁尾文字100

Conclusions

A highly Interactive free-text editor with auto-complete feature speed up the composition of discharge summaries.

A Data mining framework is proposed.

112/04/19頁尾文字101

Future Works

Find out why corpus-based methods produce closer correlation with physicians’ scores than experts’

Structuralized the healthcare documents Prove Data mining models’ robustness

– Variation analysis across hospitals/regions– Taiwan and Mongolia– Canada , Taiwan and Mongolia

112/04/19頁尾文字102

Q&A