七尾市の 行財政改革 - Nanao...七尾市の 行財政改革 平成22年3月 月 七七 尾尾尾 市市市 目 次 七尾市行財政改革大綱(改訂版) ・・・・・・・・・
頁尾文字 2015/9/18 頁尾文字 1 Data Mining for Healthcare Documents 陳啟煌...
-
Upload
merryl-cooper -
Category
Documents
-
view
233 -
download
6
Transcript of 頁尾文字 2015/9/18 頁尾文字 1 Data Mining for Healthcare Documents 陳啟煌...
112/04/19頁尾文字1
Data Mining for Healthcare Documents
陳啟煌臺灣大學計資中心程式組
2011.10.27
112/04/19頁尾文字3
Outlines
Introduction Biomedical Semantic Similarity Measure Semantic-driven Keyword Matching Extractor Web-based Discharge Summary System Healthcare Mining Project with Mongolia Conclusions and Future Works
112/04/19頁尾文字4
Clinical Mining
Clinical Database
Clinical Pathways
IntroductionIntroduction
In IOM 2000 report, 44,000 to 98,000 unIn IOM 2000 report, 44,000 to 98,000 unnecessary deaths per yearnecessary deaths per year– Death rate equivalent to three jumbo jets cra
shed every two days– Motor vehicle accidents: 43,458– breast cancer:42,297 – AIDS:16,516
Suggested SolutionsSuggested Solutions
Development of IT infrastructuresDevelopment of IT infrastructures– Computerized Physician Order Entry (CPOE CPOE
))• Order Sets: to do the right thing easier.Order Sets: to do the right thing easier.• Alerts / remindersAlerts / reminders• Clinical guidelineClinical guideline
Restriction on working hoursRestriction on working hours Greater staffing to patient ratiosGreater staffing to patient ratios
112/04/19頁尾文字8
Motivation
Clinical Pathway– a way of treating a patient with a
standardized procedure in order to• Enhance the efficiency, • Increase the quality,• Lower the costs,• Shorten the length of stay in hospital.
Usually represented in a script book and/or flow chart diagram
Order Sets System Evolution
Paper Order Sets– Predefined orders written on paper.
Electronic Order Sets– Just a UI to create and lookup order sets
Knowledge-based Order Sets– Machine Learning– Interactive UI to user.
How to Create Order Sets
Committee– Traditional method, time-consuming
Feedback system– Interaction with users, suggestions
Data mining– Find patterns from existed clinical data
Raw Data
112/04/19頁尾文字13
Introduction
112/04/19頁尾文字14
Motivation
Free-Text Reports– Discharge summaries – Radiology reports– Pathology reports– Enclose treatments can be extracted,
learned, and gained knowledge
112/04/19頁尾文字15
Motivation
Biomedical Semantic Similar Terms exists in medical reposts.
– “congestive heart failure”,”cardiac decompensation “, and “volume overload”
112/04/19頁尾文字16
Approaches
Biomedical Semantic Similarity Measure– Calculate semantic similarity between terms
A Powerful Extractor– To view, verify, extract data items from reports
Structuralized – Providing Highly Interactive Editor
• Auto-complete• Model essay• User phrases
112/04/19頁尾文字17
Biomedical Semantic Similarity Measure
112/04/19頁尾文字18
Introduction(1/4)
Ontology-techniques– Ontology Tree
• Single ontology
• Cross ontology
– Path length, Edge counting
Corpus-based techniques– Context vector measure, Latent semantic
analysis (LSA)
112/04/19頁尾文字19
Introduction(2/4)
The Web Corpus– The Web is providing unprecedented access
to the information as well as interacting with people’s daily lives.
– The idea of using the Web as a corpus for NLP research is getting popular.
112/04/19頁尾文字20
Introduction(3/4)
How to analyze each document directly of the Web?
112/04/19頁尾文字21
Introduction(4/4)
Web search engines– Efficient interface
– Numerous documents & high growth rate
– Google – page count
112/04/19頁尾文字22
Background and Related Work
Ontology-techniques – Single ontology
• Edge counting
• Information content
• Feature based
• Hybrid
– Cross ontology• Hliaoutakis etc.
112/04/19頁尾文字23
Methodologies
Sample Construction Feature Definitions Feature Selection Strategy Machine Learning Model
– Support Vector Machine Model
112/04/19頁尾文字24
Sample Construction(1/3)
112/04/19頁尾文字25
Sample Construction(2/3)
112/04/19頁尾文字26
Sample Construction(3/3)
In our study, we collect– 1500 synonymous term pairs
– 1500 non-synonymous term pairs
112/04/19頁尾文字27
Feature Definitions(1/4)
Features–Co-occurrence
• A
• a
• B
112/04/19頁尾文字28
Feature Definitions(2/4)
Features–Co-occurrence
• A
–Semantic distance• A
112/04/19頁尾文字29
Feature Definitions(3/4)
”Apoptosis known as programmed cell death” The phrase known as indicates a synonymous
relationship between the apoptosis and the programmed cell death.
”Apoptosis known as programmed cell death”– Google page count - 141
” Isoflavone known as Cyclooxygenase”– Google page count - 0
112/04/19頁尾文字30
Feature Definitions(4/4)
Features– Lexico-syntactic pattern
• P known as Q H( P known as Q )/H( P ∩ Q )• of P (Q)• P (Q)• and P (Q• , P (Q• against P (Q• prevalence of P Q• patients with P Q • P/Q• P, Q
112/04/19頁尾文字31
Feature Selection Strategy
Rank the features according to their ability to express synonymy by F-score:
112/04/19頁尾文字32
Support Vector Machine Model(1/2)
112/04/19頁尾文字33
Support Vector Machine Model(2/2)
LIBSVM 2.89– C-SVC
• Linear• Polynomial degree=2• Polynomial degree=3• RBF
– nu-SVC• Linear• Polynomial degree=2• Polynomial degree=3• RBF
112/04/19頁尾文字34
Datasets(1/5) Concept 1 Concept 2 Human
Anemia Appendicitis 0.031
Dementia Atopic Dermatitis 0.062
Bacterial Pneumonia Malaria 0.156
OsteoporosisPatent Ductus Arteriosu
s0.156
Amino Acid Sequence Anti-Bacterial Agents 0.156
Acquired Immunodeficiency Syndrome
Congenital Heart Defects
0.062
Otitis Media Infantile Colic 0.156
Meningitis Tricuspid Atresia 0.031
Sinusitis Mental Retardation 0.031
Hypertension Kidney Failure 0.5
Hyperlipidemia Hyperkalemia 0.156
Hypothyroidism Hyperthyroidism 0.406
Sarcoidosis Tuberculosis 0.406
Vaccines Immunity 0.593
Asthma Pneumonia 0.375
Table 1: Dataset 1 of 36 medical term pairs
112/04/19頁尾文字35
Datasets(2/5) Concept 1 Concept 2 Human
Diabetic Nephropathy Diabetes Mellitus 0.5
Lactose IntoleranceIrritable Bowel Syndro
me0.468
Urinary Tract Infection Pyelonephritis 0.656
Neonatal Jaundice Sepsis 0.187
Sickle Cell Anemia Iron Deficiency Anemia 0.437
Psychology Cognitive Science 0.593
Adenovirus Rotavirus 0.437
Migraine Headache 0.718
Myocardial Ischemia Myocardial Infarction 0.75
Hepatitis B Hepatitis C 0.562
Carcinoma Neoplasm 0.75
Pulmonary Valve Stenosis
Aortic Valve Stenosis 0.531
Failure To Thrive Malnutrition 0.625
Breast Feeding Lactation 0.843
Antibiotics Antibacterial Agents 0.937
Table 1: Dataset 1 of 36 medical term pairs
112/04/19頁尾文字36
Datasets(3/5)
Concept 1 Concept 2 Human
Seizures Convulsions 0.843
Pain Ache 0.875
Malnutrition Nutritional Deficiency 0.875
Measles Rubeola 0.906
Chicken Pox Varicella 0.968
Down Syndrome Trisomy 21 0.875
Table 1: Dataset 1 of 36 medical term pairs
112/04/19頁尾文字37
Datasets(4/5) Concept 1 Concept 2 Physician Expert
Renal Failure Kidney Failure 4 4
Heart Myocardium 3.3 3
Stroke Infarct 3 2.8
Abortion Miscarriage 3 3.3
Delusion Schizophrenia 3 2.2
Congestive Heart Failure
Pulmonary Edema 3 1.4
Metastasis Adenocarcinoma 2.7 1.8
Calcification Stenosis 2.7 2
Diarrhea Stomach Cramps 2.3 1.3
Mitral Stenosis Atrial Fibrillation 2.3 1.3
Chronic ObstructivePulmonary Disease
Lung Infiltrates 2.3 1.9
Rheumatoid Arthritis Lupus 2 1.1
Brain TumorIntracranial Hemorrhag
e2 1.3
Carpel Tunnel Syndrome
Osteoarthritis 2 1.1
Diabetes mellitus Hypertension 2 1
Table 2: Dataset 2 of 30 medical term pairs
112/04/19頁尾文字38
Datasets(5/5) Concept 1 Concept 2 Physician Expert
Acne Syringe 1.7 1.2
Antibiotic Allergy 1.7 1
CortisoneTotal Knee Replacemen
t1.7 1.2
Pulmonary Embolus Myocardial Infarction 1.7 1.4
Pulmonary Fibrosis Lung Cancer 1.3 1
Cholangiocarcinoma Colonoscopy 1.3 1
Lymphoid Hyperplasia Laryngeal Cancer 1 1
Multiple sclerosis Psychosis 1 1
Appendicitis Osteoporosis 1 1
Rectal Polyp Aorta 1 1
Xerostomia Alcoholic Cirrhosis 1 1
Peptic Ulcer Disease Myopia 1 1
Depression Cellulites 1 1
Varicose Vein Entire Knee Meniscus 1 1
Hyperlidpidemia Metastasis 1 1
Table 2: Dataset 2 of 30 medical term pairs
112/04/19頁尾文字39
Experiment Results
Rank Feature F(i)
1 NGD 0.2751
2 WebPMI 0.237
3 , X (Y 0.1648
4 X/Y 0.1632
5 X(Y) 0.1606
6 X, Y 0.1585
7 WebOverlap 0.1173
8 WebDice 0.0555
9 WebJaccard 0.0347
10 of X (Y) 0.0185
11 and X (Y 0.0093
12 against X (Y 0.0027
13 patients with X Y 0.0017
14 X known as Y 0.0014
15 prevalence of X Y 0.0011
112/04/19頁尾文字40
Experiment Results
Figure 3.4(a): Correlation vs. No of features and training samples using C-SVC with linear kernel
112/04/19頁尾文字41
Experiment Results
Figure 3.4(b): Correlation vs. No of features and training samples using C-SVC with polynomial degree=2 kernel
112/04/19頁尾文字42
Experiment Results
Figure 3.4(c): Correlation vs. No of features and training samples using C-SVC with polynomial degree=3 kernel
112/04/19頁尾文字43
Experiment Results
Figure 3.4(d): Correlation vs. No of features and training samples using C-SVC with RBF kernel
112/04/19頁尾文字44
Experiment Results
Figure 3.5(a): Correlation vs. No of features and training samples using nu-SVC with linear kernel
112/04/19頁尾文字45
Experiment Results
Figure 3.5(b): Correlation vs. No of features and training samples using nu-SVC with polynomial degree=2 kernel
112/04/19頁尾文字46
Experiment Results
Figure 3.5(c): Correlation vs. No of features and training samples using nu-SVC with polynomial degree=3 kernel
112/04/19頁尾文字47
Experiment Results
Figure 3.5(d): Correlation vs. No of features and training samples using nu-SVC with RBF kernel
112/04/19頁尾文字48
Experiment Results
ModelMaximum correlation
Number of samplesNumber of features
C-SVC(Linear) 0.758 1500 9
C-SVC(Poly=2) 0.776 1200 7
C-SVC(Poly=3) 0.759 300 13
C-SVC(RBF) 0.612 1100 10
nu-SVC(Linear) 0.798 900 7
nu-SVC(Poly=2) 0.766 300 11
nu-SVC(Poly=3) 0.736 300 12
nu-SVC(RBF) 0.743 100 11
112/04/19頁尾文字49
Experiment Results
Table 5: Correlation vs. Dataset 1 and Dataset 2 with physician scores and expert scores of differe
nt models
Model Dataset 1Dataset 2(Phy)
Dataset 2(Exp)
C-SVC(Linear) 0.758 0.689 0.482
C-SVC(Poly=2) 0.776 0.698 0.479
C-SVC(Poly=3) 0.759 0.649 0.395
C-SVC(RBF) 0.612 0.388 0.171
nu-SVC(Linear) 0.798 0.705 0.496
nu-SVC(Poly=2) 0.766 0.671 0.424
nu-SVC(Poly=3) 0.736 0.641 0.384
nu-SVC(RBF) 0.743 0.632 0.373
112/04/1950
Result comparisonTable 3.4 Result comparison for Dataset 1
Measure Dataset 1
SemDist 0.726(2)
Path length 0.422(5)
Leacock & Chodorow
0.600 (3)
Wu & Palmer 0.498(4)
Proposed 0.798 (1)
112/04/1951
Result comparisonTable 3.5: Results comparison for Dataset 2
MeasureDataset 2((Phys
ician)Dataset 2(EXPE
RT)
Path length 0.512(4) 0.731(2)
Leacock & Chodorow
0.358(7) 0.497(5)
Lin 0.522(3) 0.565(4)
Resnik 0.534(2) 0.61(3)
Jiang & Conrath
0.506(5) 0.741(1)
Vector(All sect, 1M notes)
0.436(6) 0.497(5)
Proposed 0.705(1) 0.496(6)
112/04/19頁尾文字52
Semantic-driven Keyword Matching Extractor
112/04/19頁尾文字53
Introduction
For Structuralized Clinical Data– Data can be directly exported for further anal
yzing and mining For Non-structuralized Clinical Data
– Data need to be further processed to extract the relevant information
112/04/19頁尾文字54
Background and Related works
Marking concepts and related semantics– Cancer Text Information Extraction System (ca
TIES) Extracting data items fill the outcomes into t
he predefined template– IBM Watson Research Center & Mayo Clinic
Providing the verification user interface– Commercial natural language processing (NL
P) engines
112/04/19頁尾文字55
Architecture
Apply match pattern on textual reports
Send matching profile
Review and verify matched information
Clinical data warehouse
Textualclinical reports
Matching metadata
Retrievekeyword list
Select keyword
Retrievematching profile
Store structuralized data
Case-oriented template schema
Keyword selection interface
Information matching modules
Textual documentsviewer
Extraction verification editor
112/04/19頁尾文字56
Methodology
The default common keyword lists of each type of textual documents
the personal keyword lists – matching the keyword and the keywords with relate
d semantic – mapping the corresponding matching rules using th
e retrieved matching pattern and applying the matching rules on the textual reports
– Date, 2009/01/01, 12/01 – Size, “4.9 x 1 x 1.8” length x width x height
112/04/19頁尾文字57
Result
112/04/19頁尾文字58
Result
112/04/19頁尾文字59
Discharge Summary System
112/04/19頁尾文字60
Background
Old Discharge summary system(Dis32)– Client/Server Architecture – Install/upgrade client applications
Web Discharge summary system– Service-Oriented Architecture– 2009.10 Online
112/04/19頁尾文字61
112/04/19頁尾文字62
Motivation
Discharge summary user interface– Chief Complaint, Brief History – Free-Text field– How to generate a list of suggesting phrases
112/04/19頁尾文字63
Motivation
Auto-Complete
112/04/19頁尾文字64
Language Modeling
We want to compute P(w1,w2,w3,w4,w5…wn), the probability of a sequence
Alternatively we want to compute P(w5|w1,w2,w3,w4): the probability of a word given some previous words
The model that computes P(W) or P(wn|w1,w2…wn-1) is called the language model.
112/04/19頁尾文字65
SRILM
SRILM– The SRI Language Modeling Toolkit – SRILM is a toolkit for building and applying s
tatistical language models (LMs)– http://www.speech.sri.com/projects/srilm/
112/04/19頁尾文字66
SRILM
Three Main Functionalities – Generate the n-gram count file from the corpus – Train the language model from the n-gram count file – Calculate the test data perplexity using the trained la
nguage mode
112/04/19頁尾文字67
Implementation
N-gram Count File– Chief Complaint, Brief History
Static– Phrase lists
Dynamic– AJAX + AutoComplete toolkit
112/04/19頁尾文字68
Discharge notes
112/04/19頁尾文字69
Results
System Name Time Spent
Client-server system 652 seconds
(00:10:52)
Web-based system 372 seconds
(00:06:12)
The average consumed time (Measure unit: seconds (hh:mm:ss)
7 intern participants
112/04/19頁尾文字70
Healthcare Mining Project with Mongolia
112/04/19頁尾文字71
Background
Taiwan — Mongolia– National Science Council– Mongolian Ministry of Education, Culture an
d Sciences NTU — MUST
– Mongolian University of Science and Technology
3-Year Project– 2009/8/1 – 2012/7/31
112/04/19頁尾文字72
Motivation
Reduce cost– Length of stay in hospital – Early detection of disease
Improve quality and patient safety– SOP, Clinical Pathways
112/04/19頁尾文字73
Motivation
Clinical Pathway– a way of treating a patient with a standardize
d procedure in order to• Enhance the efficiency, • Increase the quality,• Lower the costs,• Shorten the length of stay in hospital.
Usually represented in a script book and/or flow chart diagram
112/04/19頁尾文字74
Project Goal
Build A Data Mining framework for– Early detection of disease
• Find out the sequential patterns between different diseases
– Standardized therapeutic procedure • Discover clinical pathways and clinical guide
112/04/19頁尾文字75
Mining Clinical Pathway
Clinical Database
Clinical Pathways
112/04/19頁尾文字76
Clinical Data
The clinical data include– Patient information,– Diagnosis– Sequences of physicians orders taken at diff
erent time moments.
112/04/19頁尾文字77
112/04/19頁尾文字78
Clinical Sequence Mining system diagram
DataPreparation
Data Pre-Processing
Mining Model
HistoricalDiagnosisDatabase
OrdersSequenceKnowledge
base Alert and Reminding
System
Clinical Pathway Creation System
112/04/19頁尾文字79
Data Preparation
Inpatient Department raw data– From 2007/1/1 to 2007/5/26
Discharge notes– with admission/discharge diagnosis, chief co
mplaint. 22,000 records Diagnosis records in IPD
– with ICD9 code Related orders in IPD
112/04/19頁尾文字80
Data Preparation
Chief complaint– For scheduled chemotherapy– Total
• 791 cases• 33,771 physician orders
112/04/19頁尾文字81
Data Pre-processing
Select relevant data according to the order type attribute– Drop some non-meaningful orders such as n
ursing care, Administration routine orders.
112/04/19頁尾文字82
Order Type Statistics
ordertypecode cnt ordercntR 10309 368
T 6180 135
A 6063 20
L 5569 175
M 4026 84
D 814 25
X 360 41
B 168 5
O 106 58
J 47 6
E 40 14
P 12 4
I 11 3
N 6 2
112/04/19頁尾文字83
Mining Model
Sequence Clustering Algorithm Mining Tool
– Microsoft SQL Server 2005– Sequence Clustering Model– Visualize Data Analysis
Parameter– Support– Confidence
112/04/19頁尾文字84
Sequence Clustering Mining
Sequence Clustering algorithm finds clusters of cases that contain similar paths in a sequence.
112/04/19頁尾文字85
Sequence Clustering Sample
CustomID (Sequence Data)1 (30) (60 90)
2 (10 20) (30) (40 60 70)
3 (30 50 70)
4 (30) (40 70) (90)
5 (90)
Sequential Pattern :
(30) (90) 、 (30) (40 70)
112/04/19頁尾文字86
Mapping
Custom Patient Item Order Shopping Cart Concurrent Orders
Result
112/04/19頁尾文字88
112/04/19頁尾文字89
112/04/19頁尾文字90
112/04/19頁尾文字91
112/04/19頁尾文字92
Sequence Sample
09029CZP Bilirubin, total
08011CZP CBC & platelet
08013CZP WBC differential count
09015CZP (Blood)Creatinine
09002CZP (Blood)UN
09025CZP AST(GOT)
09026CZP ALT(GPT)
09038CZP Albumin(Blood)
09021CZP Sodium, Na
09022CZP Potassium, K
血小板
白血球
肌酸酐
膽紅素
肝功能指數
肝功能指數
清蛋白
鈉
鉀
112/04/19頁尾文字93
The SAGE Guideline Model
Standards-Based Sharable Active Guideline Environment– Developed by
• Stanford Medical Informatics, IDX Systems Corporation, Apelon Inc., Intermountain Health Care, Mayo Clinic and University of Nebraska Medical Center
The Protégé
112/04/19頁尾文字95
Activity Graphs
Aspirin Therapy for diabetic patients
112/04/19頁尾文字96
112/04/19頁尾文字97
Cooperation Architecture
Hospital in Mongolia
VM-DB VM-Web VM-DB VM-Web
Hospital in Taiwan
VM Images
Model Feedback
112/04/19頁尾文字98
Cloud Architecture
Health Mining Server
Hospital in Taiwan
Hospital in Mongolia
Hospital in Canada
112/04/19頁尾文字99
Conclusions
A measure that uses page counts calculate semantic similarity between two given concepts.
A semantic-driven keyword matching extractor help extract data item from reports
112/04/19頁尾文字100
Conclusions
A highly Interactive free-text editor with auto-complete feature speed up the composition of discharge summaries.
A Data mining framework is proposed.
112/04/19頁尾文字101
Future Works
Find out why corpus-based methods produce closer correlation with physicians’ scores than experts’
Structuralized the healthcare documents Prove Data mining models’ robustness
– Variation analysis across hospitals/regions– Taiwan and Mongolia– Canada , Taiwan and Mongolia
112/04/19頁尾文字102
Q&A