Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of...
Transcript of Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of...
Big Data Algorithms with Medical Applications
Yixin Chen
Outline
Challenges to big data algorithms
Clinical Big Data
Our new algorithms
Small data vs. Big data
Small data vs. Big data
一般性规律
VS
特殊性规律
Small data vs. Big data
Causality Association
Domain knowledge
Data knowledge
Small data vs. Big data Models
Data Size
Model Quality
Big Data
Small Data
Modeling techniques
Parametric VS Non-parametric
Efficiency interpretability Accuracy
Efficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic improvements (e.g. O(N3) vs O(N2))
Large-scale Manifold Learning Maximum Variance Correction (Chen et al. ICML’13)
Outline
Challenges to big data algorithms
Clinical Big Data
Our new algorithms
The need for clinical prediction
• The ICU direct costs per day for survivors is between six and seven times those for non-ICU care.
• Unlike patients at ICUs, general hospital wards (GHW) patients are not under extensive electronic monitoring and nurse care.
• Clinical study has found that 4–17% of patients will undergo cardiopulmonary or respiratory arrest while in the GHW of hospital.
Goal: Let Data Speak!
Sudden deteriorations (e.g. septic shock, cardiopulmonary or respiratory arrest) of GHW patients can often be severe and life threatening. Goal: Provide early detection and intervention based on data mining to prevent these serious, often life-threatening events. Using both clinical data and wireless body sensor data A NSF/NIH funded clinical trial at Washington University/Barnes Jewish Hospital
Clinical Data: high-dimensional real-time time-series data
34 vital signs: pulse, temperature, oxygen saturation, shock index, respirations, blood pressure, …
Time/second
Time/second
Previous Work
Main problems : Most previous general work uses a snapshot method that takes all the features at a given time as input to a model, discarding the temporal evolving of data
Medical data
mining
medical knowledge
machine learning
methods
SCAP and PSI
Acute Physiology Score, Chronic
Health Score , and APACHE score are
used to predict renal failures
Modified Early Warning
Score (MEWS)
decision trees
neural networks SVM
Machine learning task
0
5000
10000
15000
20000
25000
30000
Non-ICUICU
Challenges: • Classification of high-
dimensional time series data
• Irregular data gaps • measurement errors • class imbalance
Solution based on existing techniques
Temporal feature extraction Bootstrap aggregating (bagging) Exploratory under-sampling Feature selection Exponential moving average smoothing Basic classifier (Mao et al. KDD’12)
Solution based on existing techniques
Temporal feature extraction Bootstrap aggregating (bagging) Exploratory under-sampling Feature selection Exponential moving average smoothing Basic classifier (Mao et al. KDD’12)
• Nonlinear classification ability • Interpretability • Support for mixed data types • Efficiency • Multi-class classification
Desired Classifier Properties
Linear SVM and Logistic Regression Interpretable and efficient but linear
SVM with RBF kernels Nonlinear but not interpretable; inefficient
kNN NB NN LR Linear SVM
Kernel SVM
Nonlinear classification ability Y N Y N N Y
Interpretability N Y N Y Y N Direct support for mixed data types Y Y N N N N
Efficiency Y Y Y Y Y N Multi-class classification Y Y Y Y N N
Desired Classifier Properties
Random kitchen sinks (RKS) Random nonlinear feature
transformation
Parametric, linear
classifier
1. Transform each input x into: exp(-i wk x), k= 1, …, K, wk ~ Gaussian distribution p(w) 2. Learn a linear model ∑ αk exp(-i wk x) Theory: based on Fourier transformation, RKS converges to RBF-SVM with large K Efficiency, but no interpretability
Outline
Challenges to big data algorithms
Clinical Big Data
Our new algorithms
Key Idea: Hybrid Model Non-parametric,
Nonlinear Feature
Transformation
Parametric, Linear
Classifier
Efficiency
Interpretability
Nonlinearity
kNN NB NN LR Linear SVM
Kernel SVM DLR
Nonlinear classification ability Y N Y N N Y Y
Interpretability N Y N Y Y N Y Direct support for mixed data types Y Y N N N N Y
Efficiency Y Y Y Y Y N Y Multi-class classification Y Y Y Y N N Y
Desired Classifier Properties
DLR: Density-based Logistic Regression (Chen et al., KDD’13)
Each instance has D features:
Logistic Regression
Training dataset:
Optimization: maximize the overall log likelihood
where τ(x)
Assume:
Problem with linear models
If we set , what should be ϕd(x)?
Insights on τ(x) (Logistic regression)
On the other hand:
Hence: LR:
Factorization in DLR
Assumption:
, where
DLR Feature Transformation
is an increasing function of
Conditional Probability Estimation
Numerical : Kernel density estimation
Categorical xd :
(smoothed histogram)
kernel bandwidth
Kernel density estimation Training dataset:
where
DLR Learning Maximize the overall log likelihood
Objective:
A function of
Overview of DLR Initialize h and w
Update w
Calculate new feature vector
Update h
Converged? No
Fix and optimize (steepest gradient descent)
Repeat until convergence (using a LR solver) Fix and optimize
Optimization
Initial h iter 1 Iter 2 Iter 3
Interpretability DLR:
For example, represents a particular disease If represents the blood pressure (BP) of a patient
On disease level Ranking can identify the risk factors of this disease
indicates the abnormality of his BP indicates the extent of BP resulting in his disease
On patient level
Kernel Ideal kernel:
RBF kernel:
doesn’t consider the label information
DLR Kernel DLR kernel:
indicates same label
indicates different label
DLR on example data
Original LR Density-based LR
Test Data:
Accuracy on UCI Datasets
Better
numerical categorical
Training Time
Better
numerical categorical
Results on clinical data SVM: 0.9194 DLR: 0.9204 Accuracy: LR: 0.9141
Early alert when the patient appears normal to the best doctors in the world
DLR for real large data estimation: kernel density smoothing
Still too slow for big data Testing time grows as get larger
No curse of dimensionality for estimation Ultra-fast training and testing
estimation: histogram
DLR with Bins
DLR with Bins
Not smooth Not enough data
Histogram KDE Smoothing
where is the number of label in bin i is the number of instances in bin i
Different Number of Bins
5 bins 20 bins 100 bins
Results on accuracy
Splice 1K
Mush 8K
w5a 10K
w8a 50K
Adult 30K
kddcup
1.26M linearSVM 75 100 98.15 98.57 60.03 99.99
LR 77 99.87 97.67 98.24 84.80 99.99 RBF SVM 80 99.23 97.14 97.20 75.29 N/A
DLR-b 88 99.95 98.26 98.55 85.54 99.99
Results on efficiency
Splice 1K
Mush 8K
w5a 10K
w8a 50K
Adult 30K
kddcup
1.26M linearSVM 0.12 0.56 1.16 15 2847 81.70
LR 0.15 0.21 0.18 0.7 2.89 55.66 RBF SVM 0.09 1.63 1.60 29 217 N/A
DLR-b 0.22 0.32 2.65 7.6 0.6 17.93
Feature Selection Ability
DLR:
• l1-regularization: loss(w) + c∑max(wd,0) non-smooth optimization
• However, in DLR, we can simply use c ∑wd
along with constraints wd ≥ 0 smooth optimization
Top features selected by DLR
standard deviation of heart rate
ApEn of heart rate
Energy of oxygen saturation
LF of oxygen saturation
LF of heart rate
DFA of oxygen saturation
Mean of heart rate
HF of heart rate
Inertia of heart rate
Homogeneity of heart rate
Energy of heart rate
linear correlation of heart rate of oxygen saturation
• Nonlinear classification ability • Support for mixed data types • Interpretability • Efficiency • Multi-class classification
Conclusions on DLR DLR satisfies all the following:
Try it out! http://www.cse.wustl.edu/~wenlinchen/project/DLR/
• Hybrid! - Non-parametric + parametric - Association + causality - Generative + discriminative - Balance accuracy and speed
• For real big data, get rid of heavy machinery - Let accuracy grow with data size • Linear model would suffice with enough
nonlinearity/randomness
Big Data Algorithms
Thank you
第
大数据时代的挑战:
麦肯锡全球研究院报告:大数据人才稀缺
人才
kNN NB NN LR Linear SVM
Kernel SVM
Random Kitchen Sinks
Nonlinear classification ability Y N Y N N Y Y
Interpretability N Y N Y Y N N Direct support for mixed data types Y Y N N N N N
Efficiency Y Y Y Y Y N Y Multi-class classification Y Y Y Y N N N
RKS: Linear model over nonlinear features
RBF SVM: k(x,x’) =
Gaussian Naive Bayes Assumption:
Gaussian:
LR and GNB
Both GNB and LR express in a linear model
GNB learns under GNB assumption LR learns using maximum likelihood of the data
Assumption:
Motivation
NB LR
Assumption:
Motivation GNB Assumption:
Factorizing by
Factorizing by Naïve Bayes