Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of...

57
Big Data Algorithms with Medical Applications Yixin Chen

Transcript of Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of...

Page 1: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Big Data Algorithms with Medical Applications

Yixin Chen

Page 2: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Outline

Challenges to big data algorithms

Clinical Big Data

Our new algorithms

Page 3: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Small data vs. Big data

Page 4: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Small data vs. Big data

一般性规律

VS

特殊性规律

Page 5: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Small data vs. Big data

Causality Association

Domain knowledge

Data knowledge

Page 6: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Small data vs. Big data Models

Data Size

Model Quality

Big Data

Small Data

Page 7: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Modeling techniques

Parametric VS Non-parametric

Efficiency interpretability Accuracy

Page 8: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Efficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic improvements (e.g. O(N3) vs O(N2))

Large-scale Manifold Learning Maximum Variance Correction (Chen et al. ICML’13)

Page 9: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Outline

Challenges to big data algorithms

Clinical Big Data

Our new algorithms

Page 10: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

The need for clinical prediction

• The ICU direct costs per day for survivors is between six and seven times those for non-ICU care.

• Unlike patients at ICUs, general hospital wards (GHW) patients are not under extensive electronic monitoring and nurse care.

• Clinical study has found that 4–17% of patients will undergo cardiopulmonary or respiratory arrest while in the GHW of hospital.

Page 11: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Goal: Let Data Speak!

Sudden deteriorations (e.g. septic shock, cardiopulmonary or respiratory arrest) of GHW patients can often be severe and life threatening. Goal: Provide early detection and intervention based on data mining to prevent these serious, often life-threatening events. Using both clinical data and wireless body sensor data A NSF/NIH funded clinical trial at Washington University/Barnes Jewish Hospital

Page 12: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Clinical Data: high-dimensional real-time time-series data

34 vital signs: pulse, temperature, oxygen saturation, shock index, respirations, blood pressure, …

Time/second

Time/second

Page 13: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Previous Work

Main problems : Most previous general work uses a snapshot method that takes all the features at a given time as input to a model, discarding the temporal evolving of data

Medical data

mining

medical knowledge

machine learning

methods

SCAP and PSI

Acute Physiology Score, Chronic

Health Score , and APACHE score are

used to predict renal failures

Modified Early Warning

Score (MEWS)

decision trees

neural networks SVM

Page 14: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Machine learning task

0

5000

10000

15000

20000

25000

30000

Non-ICUICU

Challenges: • Classification of high-

dimensional time series data

• Irregular data gaps • measurement errors • class imbalance

Page 15: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Solution based on existing techniques

Temporal feature extraction Bootstrap aggregating (bagging) Exploratory under-sampling Feature selection Exponential moving average smoothing Basic classifier (Mao et al. KDD’12)

Page 16: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Solution based on existing techniques

Temporal feature extraction Bootstrap aggregating (bagging) Exploratory under-sampling Feature selection Exponential moving average smoothing Basic classifier (Mao et al. KDD’12)

Page 17: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

• Nonlinear classification ability • Interpretability • Support for mixed data types • Efficiency • Multi-class classification

Desired Classifier Properties

Linear SVM and Logistic Regression Interpretable and efficient but linear

SVM with RBF kernels Nonlinear but not interpretable; inefficient

Page 18: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

kNN NB NN LR Linear SVM

Kernel SVM

Nonlinear classification ability Y N Y N N Y

Interpretability N Y N Y Y N Direct support for mixed data types Y Y N N N N

Efficiency Y Y Y Y Y N Multi-class classification Y Y Y Y N N

Desired Classifier Properties

Page 19: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Random kitchen sinks (RKS) Random nonlinear feature

transformation

Parametric, linear

classifier

1. Transform each input x into: exp(-i wk x), k= 1, …, K, wk ~ Gaussian distribution p(w) 2. Learn a linear model ∑ αk exp(-i wk x) Theory: based on Fourier transformation, RKS converges to RBF-SVM with large K Efficiency, but no interpretability

Page 20: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Outline

Challenges to big data algorithms

Clinical Big Data

Our new algorithms

Page 21: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Key Idea: Hybrid Model Non-parametric,

Nonlinear Feature

Transformation

Parametric, Linear

Classifier

Efficiency

Interpretability

Nonlinearity

Page 22: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

kNN NB NN LR Linear SVM

Kernel SVM DLR

Nonlinear classification ability Y N Y N N Y Y

Interpretability N Y N Y Y N Y Direct support for mixed data types Y Y N N N N Y

Efficiency Y Y Y Y Y N Y Multi-class classification Y Y Y Y N N Y

Desired Classifier Properties

DLR: Density-based Logistic Regression (Chen et al., KDD’13)

Page 23: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Each instance has D features:

Logistic Regression

Training dataset:

Optimization: maximize the overall log likelihood

where τ(x)

Assume:

Page 24: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Problem with linear models

If we set , what should be ϕd(x)?

Page 25: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Insights on τ(x) (Logistic regression)

On the other hand:

Hence: LR:

Page 26: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Factorization in DLR

Assumption:

Page 27: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

, where

DLR Feature Transformation

is an increasing function of

Page 28: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Conditional Probability Estimation

Numerical : Kernel density estimation

Categorical xd :

(smoothed histogram)

Page 29: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

kernel bandwidth

Kernel density estimation Training dataset:

where

Page 30: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

DLR Learning Maximize the overall log likelihood

Objective:

A function of

Page 31: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Overview of DLR Initialize h and w

Update w

Calculate new feature vector

Update h

Converged? No

Page 32: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Fix and optimize (steepest gradient descent)

Repeat until convergence (using a LR solver) Fix and optimize

Optimization

Initial h iter 1 Iter 2 Iter 3

Page 33: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Interpretability DLR:

For example, represents a particular disease If represents the blood pressure (BP) of a patient

On disease level Ranking can identify the risk factors of this disease

indicates the abnormality of his BP indicates the extent of BP resulting in his disease

On patient level

Page 34: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Kernel Ideal kernel:

RBF kernel:

doesn’t consider the label information

Page 35: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

DLR Kernel DLR kernel:

indicates same label

indicates different label

Page 36: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

DLR on example data

Original LR Density-based LR

Test Data:

Page 37: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Accuracy on UCI Datasets

Better

numerical categorical

Page 38: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Training Time

Better

numerical categorical

Page 39: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Results on clinical data SVM: 0.9194 DLR: 0.9204 Accuracy: LR: 0.9141

Early alert when the patient appears normal to the best doctors in the world

Page 40: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

DLR for real large data estimation: kernel density smoothing

Still too slow for big data Testing time grows as get larger

No curse of dimensionality for estimation Ultra-fast training and testing

estimation: histogram

Page 41: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

DLR with Bins

Page 42: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

DLR with Bins

Not smooth Not enough data

Page 43: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Histogram KDE Smoothing

where is the number of label in bin i is the number of instances in bin i

Page 44: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Different Number of Bins

5 bins 20 bins 100 bins

Page 45: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Results on accuracy

Splice 1K

Mush 8K

w5a 10K

w8a 50K

Adult 30K

kddcup

1.26M linearSVM 75 100 98.15 98.57 60.03 99.99

LR 77 99.87 97.67 98.24 84.80 99.99 RBF SVM 80 99.23 97.14 97.20 75.29 N/A

DLR-b 88 99.95 98.26 98.55 85.54 99.99

Page 46: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Results on efficiency

Splice 1K

Mush 8K

w5a 10K

w8a 50K

Adult 30K

kddcup

1.26M linearSVM 0.12 0.56 1.16 15 2847 81.70

LR 0.15 0.21 0.18 0.7 2.89 55.66 RBF SVM 0.09 1.63 1.60 29 217 N/A

DLR-b 0.22 0.32 2.65 7.6 0.6 17.93

Page 47: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Feature Selection Ability

DLR:

• l1-regularization: loss(w) + c∑max(wd,0) non-smooth optimization

• However, in DLR, we can simply use c ∑wd

along with constraints wd ≥ 0 smooth optimization

Page 48: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Top features selected by DLR

standard deviation of heart rate

ApEn of heart rate

Energy of oxygen saturation

LF of oxygen saturation

LF of heart rate

DFA of oxygen saturation

Mean of heart rate

HF of heart rate

Inertia of heart rate

Homogeneity of heart rate

Energy of heart rate

linear correlation of heart rate of oxygen saturation

Page 49: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

• Nonlinear classification ability • Support for mixed data types • Interpretability • Efficiency • Multi-class classification

Conclusions on DLR DLR satisfies all the following:

Try it out! http://www.cse.wustl.edu/~wenlinchen/project/DLR/

Page 50: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

• Hybrid! - Non-parametric + parametric - Association + causality - Generative + discriminative - Balance accuracy and speed

• For real big data, get rid of heavy machinery - Let accuracy grow with data size • Linear model would suffice with enough

nonlinearity/randomness

Big Data Algorithms

Page 51: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Thank you

Page 52: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

大数据时代的挑战:

麦肯锡全球研究院报告:大数据人才稀缺

人才

Page 53: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

kNN NB NN LR Linear SVM

Kernel SVM

Random Kitchen Sinks

Nonlinear classification ability Y N Y N N Y Y

Interpretability N Y N Y Y N N Direct support for mixed data types Y Y N N N N N

Efficiency Y Y Y Y Y N Y Multi-class classification Y Y Y Y N N N

RKS: Linear model over nonlinear features

RBF SVM: k(x,x’) =

Page 54: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Gaussian Naive Bayes Assumption:

Gaussian:

Page 55: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

LR and GNB

Both GNB and LR express in a linear model

GNB learns under GNB assumption LR learns using maximum likelihood of the data

Assumption:

Page 56: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Motivation

NB LR

Assumption:

Page 57: Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic

Motivation GNB Assumption:

Factorizing by

Factorizing by Naïve Bayes