Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of...

Big Data Algorithms with Medical Applications

Yixin Chen

Outline

Challenges to big data algorithms

Clinical Big Data

Our new algorithms

Small data vs. Big data


一般性规律

VS

特殊性规律


Causality Association

Domain knowledge

Data knowledge

Small data vs. Big data Models

Data Size

Model Quality

Big Data

Small Data

Modeling techniques

Parametric VS Non-parametric

Efficiency interpretability Accuracy

Efficiency of big data models High efficiency - Parallelization (constant speedup) - Algorithmic improvements (e.g. O(N3) vs O(N2))

Large-scale Manifold Learning Maximum Variance Correction (Chen et al. ICML’13)

Outline


Clinical Big Data

Our new algorithms

The need for clinical prediction

• The ICU direct costs per day for survivors is between six and seven times those for non-ICU care.

• Unlike patients at ICUs, general hospital wards (GHW) patients are not under extensive electronic monitoring and nurse care.

• Clinical study has found that 4–17% of patients will undergo cardiopulmonary or respiratory arrest while in the GHW of hospital.

Goal: Let Data Speak!

Sudden deteriorations (e.g. septic shock, cardiopulmonary or respiratory arrest) of GHW patients can often be severe and life threatening. Goal: Provide early detection and intervention based on data mining to prevent these serious, often life-threatening events. Using both clinical data and wireless body sensor data A NSF/NIH funded clinical trial at Washington University/Barnes Jewish Hospital

Clinical Data: high-dimensional real-time time-series data

34 vital signs: pulse, temperature, oxygen saturation, shock index, respirations, blood pressure, …

Time/second

Time/second

Previous Work

Main problems : Most previous general work uses a snapshot method that takes all the features at a given time as input to a model, discarding the temporal evolving of data

Medical data

mining

medical knowledge

machine learning

methods

SCAP and PSI

Acute Physiology Score, Chronic

Health Score , and APACHE score are

used to predict renal failures

Modified Early Warning

Score (MEWS)

decision trees

neural networks SVM

Machine learning task

0

5000

10000

15000

20000

25000

30000

Non-ICUICU

Challenges: • Classification of high-

dimensional time series data

• Irregular data gaps • measurement errors • class imbalance

Solution based on existing techniques

Temporal feature extraction Bootstrap aggregating (bagging) Exploratory under-sampling Feature selection Exponential moving average smoothing Basic classifier (Mao et al. KDD’12)

• Nonlinear classification ability • Interpretability • Support for mixed data types • Efficiency • Multi-class classification

Desired Classifier Properties

Linear SVM and Logistic Regression Interpretable and efficient but linear

SVM with RBF kernels Nonlinear but not interpretable; inefficient

kNN NB NN LR Linear SVM

Kernel SVM

Nonlinear classification ability Y N Y N N Y

Interpretability N Y N Y Y N Direct support for mixed data types Y Y N N N N

Efficiency Y Y Y Y Y N Multi-class classification Y Y Y Y N N


Random kitchen sinks (RKS) Random nonlinear feature

transformation

Parametric, linear

classifier

1. Transform each input x into: exp(-i wk x), k= 1, …, K, wk ~ Gaussian distribution p(w) 2. Learn a linear model ∑ αk exp(-i wk x) Theory: based on Fourier transformation, RKS converges to RBF-SVM with large K Efficiency, but no interpretability

Outline


Clinical Big Data

Our new algorithms

Key Idea: Hybrid Model Non-parametric,

Nonlinear Feature

Transformation

Parametric, Linear

Classifier

Efficiency

Interpretability

Nonlinearity


Kernel SVM DLR

Nonlinear classification ability Y N Y N N Y Y

Interpretability N Y N Y Y N Y Direct support for mixed data types Y Y N N N N Y

Efficiency Y Y Y Y Y N Y Multi-class classification Y Y Y Y N N Y


DLR: Density-based Logistic Regression (Chen et al., KDD’13)

Each instance has D features:

Logistic Regression

Training dataset:

Optimization: maximize the overall log likelihood

where τ(x)

Assume:

Problem with linear models

If we set , what should be ϕd(x)?

Insights on τ(x) (Logistic regression)

On the other hand:

Hence: LR:

Factorization in DLR

Assumption:

, where

DLR Feature Transformation

is an increasing function of

Conditional Probability Estimation

Numerical : Kernel density estimation

Categorical xd :

(smoothed histogram)

kernel bandwidth

Kernel density estimation Training dataset:

where

DLR Learning Maximize the overall log likelihood

Objective:

A function of

Overview of DLR Initialize h and w

Update w

Calculate new feature vector

Update h

Converged? No

Fix and optimize (steepest gradient descent)

Repeat until convergence (using a LR solver) Fix and optimize

Optimization

Initial h iter 1 Iter 2 Iter 3

Interpretability DLR:

For example, represents a particular disease If represents the blood pressure (BP) of a patient

On disease level Ranking can identify the risk factors of this disease

indicates the abnormality of his BP indicates the extent of BP resulting in his disease

On patient level

Kernel Ideal kernel:

RBF kernel:

doesn’t consider the label information

DLR Kernel DLR kernel:

indicates same label

indicates different label

DLR on example data

Original LR Density-based LR

Test Data:

Accuracy on UCI Datasets

Better

numerical categorical

Training Time

Better

numerical categorical

Results on clinical data SVM: 0.9194 DLR: 0.9204 Accuracy: LR: 0.9141

Early alert when the patient appears normal to the best doctors in the world

DLR for real large data estimation: kernel density smoothing

Still too slow for big data Testing time grows as get larger

No curse of dimensionality for estimation Ultra-fast training and testing

estimation: histogram

DLR with Bins

DLR with Bins

Not smooth Not enough data

Histogram KDE Smoothing

where is the number of label in bin i is the number of instances in bin i

Different Number of Bins

5 bins 20 bins 100 bins

Results on accuracy

Splice 1K

Mush 8K

w5a 10K

w8a 50K

Adult 30K

kddcup

1.26M linearSVM 75 100 98.15 98.57 60.03 99.99

LR 77 99.87 97.67 98.24 84.80 99.99 RBF SVM 80 99.23 97.14 97.20 75.29 N/A

DLR-b 88 99.95 98.26 98.55 85.54 99.99

Results on efficiency

Splice 1K

Mush 8K

w5a 10K

w8a 50K

Adult 30K

kddcup

1.26M linearSVM 0.12 0.56 1.16 15 2847 81.70

LR 0.15 0.21 0.18 0.7 2.89 55.66 RBF SVM 0.09 1.63 1.60 29 217 N/A

DLR-b 0.22 0.32 2.65 7.6 0.6 17.93

Feature Selection Ability

DLR:

• l1-regularization: loss(w) + c∑max(wd,0) non-smooth optimization

• However, in DLR, we can simply use c ∑wd

along with constraints wd ≥ 0 smooth optimization

Top features selected by DLR

standard deviation of heart rate

ApEn of heart rate

Energy of oxygen saturation

LF of oxygen saturation

LF of heart rate

DFA of oxygen saturation

Mean of heart rate

HF of heart rate

Inertia of heart rate

Homogeneity of heart rate

Energy of heart rate

linear correlation of heart rate of oxygen saturation

• Nonlinear classification ability • Support for mixed data types • Interpretability • Efficiency • Multi-class classification

Conclusions on DLR DLR satisfies all the following:

Try it out! http://www.cse.wustl.edu/~wenlinchen/project/DLR/

• Hybrid! - Non-parametric + parametric - Association + causality - Generative + discriminative - Balance accuracy and speed

• For real big data, get rid of heavy machinery - Let accuracy grow with data size • Linear model would suffice with enough

nonlinearity/randomness

Big Data Algorithms

Thank you

第

大数据时代的挑战：

麦肯锡全球研究院报告：大数据人才稀缺

人才


Kernel SVM

Random Kitchen Sinks

Nonlinear classification ability Y N Y N N Y Y

Interpretability N Y N Y Y N N Direct support for mixed data types Y Y N N N N N

Efficiency Y Y Y Y Y N Y Multi-class classification Y Y Y Y N N N

RKS: Linear model over nonlinear features

RBF SVM: k(x,x’) =

Gaussian Naive Bayes Assumption:

Gaussian:

LR and GNB

Both GNB and LR express in a linear model

GNB learns under GNB assumption LR learns using maximum likelihood of the data

Assumption:

Motivation

NB LR

Assumption:

Motivation GNB Assumption:

Factorizing by

Factorizing by Naïve Bayes

Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of...

Documents

Transcript of Big Data Algorithms with Medical Applicationsidke.ruc.edu.cn/ccfbigdata/slides/cyx.pdfEfficiency of...