Deep learning based multi-omics integration, a survey

Post on 14-Apr-2017

12 views 0 download

Transcript of Deep learning based multi-omics integration, a survey

Deep Learning based Multi-omics integration

A survey

Deep Learning in Bioinformatics

Min, Seonwoo, Byunghan Lee, and Sungroh Yoon. "Deep learning in bioinformatics." Briefings in Bioinformat-ics (2016)

Outline• Summarize three related works on deep learning based

feature extraction / survival prediction on omics data• Unsupervised feature construction and knowledge extraction

from genome-wide assays of breast cancer with denoising au-toencoders• A deep learning approach for cancer detection and relevant

gene identification• Deep Learning based multi-omics integration robustly predicts

survival in liver cancer

Unsupervised feature construction and knowledge extraction from genome-wide assays of breast cancer with denoising autoencodersPacific Symposium on Biocomputing, 2015

Denoising Auto-Encoder (DAE)• Build features that recon-

struct initial input data from corrupted data• Generate robust features• Unsupervised learning• Extract features in the

non-linear space

Data• Two largest breast cancer dataset• Train DAs and identify predictive features with METABRIC dataset• 2137 samples, 3000 2520 genes• gene expression data from European Genomephenome Ar-

chive• Evaluate with TCGA dataset independently• 547 samples, 2520 genes

Features to clinical characteristics• Genes are not linked to their neigh-

bors• Genes are linked by transcription

factors, pathway memberships• Are constructed features linked to

clinical and molecular features of the samples?• Categorize tumor / normal samples• Categorize ER+/- samples• Categorize samples into molecular

subtypes(Luminal A/B, Basal-like, HER2-enriched, Normal-like)

Features to clinical characteristics•  classifying tumor from

normal samples• classifying ER + from ER -

samples

Robust performance across datasets

Features to transcription factor• Breast cancer related transcription factors are linked to these

high-weight features (Node58)• It contained genes that reflect activity of key ER-associated TFs

Most genes gave zero or low weight to a hidden node

High positive weightHigh negative weight

Features to patient survival• Node whose activities best sepa-

rated two high / low survival groups (Node5)• Highly predictive of patient sur-

vival

Features to Biological pathways• Pathways significantly associated with genes that con-

sistently gave high weights to a nodePID pathways enriched in Node5(5th fea-ture)

Summary• Unsupervised feature construction based on DAEs and

interpretation• Apply to a breast cancer gene expression data• Consistent results across different datasets• In the future..• Multiple layers of stacked DAEs• Consistency across datasets will useful for data integration• Limitations for large-scale data integration

A deep learning approach for cancer de-tection and relevant gene identificationPacific Symposium on Biocomputing, 2016

RNA-seqsamples

TCGAHealth

yCancer

Test Train

SDAE fea-turesDCGs

ModelValidation

weights

Overview

Supervised classification(cancer detection)

Highly interactive genes identification

1210 breast cancer samples

Stacked Denoising Auto-Encoder• Extract functional features from high dimensional, noisy gene ex-

pression profiles with reduced loss of information• Select a layer has both low dimension and low validation error

Classification result• Classify cancer samples from

healthy control samples• Feature extraction

• SDAE• Differentially expressed genes

(DIFFEXP)• PCA• KPCA (RBF kernel)

• Classification model• SVM• SVM (RBF kernel)• single-layer ANN

Deeply connected genes• Genes with the largest weights in W (the product of the

weight matrices for each layer) are the most strongly connected to the extracted and highly predictive fea-tures

But lower performance than SDAE feateures

….

Summary• SDAE to transform high-dimensional, noisy gene expression data to a

lower dimensional, meaningful representation• Classify breast cancer samples from the healthy control samples using

new compact features• Identify a set of highly interactive genes critical for the diagnosis of

breast cancer• In the future..

• Need to improve the extraction of DCGs• Limitation on the requirement for large data sets• Identify cross-cancer biomarkers through the analysis of aggregated heteroge-

neous cancer data

Deep Learning based multi-omics integra-tionrobustly predicts survival in liver cancerpreprint, 2017

360 tumor samples

15629 genes 365 miRNAs 19883 genes

100 features

37 features

high/poor survival

Why Autoencoders?• Produce features linked to

clinical outcomes• Analyze high-dimensional

gene expression data• Integrate heterogeneous

data• Interpret the biological func-

tions (aggregate genes shar-ing similar pathways)

Classification result

PCA

Classification result

Single-omics based DL models

Validation in five cohorts• Robustness of the model at predicting survival out-

comes

Adding clinical information• Age, Stage, Grade, Race, Risk factors (HBV, HCV, Alco-

hol, …)• DL-based multi-omics model performs sufficiently well

even without clinical features

Functional analysis of the survival-subgroups

• KEGG pathway analysis to pinpoint the pathways en-riched in two subtypes• Two subtypes have different

and disjoint active pathways

Enriched pathway-gene analysis for upregulated genes• S1 aggressive tu-

mor sub-group

• Enriched with can-cer related path-ways

Enriched pathway-gene analysis for upregulated genes• S2 less aggressive tu-

mor sub-group

• Activated metabolism related pathways

Summary• Contributions• Identified two subtypes from the molecular level• Consistent performance implying the reliability and robustness

of the model• Sufficient performance without adding clinical features• AE has much more efficiency to infer features linked to survival• Validated in five additional cohorts

• Challenges• The absence of cluster label information in original reports• Lack of survival data in some cases

Conclusion• Feature extraction with SDAE• Robust to noisy datasets• Extract meaningful features and reflect both linear and non-

linear relationships• Consistent performance, good for multi-omics integration

• Multi-omics integration• More sophisticated strategy to combine multiple features• May incorporate pathways, handle overlapping genes

Thank you!Q & A