Tensor decompositionbased unsupervised feature extraction identifies candidate
genes that induce posttraumatic stress disordermediated heart diseases
Yh. Taguchi, Dept. Phys., Chuo Uinv., Tokyo, Japan
Best Paper Awards of BMC Track of
InCoB 2017 (BMC Med. Geno., Silver)
1. Introduction PTSD (Post Traumatic Stress Disorder) is primarily a mental disease that can mediate other physical diseases, e.g., diabetes (Roberts et al., JAMA Psychiatry, 2015).
In this study, we study how PTSD mediates heart failure in spite of the remote distance between heart and brain where PTSD primarily takes place.
We hypothesize that similar gene expression between heart and brain might mediate PTSD mediated heart diseases.
2. Methods
Applying tensor decomposition (TD) based unsupervised feature extraction (FE) to gene expression tensor, treatments treatments tissuestissues genesgenes, identify genes coexpressed between brain and heart, but differentially between control and treated samples.
Tensor
GenesGenesTiss
ues
Tissues
Treatm
entsT
reatments
Genes expressive selectively at the specific combination of tissues and conditions
3. Synthetic data
10 treatments 10 tissues = 100 classes1 sample / 1 class
1st 100 genes: expressive in 4 tissues at 1st treatment
2nd 100 genes: expressive in other 4 tissues at 2nd treatment
….10th 100 genes: expressive in other 4 tissues at 10th treatment
1,000 expressive genes + 29,000 noise =30,000
1st 100 genes set2nd 100 genes set
3rd 100 genes set
4th 100 genes set
5th 100 genes set6th 100 genes set
7th 100 genes set
8th 100 genes set
9th 100 genes set
10th 100 genes setTotal 1,000 genes
Task:Identification of 1000 expressive genes among 30,000 genes composed of 1,000 expressive genes and 29,000 noise.
Tensor representation:xi1i2i3
: 1 ≤ i1 ≤ 30,000 genes,
1 ≤ i2 ≤ 10 tissues, 1 ≤ i3 ≤ 10 treatments.
HOSVD (Higher Order Singular Value Decomposition)
xi1i2i3 = ∑ l1l2l3
G(l1l2l3) xl1i1 xl2i2
xl3i3
1 ≤ l1 ≤ 30,000, 1 ≤ l2 ≤ 10, 1 ≤ l3 ≤ 10.
G(l1l2l3): core tensor
xl1i1, xl2i2
, xl3i3 :singular value vectors
(orthogonal matrices)
xi1i2i3
Gxi1l1
xi2l2
xi3l3
xl1i1, 2 ≤ l1 ≤ 5
1,000 genes are well separated.
(10 classes=
5 colors
5 symbols)29,000 genes are omitted.
Selection of 1,000 genesSelection of 1,000 genes assuming that xl1i1
, 2 ≤ l1 ≤ 5 obey multiple Gaussian.
In other words, genes not obeying Gaussian are supposed to be expressive genes.Pi1
s are attributed to 1 ≤ i1 ≤ 30,000 by 2
distribution. Pi1
= P2
[> ∑ 2 ≤ l1 ≤ 5 (xl1i1/l1
)2]
P2
[> x] :Cumulative probability that argument is
larger than x under the 2 distribution.l1
: standard deviation.
Pi1 are corrected by multiple comparison correction
.. . AUCadjusted Pi1
<0.1 adjusted Pi1
<0.05adjusted Pi1
<0.01
True positive rateFalse positive rate
BenjaminiHochberg FDR
adjusted Pi1 <0.1
Comparison between true classes vs two clustering results of selected genes
True classes
Clusteri ng
Gaussian mixture
Ward (hierarchical clustering)
Conclusions of synthetic data
1. Singular value vectors given by HOSVD clusters 10 classes well.
2. Singular value vectors given by HOSVD discriminate 1,000 expressive genes from 29,000 noise
3. Unsupervised clustering by singular value vectors given by HOSVD are coincident with 10 classes.
4. Either ANOVA, SAM, or limma, could not achieve comparative performance for discriminating 1,000 genes from other 29,000 genes (omittedomitted).
4. Real Data sets
PTSD model mouse: numbers: controls, treated
AY: amygdala, HC: hippocampus, MPFC: medial prefrontal cortex, SE: septal nucleus,ST: striatum,VS: ventral striatum.
xj1j2j3j4i : 1 ≤ j1 ≤ 2 control vs treated
1 ≤ j2 ≤ 10 tissues, 1 ≤ j3 ≤ 2 stress periods (5 vs 10 days) 1 ≤ j4 ≤ 3 rest periods 1 ≤ i ≤ 43,379 genes,
xj1j2j3j4i=∑l1l2l3l4l5G(l1l2l3l4l5)xl1j1
xl2j2xl3j3
xl4j4xl5i
HOSVD
Control vs treated Tissues
xl1=2,j1xl2=4,j2
AY
HC
hear them
i brai n
In order to identify xl3j3,xl4j4
,xl5i associated
with l1=2 and l2=4, we rank G(2,3,l3,l4,l5) since G with larger absolute value means more contributions.
Stress l3=1,2Rest l4=1,2,3
Gene l5=1,4,11
Gene selection
Pi = P2
[> ∑ l5 =1,4,11 (xl5i/l5)2]
BH correction Adjusted Pi <0.01 → 801 genes
Differential expression between control and treated is checked in raw data.
Successful!
Conclusions of real data
1. TD based unsupervised FE applied to real gene expression identify 801 genes.
2. 801 genes are expressed commonly between heart and brain, but differently between controls and treated.
3. ANOVA, SAM, and limma failed to identify reasonable number of genes (omittedomitted).
4. Biological validation of genes are promising (omittedomitted).
Top Related