Valerie de Anda at #ICG12: A new multi-genomic approach for the study of biogeochemical cycles at...
-
Upload
gigascience-bgi-hong-kong -
Category
Science
-
view
90 -
download
3
Transcript of Valerie de Anda at #ICG12: A new multi-genomic approach for the study of biogeochemical cycles at...
Valerie De AndaEcology Institute UNAM México
Laboratory of Computational Biology Zaragoza CSIC Spain
[email protected]://github.com/valdeanda
@val_deanda
The 12th International Conference on GenomicsO c t o b e r 2 6 t o 2 9 , 2 0 1 7S h e n z h e n , C h i n a
Revolution in microbialecology field
»
Genomic reconstruction: microbial dark matter
»
Large amount of data
»
Ability to evaluate complex metabolic
functions data in large data sets
remains:
The iceberg illusion of metagenomics
Biologically and computationally
challenging »»Diversity, ecology, evolution and functional makeup of the microbial world
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 2 / 2 2
»Really complex to infer and test
biological hypothesis in
such data
M E B S
The Iceberg illusion of metagenomics
Microbial ecology-derived ‘omic’ studies
What do we need to improve efficiency of data processing?
Biological data interpretation
(evaluate, compare and analyze
complex data in a large scale)
Computationally efficiency:
(high performance, accuracy, high speed,
data processing, reproducibility)
» Most abundant
» Marker genes
Metagenomic data
» Statistically
≠ features
Gomez Cabrero et al 2014 BMC SBReshetova et al 2013 BMC SB
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 3 / 2 2M E B S
Data integration
For a given system,multiple sources (andpossible types) of dataare available and wewant to study themintegratively to improveknowledge discovery
What are the available data that can be used to characterize large-scale metabolic machineries?
How do we integrate all to improve the understanding the system?.
CGomez Cabrero et al 2014 BMC SBReshetova et al 2013 BMC SB
Prior knowledge: Toreduce the solutionspace and/or tofocus the analysis onbiological meaningfulregions(specific metabolicmachineries)
(Targeted)
Metabolism Taxa involved in that particular
metabolism
Proteins involved in that particular
metabolism
Public available genomes?
Mathematical model Relative entropy
Informative ScoreMEBS
𝐇′ =
𝑖
𝑃 𝑖 log2𝑃 𝑖
𝑄 𝑖
n0
≥1
≤0
Informative
Non-Informative
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 4 / 2 2M E B S
What are the available data that can be used to characterize large-scale metabolic machineries?
How do we integrate all to improve the understanding the system?.
C
Prior knowledge: Toreduce the solutionspace and/or tofocus the analysis onbiological meaningfulregions(specific metabolicmachineries)
(Targeted)
Metabolism Taxa involved in that particular
metabolism
Proteins involved in that particular
metabolism
Large scale dataset
Mathematical model Relative entropy
Informative ScoreMEBS
𝐇′ =
𝑖
𝑃 𝑖 log2𝑃 𝑖
𝑄 𝑖
n0
≥1
≤0
Informative
Non-Informative
Does is it really work?
Can capture an entiremetabolic machinery?Can we used toevaluate, compare andanalyze complex data inlarge scale ? (genomes,metagenomes)
Computationally efficient? Accurate, high speed in large datasets and reproducible
Data integration
Single Value
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 5 / 2 2M E B S
Data integration: case of study
Atmosphere
Solar E°
Redox reactions
Metabolic guilds
Geological processes
An entire biogeochemical cycle
S-cycle
CHONS-P
What are the available data that can be used to characterize large-scale metabolic machineries?
How do we integrate all to improve the understanding the system?.
Taxa involved in that particular
metabolism
Proteins involved in that particular
metabolism
Large scale datasets
Mathematical model Relative entropy
Informative ScoreMEBS
𝐇′ =
𝑖
𝑃 𝑖 log2𝑃 𝑖
𝑄 𝑖
n0
≥1
≤0
Informative
Non-Informative
They really capture the
major processes
involved in the mobilization
and use of S-compounds
through Earth biosphere
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 6 / 2 2M E B S
Data integration: case of study S-cycle
https://metacyc.org/META/NEW-IMAGE?object=Sulfur-Metabolism
http://www.genome.jp/kegg-bin/show_pathway?map00920
Manually curated reconstruction of the S-
metabolic machinery
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 7 / 2 2M E B S
Data integration: case of study S-cycle
Taxa: metabolic guilds Metabolic machinery
i) CLSB: 24 generaii) PSB: 25 generaiii) GSB: 9 generaiv) SRB: 40 generav) SRM:19 genera vi) SO:4 genera
SuliN=161
i) Sulfur compounds
ii) Metabolic pathways
iii) Genesiv) Proteins
Complete nr sequenced S-genomes
SucyN=152
txt
GCF_000006985.1 Chlorobium tepidum TLS
GCF_000007005.1 Sulfolobus solfataricus P2
GCF_000007305.1 Pyrococcus furiosus DSM 3638
GCF_000008545.1 Thermotoga maritima MSB8
GCF_000008625.1 Aquifex aeolicus VF5
GCF_000008665.1 Archaeoglobus fulgidus DSM 4304
GCF_000009965.1 Thermococcus kodakarensis KOD1
>Protein1
MIKPVGSDELKPLFVYDPEEHHKLSHEAESLPSVVISSQGPRVSSM
MGAGYFSPAGFMNV
>Protein 2
MAYKTIIEDGIDVLVVGAGLGGTGAAFEARYWGQDKKIVIAEKANID
>Protein 3
MPTFVYMTRCDGCGQCVDICPSDIMHIDTTIRRAYNIEPNMCWEC
YSCVKACPHNAIDVR
Evidence linking them with the S-cycle (Curated DB and primarily literature)
Evidence suggesting their physiological and biochemical involvement in the use of sulfur compounds.
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 8 / 2 2M E B S
Data integration: case of study S-cycle
Metabolic machinery
i) Sulfur compounds
ii) Metabolic pathways
iii) Genesiv) Proteins
SucyN=152
>Protein1
MIKPVGSDELKPLFVYDPEEHHKLSHEAESLPSVVISSQGPRVSSM
MGAGYFSPAGFMNV
>Protein 2
MAYKTIIEDGIDVLVVGAGLGGTGAAFEARYWGQDKKIVIAEKANID
>Protein 3
MPTFVYMTRCDGCGQCVDICPSDIMHIDTTIRRAYNIEPNMCWEC
YSCVKACPHNAIDVR
Evidence linking them with the S-cycle (Curated DB and primarily literature)
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 9 / 2 2M E B S
Data integration: case of study S-cycleTable 1. Metabolic pathways of global biogeochemical S-cycle Pathway number Metabolisma
Chemical processb Sulfur compound Typec
Chemical formula Sourced
Number of Pfam domaise
P1 DS O Sulfite I SO32- E 9 P2 DS O Thiosulfate I S2O3
2- E 10 P3 DS O Tetrathionate I S4O6
2- E 2 P4 DS R Tetrathionate I S4O6
2- E 17 P5 DS R Sulfate I SO42- E 20 P6 DS R Elemental sulfur I Sº E 20 P7 DS D Thiosulfate I S2O3
2- E 9 P8 DS O Carbon disulfide O CS2 E 1 P9 A DE Alkanesulfonate O CH3O3SR S 5
P10 A R Sulfate I SO42- S 20
P11 DS O Sulfide I H2S E/S 29 P12 A DE L-cysteate O C3H6NO5S C/E 1 P13 A DE Dimethyl sulfone O C2H6O2S C/E 3 P14 A DE Sulfoacetate O C2H2O5S C/E 2 P15 A DE Sulfolactate O C3H4O6S C/S 14 P16 A DE Dimethyl sulfide O C2H6S C/S 16 P17 A DE Dimethylsulfoniopropionate O C5H10O2S C/S/E 12 P18 A DE Methylthiopropanoate O C4H7O2S C/S 7 P19 A DE Sulfoacetaldehyde O C2H3O4S C/S 7 P20 DS O Elemental sulfur I S° C/S/E 7 P21 DS D Elemental sulfur I S° C/S/E 1 P22 A DE Methanesulfonate O CH3O3S C/S/E 7 P23 A DE Taurine O C2H7NO3S C/S/E 11 P24 DS M Dimethyl sulfide O C2H6S C 1 P25 DS M Metylthio-propanoate O C4H7O2S C 1 P26 DS M Methanethiol O CH4S C 1 P27 A DE Homotaurine O C3H9NO3S N 1 P28 A B Sulfolipid O SQDG 4
P29 Markers Markers 12
1
Metabolic machinery
i) Sulfur compounds
ii) Metabolic pathways
iii) Genesiv) Proteins
SucyN=152
>Protein1
MIKPVGSDELKPLFVYDPEEHHKLSHEAESLPSVVISSQGPRVSSM
MGAGYFSPAGFMNV
>Protein 2
MAYKTIIEDGIDVLVVGAGLGGTGAAFEARYWGQDKKIVIAEKANID
>Protein 3
MPTFVYMTRCDGCGQCVDICPSDIMHIDTTIRRAYNIEPNMCWEC
YSCVKACPHNAIDVR
Evidence linking them with the S-cycle (Curated DB and primarily literature)
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 1 0 / 2 2M E B S
Data integration: case of study S-cycle
Metabolic machinery
i) Sulfur compounds
ii) Metabolic pathways
iii) Genesiv) Proteins
SucyN=152
>Protein1
MIKPVGSDELKPLFVYDPEEHHKLSHEAESLPSVVISSQGPRVSSM
MGAGYFSPAGFMNV
>Protein 2
MAYKTIIEDGIDVLVVGAGLGGTGAAFEARYWGQDKKIVIAEKANID
>Protein 3
MPTFVYMTRCDGCGQCVDICPSDIMHIDTTIRRAYNIEPNMCWEC
YSCVKACPHNAIDVR
Evidence linking them with the S-cycle (Curated DB and primarily literature)
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 1 1 / 2 2M E B S
Large omic datasetsWhat are the available data that can be used to
characterize large-scale metabolic pathways?
How do we integrate all to improve the understanding the system?.
Mathematical model Relative entropy
Informative ScoreMEBS
𝐇′ =
𝑖
𝑃 𝑖 log2𝑃 𝑖
𝑄 𝑖
n0
≥1
≤0
Informative
Non-Informative
Taxa involved in that particular
metabolism
Proteins involved in that particular
metabolism
txt
2,107 nr genomes (faa)
Gen1,5 GB
How many genomes were available at the time of analysis?
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 1 2 / 2 2
Num of complete prokarioticgenomes
≈4,000 (NCBI Refseq) Dec 2016
Non redundant 2,107 Dec 2016
Publicavailableand manuallycuarteddata
M E B S
Large omic datasetsWhat are the available data that can be used to characterize large-scale metabolic machineries?
How do we integrate all to improve the understanding the system?.
Mathematical model Relative entropy
Informative ScoreMEBS
𝐇′ =
𝑖
𝑃 𝑖 log2𝑃 𝑖
𝑄 𝑖
n0
≥1
≤0
Informative
Non-Informative
Taxa: Suli Proteins: Sucy
txt
2,107 nr genomes (faa)
Gen MetGenF
104GB≈ 500 GB
1,5 GB
How many metagenomes were available at the time of analysis?
i) were publicly availableii) contained associated metadata iii) had been isolated from well-defined environments
(i.e., rivers, soil, biofilms)iv) discarding host associated microbiome sequences
(i.e., human, cow, chicken)
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 1 3 / 2 2M E B S
112-HMM of S-proteins
C
txt
GCF_000006985.1 Chlorobium tepidum TLS
GCF_000007005.1 Sulfolobus solfataricus P2
GCF_000007305.1 Pyrococcus furiosus DSM 3638
GCF_000008545.1 Thermotoga maritima MSB8
GCF_000008625.1 Aquifex aeolicus VF5
GCF_000008665.1 Archaeoglobus fulgidus DSM 4304
GCF_000009965.1 Thermococcus kodakarensis KOD1
>Protein1
MIKPVGSDELKPLFVYDPEEHHKLSHEAESLPSVVISSQGPRVSSM
MGAGYFSPAGFMNV
>Protein 2
MAYKTIIEDGIDVLVVGAGLGGTGAAFEARYWGQDKKIVIAEKANID
>Protein 3
MPTFVYMTRCDGCGQCVDICPSDIMHIDTTIRRAYNIEPNMCWEC
YSCVKACPHNAIDVR2,107 nr genomes (faa)
Gen GenF
Stage 1: Manual curation and omic datasets
Stage 2: Domain composition
Stage 4: Informative Score Can capture the S- metabolic machinery?Can we used to evaluate, compare and analyzecomplex data in large scale ? (genomes, metagenomes)
Computationally efficient? Accurate, high speed in large datasets and reproducible Single Value
Mathematical model
𝐇′ =
𝑖
𝑃 𝑖 log2𝑃 𝑖 (𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑)
𝑄 𝑖 (𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑)
n
≥1
Informative
Non-Informative
Stage 3: Relative Entropy
Domains enriched among the microorganisms of interest
𝑃 𝑖 = frequency of protein domain i in S genomes (161)
Q 𝑖 = frequency of protein domain i in Gen (2,107)
0
≤0
Taxa: Suli Proteins: Sucy
MEBS: GENERAL OVERVIEW
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 1 4 / 2 2M E B S
https://github.com/eead-csic-compbio/metagenome_Pfam_score
2,107 genomes 161 Suli +
935 metagenomes
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 1 5 / 2 2M E B S
an unnamed endosymbiont of a scaly snail from a black smoker chimney
archaeon Geoglobus ahangari, sampled from a 2,000m depth hydrothermal vent .
Distribution of Sulfur Score (SS)
in 2,107 nr-genomes
CandidatusDesulforudisaudaxviator MP104C
Metagenomic reconstructions hard-to culture taxa
SurN=192
»
»»
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 1 6 / 2 2M E B S
Positive instances
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 1 7 / 2 2
SuliN=161
(1946) > Negative instances.
Gen
ROC CURVE• Two-dimensional graphs in which TP rate is plotted on the Y axis and FP rate is plotted on the X axis. • Depicts relative tradeoffs between benefits (true positives) and costs (false positives).
Perfect classification
M E B S
Distribution of Sulfur Score (SS) in the metagenomic dataset (935 metagenomes)
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
Distribution of SS values observed in 935 metagenomes classified in terms of features (X-axis) and colored according to their particular habitats Features are sorted according to their median SS values. Green lines indicate the lowest and largest 95th percentiles observed across MSL classes.
Geo-localized metagenomes sampled around the globe are colored according to their SS values
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 1 8 / 2 2M E B S
mebsBG cygling
Sgenes
Sgenomes
Informative
Non-informative
9.5
Markers Comp
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
C
Conclusions» We present MEBS a new open source software to evaluate, quantify, compare, and
predict the metabolic machinery of interest in large ‘omic’ datasets using one single value
» To test the applicability of this approach, we evaluated one of the most complex biogeochemical cycles the sulfur cycle.
» Using data integration and manual curation we reconstructed the entire sulfur machinery: Suli and Sucy
» We prove that the use of the mathematical framework of the relative entropy can be used to capture complex metabolic machineries in large scale omic samples.
» MEBS powerful and broadly applicable approach to predict, and classify microorganisms closely involved in the sulfur cycle even in hard-to culture microbial lineages
» Computationally efficient, accurate (AUC0985) and reproducible.
» Not in the presentation: the entropy can be used to detect marker domains and the completeness of the S-cycle pathways can be benchmarked in large scale
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 1 9 / 2 2
MEBS
M E B S
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 2 0 / 2 2
mebsBG CYGLING
9.5
C N O
SFe P
BIOREMEDIATION ANTIBIOTICS
EXTREME ENVIRONMENTS
AGRICULTURE
?
Perspectives• We are currently finishing the analyses to demonstrate the applicability of
this approach to other biogeochemical cycles (C, N, O, Fe, P). • Thereby, we hope that the pipeline MEBS will facilitate analysis of
biogeochemical cycles or complex metabolic networks carried out by specific prokaryotic guilds, such as bioremediation processes (i.e., degradation of hydrocarbons, toxic aromatic compounds, heavy metals etc.).
• We look forward to collaborate and help other researchers by integrating comprehensive databases that might be helpful to the scientific community.
• Furthermore, we are currently working to improve the algorithm by using only a list of sequenced genomes involved in the metabolism of interest, in order to reduce the manual curation effort.
• We are also considering taking k-mers instead of peptide Hidden Markov Models to increase the speed of the pipeline.
• We anticipate that our platform will stimulate interest and involvement among the scientific community to explore uncultured genomes derived from large metagenomic sequences: exploring microbial dark matter
M E B S
IcoquihZapata
Valeria SouzaLuis Equiarte
Bruno Contreras
De Anda et al., 2017 MEBS, a software platform to evaluate large (meta)genomic collections according to their metabolic machinery: unraveling the sulfur cycle GigaScience in press
Cesar-Poot Hernandez
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 2 1 / 2 2M E B S
L A B O R A T O R Y O F M O L E C U L A R A N D E X P E R I M E N T A L E V O L U T I O N E C O L O G Y I N S T I T U T E U N A M M E X I C O
22
L A B O R A T O R Y O F C O M P U T A T I O N A L B I O L O G Y
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 2 2 / 2 2
Thank you for your attention!
M E B S
supplementary files
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d am e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 1 / 1 2
A B
Gen (n=2,107) Met (n=935) D. acidiphilus
HydrogenobacullumA. caldusA. ferrivorans
T. mobilis
D. aromaticaT. hauera sp. T. humireducensA. denitrificans
S. tokodaiiA. hospitalis (among other 12 genomes)
P. phaeoclathratiformeC. chlorochromatiiC. tepidumT. denitrificansT. violascensS. thiotaurini
Completeness
Supplementary files
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
Table 1. Metabolic pathways of global biogeochemical S-cycle Pathway number Metabolisma
Chemical processb Sulfur compound Typec
Chemical formula Sourced
Number of Pfam domaise
P1 DS O Sulfite I SO32- E 9 P2 DS O Thiosulfate I S2O3
2- E 10 P3 DS O Tetrathionate I S4O6
2- E 2 P4 DS R Tetrathionate I S4O6
2- E 17 P5 DS R Sulfate I SO42- E 20 P6 DS R Elemental sulfur I Sº E 20 P7 DS D Thiosulfate I S2O3
2- E 9 P8 DS O Carbon disulfide O CS2 E 1 P9 A DE Alkanesulfonate O CH3O3SR S 5
P10 A R Sulfate I SO42- S 20
P11 DS O Sulfide I H2S E/S 29 P12 A DE L-cysteate O C3H6NO5S C/E 1 P13 A DE Dimethyl sulfone O C2H6O2S C/E 3 P14 A DE Sulfoacetate O C2H2O5S C/E 2 P15 A DE Sulfolactate O C3H4O6S C/S 14 P16 A DE Dimethyl sulfide O C2H6S C/S 16 P17 A DE Dimethylsulfoniopropionate O C5H10O2S C/S/E 12 P18 A DE Methylthiopropanoate O C4H7O2S C/S 7 P19 A DE Sulfoacetaldehyde O C2H3O4S C/S 7 P20 DS O Elemental sulfur I S° C/S/E 7 P21 DS D Elemental sulfur I S° C/S/E 1 P22 A DE Methanesulfonate O CH3O3S C/S/E 7 P23 A DE Taurine O C2H7NO3S C/S/E 11 P24 DS M Dimethyl sulfide O C2H6S C 1 P25 DS M Metylthio-propanoate O C4H7O2S C 1 P26 DS M Methanethiol O CH4S C 1 P27 A DE Homotaurine O C3H9NO3S N 1 P28 A B Sulfolipid O SQDG 4
P29 Markers Markers 12
1
The protein domains currently present in any given sample are divided by the total number of domains in the pre-defined pathway
Completeness
Supplementary files
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
Supplementary files
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
35 private metagenomes:microbial mats, sediment
and lake water
Reads
Processing
ORF prediction
Gene Calling
(aa residues)
Mean Size Length
https://microbiome.wordpress.com/
Counts of prokaryotic genomes in each NCBI category as of July 2017
Non-redundant Redundant
LARGE SCALE
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
Supplementary files
GenF size category 5-percentile 95-percentile
Real -0.091 0.101
30 -0.086 0.105
60 -0.09 0.104
100 -0.088 0.1
150 -0.09 0.103
200 -0.89 0.105
250 -0.09 0.106
300 -0.09 0.1
Completeness
Supplementary files
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
Table 2 Informative Pfam domains with high H’ and low std. Novel proposed molecular marker
domains in metagenomic data of variable MSL
Pfam ID
( Suli
ocurrences)
H’
mean
H’
std Description
PF12139
58/161
1.2 0.01 Adenosine-5'-phosphosulfate reductase beta subunit: Key protein domain for both sulfur oxidation/reduction metabolic pathways. Has been widely studied in the dissimilatory sulfate reduction metabolism. In all recognized sulfate-reducing prokaryotes, the dissimilatory process is mediated by three key enzymes: Sat, Apr and Dsr. Homologous proteins are also present in the anoxygenic photolithotrophic and chemolithotrophic sulfur-oxidizing bacteria (CLSB, PSB, GSB), in different cluster organization [35].
PF00374
135/161
1.1 0.09 Nickel-dependent hydrogenase: Hydrogenases with S-cluster and selenium containing Cys-x-x-Cys motifs involved in the binding of nickel. Among the homologues of this hydrogenase domain, is the alpha subunit of the sulfhydrogenase I complex of Pyrococcus furiosus, that catalyzes the
reduction of polysulfide to hydrogen sulfide with NADPH as the electron donor [55].
PF01747
103/161
1.03 0.06 ATP-sulfurylase: Key protein domain for both sulfur oxidation and reduction processes. The enzyme catalyzes the transfer of the adenylyl group from ATP to inorganic sulfate, producing
adenosine 5′-phosphosulfate (APS) and pyrophosphate, or the reverse reaction [56].
PF02662
62/161
0.82 0.03 Methyl-viologen-reducing hydrogenase, delta subunit: Is one of the enzymes involved in methanogenesis and encoded in the mth-flp-mvh-mrt cluster of methane genes in Methanothermobacter thermautotrophicus. No specific functions have been assigned to the delta
subunit [48].
PF10418
122/161
0.78 0.06 Iron-sulfur cluster binding domain of dihydroorotate dehydrogenase B: Among the homologous genes in this family are asrA and asrB from Salmonella enterica enterica serovar Typhimurium, which encode 1) a dissimilatory sulfite reductase, 2) a gamma subunit of the sulfhydrogenase I complex of Pyrococcus furiosus and, 3) a gamma subunit of the sulfhydrogenase II complex of the same organism [12].
PF13247
149/161
0.66 0.06 4Fe-4S dicluster domain: Homologues of this family include: 1) DsrO, a ferredoxin-like protein, related to the electron transfer subunits of respiratory enzymes, 2) dimethylsulfide dehydrogenase β subunit (ddhB ), involved in dimethyl sulfide degradation in Rhodovulum sulfidophilum and 3) sulfur reductase FeS subunit (sreB) of Acidianus ambivalens, involved in the sulfur reduction using
H2 or organic substrates as electron donors [12].
PF04358
73/161
0.52 0 DsrC like protein: DsrC is present in all organisms encoding a dsrAB sulfite reductase (sulfate/sulfite reducers or sulfur oxidizers). The physiological studies suggest that sulfate reduction rates are determined by cellular levels of this protein. The dissimilatory sulfate reduction couples the four-electron reduction of the DsrC trisulfide to energy conservation [57]. DsrC was initially described as a subunit of DsrAB, forming a tight complex; however, it is not a subunit, but rather a protein with which DsrAB interacts. DsrC is involved in sulfur-transfer reactions; there is a disulfide bond between the two DsrC cysteines as a redox-active center in the sulfite reduction pathway. Moreover, DsrC is among the most highly expressed sulfur energy metabolism genes in isolated organisms and meta- transcriptomes (Santos et al., 2015).
PF01058
158/161
0.45 0.01 NADH ubiquinone oxidoreductase, 20 Kd subunit: Homologous genes are found in the delta
subunits of both sulfhydrogenase complexes of Pyrococcus furiosus [12].
PF01568
156/161
0.4 0.05 Molydopterin dinucleotide binding domain: This domain corresponds to the C-terminal domain IV
in dimethyl sulfoxide (DMSO) reductase [48].
Supplementary files
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
https://github.com/eead-csic-compbio/metagenome_Pfam_score
Modo avanzado manual
» Biogeochemical cycles (CNOPFe)
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
Supplementary files
Species SS Genus GuildAmmonifex degensii KC4 12,508 Moorella group SRB/SRArchaeoglobus profundus DSM 5631 12,024 Archaeoglobus SRBCandidatus Desulforudis audaxviator MP104C 11,972 Candidatus Desulforudis Sur
Pelodictyon phaeoclathratiforme BU-1 11,836Chlorobium/Pelodictyon
group GSB
Chlorobium phaeobacteroides BS1 11,649Chlorobium/Pelodictyon
group GSB
Chlorobium chlorochromatii CaD3 11,625Chlorobium/Pelodictyon
group GSBThiobacillus denitrificans ATCC 25259 11,61 Thiobacillus CLSBDesulfohalobium retbaense DSM 5692 11,511 Desulfohalobium SRBDesulfovibrio alaskensis G20 11,5 Desulfovibrio SRBDesulfovibrio vulgaris DP4 11,442 Desulfovibrio SRBChlorobium tepidum TLS 11,354 Chlorobaculum GSBendosymbiont of unidentified scaly snail isolate Monju 11,205 0 SurDesulfovibrio vulgaris str. 'Miyazaki F' 11,093 Desulfovibrio SRBDesulfovibrio desulfuricans subsp. desulfuricans str. ATCC 27774 11,034 Desulfovibrio SRB
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
Supplementary files
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
Supplementary files
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
Supplementary files
34
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
Supplementary files
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
Supplementary files
Sulfur: 112 H’ Nitrogen: 176 H’ Methane: 119 H’Oxygen:55 H’
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
Supplementary files
Iron: 112 H’
Biogeochemical cycle Genes Pfam domains Genomes AUC
Sulfur (S) 152 112 161 0.9855
Nitrogen (N) 267 176 144 0.791
Methane (C) 135 119 90 0.988
Oxygenic Photosynthesis (O) 50 55 53 0.983
Phosphorous (P)
Iron (Fe) 36 33 34 0.863
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
Supplementary files
ID Description H’ mean stdPF00067 Cytochrome P450 0.644 0.033785
PF00115 Cytochrome C and Quinol oxidase polypeptide I 0.513 0.061551PF01077 Nitrite and sulphite reductase 4Fe-4S domain 0.55825 0.049936PF02560 Cyanate lyase C-terminal domain 0.93625 0.001389
PF03460 Nitrite/Sulfite reductase ferredoxin-like half domain 0.5525 0.040324PF04898 Glutamate synthase central domain 0.479 0.034699PF13442 Cytochrome C oxidase, cbb3-type, subunit III 0.6565 0.047093
python3 plot_entropy.py gen_genF_entropies.oxygen.tab -0.156 0.20625
Oxygen Markers
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
Supplementary files
ID Description H’ mean stdPF01913 Formylmethanofuran-tetrahydromethanopterin formyltransferase 3.629125 0.0227PF01993 methylene-5,6,7,8-tetrahydromethanopterin dehydrogenase 2.876 0PF02240 Methyl-coenzyme M reductase gamma subunit 3.168 0PF02241 Methyl-coenzyme M reductase beta subunit, C-terminal domain 3.168 0
PF02289 Cyclohydrolase (MCH) 3.353 0PF02741 FTR, proximal lobe 3.63475 0.034648PF02745 Methyl-coenzyme M reductase alpha subunit, N-terminal domain 3.168 0PF02783 Methyl-coenzyme M reductase beta subunit, N-terminal domain 3.168 0PF04206 Tetrahydromethanopterin S-methyltransferase, subunit E 3.032 0PF04207 Tetrahydromethanopterin S-methyltransferase, subunit D 3.032 0PF04208 Tetrahydromethanopterin S-methyltransferase, subunit A 2.903375 0.015203PF04211 Tetrahydromethanopterin S-methyltransferase, subunit C 3.02575 0.017678PF05440 Tetrahydromethanopterin S-methyltransferase subunit B 2.980125 0.036537 python3 plot_entropy.py
gen_genF_entropies.methane.tab -0.121 0.1475m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
Supplementary files
Methane
ID Description H’ mean std
PF00067 Cytochrome P450 0.57375 0.0056
PF00174 Oxidoreductase molybdopterin binding domain 0.528125 0.006578
PF00355 Rieske [2Fe-2S] domain 0.507 0.032076
PF00507 NADH-ubiquinone/plastoquinone oxidoreductase, chain 3 0.36975 0.010886
PF00547 Urease, gamma subunit 0.464 0
PF00699 Urease beta subunit 0.475125 0.001126
PF01077 Nitrite and sulphite reductase 4Fe-4S domain 0.47025 0.014568
PF02211 Nitrile hydratase beta subunit 0.405625 0.005041
PF02633 Creatinine amidohydrolase 0.58725 0.017466
PF03460 Nitrite/Sulfite reductase ferredoxin-like half domain 0.48 0.032715
PF05899 Protein of unknown function (DUF861) 0.52175 0.022914
PF09347 Domain of unknown function (DUF1989) 0.398875 0.007415
Nitrogen
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
Supplementary files
Iron
ID Description H’ mean std
PF14522 Cytochrome c7 and related cytochrome c 1.010 0.104
PF00355 Rieske [2Fe-2S] domain 0.51912 0.02854
PF00033 Cytochrome b/b6/petB 0.55875 0.04974
PF00034 Cytochrome c 0.5061 0.1013
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
Supplementary files
Positive instances
Positive classificationsonly with strong evidence so they make few false positiveerrors
MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 1 8 / 2 2
SuliN=161
(1946) > Negative instances.
Gen
ROC CURVE• Two-dimensional graphs in which tprate is plotted on the Y axis and fp rate is plotted on the X axis. • Depicts relative tradeoffs between benefits (true positives)
and costs (false positives).
Never issuing a positive classification; such a classifier commits no false positive errors but also gains no true positives
Perfect classification
Random guessing produces the diagonal line between (0,0) and (1, 1), which has an area of 0.5, no realistic classifier should have an AUC less than 0.5
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
Supplementary files
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
Supplementary files
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
Supplementary files
Rel
ativ
een
tro
py
H’
4Fe-4S dicluster domain
Molydopterindinucleotide bindingdomain
Cytochrome C oxidase, cbb3-type, subunit III
Nitrogenase component1 type Oxidoreductase
m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a
Supplementary files