Valerie de Anda at #ICG12: A new multi-genomic approach for the study of biogeochemical cycles at...

Valerie De AndaEcology Institute UNAM México

Laboratory of Computational Biology Zaragoza CSIC Spain

[email protected]://github.com/valdeanda

@val_deanda

The 12th International Conference on GenomicsO c t o b e r 2 6 t o 2 9 , 2 0 1 7S h e n z h e n , C h i n a

Revolution in microbialecology field

»

Genomic reconstruction: microbial dark matter

»

Large amount of data

»

Ability to evaluate complex metabolic

functions data in large data sets

remains:

The iceberg illusion of metagenomics

Biologically and computationally

challenging »»Diversity, ecology, evolution and functional makeup of the microbial world

MOTIVATION GENERAL IDEA RESULTS CONCLUSIONS PERSPECTIVES THANKS

T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 2 / 2 2

»Really complex to infer and test

biological hypothesis in

such data

M E B S

The Iceberg illusion of metagenomics

Microbial ecology-derived ‘omic’ studies

What do we need to improve efficiency of data processing?

Biological data interpretation

(evaluate, compare and analyze

complex data in a large scale)

Computationally efficiency:

(high performance, accuracy, high speed,

data processing, reproducibility)

» Most abundant

» Marker genes

Metagenomic data

» Statistically

≠ features

Gomez Cabrero et al 2014 BMC SBReshetova et al 2013 BMC SB


T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 3 / 2 2M E B S

Data integration

For a given system,multiple sources (andpossible types) of dataare available and wewant to study themintegratively to improveknowledge discovery

What are the available data that can be used to characterize large-scale metabolic machineries?

How do we integrate all to improve the understanding the system?.

CGomez Cabrero et al 2014 BMC SBReshetova et al 2013 BMC SB

Prior knowledge: Toreduce the solutionspace and/or tofocus the analysis onbiological meaningfulregions(specific metabolicmachineries)

(Targeted)

Metabolism Taxa involved in that particular

metabolism

Proteins involved in that particular

metabolism

Public available genomes?

Mathematical model Relative entropy

Informative ScoreMEBS

𝐇′ =

𝑖

𝑃 𝑖 log2𝑃 𝑖

𝑄 𝑖

n0

≥1

≤0

Informative

Non-Informative





C

Prior knowledge: Toreduce the solutionspace and/or tofocus the analysis onbiological meaningfulregions(specific metabolicmachineries)

(Targeted)

Metabolism Taxa involved in that particular

metabolism


metabolism

Large scale dataset



𝐇′ =

𝑖


𝑄 𝑖

n0

≥1

≤0

Informative

Non-Informative

Does is it really work?

Can capture an entiremetabolic machinery?Can we used toevaluate, compare andanalyze complex data inlarge scale ? (genomes,metagenomes)

Computationally efficient? Accurate, high speed in large datasets and reproducible

Data integration

Single Value



Data integration: case of study

Atmosphere

Solar E°

Redox reactions

Metabolic guilds

Geological processes

An entire biogeochemical cycle

S-cycle

CHONS-P



Taxa involved in that particular

metabolism


metabolism

Large scale datasets



𝐇′ =

𝑖


𝑄 𝑖

n0

≥1

≤0

Informative

Non-Informative

They really capture the

major processes

involved in the mobilization

and use of S-compounds

through Earth biosphere



Data integration: case of study S-cycle

https://metacyc.org/META/NEW-IMAGE?object=Sulfur-Metabolism

http://www.genome.jp/kegg-bin/show_pathway?map00920

Manually curated reconstruction of the S-

metabolic machinery




Taxa: metabolic guilds Metabolic machinery

i) CLSB: 24 generaii) PSB: 25 generaiii) GSB: 9 generaiv) SRB: 40 generav) SRM:19 genera vi) SO:4 genera

SuliN=161

i) Sulfur compounds

ii) Metabolic pathways

iii) Genesiv) Proteins

Complete nr sequenced S-genomes

SucyN=152

txt

GCF_000006985.1 Chlorobium tepidum TLS

GCF_000007005.1 Sulfolobus solfataricus P2

GCF_000007305.1 Pyrococcus furiosus DSM 3638

GCF_000008545.1 Thermotoga maritima MSB8

GCF_000008625.1 Aquifex aeolicus VF5

GCF_000008665.1 Archaeoglobus fulgidus DSM 4304

GCF_000009965.1 Thermococcus kodakarensis KOD1

>Protein1

MIKPVGSDELKPLFVYDPEEHHKLSHEAESLPSVVISSQGPRVSSM

MGAGYFSPAGFMNV

>Protein 2

MAYKTIIEDGIDVLVVGAGLGGTGAAFEARYWGQDKKIVIAEKANID

>Protein 3

MPTFVYMTRCDGCGQCVDICPSDIMHIDTTIRRAYNIEPNMCWEC

YSCVKACPHNAIDVR

Evidence linking them with the S-cycle (Curated DB and primarily literature)

Evidence suggesting their physiological and biochemical involvement in the use of sulfur compounds.




Metabolic machinery

i) Sulfur compounds



SucyN=152

>Protein1


MGAGYFSPAGFMNV

>Protein 2


>Protein 3


YSCVKACPHNAIDVR




Data integration: case of study S-cycleTable 1. Metabolic pathways of global biogeochemical S-cycle Pathway number Metabolisma

Chemical processb Sulfur compound Typec

Chemical formula Sourced

Number of Pfam domaise

P1 DS O Sulfite I SO32- E 9 P2 DS O Thiosulfate I S2O3

2- E 10 P3 DS O Tetrathionate I S4O6

2- E 2 P4 DS R Tetrathionate I S4O6

2- E 17 P5 DS R Sulfate I SO42- E 20 P6 DS R Elemental sulfur I Sº E 20 P7 DS D Thiosulfate I S2O3

2- E 9 P8 DS O Carbon disulfide O CS2 E 1 P9 A DE Alkanesulfonate O CH3O3SR S 5

P10 A R Sulfate I SO42- S 20

P11 DS O Sulfide I H2S E/S 29 P12 A DE L-cysteate O C3H6NO5S C/E 1 P13 A DE Dimethyl sulfone O C2H6O2S C/E 3 P14 A DE Sulfoacetate O C2H2O5S C/E 2 P15 A DE Sulfolactate O C3H4O6S C/S 14 P16 A DE Dimethyl sulfide O C2H6S C/S 16 P17 A DE Dimethylsulfoniopropionate O C5H10O2S C/S/E 12 P18 A DE Methylthiopropanoate O C4H7O2S C/S 7 P19 A DE Sulfoacetaldehyde O C2H3O4S C/S 7 P20 DS O Elemental sulfur I S° C/S/E 7 P21 DS D Elemental sulfur I S° C/S/E 1 P22 A DE Methanesulfonate O CH3O3S C/S/E 7 P23 A DE Taurine O C2H7NO3S C/S/E 11 P24 DS M Dimethyl sulfide O C2H6S C 1 P25 DS M Metylthio-propanoate O C4H7O2S C 1 P26 DS M Methanethiol O CH4S C 1 P27 A DE Homotaurine O C3H9NO3S N 1 P28 A B Sulfolipid O SQDG 4

P29 Markers Markers 12

1

Metabolic machinery

i) Sulfur compounds



SucyN=152

>Protein1


MGAGYFSPAGFMNV

>Protein 2


>Protein 3


YSCVKACPHNAIDVR



T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 1 0 / 2 2M E B S


Metabolic machinery

i) Sulfur compounds



SucyN=152

>Protein1


MGAGYFSPAGFMNV

>Protein 2


>Protein 3


YSCVKACPHNAIDVR




Large omic datasetsWhat are the available data that can be used to

characterize large-scale metabolic pathways?




𝐇′ =

𝑖


𝑄 𝑖

n0

≥1

≤0

Informative

Non-Informative

Taxa involved in that particular

metabolism


metabolism

txt

2,107 nr genomes (faa)

Gen1,5 GB

How many genomes were available at the time of analysis?


T h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 1 2 / 2 2

Num of complete prokarioticgenomes

≈4,000 (NCBI Refseq) Dec 2016

Non redundant 2,107 Dec 2016

Publicavailableand manuallycuarteddata

M E B S

Large omic datasetsWhat are the available data that can be used to characterize large-scale metabolic machineries?




𝐇′ =

𝑖


𝑄 𝑖

n0

≥1

≤0

Informative

Non-Informative

Taxa: Suli Proteins: Sucy

txt

2,107 nr genomes (faa)

Gen MetGenF

104GB≈ 500 GB

1,5 GB

How many metagenomes were available at the time of analysis?

i) were publicly availableii) contained associated metadata iii) had been isolated from well-defined environments

(i.e., rivers, soil, biofilms)iv) discarding host associated microbiome sequences

(i.e., human, cow, chicken)



112-HMM of S-proteins

C

txt

GCF_000006985.1 Chlorobium tepidum TLS

GCF_000007005.1 Sulfolobus solfataricus P2

GCF_000007305.1 Pyrococcus furiosus DSM 3638

GCF_000008545.1 Thermotoga maritima MSB8

GCF_000008625.1 Aquifex aeolicus VF5

GCF_000008665.1 Archaeoglobus fulgidus DSM 4304

GCF_000009965.1 Thermococcus kodakarensis KOD1

>Protein1


MGAGYFSPAGFMNV

>Protein 2


>Protein 3


YSCVKACPHNAIDVR2,107 nr genomes (faa)

Gen GenF

Stage 1: Manual curation and omic datasets

Stage 2: Domain composition

Stage 4: Informative Score Can capture the S- metabolic machinery?Can we used to evaluate, compare and analyzecomplex data in large scale ? (genomes, metagenomes)

Computationally efficient? Accurate, high speed in large datasets and reproducible Single Value

Mathematical model

𝐇′ =

𝑖

𝑃 𝑖 log2𝑃 𝑖 (𝑜𝑏𝑠𝑒𝑟𝑣𝑒𝑑)

𝑄 𝑖 (𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑)

n

≥1

Informative

Non-Informative

Stage 3: Relative Entropy

Domains enriched among the microorganisms of interest

𝑃 𝑖 = frequency of protein domain i in S genomes (161)

Q 𝑖 = frequency of protein domain i in Gen (2,107)

0

≤0

Taxa: Suli Proteins: Sucy

MEBS: GENERAL OVERVIEW



https://github.com/eead-csic-compbio/metagenome_Pfam_score

2,107 genomes 161 Suli +

935 metagenomes




an unnamed endosymbiont of a scaly snail from a black smoker chimney

archaeon Geoglobus ahangari, sampled from a 2,000m depth hydrothermal vent .

Distribution of Sulfur Score (SS)

in 2,107 nr-genomes

CandidatusDesulforudisaudaxviator MP104C

Metagenomic reconstructions hard-to culture taxa

SurN=192

»

»»



Positive instances



SuliN=161

(1946) > Negative instances.

Gen

ROC CURVE• Two-dimensional graphs in which TP rate is plotted on the Y axis and FP rate is plotted on the X axis. • Depicts relative tradeoffs between benefits (true positives) and costs (false positives).

Perfect classification

M E B S

Distribution of Sulfur Score (SS) in the metagenomic dataset (935 metagenomes)


Distribution of SS values observed in 935 metagenomes classified in terms of features (X-axis) and colored according to their particular habitats Features are sorted according to their median SS values. Green lines indicate the lowest and largest 95th percentiles observed across MSL classes.

Geo-localized metagenomes sampled around the globe are colored according to their SS values


mebsBG cygling

Sgenes

Sgenomes

Informative

Non-informative

9.5

Markers Comp


C

Conclusions» We present MEBS a new open source software to evaluate, quantify, compare, and

predict the metabolic machinery of interest in large ‘omic’ datasets using one single value

» To test the applicability of this approach, we evaluated one of the most complex biogeochemical cycles the sulfur cycle.

» Using data integration and manual curation we reconstructed the entire sulfur machinery: Suli and Sucy

» We prove that the use of the mathematical framework of the relative entropy can be used to capture complex metabolic machineries in large scale omic samples.

» MEBS powerful and broadly applicable approach to predict, and classify microorganisms closely involved in the sulfur cycle even in hard-to culture microbial lineages

» Computationally efficient, accurate (AUC0985) and reproducible.

» Not in the presentation: the entropy can be used to detect marker domains and the completeness of the S-cycle pathways can be benchmarked in large scale


MEBS

M E B S



mebsBG CYGLING

9.5

C N O

SFe P

BIOREMEDIATION ANTIBIOTICS

EXTREME ENVIRONMENTS

AGRICULTURE

?

Perspectives• We are currently finishing the analyses to demonstrate the applicability of

this approach to other biogeochemical cycles (C, N, O, Fe, P). • Thereby, we hope that the pipeline MEBS will facilitate analysis of

biogeochemical cycles or complex metabolic networks carried out by specific prokaryotic guilds, such as bioremediation processes (i.e., degradation of hydrocarbons, toxic aromatic compounds, heavy metals etc.).

• We look forward to collaborate and help other researchers by integrating comprehensive databases that might be helpful to the scientific community.

• Furthermore, we are currently working to improve the algorithm by using only a list of sequenced genomes involved in the metabolism of interest, in order to reduce the manual curation effort.

• We are also considering taking k-mers instead of peptide Hidden Markov Models to increase the speed of the pipeline.

• We anticipate that our platform will stimulate interest and involvement among the scientific community to explore uncultured genomes derived from large metagenomic sequences: exploring microbial dark matter

M E B S

IcoquihZapata

Valeria SouzaLuis Equiarte

Bruno Contreras

De Anda et al., 2017 MEBS, a software platform to evaluate large (meta)genomic collections according to their metabolic machinery: unraveling the sulfur cycle GigaScience in press

Cesar-Poot Hernandez



L A B O R A T O R Y O F M O L E C U L A R A N D E X P E R I M E N T A L E V O L U T I O N E C O L O G Y I N S T I T U T E U N A M M E X I C O

22

L A B O R A T O R Y O F C O M P U T A T I O N A L B I O L O G Y



Thank you for your attention!

M E B S

supplementary files

m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d am e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 1 / 1 2

A B

Gen (n=2,107) Met (n=935) D. acidiphilus

HydrogenobacullumA. caldusA. ferrivorans

T. mobilis

D. aromaticaT. hauera sp. T. humireducensA. denitrificans

S. tokodaiiA. hospitalis (among other 12 genomes)

P. phaeoclathratiformeC. chlorochromatiiC. tepidumT. denitrificansT. violascensS. thiotaurini

Completeness

Supplementary files

m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a

Table 1. Metabolic pathways of global biogeochemical S-cycle Pathway number Metabolisma

Chemical processb Sulfur compound Typec

Chemical formula Sourced

Number of Pfam domaise

P1 DS O Sulfite I SO32- E 9 P2 DS O Thiosulfate I S2O3

2- E 10 P3 DS O Tetrathionate I S4O6

2- E 2 P4 DS R Tetrathionate I S4O6

2- E 17 P5 DS R Sulfate I SO42- E 20 P6 DS R Elemental sulfur I Sº E 20 P7 DS D Thiosulfate I S2O3

2- E 9 P8 DS O Carbon disulfide O CS2 E 1 P9 A DE Alkanesulfonate O CH3O3SR S 5

P10 A R Sulfate I SO42- S 20

P11 DS O Sulfide I H2S E/S 29 P12 A DE L-cysteate O C3H6NO5S C/E 1 P13 A DE Dimethyl sulfone O C2H6O2S C/E 3 P14 A DE Sulfoacetate O C2H2O5S C/E 2 P15 A DE Sulfolactate O C3H4O6S C/S 14 P16 A DE Dimethyl sulfide O C2H6S C/S 16 P17 A DE Dimethylsulfoniopropionate O C5H10O2S C/S/E 12 P18 A DE Methylthiopropanoate O C4H7O2S C/S 7 P19 A DE Sulfoacetaldehyde O C2H3O4S C/S 7 P20 DS O Elemental sulfur I S° C/S/E 7 P21 DS D Elemental sulfur I S° C/S/E 1 P22 A DE Methanesulfonate O CH3O3S C/S/E 7 P23 A DE Taurine O C2H7NO3S C/S/E 11 P24 DS M Dimethyl sulfide O C2H6S C 1 P25 DS M Metylthio-propanoate O C4H7O2S C 1 P26 DS M Methanethiol O CH4S C 1 P27 A DE Homotaurine O C3H9NO3S N 1 P28 A B Sulfolipid O SQDG 4

P29 Markers Markers 12

1

The protein domains currently present in any given sample are divided by the total number of domains in the pre-defined pathway

Completeness

Supplementary files


Supplementary files


35 private metagenomes:microbial mats, sediment

and lake water

Reads

Processing

ORF prediction

Gene Calling

(aa residues)

Mean Size Length

https://microbiome.wordpress.com/

Counts of prokaryotic genomes in each NCBI category as of July 2017

Non-redundant Redundant

LARGE SCALE


Supplementary files

GenF size category 5-percentile 95-percentile

Real -0.091 0.101

30 -0.086 0.105

60 -0.09 0.104

100 -0.088 0.1

150 -0.09 0.103

200 -0.89 0.105

250 -0.09 0.106

300 -0.09 0.1

Completeness

Supplementary files


Table 2 Informative Pfam domains with high H’ and low std. Novel proposed molecular marker

domains in metagenomic data of variable MSL

Pfam ID

( Suli

ocurrences)

H’

mean

H’

std Description

PF12139

58/161

1.2 0.01 Adenosine-5'-phosphosulfate reductase beta subunit: Key protein domain for both sulfur oxidation/reduction metabolic pathways. Has been widely studied in the dissimilatory sulfate reduction metabolism. In all recognized sulfate-reducing prokaryotes, the dissimilatory process is mediated by three key enzymes: Sat, Apr and Dsr. Homologous proteins are also present in the anoxygenic photolithotrophic and chemolithotrophic sulfur-oxidizing bacteria (CLSB, PSB, GSB), in different cluster organization [35].

PF00374

135/161

1.1 0.09 Nickel-dependent hydrogenase: Hydrogenases with S-cluster and selenium containing Cys-x-x-Cys motifs involved in the binding of nickel. Among the homologues of this hydrogenase domain, is the alpha subunit of the sulfhydrogenase I complex of Pyrococcus furiosus, that catalyzes the

reduction of polysulfide to hydrogen sulfide with NADPH as the electron donor [55].

PF01747

103/161

1.03 0.06 ATP-sulfurylase: Key protein domain for both sulfur oxidation and reduction processes. The enzyme catalyzes the transfer of the adenylyl group from ATP to inorganic sulfate, producing

adenosine 5′-phosphosulfate (APS) and pyrophosphate, or the reverse reaction [56].

PF02662

62/161

0.82 0.03 Methyl-viologen-reducing hydrogenase, delta subunit: Is one of the enzymes involved in methanogenesis and encoded in the mth-flp-mvh-mrt cluster of methane genes in Methanothermobacter thermautotrophicus. No specific functions have been assigned to the delta

subunit [48].

PF10418

122/161

0.78 0.06 Iron-sulfur cluster binding domain of dihydroorotate dehydrogenase B: Among the homologous genes in this family are asrA and asrB from Salmonella enterica enterica serovar Typhimurium, which encode 1) a dissimilatory sulfite reductase, 2) a gamma subunit of the sulfhydrogenase I complex of Pyrococcus furiosus and, 3) a gamma subunit of the sulfhydrogenase II complex of the same organism [12].

PF13247

149/161

0.66 0.06 4Fe-4S dicluster domain: Homologues of this family include: 1) DsrO, a ferredoxin-like protein, related to the electron transfer subunits of respiratory enzymes, 2) dimethylsulfide dehydrogenase β subunit (ddhB ), involved in dimethyl sulfide degradation in Rhodovulum sulfidophilum and 3) sulfur reductase FeS subunit (sreB) of Acidianus ambivalens, involved in the sulfur reduction using

H2 or organic substrates as electron donors [12].

PF04358

73/161

0.52 0 DsrC like protein: DsrC is present in all organisms encoding a dsrAB sulfite reductase (sulfate/sulfite reducers or sulfur oxidizers). The physiological studies suggest that sulfate reduction rates are determined by cellular levels of this protein. The dissimilatory sulfate reduction couples the four-electron reduction of the DsrC trisulfide to energy conservation [57]. DsrC was initially described as a subunit of DsrAB, forming a tight complex; however, it is not a subunit, but rather a protein with which DsrAB interacts. DsrC is involved in sulfur-transfer reactions; there is a disulfide bond between the two DsrC cysteines as a redox-active center in the sulfite reduction pathway. Moreover, DsrC is among the most highly expressed sulfur energy metabolism genes in isolated organisms and meta- transcriptomes (Santos et al., 2015).

PF01058

158/161

0.45 0.01 NADH ubiquinone oxidoreductase, 20 Kd subunit: Homologous genes are found in the delta

subunits of both sulfhydrogenase complexes of Pyrococcus furiosus [12].

PF01568

156/161

0.4 0.05 Molydopterin dinucleotide binding domain: This domain corresponds to the C-terminal domain IV

in dimethyl sulfoxide (DMSO) reductase [48].

Supplementary files



Modo avanzado manual

» Biogeochemical cycles (CNOPFe)


Supplementary files


Species SS Genus GuildAmmonifex degensii KC4 12,508 Moorella group SRB/SRArchaeoglobus profundus DSM 5631 12,024 Archaeoglobus SRBCandidatus Desulforudis audaxviator MP104C 11,972 Candidatus Desulforudis Sur

Pelodictyon phaeoclathratiforme BU-1 11,836Chlorobium/Pelodictyon

group GSB

Chlorobium phaeobacteroides BS1 11,649Chlorobium/Pelodictyon

group GSB

Chlorobium chlorochromatii CaD3 11,625Chlorobium/Pelodictyon

group GSBThiobacillus denitrificans ATCC 25259 11,61 Thiobacillus CLSBDesulfohalobium retbaense DSM 5692 11,511 Desulfohalobium SRBDesulfovibrio alaskensis G20 11,5 Desulfovibrio SRBDesulfovibrio vulgaris DP4 11,442 Desulfovibrio SRBChlorobium tepidum TLS 11,354 Chlorobaculum GSBendosymbiont of unidentified scaly snail isolate Monju 11,205 0 SurDesulfovibrio vulgaris str. 'Miyazaki F' 11,093 Desulfovibrio SRBDesulfovibrio desulfuricans subsp. desulfuricans str. ATCC 27774 11,034 Desulfovibrio SRB


Supplementary files


Supplementary files

34


Supplementary files


Supplementary files

Sulfur: 112 H’ Nitrogen: 176 H’ Methane: 119 H’Oxygen:55 H’


Supplementary files

Iron: 112 H’

Biogeochemical cycle Genes Pfam domains Genomes AUC

Sulfur (S) 152 112 161 0.9855

Nitrogen (N) 267 176 144 0.791

Methane (C) 135 119 90 0.988

Oxygenic Photosynthesis (O) 50 55 53 0.983

Phosphorous (P)

Iron (Fe) 36 33 34 0.863


Supplementary files

ID Description H’ mean stdPF00067 Cytochrome P450 0.644 0.033785

PF00115 Cytochrome C and Quinol oxidase polypeptide I 0.513 0.061551PF01077 Nitrite and sulphite reductase 4Fe-4S domain 0.55825 0.049936PF02560 Cyanate lyase C-terminal domain 0.93625 0.001389

PF03460 Nitrite/Sulfite reductase ferredoxin-like half domain 0.5525 0.040324PF04898 Glutamate synthase central domain 0.479 0.034699PF13442 Cytochrome C oxidase, cbb3-type, subunit III 0.6565 0.047093

python3 plot_entropy.py gen_genF_entropies.oxygen.tab -0.156 0.20625

Oxygen Markers


Supplementary files

ID Description H’ mean stdPF01913 Formylmethanofuran-tetrahydromethanopterin formyltransferase 3.629125 0.0227PF01993 methylene-5,6,7,8-tetrahydromethanopterin dehydrogenase 2.876 0PF02240 Methyl-coenzyme M reductase gamma subunit 3.168 0PF02241 Methyl-coenzyme M reductase beta subunit, C-terminal domain 3.168 0

PF02289 Cyclohydrolase (MCH) 3.353 0PF02741 FTR, proximal lobe 3.63475 0.034648PF02745 Methyl-coenzyme M reductase alpha subunit, N-terminal domain 3.168 0PF02783 Methyl-coenzyme M reductase beta subunit, N-terminal domain 3.168 0PF04206 Tetrahydromethanopterin S-methyltransferase, subunit E 3.032 0PF04207 Tetrahydromethanopterin S-methyltransferase, subunit D 3.032 0PF04208 Tetrahydromethanopterin S-methyltransferase, subunit A 2.903375 0.015203PF04211 Tetrahydromethanopterin S-methyltransferase, subunit C 3.02575 0.017678PF05440 Tetrahydromethanopterin S-methyltransferase subunit B 2.980125 0.036537 python3 plot_entropy.py

gen_genF_entropies.methane.tab -0.121 0.1475m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a

Supplementary files

Methane

ID Description H’ mean std

PF00067 Cytochrome P450 0.57375 0.0056

PF00174 Oxidoreductase molybdopterin binding domain 0.528125 0.006578

PF00355 Rieske [2Fe-2S] domain 0.507 0.032076

PF00507 NADH-ubiquinone/plastoquinone oxidoreductase, chain 3 0.36975 0.010886

PF00547 Urease, gamma subunit 0.464 0

PF00699 Urease beta subunit 0.475125 0.001126

PF01077 Nitrite and sulphite reductase 4Fe-4S domain 0.47025 0.014568

PF02211 Nitrile hydratase beta subunit 0.405625 0.005041

PF02633 Creatinine amidohydrolase 0.58725 0.017466

PF03460 Nitrite/Sulfite reductase ferredoxin-like half domain 0.48 0.032715

PF05899 Protein of unknown function (DUF861) 0.52175 0.022914

PF09347 Domain of unknown function (DUF1989) 0.398875 0.007415

Nitrogen


Supplementary files

Iron

ID Description H’ mean std

PF14522 Cytochrome c7 and related cytochrome c 1.010 0.104

PF00355 Rieske [2Fe-2S] domain 0.51912 0.02854

PF00033 Cytochrome b/b6/petB 0.55875 0.04974

PF00034 Cytochrome c 0.5061 0.1013


Supplementary files

Positive instances

Positive classificationsonly with strong evidence so they make few false positiveerrors


m e b sT h e 1 2 t h I n t e r n a t i o n a l C o n f e r e n c e o n G e n o m i c s O c t o b e r 2 0 1 7 S h e n z h e n C h i n a V a l e r i e d e A n d a 1 8 / 2 2

SuliN=161

(1946) > Negative instances.

Gen

ROC CURVE• Two-dimensional graphs in which tprate is plotted on the Y axis and fp rate is plotted on the X axis. • Depicts relative tradeoffs between benefits (true positives)

and costs (false positives).

Never issuing a positive classification; such a classifier commits no false positive errors but also gains no true positives

Perfect classification

Random guessing produces the diagonal line between (0,0) and (1, 1), which has an area of 0.5, no realistic classifier should have an AUC less than 0.5


Supplementary files

Rel

ativ

een

tro

py

H’

4Fe-4S dicluster domain

Molydopterindinucleotide bindingdomain

Cytochrome C oxidase, cbb3-type, subunit III

Nitrogenase component1 type Oxidoreductase


Supplementary files

Valerie de Anda at #ICG12: A new multi-genomic approach for the study of biogeochemical cycles at...

Science

Transcript of Valerie de Anda at #ICG12: A new multi-genomic approach for the study of biogeochemical cycles at...