Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

34
http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries

Transcript of Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

Page 1: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

http://amp.pharm.mssm.edu/Enrichr

Babak Bababasi

Libraries

Page 2: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.
Page 3: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

انتولوژی ژن•با جمع آوري اطالعات توالي و نام گذاري ميليون ها ژن مشكل ديگري بروز كرد و آن •

ناهمگوني وازه هاي بكار برده شده توسط زيست شناسان مختلف بود. عالوه بر آن، عملكردهاي متفاوت براي يك ژن باعث شده بود يك توالي با نام هاي متفاوتي ذخيره شود. بنابراين وقت زيادي توسط پژوهشگران براي تطبيق اطالعات و وازه ها صرف

مي شود. آز آن گذشته، خودكار نمودن و رايانه اي نمودن بسياري از داده پردازي ها دچار مشكل مي شود. با توجه به توضیحات ذکر شده باید توصیف های عملکردی

( شد. GOپروتئین ها استاندارد باشد. این موضوع باعث ایجاد پروژه انتولوژن ژن)علت سردرگمی این می باشد که پژوهشگرانی که بر روی موجودات مختلف کار

می کنند تمایل دارند تا از اصطالحات متفاوت برای یک ژن یا پروتئین استفاده کنند. دائره المعارفي از اطالعات براي توضيح ژن و AmiGO يا Gene ontologyپايگاه

محصوالت آن است كه سعي بر يكنواخت سازي نام گذاري ها و واژه شناسي هادارد. سه دسته اطالعات به ما می دهد:GOتوصیف •- فرآیندهای زیستی• -اجزای سلول• -عملکرد مولکول.•( در يك Cellular componentدر واقع هر محصول ژني با توجه به موقعيت سلولي )•

( درگير بوده و در نتيجه عملكرد خاصي Biological processفرايند زيستي خاص )(Molecular function را در سلول انجام مي دهد. براي مثال، سيتوكروم )C با توجه به

موقعيت هاي ماتريكس ميتوكندريايي و غشاي دروني ميتوكندري با عملكرد اكسيدوردوكتازي در فرايندهاي زيستي فسفوريالسيون اكسيداتيو و مرگ سلولي

درگير است.

از طریق برسی آنتولوژی ژن میتوان به این موضوع که آیا یک ژن و یا یک دسته ژن •

و یا هر بیماری ژنتیکی دیگری نقش دارند پی برد. سرطانخاص در

توسط 1998 در سال Gene Ontologyبا توجه به مفاهيم ذكر شده پروژه • Gene Ontologyپژوهشگران مطالعه ژنوم پايه گذاري شد و به عبارتي كنسرسيوم

در واقع ابزار قدرتمندي براي جستجو و بازيابي اطالعات از AmiGOشكل گرفت. است. Gene ontologyپايگاه داده اي

• استفاده http://www.geneontology.org/amigoبراي ورود به اين پايگاه بايد از آدرس •

GOكرد. براي جستجو مي توان از نام ژن، محصول ژن، توالي ژن مورد جستجو و نام ژن استفاده كرد.

Page 4: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.
Page 5: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

KEGGپایگاه • KEGG (Kyoto Encyclopedia of Genes and پایگاه •

Genome ) همراه پروژه ژنوم انسانی در 1995 در سال ژاپن شروع به کار کرد.

هدف اولیه این پایگاه سامانمند نمودن اطالعات موجود •در مورد ارتباط بین ماکرومولکول های زیستی به ویژه

مسیر های بیوشیمیایی، مسیر های تنظیمی و از KEGGفرآيندهای زیستی بود. در همان زمان

شناسنامه های ژنی )برای همه ی موجوداتی که توالی یابی شده بودند( پشتیباني مي کرد و اطالعات مربوط

به هر کدام از واحد های عملکردی در مسیر های بیوشیمایی را به شناسنامه مربوط ارتباط داد. ضمن

اطالعات مربوط به ترکیبات شیمیایی KEGGاین که مورد استفاده در سلول های زنده را نیز به صورت داده

های دسته بندی شده در مي آورد، اطالعات مربوط به هر یک از آن ها را در مسیری که استفاده می شدند

ارتباط داد. استفاده از ابزار های KEGG هدف نهایی •

بیوانفورماتیکی به منظور بازسازی و پیش بینی عملکرد ژن ها و محصوالت آن ها در مسیر های سلولی است.

به مسیرهای بیماریهای مختلف از جمله KEGG پایگاه •

نیز میپردازد سرطان

•www.genome.jp/kegg/

Page 6: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• WikiPathways was established to facilitate the contribution and maintenance of pathway information by the biology community. WikiPathways is an open, collaborative platform dedicated to the curation of biological pathways. WikiPathways thus presents a new model for pathway databases that enhances and complements ongoing efforts, such as KEGG, Reactome and Pathway Commons. Building on the same MediaWiki software that powers Wikipedia, we added a custom graphical pathway editing tool and integrated databases covering major gene, protein, and small-molecule systems. The familiar web-based format of WikiPathways greatly reduces the barrier to participate in pathway curation. More importantly, the open, public approach of WikiPathways allows for broader participation by the entire community, ranging from students to senior experts in each field. This approach also shifts the bulk of peer review, editorial curation, and maintenance to the community.

Page 7: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• REACTOME is an open-source, open access, manually curated and peer-reviewed pathway database. Pathway annotations are authored by expert biologists, in collaboration with Reactome editorial staff and cross-referenced to many bioinformatics databases. These include NCBI Gene, Ensembl and UniProt databases, the UCSC and HapMap Genome Browsers, the KEGG Compound and ChEBI small molecule databases, PubMed, and Gene Ontology.

• The rationale behind Reactome is to convey the rich information in the visual representations of biological pathways familiar from textbooks and articles in a detailed, computationally accessible format. The core unit of the Reactome data model is the reaction. Entities (nucleic acids, proteins, complexes, vaccines, anti-

cancer theraputics and small molecules) participating in reactions form a network of biological interactions and are grouped into pathways. Examples of biological pathways in Reactome include classical intermediary metabolism, signaling, innate and acquired immune function, transcriptional regulation, apoptosis and disease.

Page 8: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

http://cgap.nci.nih.gov/Pathways/BioCarta_Pathways

• Pathways on the CGAP web site have been obtained directly from BioCarta and KEGG (Kyoto Encyclopedia of Genes and Genomes). In addition, CGAP has linked each human gene in BioCarta and each human enzyme in KEGG to its CGAP Gene Info page, and each intermediary metabolite in KEGG to a CGAP Compound Info page.

• View Pathways:

• BioCarta Pathways on CGAP provide displays of gene interactions within pathways for human cellular processes, such as apoptosis and signal transduction.

• KEGG Pathways on CGAP display interactions between enzymes and intermediary compounds within metabolic and regulatory pathways, such as glycolysis and cell cycle regulation, and diagrams of molecular assemblies.

Page 9: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• The protein-protein interaction hubs gene-set library is made from an updated version of a human protein-protein interaction network that we are continually updating and originally published as part of the program, Expression2Kinases.

• From this network, we extracted the proteins with 120 or more interactions. These proteins are the terms in the library whereas their direct protein interactors are the genes in each gene set.

Page 10: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• The Kinase Enrichment Analysis (KEA) gene-set library contains human or mouse kinases and their known substrates collected from literature reports as provided by six kinase-substrate databases:

• HPRD, PhosphoSite, PhosphoPoint, Phospho.

• Elm, NetworKIN, and MINT.

Page 11: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• The next two gene-set libraries in the pathway category are protein complexes. The first library was created from a recent study that profiled nuclear complexes in human

breast cancer cell lines after applying over 3000 immuno-precipitations followed by mass-spectrometry (IP-MS) experiments using over 1000 different antibodies. The second complexes gene-set library was created from the mammalian complexes

database, CORUM

Page 12: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• the SILAC phosphoproteomics gene set library was created by processing tables from the supporting materials of SILAC phosphoproteomics studies.

• From each supporting table, we extracted lists of up and down proteins without applying any cutoffs.

• Protein IDs were converted to mammalian gene IDs when necessary using online gene symbol conversion tools. A total of 84 gene lists were extracted from such studies.

Page 13: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• Elucidation of endogenous cellular protein-protein interactions and their networks is most desirable for biological studies.

• Here we report our study of endogenous human coregulator protein complex networks obtained from integrative mass spectrometry-based analysis of 3290 affinity purifications. By preserving weak protein interactions during complex isolation and utilizing high levels of reciprocity in the large dataset, we identified many unreported protein associations, such as a transcriptional network formed by ZMYND8, ZNF687, and ZNF592.

• Furthermore, our work revealed a tiered interplay within networks that share common proteins, providing a conceptual organization of a cellular proteome composed of minimal endogenous modules (MEMOs), complex isoforms (uniCOREs), and regulatory complex-complex interaction networks (CCIs). This resource will effectively fill a void in linking correlative genomic studies with an understanding of transcriptional regulatory protein functions within the proteome for formulation and testing of future hypothesesAnalysis of the Human Endogenous Complexome

Page 14: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• HumanCyc: Encyclopedia of Human Genes and Metabolism

• HumanCyc provides an encyclopedic reference on human metabolic pathways. It provides a zoomable human metabolic map diagram, and it has been used to generate a steady-state quantitative model of human metabolism.

• http://humancyc.org/

Page 15: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• PID data are now available for the research community via the NDEx database, hosted by the Ideker Lab at the UC San Diego School of Medicine.

• NDEx will return all the PID network files.

• PID data is also available for download in BioPax format on GitHub. The NCI PID data portal will be retired on or around December 31, 2015. If you have any questions, please contact the NCI CBIIT Application Support team.

http://pid.nci.nih.gov/

Page 16: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

he PANTHER (Protein ANalysis THrough Evolutionary Relationships) • Classification System was designed to classify proteins (and

their genes) in order to facilitate high-throughput analysis. Proteins have been classified according to:

• Family and subfamily: families are groups of evolutionarily related proteins; subfamilies are related proteins that also have the same function

• Molecular function: the function of the protein by itself or with directly interacting proteins at a biochemical level, e.g. a protein kinase

• Biological process: the function of the protein in the context of a larger network of proteins that interact to accomplish a process at the level of the cell or organism, e.g. mitosis.

• • Pathway: similar to biological process, but a pathway also

explicitly specifies the relationships between the interacting molecules.

• The PANTHER Classifications are the result of human curation as well as sophisticated bioinformatics algorithms. Details of the methods can be found in (Thomas et al., Genome Research 2003; Mi et al. NAR 2005).

http://www.pantherdb.org/

Page 17: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.
Page 18: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• In order to integrate data from such studies and utilize it for further biological discovery, we collected interactions from such experiments to construct a mammalian ChIP-X database. The database contains 189,933 interactions, manually extracted from 87 publications, describing the binding of 92 transcription factors to 31,932 target genes. We used the database to analyze mRNA expression data where we perform gene-list enrichment analysis using the ChIP-X database as the prior biological knowledge gene-list library.

• The system is delivered as a web-based interactive application called ChIP Enrichment Analysis (ChEA).

• With ChEA, users can input lists of mammalian gene symbols for which the program computes over-representation of transcription factor targets from the ChIP-X database.

• The ChEA database allowed us to reconstruct an initial network of transcription factors connected based on shared overlapping targets and binding site proximity.

• To demonstrate the utility of ChEA we present three case studies. We show how by combining the Connectivity Map (CMAP) with ChEA, we can rank pairs of compounds to be used to target specific transcription factor activity in cancer cells.

• AVAILABILITY:

• The ChEA software and ChIP-X database is freely available online at: http://amp.pharm.mssm.edu/lib/chea.jsp.

Page 19: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• PWMs from TRANSFAC and JASPAR were used to scan the promoters of all human genes in the region −2000 and +500 from the transcription factor start site (TSS). We retained only the 100% matches to the consensus sequences to call an interaction between a factor and target gene. This gene-set library was created for a tool we previously published called Expression2Kinases

Page 20: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• Transcription factor target genes inferred from PWMs for the human genome were downloaded from the UCSC Genome Browser FTP site which contains many resources for gene and sequence annotations. We converted this file into a gene set library and included it in Enrichr since it produces different results compared with the other method to identify transcription factor/target interactions from PWMs as described above.

Page 21: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• The ENCODE transcription factor gene-set library is the fourth method to create a transcription factor/target gene set library. We processed the newly published data from the Encyclopedia of DNA Elements (ENCODE) project. Using the aligned files for all 646 experiments that profiled transcription factors in mammalian cells, we identified the peaks using the MACS software and then identified the genes targeted by the factors using our own custom processing. We sorted the peaks for each experiment by distance to the transcription factor start site (TSS) and retained the top 2000 target genes for each experiment.

Page 22: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• The Histone modification gene-set library was created by processing experiments from the NIH Roadmap Epigenomics. Such experiments were conducted using various types of human cell lines types with antibodies targeting over 30 different histone modification marks. ChIP-seq datasets from the Roadmap Epigenomics project deposited to the GEO database were analyzed and converted to gene sets with the use of the software, SICER. Previous studies have indicated that the use of control sample substantially reduces DNA shearing biases and sequencing artifacts; therefore, for each experiment, an input control sample was matched according to the description in GEO. ChIP-seq experiments without matched control input were not included. The resulting gene-set library contains 27 types of histone modifications for 64 human cell lines from various tissue origins.

Page 23: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• The microRNA gene set library was created by processing data from the TargetScan online database and was borrowed from our previous publication, Lists2Networks

Page 24: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.
Page 25: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

•OMIM پايگاه داده اي است كه در NCBI قرار دارد و تمام جهش

هاي شناخته شده و فنوتيپ هاي وابسته را كاتالوگ كرده است.

2000در اين پايگاه حدود بيماري ژنتيكي ثبت شده و با نام

بيماري قابل جستجو است. در ضمن در اين پايگاه خالصه اي از

مطالعات باليني مربوط به بيماري نيز وجود دارد

The OMIM (Online Mendelian Inheritance in Man) database contains review articles human genes, genetic disorders, and other inherited traits. OMIM articles provide links to associated literature references, sequence records, maps, and related databases. 

OMIM expanded to create the OMIM expanded gene-set library. We entered the disease genes as the seed list and expanded the list by identifying proteins that directly interact with at least two of the disease gene products; in other words, we searched for paths that connect two disease gene products with one intermediate protein, resulting in a sub-network that connects the disease genes with additional proteins/genes. Each sub-network for each disease was converted to a gene set.

Page 26: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• The GeneSigDB gene-set library was borrowed from the GeneSigDB database .The database contains gene lists extracted manually from the supporting tables of thousands of publications; most are from cancer related studies

Page 27: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• The Connectivity Map (CMAP) database [39] contains over 6,000 Affymetrix microarray gene

expression experiments where human cancer cell lines were treated with over 1,300 drugs, many of them FDA approved, and changes in expression where measured after six hours. The drugs were always used as a single treatment but varied in concentrations. The CMAP database provides the results in a table where genes are listed in rank order based on their level of differential expression compared to the untreated state.

• From this table, we extracted the top 100 and bottom 100 differentially expressed genes to create two gene-set libraries, one for the up genes and one for the down genes for each condition. Each set is associated with a drug name and the four digit experiment number from CMAP. This four digit number can be used to locate the concentration, cell-type, and batch.

Page 28: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• The VirusMINT gene-set library was created from the VirusMINT database [42], which is made of literature extracted protein-protein interactions between viral proteins and human proteins. Each term in the library represents a virus wherein the genes/proteins in each set are the host proteins that are known to directly interact with all the

viral proteins for each virus.

Page 29: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• Project Achilles is a systematic effort aimed at identifying and cataloging genetic vulnerabilities across hundreds of genomically characterized cancer cell lines. The project uses genome-wide genetic perturbation reagents (shRNAs or Cas9/sgRNAs) to silence or knock-out individual genes and identify those genes that affect cell survival. Large-scale functional screening of cancer cell lines provides a complementary approach to those studies that aim to characterize the molecular alterations (e.g. mutations, copy number alterations) of primary tumors, such as The Cancer Genome Atlas (TCGA). The overall goal of the project is to identify cancer genetic dependencies and link them to molecular characteristics in order to prioritize targets for therapeutic development and identify the patient population that might benefit from such targets.

Page 30: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• The MSigDB computational and MSigDB oncogenic signature gene-set libraries were borrowed from the MSigDB database from categories C4 and C6 .These gene-set libraries contain modules of genes differentially expressed in

various cancers.

Page 31: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.
Page 32: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• The cell type category is made of four gene-set libraries: genes highly expressed in human and mouse tissues extracted from the Mouse and Human Gene Atlases [44] and genes highly

expressed in cancer cell lines from the Cancer Cell Line Encyclopedia (CCLE) [45] and NCI-60 [46]. The gene-set libraries in this category were all created similarly.

• The Cancer Cell Line Encyclopedia (CCLE) dataset was derived from the gene-centric RMA-normalized mRNA expression data from the CCLE site.

• The Human Gene Atlas and Mouse Gene Atlas datasets were derived from averaged GCRMA-normalized mRNA expression data from the BioGPS site. • Finally, the Human NCI60 Cell Lines dataset, while also downloaded from the BioGPS

site, was raw and not normalized; hence, it was normalized using quantile normalization. The downloaded datasets were all of similar format such that the raw data was in a table with the rows being the genes and the columns being the expression values in the different cells. For each gene, the average and standard deviation of the expression values across all samples were computed. For each gene/term data point, a z-score was calculated based on the row’s average and standard deviation. Duplicate gene probes were merged by selecting the highest absolute z-score. Only genes with an absolute z-score of greater than 3 were selected to be part of a gene set for a particular cell which represents the term.

Page 33: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.

• http://www.maayanlab.net/ESCAPE/

Page 34: Http://amp.pharm.mssm.edu/Enrichr Babak Bababasi Libraries.