Thang
Transcript of Thang
![Page 1: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/1.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 1/25
Large Scale Semantic Data Integration andAnalytics through Cloud: A Case Study in
Bioinformatics
Tat Thang
Parallel and Distributed Computing Centre,
School of Computer Engineering, NTU, Singapore
Michael Li
Semantic Technology Group,
Institute for Infocomm Research (I2R), A-Star, Singapore
11th Feb 2011
![Page 2: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/2.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 2/25
Overview
• Motivation
• Problem Definition
• Objective
• Proposed Architecture
• A case study in Bio-informatics
• Demo
• Future works
• Summary
![Page 3: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/3.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 3/25
Motivation
• Deluge of biological data
• Biomedical data is available on heterogeneous
databases
• Data: structured and semi/un-structured
formats
•
Demand for fast, large-scale and cost-effectivecomputing strategies
![Page 4: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/4.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 4/25
Problem Definition
• Data
– PubMed contains 20+ million abstracts
– UniProt contains 13.5+ million records
• Case study on antiviral proteins
– Over 70,000 citations in Pubmed
– Over 14,000 proteins in Uniprot
• Integration and Analysis
![Page 5: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/5.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 5/25
Related Works
• Using NLP to link documents to existing ontologies(e.g. GoPubMed, Textpresso) – No querying & reasoning
– Not scalable
• RDF/OWL based integration tools (e.g. TopBraidSuite) – No NLP
– Not bio specific. Also not biologist friendly
•Cloud-based bio data mining works (e.g. Kudtarkar P2010) – Still in early stages
– Challenging to perform semantic integration on cloud
![Page 6: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/6.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 6/25
Objective
To provide a framework that enables
• Better data infrastructure
– Scalability
– Management of heterogeneity
– Cost-effectiveness
• Better data analytics
– Integrative data mining
– Visual query interface
![Page 7: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/7.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 7/25
Proposed Framework
Our Approach
Data Infrastructure Module Data Analytics Module
![Page 8: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/8.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 8/25
Data Infrastructure module Data Analytics module
Our Approach
Biomedical
sources
Web
Crawler
Parser
Query &
Reasoner
Knowle
Population
Service
Cloud-based data store
Ontology
User
Interface
![Page 9: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/9.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 9/25
Our Approach
• Data Infrastructure Module
– Cloud based: Amazon EC2, Hadoop, MicrosoftAzure
– Parallel processing: MapReduce – Distributed Storage: Big Table, HBase, HDFS
• Data Analytics Module
–
Non-semantic: database driven – Semantic: ontology driven (Knowle, Allegrograph,
TopBraid)
![Page 10: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/10.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 10/25
Data Infrastructure Module (Hadoop)
• Software framework for data-intensive and
distributed applications
• Hadoop distributed file system provides a distributed,
scalable, and portable file system that support forlarge data set
• Hadoop Map-reduce allows to program in parallel on
large amount of data
![Page 11: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/11.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 11/25
Cloud Based Data Store
Hadoop Distributed File System
Name node
Data
node
Data
node
Data
node
Data
node
Data
node
- Meta data (in memory)
- Data nodes
- Data blocks
- Node attributes- Name of files
- Mapping of block-node
Secondary
Name node
- Stores file contents
- File is chunked to block
- each block is spread to data nodes
![Page 12: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/12.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 12/25
Data Analytics Module (Knowle)
• Semantic Technology
Toolkit
• Knowle services used in
Data Analytics Module – Data/Text mining
– Ontology Population
– Ontology Query
– Visual Ontology Query
Developed in Institute for Infocomm Research, Singapore
![Page 13: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/13.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 13/25
Data Infrastructure module Data Analytics module
Our Approach
Biomedical
data sources
Web
Crawler
Parser
Query &
Reasoner
Knowle
Population
Service
Cloud-based data store
Ontology
User
Interface
![Page 14: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/14.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 14/25
Web Crawler
UniProt
Crawler
Cloud-based data
store
Bio-medical
data source
UniProt
PubMedPubMed
Crawler
![Page 15: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/15.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 15/25
Parser
UniProt
Parser
PubMed
Parser
Knowle Ontology
Population Service
Crawled
UniProt data
Crawled
PubMed
data
Cloud-based data store
![Page 16: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/16.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 16/25
Ontology
Protein OntologyProtein + Literature Ontology
![Page 17: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/17.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 17/25
Ontology Populator
Parsed Uniprot
Data
Parsed Pubmed
Data
Ontology
Triplestore
Protein + Literature
ontology
Knowle Ontolgy Population Service
Knowle Text mining Service
Populate
concepts
Assert
DatatypeProperties
Assert
ObjectProperties
EntityDetection
RelationExtraction
![Page 18: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/18.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 18/25
Query & Reasoner
Ontology
Triplestore
User
Interface
OWLIM
Reasoner
SAIL
SesameKnowle
Query Service
![Page 19: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/19.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 19/25
User Interface
Ontology
Triplestore
Knowle
Population
ServiceSearch Web Crawler Parser
KnowleGator
Ontology
Visual Query
Visual QueryTranslator
OntologyQuery &
Reasoner
![Page 20: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/20.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 20/25
A case study in Bio-informatics
• Integration, cross-querying from PubMed andUniProt
• Data
–70,054 citations from Pubmed
– 14,527 proteins in Uniprot
• Infrastructure (virtual computers)
– 4 data node ( RAM : 1Gb, CPU : Intel Xeon 2.4Ghz)
– 2 master node ( 1 name node,1 secondary namenode) (RAM : 512 Mb, CPU : Intel Xeon 2.4Ghz)
– 1 virtual CPU = Intel Xeon 2.4 Ghz
![Page 21: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/21.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 21/25
Demo
• Data
– Uniprot : 853 antiviral protein entries
– Pubmed : 2000 citations
![Page 22: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/22.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 22/25
Demo Snapshot
![Page 23: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/23.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 23/25
Summary
• We proposed a new framework
– Data infrastructure module (cloud-based
infrastructure )
– Data analytics module(semantic technologies)
• We tested on a prototype
– Using our own infrastructure
– With integration, cross-querying from PubMed
and UniProt
![Page 24: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/24.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 24/25
Future works
• Integrated user interface
• Explore other cloud-based data store: HBase,
BigTable
• Apply map-reduce concept on data analytics
and crawling
• Integrate Knowle into cloud-based
environment
![Page 25: Thang](https://reader031.fdocument.pub/reader031/viewer/2022020715/577cd6bd1a28ab9e789d1e62/html5/thumbnails/25.jpg)
7/27/2019 Thang
http://slidepdf.com/reader/full/thang 25/25
Large Scale Semantic Data Integration andAnalytics through Cloud: A Case Study in
Bioinformatics
Tat Thang
Parallel and Distributed Computing Centre,
School of Computer Engineering, NTU, Singapore
Michael Li
Semantic Technology Group,
Institute for Infocomm Research (I2R), A-Star, Singapore
11th Feb 2011