Post on 13-Jan-2016
Data Management and Data Management and Representations in Ecce and CMCSRepresentations in Ecce and CMCS
Data Management and Data Management and Representations in Ecce and CMCSRepresentations in Ecce and CMCS
Theresa L. WindusPacific Northwest National Laboratory
Environmental Molecular Sciences LaboratoryMolecular Science Software Group
2
OutlineOutlineOutlineOutline
Some “definitions”Data and task representations Ecce CMCS
SummaryAcknowledgement
3
DataData and and metadatametadata(one scientist’s data is another scientist’s metadata)(one scientist’s data is another scientist’s metadata)
DataData and and metadatametadata(one scientist’s data is another scientist’s metadata)(one scientist’s data is another scientist’s metadata)
CH3OOHH°atomiz ( ) = 522.09 2.02± kcal/mol
: value and uncertainty dataunits: kcal/molquantity: enthalpy of atomization
species: methylhydroperoxide, CAS# 3031-73-0
temperature: 0 K
0
calculated: G3//B3LYPcreator: T. Windus using Eccemore info: http://avatar.emsl.pnl.gov:8080/Ecce/.../CH3OOH/.../GxEnergy
[calculated, G3//B3LYP, T. Windus, more at http://...]
4
Metadata Converts Scientific Data into Metadata Converts Scientific Data into KnowledgeKnowledge
Metadata Converts Scientific Data into Metadata Converts Scientific Data into KnowledgeKnowledge
Metadata provides identification and documentation to scientific data. Example: Attaching an owner, creation date, abstract, type to data. Example: Tracking data to program versions, and possibly bugs for that version.
Metadata documents the context and value of the data. Example: The theoretical atomization energy of methylhydroperoxide (and its uncertainty) from
Ecce (used as input to ATcT) contains information identifying the species and the quantity, units, the theoretical method used, vibrational frequencies and geometry, reference to source file, creator, etc.
Metadata facilitates cross-scale transfer of data. Example: Can show a chain of inputs, including input parameters and configuration
files, across scales. Example: Can retrieve literature references which describe this data.
Metadata allows users to comment on the data and its quality. Example: Can be used for scientific peer review of data.
Metadata is necessary for effective collaboration. Example: Scientific data becomes more usable to others when it is documented.
Annotation is another term for metadata. Annotations can be added by either the data owner or a third party.
5
Data Pedigree: A Special Kind of MetadataData Pedigree: A Special Kind of MetadataData Pedigree: A Special Kind of MetadataData Pedigree: A Special Kind of Metadata
Data pedigree or data provenance is a relationship which provides a “line of ancestors”.
Pedigree allows for the categorization and tracing of the scientific data, and for the identification of the data’s ultimate origin, possibly across scales.
Pedigree includes the series of steps necessary to reproduce the data.
Data is linked, for example, to projects, references, inputs, and outputs.
6
Knowledge GridKnowledge GridKnowledge GridKnowledge Grid
A set of scalable tools, middleware, and services
For the creation, analysis, dissemination, evaluation, and use
Of data, information, and knowledge
By individuals, groups, and communities
…A digital place for performing ‘all’ aspects of science
7
Ecce & NWChemEcce & NWChemEcce & NWChemEcce & NWChemEcce – Extensible Computational Chemistry Environment
comprehensive problem solving environment
common graphical user interfaces scientific modeling management seamless transfer of information between
applications persistent data storage through DAV integrated scientific data management tools for ensuring efficient use of
computing resources across a distributed network
visualization of multi-dimensional data structures
http://ecce.emsl.pnl.gov
NWChem – massively parallel computational chemistry program
Energetics, geometries, frequencies, etc. at various levels of theory
http://www.emsl.pnl.gov/docs/nwchem
8
Ecce is… (cont.)Ecce is… (cont.)Ecce is… (cont.)Ecce is… (cont.)
9
Ecce ArchitectureEcce ArchitectureEcce ArchitectureEcce Architecture
10
Distributed Authoring and Versioning (DAV)Distributed Authoring and Versioning (DAV)Distributed Authoring and Versioning (DAV)Distributed Authoring and Versioning (DAV)
An early web service (XML commands over HTTP)A widely adopted standard for metadata/data transport
Put/Get data with arbitrary properties (dynamic)Properties can be discovered and accessed independentlyDASL, Versioning, Transactions, …
11
What does the WebDAV protocol provide?What does the WebDAV protocol provide? What does the WebDAV protocol provide?What does the WebDAV protocol provide?
C ollection
C ollection
R esource R esource C ollection
R esource
P ropertiesP roperties
P roperties
W ebD A V
H TTP
D A V S erver
A pp lica tionsD ata
S torageP rovider
12
Accessing WebDAV Server from Windows 2000Accessing WebDAV Server from Windows 2000Accessing WebDAV Server from Windows 2000Accessing WebDAV Server from Windows 2000
13
Accessing WebDAV Server Using BrowserAccessing WebDAV Server Using BrowserAccessing WebDAV Server Using BrowserAccessing WebDAV Server Using Browser
14
Accessing WebDAV Server Using EcceAccessing WebDAV Server Using EcceAccessing WebDAV Server Using EcceAccessing WebDAV Server Using Ecce
Calculation
PropertiesFiles
BasisSetChem icalSystem
15
Ecce Physical ModelEcce Physical ModelEcce Physical ModelEcce Physical Model
contains
contains
is composed of
Project
Calculation Project
PropertiesFiles
BasisSetChem icalSystem
Project
Calculation
Setup Data/LogsPropertiesChem ical SystemBasis Set
Calculations are referred to as a “virtual document” because we distribute the structure across many physical objects.
Physical collections and resources are URI addressable.
Collections are unordered and allow mixed content.
16
Calculation SetupCalculation SetupCalculation SetupCalculation Setup
CalculationEditor
Builder Basis SetTool
.edml File
TheoryDetails
RuntypeDetails
Parameters
Geometry
ESP
Basis Set
ai.input
Input Deck
Python
Perl
TemplateFile
Basis SetReformatting
Script
Perl
17
Output ParsingOutput ParsingOutput ParsingOutput Parsing
Output
Job MonitorJob Monitor
ParseDescriptor
Text Block 1
Text Block 2
Text Block N
.
.
.
Parse Script 1
Parse Script 2
Parse Script N
.
.
.
EcceDataBase
EcceDataBase
CalculationViewer
CalculationViewer
Perl
18
Example metadataExample metadataExample metadataExample metadata
On the calculation: http://www.emsl.pnl.gov/ecce:contenttype=ecceCalculationhttp://www.emsl.pnl.gov/ecce:resourcetype=VIRTUAL_DOCUMENThttp://www.emsl.pnl.gov/ecce:createdWith=v3.2http://www.emsl.pnl.gov/ecce:owner=d39974http://www.emsl.pnl.gov/ecce:application=NWChemhttp://www.emsl.pnl.gov/ecce:theory=SCF/RHFhttp://www.emsl.pnl.gov/ecce:spinmultiplicity=Singlethttp://www.emsl.pnl.gov/ecce:currentVersion=v3.2http://www.emsl.pnl.gov/ecce:creationdate=Mon, 22 Mar 2004 17:24:00 GMThttp://www.emsl.pnl.gov/ecce:reviewed=falsehttp://www.emsl.pnl.gov/ecce:runtype=ESPhttp://www.emsl.pnl.gov/ecce:launch_machine=aruntahttp://www.emsl.pnl.gov/ecce:launch_nodes=1http://www.emsl.pnl.gov/ecce:launch_rundir=/home/d39974/eccerunshttp://www.emsl.pnl.gov/ecce:launch_totalprocs=1http://www.emsl.pnl.gov/ecce:launch_user=d39974http://www.emsl.pnl.gov/ecce:launch_maxmemory=0http://www.emsl.pnl.gov/ecce:launch_remoteShell=sshhttp://www.emsl.pnl.gov/ecce:job_jobid=13858http://www.emsl.pnl.gov/ecce:job_path=/home/d39974/ecceruns/tracebug/esphttp://www.emsl.pnl.gov/ecce:job_clienthost=aruntahttp://www.emsl.pnl.gov/ecce:startdate=Mon, 22 Mar 2004 17:25:11 GMThttp://www.emsl.pnl.gov/ecce:version=Thu May 8 13:16:51 PDT 2003 Version 4.5http://www.emsl.pnl.gov/ecce:state=Completehttp://www.emsl.pnl.gov/ecce:completiondate=Mon, 22 Mar 2004 17:25:14 GMTDAV:resourcetype=<D:collection/>DAV:creationdate=2004-03-22T17:24:38ZDAV:getlastmodified=Mon, 22 Mar 2004 17:24:38 GMTDAV:getetag="b2805d-1000-926a8180“DAV:supportedlock=DAV:getcontenttype=httpd/unix-directory
On the molecule:http://www.emsl.pnl.gov/ecce:empiricalFormula=H4Chttp://www.emsl.pnl.gov/ecce:charge=0.000000http://www.emsl.pnl.gov/ecce:useSymmetry=falsehttp://www.emsl.pnl.gov/ecce:symmetrygroup=C1DAV:creationdate=2004-03-22T17:24:38ZDAV:getcontentlength=386DAV:getlastmodified=Mon, 22 Mar 2004 17:24:38 GMTDAV:getetag="b28064-182-926a8180“DAV:executable=FDAV:supportedlock=DAV:getcontenttype=chemical/x-ecce-mvm
19
Example MVM fileExample MVM fileExample MVM fileExample MVM file
title: demotype: moleculenum_atoms: 1065atom_info: symbol cartatom_list: O -2.37400 -3.09100 13.5210H -1.91600 -2.20200 14.0480...pdb_list: H O5* RC 1 157D AH H5T RC 1 157D A…attr_list:-0.622300 1 1 0 0 0.429500 1 1 0 0…
atom_type_list:OH HO …num_bonds: 1028bond_list: 2 1 1.000001 3 1.00000…
20
XML format for PropertiesXML format for PropertiesXML format for PropertiesXML format for Properties<?xml version="1.0" encoding="utf-8" ?><value name="CPUSEC" units="second">9.60000000000000e-01</value>
<?xml version="1.0" encoding="utf-8" ?><vector name="MLKNSHELL" rows="7" units="e" rowLabel="Unknown" rowLabels="1 2 3 4 5 6 7">1.99199825923126e+00 1.18803456337004e+00 3.08260463820159e+00 9.34340637068915e-019.34340635555820e-01 9.34340634042729e-01 9.34340632529639e-01</vector>
<?xml version="1.0" encoding="utf-8" ?><tsvectable name="GEOMTRACE" rows="5" units="Angstrom" columns="3" vectors="1" rowLabel="Atom,Coordinate" rowLabels="0 1 2 3 4" columnLabel="Coordinate" vectorLabel="Coordinate" columnLabels="X Y Z"><step number="1">0.000000000000000e+00 0.000000000000000e+00 0.000000000000000e+00 -6.755000000000000e-01-6.755000000000000e-01 6.755000000000000e-01 6.755000000000000e-01 6.755000000000000e-016.755000000000000e-01 6.755000000000000e-01 -6.755000000000000e-01 -6.755000000000000e-01-6.755000000000000e-01 6.755000000000000e-01 -6.755000000000000e-01</step><step number="2">6.767628142309400e-15 -6.950100046595310e-09 1.390021315920880e-08 -6.239857395114590e-01-6.239857464615680e-01 6.239857534116811e-01 6.239857568867110e-01 6.239857499366001e-016.239857707869190e-01 6.239857742619920e-01 -6.239857812120860e-01 -6.239857603617700e-01-6.239857916372510e-01 6.239857846871540e-01 -6.239857777370440e-01</step><step number="3">6.549446678833860e-15 1.124467050187860e-09 -2.248938851918010e-09 -6.252750669032320e-01-6.252750631744280e-01 6.252750594456050e-01 6.252750588833910e-01 6.252750626121890e-016.252750514257610e-01 6.252750508635410e-01 -6.252750471347340e-01 -6.252750583211300e-01-6.252750428437061e-01 6.252750465725070e-01 -6.252750503012980e-01</step></tsvectable>
21
Vinoxy
6-31G*
NWChemInput File
B3LYP
Optimization and Frequencies
Input Parameters
NWChem Output File
B3LYP
PropertiesProperties
Vibrational ModeAnimated GIF
GaussianInput
Gaussian Output
QCISD
PropertiesProperties
Vinoxy
6-31G*
QCISD(T,FC)
Energy
Input Parameters
NWChemInput
NWChem Output
Vinoxy
G3MP2large
MP2(FC)
Energy
Input Parameters
PropertiesProperties
MP2
G3(MP2)B3LYP Hf Vinoxy NASA File
Crossing the Molecular to Crossing the Molecular to Thermodynamic Scales Data ModelThermodynamic Scales Data Model
NWChem
Ecce
CMCS
Active Tables
Pedigree - hasInput
Pedigree - hasOutput
Gaussian
Legend
Pedigree is imperative to moving data across scales.
22
Ecce publishingEcce publishingEcce publishingEcce publishing
23
The Multi-scale ChallengeThe Multi-scale Challengefor Chemical Sciencefor Chemical Science
The Multi-scale ChallengeThe Multi-scale Challengefor Chemical Sciencefor Chemical Science
Impact of chemical science relies upon flow of information across physical scales
Data from smaller scales supports models at larger scales
Critical science lies at scale interfaces Molecular properties, transport Mechanism validation, reduction Chemistry – fluid interactions
The pedigree of information matters The propagation of data pedigree across
scales is difficult Validation and data reliability is often a
post-publication process
Multi-scale science faces barriers Normal publication route is slow Numerous sub-disciplines employ different
applications, formats, models Centers of excellence are geographically
distributed
24
Multi-scale Chemical Science DataMulti-scale Chemical Science DataMulti-scale Chemical Science DataMulti-scale Chemical Science Data
Unique terascale reacting flow simulation databases – collection of files @ N x t, and experimental data
Chemical Mechanisms – k, MB files in various formats containing collections of reaction rates and transport coefficients. Modeled using theory, validated against experiments
Kinetic rates – by measurement and computation. Tables collected, reviewed and annotated. NIST WebBook, publications
Thermo-Chemistry- Tables of ‘constant’ properties of all molecules (of interest w/data) derived from many experiments, computations, extrapolations
Quantum chemistry computations of molecular properties – data from one number to large potential energy surfaces - input to thermo-chemistry and reaction rate computations
25
CMCS Spans Scales & CMCS Spans Scales & GeographyGeography
Biggest barrier is “language” and informatics
26
Adaptive Informatics InfrastructureAdaptive Informatics InfrastructureAdaptive Informatics InfrastructureAdaptive Informatics Infrastructure
Infrastructure – a well designed, scalable, reusable, flexible set of tools, middleware, and services
Informatics – the emerging use of semi-automated means to derive new knowledge from the analysis of (large amounts of) heterogeneous data, annotating existing data with its newly discovered meaning
Adaptive – able to dynamically change to incorporate new knowledge and support new activities Low Barriers
Many access points Storage of data in original formats with dynamic metadata extraction and translation
Powerful Arbitrary formats (binary, ASCII, XML) Integrated data, metadata, pedigree across internal and external tools
Evolvable Schema can be changed/extended as needed Metadata, translations, viewers, portal, etc. can be dynamically configured
27
CMCS Technical Choices Enable Adaptive, Long-CMCS Technical Choices Enable Adaptive, Long-lived Infrastructurelived Infrastructure
CMCS Technical Choices Enable Adaptive, Long-CMCS Technical Choices Enable Adaptive, Long-lived Infrastructurelived Infrastructure
CMCS Data/Metadata services SAM Translation, Annotation WebDAV implementation Notification (JMS, NED) Search Pedigree browsing Core XML schema Security (JAAS)
Chemical Science Portal Jetspeed (CHEF) CMCS Explorer Application portlets Community services
Application Integration Webservices WebDAV API Multi-scale data including NIST
access
ChemicalMechanisms
ReactingFlow
Local Services/Grid FabricStorage Security Event Services Directory Services
QuantumChemistry
Kineticist
Thermo-Chemistry
Kinetics
Shared Data Service
XML
Data Set
Annotation
Binary
Data Set
Scientific Annotation MiddlewareParsers Translators Annotators WebDAV
Annotation
XML
Data Set
Annotation
Text
Data Set
Multi-scale Chemical Science Portal
CommunityTools
KnowledgeManagement
Tools
Research SupportTools
Thermochemist
ChemistryApplications
A diagram representing the major conceptual elements of the CMCS Informatics Infrastructure.
28
How Metadata is Populated in CMCSHow Metadata is Populated in CMCSHow Metadata is Populated in CMCSHow Metadata is Populated in CMCS
SAM Metadata Services Layer When data is put into WebDAV, SAM causes XSLTs to be executed to extract
metadata from XML files, based on MIME type. Similarly, Binary File Descriptor (BFD) provides an interface to extract
metadata from binary files. Other translators can be used as well.
CMCS data management/pedigree API to facilitate insertion and modification of metadata, in the proper XML format. Java code which allows software developers and scientists to easily write
programs to add/edit metadata. Scientists can use these APIs to integrate with existing or new chemical
science applications. Uses open source DAV and XML libraries.
Any WebDAV client application DAVExplorer: Java application CMCSExplorer: Integrated in the CMCS portal
29
CMCS Metadata, Annotations, and PedigreeCMCS Metadata, Annotations, and PedigreeCMCS Metadata, Annotations, and PedigreeCMCS Metadata, Annotations, and Pedigree
Using Dublin Core for some basic pedigree properties of electronic publication: creator, dates, publisher, is-referenced-by, references, etc. Digital library standard for metadata http://www.dublincore.org
CMCS properties for Chemical Science to enable searching: species name, CAS, chemical properties, and chemical formula.
CMCS properties for defining scientific data: inputs, outputs, and is-part-of-project.
CMCS properties for scientific publication and peer review annotations: is-sanctioned-by.
Currently defined more than 35 elements in the core CMCS pedigree.
Flexible infrastructure for addition of new metadata. As new metadata is added to infrastructure,current apps will not break!
CMCS metadata is strongly encouraged, though not required, for all CMCS data, and CMCS metadata is highly extensible.
30
Pedigree Browser Shows Input and Output Pedigree Browser Shows Input and Output RelationshipsRelationships
Pedigree Browser Shows Input and Output Pedigree Browser Shows Input and Output RelationshipsRelationships
31
Pedigree BrowsingPedigree BrowsingPedigree BrowsingPedigree Browsing
The Browser enables metadata editing.
Data is linked to projects, references, inputs, and outputs
32
Automatic Translation and Automatic Translation and Metadata ExtractionMetadata Extraction
Automatic Translation and Automatic Translation and Metadata ExtractionMetadata Extraction
Data translations provided automatically by SAM using previously registered XSLT’s for this file type.
33
Adaptive Infrastructure Enables Adaptive Infrastructure Enables Application IntegrationApplication Integration
Adaptive Infrastructure Enables Adaptive Infrastructure Enables Application IntegrationApplication Integration
MCS Portal
Shared Data Repository
Grid Fabric
SAMWeb service
Active Table
CMCS/DAV
API
Notification
API
Browser, e-mail
Portlet
APIPortlet
API
NotificationWeb service
FitdatCMCS/DAV
API
Notification
API
DA
V
DA
V+S
AM
NS
ELN 5.0Ecce
NWChem/GRID RESOURCES
launch
REACTIONLAB
SAMMime-type Assignment
Metadata ExtractionTranslation
Pedigree Relationships
NIST KineticsDB
Federation ML
Browser,e-mail
34
Initial “Automatic Reasoning” CapabilityInitial “Automatic Reasoning” CapabilityInitial “Automatic Reasoning” CapabilityInitial “Automatic Reasoning” Capability
35
SummarySummarySummarySummary
Users just want to have ease of use and flexibility in viewing output – adaptive informatics infrastructure
“Standards” are useful, but it is necessary to be able to translate between diverse “schema” and “ontologies”
Metadata converts scientific data into knowledge
36
Multi-disciplinary Ecce Development TeamMulti-disciplinary Ecce Development TeamMulti-disciplinary Ecce Development TeamMulti-disciplinary Ecce Development Team
Gary Black -- Project leadKaren Schuchardt -- Software architect leadBruce Palmer -- Chemist architectTodd Elsethagen -- Data management leadErich Vorpagel – Chemist consultantMichael Peterson -- Operations supportMahin Hackler -- Operations supportSue Havre -- Application developmentBrett Didier -- Application developmentCarina Lansing -- Application developmentSteve Matsumoto -- Online help leadColleen Winters -- Online helpDoug Rice -- Online help
37
Multi-disciplinary CMCS TeamMulti-disciplinary CMCS TeamMulti-disciplinary CMCS TeamMulti-disciplinary CMCS TeamChemical Science Computer/Information Science
Larry Rahn*, SNL
Sandra Bittner, ANL
Brett Didier, PNLKaren Schuchardt, PNL
James D. Myers, PNL
Theresa Windus*, PNL
Renata McCoy, SNLMichael Lee, SNL
David Leahy, SNL
Carmen Pancerella, SNLChristine Yang, SNL
Reinhardt Pinzon, ANL
Gregor von Laszewski, ANL
Michael Minkoff, ANL
Branko Ruscic, ANL
Al Wagner*, ANL
Carina Lansing, PNLEric Stephan, PNL
David Montoya*, LANL Lili Xu, LANLYen-Ling Ho, LANL
Thomas C. Allison*, NIST
William H. Green, Jr. *, MIT
William Pitz*, LLNL
Baoshan Wang, ANL
Kaizar Amin, ANLSandeep Nijsure, ANL
Michael Frenklach*, UCB
SAM
National Collaboratory Program
Wendy Koegler, SNLJohn Hewson, SNL
Ed Walsh, SNL
Elena Mendoza, PNL
38
This research was performed in part using the Molecular Science Computing Facility (MSCF) in the William R. Wiley Environmental Laboratory at the Pacific Northwest National Laboratory (PNNL). The MSCF is funded by the Office of Biological and Environmental Research in the U. S. Department of Energy (DOE). PNNL is operated by Battelle for the U. S. Department of Energy under contract DE-AC06-76RLO 1830. Funding is also provided by the Mathematics, Information and Computer Science and Basic Energy Sciences Division of DOE.
AcknowledgementsAcknowledgements
39
EndEndEndEnd