Optimized Data Migration within a System of Linked Medical Research
Databases By Jared
ChristophersonU. of
Connecticut
Problem to AddressMedical research
requires connections to multiple hospitals, institutions, or online databases
Data must be compiled manually
Process is time-consuming
General Project Goal
Project Goals:
Present data as though from a single sourceGive the researcher flexibility with viewing
the dataOptimize data flow with site caching
Real World Issue: Data Formatting
Real World Issue: Data FormattingTABLE NAME: hiv_resistance_1
[gene] [drug_class] [compound] [aa_mutation] [codon_mutation] [cite]
TABLE NAME: db_data[gene_name] [drug] [aa_info] [codon_info] [source]
TABLE NAME: ResearchDataHIV[geneName] [drugClass] [compoundInfo] [aaMutation] [codonMutation]
Solution: Master Template hiv_resistance_1: gene
db_data: gene_name
ResearchDataHIV: geneName
Gene
hiv_resistance_1: cite
db_data: source Source
hiv_resistance_1: codon_mutation
db_data: codon_info
ResearchDataHIV: codonMutation
Codon Mutation
Master Template: [Gene] [Source] [Codon Mutation]
Master Templates and Display Templates
Database Store
ID: 1Stanford Cancer Database
64.434.343.99Admin/password
ID: 2UConn Database
44.254.292.34Admin/password
ID: 3HIV Research Database
23.32.232.19Admin/password
Master Template - Cancer
Master Template – HIV Resistance
[Age] [Cancer Type] [Drug Class]
ID -> Table Name -> Field Name
[Compound] [Codon] [Result]
ID -> Table Name -> Field Name
Display Template 1
[Age] [Cancer Type] [Drug Class]
Display Template 2
[Cancer Type] [Drug Class]
Display Template 1
[Compound] [Result]
Display Template 2
[Compound] [Codon] [Result]
Basic FunctionalityProvides a simple search for usersResearchers have the option of selecting a pre-set
Display Template to only display data relevant to their needs
Queries each database individually according to the Master Template
Returns results and (optionally) compiles them into a single list
Use AJAX to return results for each database
Caching and OptimizationGoal: researchers should have fastest access
possible to the info they seekX-RAY or MRI images could be 2-5Mb in size
eachWhat if researchers in the US consistently
need access to data on a server in Asia?Local access would be fastest
USA
USAJAPAN
Caching Process
Caching Goal
Caching and Optimization: Possible SolutionsMove everything to a central serverMove records around as they are accessedCache everythingCache databases based on usage
Query and Result Set – No Caching
JAPAN
USA
SPAIN
Caching Process
JAPAN
USA
SPAIN
Caching Complete
JAPAN
USA
SPAIN
Region Caching
Database Caching Queue – What to cache?For each region, determines the top external
servers used based on a percentage of queries
Database Caching Queue
Need a method to determine the most heavily requested external databases for each region
Track statistics:Convert IP address -> region whenever a user
performs a searchIncrement result count for the record that
keeps track of the region ID and database ID
DB Queue to CacheUSA
DB1 (Japan) – 345 resultsDB7 (Spain) – 793 resultsDB16 (USA) – 539 resultsDB3 (Japan) – 491 results
JAPAN
DB1 (Japan) – 212 resultsDB7 (Spain) – 343 resultsDB16 (USA) – 112 resultsDB3 (Japan) – 312 results
USA
DB7 (Spain) – 49%DB3 (Japan) – 30%DB1 (Japan) – 21%
JAPAN
DB7 (Spain) – 75%DB16 (USA) – 25%
Caching and Optimization: Where to cache dataReal-world constraints
allow_cachesupersite
bandwidth cache_size
Iterate through regions
Get a list of servers by region
Filter by allow_cache Filter by supersite Sort by bandwidth
Are there more servers?
Iterate through server list
Sort by bandwidth Filter by non-supersite
Compare cache size needed by DB in queue to space available on server
Are there more DBs in the
queue?
Is space available?Cache DB
Iterate to next DB in the queue
Drop DB from the queue
YES
YES
Cache complete
NO
YES
NO
NO
Are there more servers?
Are there more DBs in the
queue?
Cache DB
Drop DB from the queue
YES
Cache complete
NO
Iterate through server list
Compare cache size needed by DB in queue to space available on server
Is space available?
YES
Do any servers in the list have
space?
NO
NO
YES
YES
Caching and Optimization: Script ProcessRuns at frequency set by adminThis process continues for each region with
the program assigning data to servers with progressively lower bandwidth and cache_size scores until all the server space from that region is exhausted
Caching and Optimization: Script ProcessAt the end, each region should have as many
local copies of the most frequently requested databases as possible
Cached copies are read-only
Further Work and ImprovementsAllow different types of databases (DAL)Remove overlapping dataScript to determine when individual caches
need to be updated
Top Related