DxD big data and data quality as a service svcc oct 2015
-
Upload
paul-bertucci -
Category
Data & Analytics
-
view
466 -
download
0
Transcript of DxD big data and data quality as a service svcc oct 2015
“Yes, you can plug Data Quality as a Service (DQaaS) into Big Data”
October 4th, 2015
Master Data Management for big data
October 4th, 2015
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
Big data is here to stay and expanding rapidly The 4th “V” of big data
How your data architecture is growing Big data, and perhaps a big mess! Data quality as a Service for your data lake
Tools of the trade (Microsoft MDS + Profisee’s Maestro)
Plugging DQaaS into your Big Data lakes
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
USA (21 years)
and France (14 years)
Database/Data Architecture – RDBMS’s:
Oracle, PostGres, MySQL, DB2, ……
Microsoft SQL Server/Analysis Services
– Master Data Management MDS
Maestro
Oracle
IBM (initiate)
– Big Data:
Hadoop, ParAccel, NoSQL
Database talent pool: – Top database and data architects
– Acclaimed Authors
– Speakers at many events and conferences
Database Tools: P&T Tool - highly graphical for Sybase, Oracle and MS SQL Server
Database Education & Training
Partnerships
Microsoft
Profisee
DesignMind
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
Big Data’s Rapid Expansion
5
Digital Data (created and replicated) Reached 4 zettabytes at the end of 2013 That’s 50% more than in 2012 And, 4 times more than in 2010 Will hit 50 ZB’s by 2020!
Source: IDC
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
Impact of bad data
$3,100,000,000,000
IBM’s Estimate of Annual Cost of Bad Data to US Economy (IBM BDH)
15% Surveyed Executives
Trusting Overall Data (IDC)
27% Surveyed Executives Sure
of Data Accuracy (IBM)
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
You will be (or, are already) dealing with..
7
Volume
Velocity
Variety
Veracity
High-Volumes of data you need to access
High-Velocity of streaming data pouring in
High-Variety of information assets (structured, semi-structured, unstructured)
AND, you need to get to this data to enable enhanced decision making, insight discovery and process optimization
Oh, and it better be good data (have Veracity) (source: IBM/Diginome)
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
Are you doing the right thing?
Hadoop (HDFS solutions) lends itself to problems that can be solved through distributed strategies coupled with advanced analytics.
Other problems just need a horizontally scalable solution (via MPP) with current mainstream analytics/database (like ParAccel, Teradata, PDW…)
AND, attack the quality of the data !!!!! (veracity)
Understand the problem first, Next, apply the proper architecture, and finally, choose the proper tools!
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
Secu
rity
/Acc
ess
Fra
mew
ork
Op
era
tio
ns
Fram
ewo
rk -
Mo
nit
ori
ng/
Ale
rts/
Wo
rkfl
ow
Dat
a In
tegr
atio
n a
nd
Acq
uis
itio
n F
ram
ewo
rk
Analytics Put into different perspectives and
trends (forecasting)
Operational What we did
Reactive
Data Mining What other things might exist or are affecting what we
did (why did it happen)
Predictive Modeling,
Simulation & Optimization
See what is possible (next)
and what is the best
way to do it (prescriptive)
Proactive Front
Office
Back
Office
Other
Internal
External
Social/
Other
Data Services/Tools/Data Visualization
Islands of BI (non-IT)
Data Warehouse/Marts (Aggregated/Dimensional)
Operational Data Store (ODS) (Detail/Transactional/taxonomies)
SAS/SPSS
QlikView Dashboards/Light Analytics
Application Bundled Reporting
Business Objects
OLAP
“Big Data” (structured, semi & un-structured data)
“Big Data” (structured, semi & un-structured data)
Var
iety
Governance – Quality - Certification
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
Data Pipeline
10
Data Acquisition
Data Storage
Data Analysis
HDFS Commands Flume (logs)
Scribe (RT stream logs)
Sqoop (as needed)
Many others
HDFS (Hadoop)
Hbase (Big Table)
Dryad Others
MapReduce (Hadoop)
Pig (data analysis/pig latin/data flow)
Hive (DW for Hadoop/HiveQL)
Cascading (complex MR workflows)
Shark (HiveQL on Spark)
Spark (In-mem/cluster computing)
Flume Few others
Kafka (producer/consumer
model)
Kestrel (distributed message
queue)
Storm (RT computation)
Trident (Operations on top of Storm)
S4 (distributed stream computing)
Spark Streaming (RT Spark)
Emerging: Hybrid Computational Models SummingBird, Lambdoop, others.
[eliminates MapReduce, all processing paradigms supported]
Vo
lum
e
Velo
city B
oth
Results
Value
Governance – Quality – Certification
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
Would you drink this?
11
NO, but it likely could have been prevented (or cleaned up during acquisition or earlier)
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
Big Data Platform
A disturbing pattern has emerged in big data
Universe of External and
Internal Data – 100’s of sources, dozens of formats, no control of
content
All new data flows to the big data platform
Unidentified Records are just
ignored
Zero data governance
Nothing is fixing bad data in the data lakes
(perhaps on query?)
How do you identity what is good data versus bad data?
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
Data Quality as a Service
Big Data Platform
So, let’s add in data quality as a service!
Universe of External Data
– 100’s of sources,
dozens of formats, no control of content
All new data flows to the big data platform
Unidentified/Unseen records flow to DQaaS
DQaaS Fuzzy Matches, Users Map, Workflows Occur, Knowledge is built
Cleansed data flows back to Big Data Platform
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
A big data effort we finished recently (with DQaaS)
Do
wn
stre
am D
ata
An
alyt
ics/
Surf
acin
g
Internal Enterprise Data - Master Data (e.g. Customers,
Products, SKU’s) - Data Warehouse (Dimensional, Facts)
BIG DATA Data
Staging/Data Acquisition
Data Quality as a Service (BUS)
External Transactional Data (streams)
External 3rd Party Data
Internal Transactional Data (streams)
Internal “Other” Data
BIG DATA Data Delivery
Platform
Hives
Sqoop
Cloudera
Cloudera
Hives
Sqoop
RAW
RAW
RAW
RAW
eMSTR
STG
STG
STG
STG
eSTG
Delivery
Delivery
Delivery
Delivery
Delivery
Masters Matching Cleansing Enriching
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
Ingest, master, and deliver (the new data pipeline)
RAW STAGE
eMaster
DIFF
Not Mastered Yet, or not seen before
Already Mastered
Data Delivery
Mastered (“conformed”) Not
Mastered Yet
Data Quality as a Service (BUS) Masters Matching Cleansing Enriching
Workflow Data Stewardship
Via Subscription Views
Via Staged Data Tables
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
Things we can do easily on big data platform
Existing Data (Mastered Already + Mastering Results)
New Data We saw this before (use the master results)
We haven’t seen this yet (master it) Needs to be mastered
DIFF
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
Filtering the Data Lake
17
Matching Strategies Survivorship Dedupe Harmonization Golden Records Taxonomies Cleansing Standardization Defaults
Data Quality as a Service (BUS) Masters Matching Cleansing Enriching
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
SQL Server Master Data Services
Master Data Management Platform on SQL Server
Model & Rules – Managed Schema
Security and Access
Bulk data loads & consumption – table access
Hierarchy Management
Deployment, management, versioning
Application-level transaction management
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
Entities
Model
Extended Attributes User-defined Metadata
Sub Entities
Collections
Derived Hierarchies
Explicit Hierarchies
Domain Based Attributes
Attributes
Attribute Groups
Business Rules
Name (mandatory) Code (mandatory) Free-form Attributes - text type - numeric type - date type - link type File Attributes - files - documents - images
Master List domains (types) Like color, ISO Customer Segmentation, States, Provinces, Countries, so on.
Members of the Model (physical data entries)
For a specific business need (example “Customer Master”) - Version
- Version Lock
1:N groups many 1:N may have
Transactions Annotations H
iera
rch
ies
Subscription Views
© Data by Design, LLC ⃝ www.dataXdesign.com
MDS Excel
Add-in
MDS Web App
MDS Web App
MDS Web Service
MDS
Staging Tables
IIS
SQL Server
SQL Server Master Data Services
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
User Experience – Stewardship, Access, Manage
Workflow – Initiate, Approve, Contribute, Calculate
Golden Record Management – Matching, Survivorship
Data Quality – Verification, Address, Person, Email
Application Integration – MDM, CRM, Federation
MDM Programmability – Web Objects, Web Services
Profisee Maestro (Empowering MDM)
© Data by Design, LLC ⃝ www.dataXdesign.com
Custom Apps
Workflow Forms
Web Parts
Maestro Desktop
MDS Excel
Add-in MDS Web
App
MS Dynamics
Salesforce
Maestro
Maestro
Maestro Web App MDS Web App
MDS Web Service Maestro Web Service
MDS
Staging Tables
Maestro SDK/API
Connectors
IIS
SQL Server
Batch Integration
Real-time Integration
MSM
Q
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
Maestro
What we can show you today!
BIG DATA Data
Staging/Data Acquisition
Data Quality as a Service (BUS)
External Transactional
Data (streams)
BIG DATA Data Delivery
Platform
Cloudera
Ingest Enterprise Master Data
Masters Matching Cleansing Enriching
Raw
Customer Data
Customer Data
that needs to be “mastered”
Cloudera Any Any
Mastered (“conformed”)
Customer Data
Enterprise
Customer Data
Maestro
MDS
Data Steward
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
Staging, matching, cleansing, publishing (MDS & Maestro)
MDS Demo (5 mins) Maestro Demo (10 mins)
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
Great options, even better opportunities
Understand your processing and data requirements!
Strive for high quality data that is relevant to your
most important business drivers/needs!
Work within a consistent framework that provides you
the needed performance, access, compliance, and
quality your company demands!
Plug in data quality (DQaaS) as early as you can
in the big data food chain
(starting at acquisition (ingest) time)
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
Top data experts in the industry USA and European offices Acclaimed Authors Presenters at major conferences
Data Consulting Services
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com
No User Interface We know how to fix the data, but we don’t have a place to fix it.
Cosmic Data Volumes More data than ever before – but is it consistent? Is the noise growing faster than the signal?
Multiple Stakeholders Departments, Subsidiaries, Regulatory bodies, view the world in different ways.
Data Quality is Poor Duplicates exist even within systems and data values are missing or just plain wrong.
Externally Sourced Data They’re talking about my things on the web, but they aren’t speaking my language.
Multiple Systems Even my own systems don’t use the same names and don’t have the same attributes.
How do I trust that my analytics are driving correct
decisions?
The Case for Master Data Management
© Data by Design, LLC ⃝ www.dataXdesign.com
Query without a data map
SELECT customers who complained on Facebook more than twice
FROM Giant Hot Mess of Data
WHERE product is in this giant list
and flag = current
BUT NOT when starts with XRB0
AND ALSO include these other products from this acquisitions list
when these four conditions match but never when the country of
manufacture is Sweden ... Goes on for 16 pages...are you following this?...
Query with a data map
SELECT customers who complained on Facebook more than twice
FROM Giant Hot Mess of Data JOIN Map on known keys
WHERE product is a current reporting product
© Data by Design, LLC ⃝ www.dataXdesign.com
Name Code Source Add1.. Customer #
Master
XYZ Corporation Master-6001 329 Main St South C5321 Master-6001
XYZ c 6001 EXT2 329 Main Street S
Master-6001
XYZ Corporation
6005 ERP 1 3229 Main St Master-6001
Xyz Corp 6009 CRM 329 Main Street So C5321 Master-6001
Profisee Master-6003 2520 Northwinds C5400 Master-6003
Profisee 6003 CRM 2520 Northwinds C5400 Master-6003
Master Customer
Golden Records Bind other Candidate Records
New Candidate Records are
Address Corrected for Sure Matching
Candidate Records are added to their
“Master or Golden” Record
Group
Golden Records may have attributes from
the candidate records or new attributes
altogether
© Data by Design, LLC ⃝ www.dataXdesign.com
31
Profisee Maestro: Reference Architecture
Company Feed Industry Feed Ratings Feed Reference Data Flat Files XML
EMR1 EMR2 Credentialing1 Credentialing2 Labs1 Labs2 Datawarehouse Flat Files XML
ERP1 ERP2 CRM1 CRM2 SCM1 SCM2 DW BI GL HR