DxD big data and data quality as a service svcc oct 2015

31
“Yes, you can plug Data Quality as a Service (DQaaS) into Big Data” October 4th, 2015 Master Data Management for big data October 4th, 2015

Transcript of DxD big data and data quality as a service svcc oct 2015

Page 1: DxD big data and data quality as a service svcc oct 2015

“Yes, you can plug Data Quality as a Service (DQaaS) into Big Data”

October 4th, 2015

Master Data Management for big data

October 4th, 2015

Page 2: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

Big data is here to stay and expanding rapidly The 4th “V” of big data

How your data architecture is growing Big data, and perhaps a big mess! Data quality as a Service for your data lake

Tools of the trade (Microsoft MDS + Profisee’s Maestro)

Plugging DQaaS into your Big Data lakes

Page 3: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

USA (21 years)

and France (14 years)

Database/Data Architecture – RDBMS’s:

Oracle, PostGres, MySQL, DB2, ……

Microsoft SQL Server/Analysis Services

– Master Data Management MDS

Maestro

Oracle

IBM (initiate)

– Big Data:

Hadoop, ParAccel, NoSQL

Database talent pool: – Top database and data architects

– Acclaimed Authors

– Speakers at many events and conferences

Database Tools: P&T Tool - highly graphical for Sybase, Oracle and MS SQL Server

Database Education & Training

Partnerships

Microsoft

Profisee

DesignMind

Page 4: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

Page 5: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

Big Data’s Rapid Expansion

5

Digital Data (created and replicated) Reached 4 zettabytes at the end of 2013 That’s 50% more than in 2012 And, 4 times more than in 2010 Will hit 50 ZB’s by 2020!

Source: IDC

Page 6: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

Impact of bad data

$3,100,000,000,000

IBM’s Estimate of Annual Cost of Bad Data to US Economy (IBM BDH)

15% Surveyed Executives

Trusting Overall Data (IDC)

27% Surveyed Executives Sure

of Data Accuracy (IBM)

Page 7: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

You will be (or, are already) dealing with..

7

Volume

Velocity

Variety

Veracity

High-Volumes of data you need to access

High-Velocity of streaming data pouring in

High-Variety of information assets (structured, semi-structured, unstructured)

AND, you need to get to this data to enable enhanced decision making, insight discovery and process optimization

Oh, and it better be good data (have Veracity) (source: IBM/Diginome)

Page 8: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

Are you doing the right thing?

Hadoop (HDFS solutions) lends itself to problems that can be solved through distributed strategies coupled with advanced analytics.

Other problems just need a horizontally scalable solution (via MPP) with current mainstream analytics/database (like ParAccel, Teradata, PDW…)

AND, attack the quality of the data !!!!! (veracity)

Understand the problem first, Next, apply the proper architecture, and finally, choose the proper tools!

Page 9: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

Secu

rity

/Acc

ess

Fra

mew

ork

Op

era

tio

ns

Fram

ewo

rk -

Mo

nit

ori

ng/

Ale

rts/

Wo

rkfl

ow

Dat

a In

tegr

atio

n a

nd

Acq

uis

itio

n F

ram

ewo

rk

Analytics Put into different perspectives and

trends (forecasting)

Operational What we did

Reactive

Data Mining What other things might exist or are affecting what we

did (why did it happen)

Predictive Modeling,

Simulation & Optimization

See what is possible (next)

and what is the best

way to do it (prescriptive)

Proactive Front

Office

Back

Office

Other

Internal

External

Social/

Other

Data Services/Tools/Data Visualization

Islands of BI (non-IT)

Data Warehouse/Marts (Aggregated/Dimensional)

Operational Data Store (ODS) (Detail/Transactional/taxonomies)

SAS/SPSS

QlikView Dashboards/Light Analytics

Application Bundled Reporting

Business Objects

OLAP

“Big Data” (structured, semi & un-structured data)

“Big Data” (structured, semi & un-structured data)

Var

iety

Governance – Quality - Certification

Page 10: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

Data Pipeline

10

Data Acquisition

Data Storage

Data Analysis

HDFS Commands Flume (logs)

Scribe (RT stream logs)

Sqoop (as needed)

Many others

HDFS (Hadoop)

Hbase (Big Table)

Dryad Others

MapReduce (Hadoop)

Pig (data analysis/pig latin/data flow)

Hive (DW for Hadoop/HiveQL)

Cascading (complex MR workflows)

Shark (HiveQL on Spark)

Spark (In-mem/cluster computing)

Flume Few others

Kafka (producer/consumer

model)

Kestrel (distributed message

queue)

Storm (RT computation)

Trident (Operations on top of Storm)

S4 (distributed stream computing)

Spark Streaming (RT Spark)

Emerging: Hybrid Computational Models SummingBird, Lambdoop, others.

[eliminates MapReduce, all processing paradigms supported]

Vo

lum

e

Velo

city B

oth

Results

Value

Governance – Quality – Certification

Page 11: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

Would you drink this?

11

NO, but it likely could have been prevented (or cleaned up during acquisition or earlier)

Page 12: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

Big Data Platform

A disturbing pattern has emerged in big data

Universe of External and

Internal Data – 100’s of sources, dozens of formats, no control of

content

All new data flows to the big data platform

Unidentified Records are just

ignored

Zero data governance

Nothing is fixing bad data in the data lakes

(perhaps on query?)

How do you identity what is good data versus bad data?

Page 13: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

Data Quality as a Service

Big Data Platform

So, let’s add in data quality as a service!

Universe of External Data

– 100’s of sources,

dozens of formats, no control of content

All new data flows to the big data platform

Unidentified/Unseen records flow to DQaaS

DQaaS Fuzzy Matches, Users Map, Workflows Occur, Knowledge is built

Cleansed data flows back to Big Data Platform

Page 14: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

A big data effort we finished recently (with DQaaS)

Do

wn

stre

am D

ata

An

alyt

ics/

Surf

acin

g

Internal Enterprise Data - Master Data (e.g. Customers,

Products, SKU’s) - Data Warehouse (Dimensional, Facts)

BIG DATA Data

Staging/Data Acquisition

Data Quality as a Service (BUS)

External Transactional Data (streams)

External 3rd Party Data

Internal Transactional Data (streams)

Internal “Other” Data

BIG DATA Data Delivery

Platform

Hives

Sqoop

Cloudera

Cloudera

Hives

Sqoop

RAW

RAW

RAW

RAW

eMSTR

STG

STG

STG

STG

eSTG

Delivery

Delivery

Delivery

Delivery

Delivery

Masters Matching Cleansing Enriching

Page 15: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

Ingest, master, and deliver (the new data pipeline)

RAW STAGE

eMaster

DIFF

Not Mastered Yet, or not seen before

Already Mastered

Data Delivery

Mastered (“conformed”) Not

Mastered Yet

Data Quality as a Service (BUS) Masters Matching Cleansing Enriching

Workflow Data Stewardship

Via Subscription Views

Via Staged Data Tables

Page 16: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

Things we can do easily on big data platform

Existing Data (Mastered Already + Mastering Results)

New Data We saw this before (use the master results)

We haven’t seen this yet (master it) Needs to be mastered

DIFF

Page 17: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

Filtering the Data Lake

17

Matching Strategies Survivorship Dedupe Harmonization Golden Records Taxonomies Cleansing Standardization Defaults

Data Quality as a Service (BUS) Masters Matching Cleansing Enriching

Page 18: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

SQL Server Master Data Services

Master Data Management Platform on SQL Server

Model & Rules – Managed Schema

Security and Access

Bulk data loads & consumption – table access

Hierarchy Management

Deployment, management, versioning

Application-level transaction management

Page 19: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

Entities

Model

Extended Attributes User-defined Metadata

Sub Entities

Collections

Derived Hierarchies

Explicit Hierarchies

Domain Based Attributes

Attributes

Attribute Groups

Business Rules

Name (mandatory) Code (mandatory) Free-form Attributes - text type - numeric type - date type - link type File Attributes - files - documents - images

Master List domains (types) Like color, ISO Customer Segmentation, States, Provinces, Countries, so on.

Members of the Model (physical data entries)

For a specific business need (example “Customer Master”) - Version

- Version Lock

1:N groups many 1:N may have

Transactions Annotations H

iera

rch

ies

Subscription Views

Page 20: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com

MDS Excel

Add-in

MDS Web App

MDS Web App

MDS Web Service

MDS

Staging Tables

IIS

SQL Server

SQL Server Master Data Services

Page 21: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

User Experience – Stewardship, Access, Manage

Workflow – Initiate, Approve, Contribute, Calculate

Golden Record Management – Matching, Survivorship

Data Quality – Verification, Address, Person, Email

Application Integration – MDM, CRM, Federation

MDM Programmability – Web Objects, Web Services

Profisee Maestro (Empowering MDM)

Page 22: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com

Custom Apps

Workflow Forms

Web Parts

Maestro Desktop

MDS Excel

Add-in MDS Web

App

MS Dynamics

Salesforce

Maestro

Maestro

Maestro Web App MDS Web App

MDS Web Service Maestro Web Service

MDS

Staging Tables

Maestro SDK/API

Connectors

IIS

SQL Server

Batch Integration

Real-time Integration

MSM

Q

Page 23: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

Maestro

What we can show you today!

BIG DATA Data

Staging/Data Acquisition

Data Quality as a Service (BUS)

External Transactional

Data (streams)

BIG DATA Data Delivery

Platform

Cloudera

Ingest Enterprise Master Data

Masters Matching Cleansing Enriching

Raw

Customer Data

Customer Data

that needs to be “mastered”

Cloudera Any Any

Mastered (“conformed”)

Customer Data

Enterprise

Customer Data

Maestro

MDS

Data Steward

Page 24: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

Staging, matching, cleansing, publishing (MDS & Maestro)

MDS Demo (5 mins) Maestro Demo (10 mins)

Page 25: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

Great options, even better opportunities

Understand your processing and data requirements!

Strive for high quality data that is relevant to your

most important business drivers/needs!

Work within a consistent framework that provides you

the needed performance, access, compliance, and

quality your company demands!

Plug in data quality (DQaaS) as early as you can

in the big data food chain

(starting at acquisition (ingest) time)

Page 26: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

Top data experts in the industry USA and European offices Acclaimed Authors Presenters at major conferences

[email protected]

Data Consulting Services

Page 27: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

Page 28: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com © Data by Design, LLC ⃝ www.dataXdesign.com

No User Interface We know how to fix the data, but we don’t have a place to fix it.

Cosmic Data Volumes More data than ever before – but is it consistent? Is the noise growing faster than the signal?

Multiple Stakeholders Departments, Subsidiaries, Regulatory bodies, view the world in different ways.

Data Quality is Poor Duplicates exist even within systems and data values are missing or just plain wrong.

Externally Sourced Data They’re talking about my things on the web, but they aren’t speaking my language.

Multiple Systems Even my own systems don’t use the same names and don’t have the same attributes.

How do I trust that my analytics are driving correct

decisions?

The Case for Master Data Management

Page 29: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com

Query without a data map

SELECT customers who complained on Facebook more than twice

FROM Giant Hot Mess of Data

WHERE product is in this giant list

and flag = current

BUT NOT when starts with XRB0

AND ALSO include these other products from this acquisitions list

when these four conditions match but never when the country of

manufacture is Sweden ... Goes on for 16 pages...are you following this?...

Query with a data map

SELECT customers who complained on Facebook more than twice

FROM Giant Hot Mess of Data JOIN Map on known keys

WHERE product is a current reporting product

Page 30: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com

Name Code Source Add1.. Customer #

Master

XYZ Corporation Master-6001 329 Main St South C5321 Master-6001

XYZ c 6001 EXT2 329 Main Street S

Master-6001

XYZ Corporation

6005 ERP 1 3229 Main St Master-6001

Xyz Corp 6009 CRM 329 Main Street So C5321 Master-6001

Profisee Master-6003 2520 Northwinds C5400 Master-6003

Profisee 6003 CRM 2520 Northwinds C5400 Master-6003

Master Customer

Golden Records Bind other Candidate Records

New Candidate Records are

Address Corrected for Sure Matching

Candidate Records are added to their

“Master or Golden” Record

Group

Golden Records may have attributes from

the candidate records or new attributes

altogether

Page 31: DxD big data and data quality as a service svcc oct 2015

© Data by Design, LLC ⃝ www.dataXdesign.com

31

Profisee Maestro: Reference Architecture

Company Feed Industry Feed Ratings Feed Reference Data Flat Files XML

EMR1 EMR2 Credentialing1 Credentialing2 Labs1 Labs2 Datawarehouse Flat Files XML

ERP1 ERP2 CRM1 CRM2 SCM1 SCM2 DW BI GL HR