Introduction to Data Warehousing - :: 건국대학교...

Introduction to Data Warehousing

Ki-Joon HanDatabase lab.

Konkuk University

2

Outline of Talk

• Data Warehousing and Information Integration

• Brief History of Data Warehousing• OLTP vs. OLAP• What is a Data Warehouse?• Types of Data and Their Uses• Data Warehouse Architectures• Issues in Data Warehousing• Course Objectives

3

A Brief History of Information Technology

• The “dark ages”: paper forms in file cabinets• Computerized systems emerge

– Initially for big projects like Social Security– Same functionality as old paper-based systems

• The “golden age”: databases are everywhere– Most activities tracked electronically– Stored data provides detailed history of activity

• The next step: use data for decision-making– Made possible by omnipresence of IT– Identify inefficiencies in current processes– Quantify likely impact of decisions

4

Databases for Decision Support• 1st phase: Automating existing processes makes

them more efficient.– Automation → Lots of well-organized, easily accessed

data

• 2nd phase: Data analysis allows for better decision-making. – Analyze data → better understanding– Better understanding → better decisions

• “Data Entry” vs. “Thinking”– Data analysts are decision-makers: managers,

executives, etc.

5

Problem: Heterogeneous Information Sources

“Heterogeneities are everywhere”

Different interfacesDifferent data representationsDuplicate and inconsistent information

PersonalDatabases

Digital Libraries

Scientific DatabasesWorldWideWeb

6

Problem: Data Management in Large Enterprises

• Vertical fragmentation of informational systems

• Result of application (user)-driven development of operational systems

Sales Administration Finance Manufacturing ...

Sales PlanningStock Mngmt

...

Suppliers

...Debt Mngmt

Num. Control

...Inventory

7

Goal: Unified Access to Data

Integration System

• Collects and combines information• Provides integrated view, uniform user interface• Supports sharing

WorldWideWeb

Digital Libraries Scientific Databases

PersonalDatabases

8

The Traditional Research Approach

Source SourceSource. . .

Integration System

. . .

Metadata

Clients

Wrapper WrapperWrapper

• Query-driven (lazy, on-demand)

9

Disadvantages of Query-Driven Approach

• Delay in query processing– Slow or unavailable information sources– Complex filtering and integration

• Inefficient and potentially expensive for frequent queries

• Competes with local processing at sources• Hasn’t caught on in industry

10

The Warehousing Approach

DataDataWarehouseWarehouse

Clients

Source SourceSource. . .

Extractor/Monitor

Integration System

. . .

Metadata

Extractor/Monitor

Extractor/Monitor

• Information integrated in advance

• Stored in DWfor direct querying and analysis

11

Advantages of Warehousing Approach• High query performance

– But not necessarily most current information• Doesn’t interfere with local processing at sources

– Complex queries at warehouse– OLTP at information sources

• Information copied at warehouse– Can modify, annotate, summarize, restructure, etc.– Can store historical information– Security, no auditing

• Has caught on in industry

12

Not Either-Or Decision

• Query-driven approach still better for– Rapidly changing information– Rapidly changing information sources– Truly vast amounts of data from large numbers

of sources– Clients with unpredictable needs

13

Federated Databases• An alternative to data warehouses• Data warehouse

– Create a copy of all the data – Execute queries against the copy

• Federated database – Pull data from source systems as needed to answer

queries• “lazy” vs. “eager” data integration

Data WarehouseFederated Database

Query

Answer

QueryExtraction

RewrittenQueries

AnswerSourceSystems

Warehouse Mediator

SourceSystems

14

Warehouses vs. Federation• Advantages of federated databases:

– No redundant copying of data– Queries see “real-time” view of evolving data– More flexible security policy

• Disadvantages of federated databases:– Analysis queries place extra load on transactional systems– Query optimization is hard to do well– Historical data may not be available– Complex “wrappers” needed to mediate between analysis

server and source systems• Data warehouses are much more common in practice

– Better performance– Lower complexity– Slightly out-of-date data is acceptable

15

OLTP vs. OLAP

• OLTP: On-Line Transaction Processing– Many short transactions (queries

+ updates)– Examples:

• Update account balance• Enroll in course• Add book to shopping cart

– Queries touch small amounts of data (one record or a few records)

– Updates are frequent– Concurrency is biggest

performance concern

• OLAP: On-Line Analytical Processing– Long transactions, complex

queries– Examples:

• Report total sales for each department in each month

• Identify top-selling books• Count classes with fewer than

10 students– Queries touch large amounts

of data– Updates are infrequent– Individual queries can require

lots of resources

16

Why OLAP & OLTP don’t mix (1)

• Transaction processing (OLTP):– Fast response time important (< 1 second)– Data must be up-to-date, consistent at all times

• Data analysis (OLAP):– Queries can consume lots of resources– Can saturate CPUs and disk bandwidth– Operating on static “snapshot” of data usually OK

• OLAP can “crowd out” OLTP transactions– Transactions are slow → unhappy users

• Example: – Analysis query asks for sum of all sales– Acquires lock on sales table for consistency– New sales transaction is blocked

Different performance requirements

17


• Transaction processing (OLTP):– Normalized schema for consistency– Complex data models, many tables– Limited number of standardized queries and updates

• Data analysis (OLAP):– Simplicity of data model is important

• Allow semi-technical users to formulate ad hoc queries

– De-normalized schemas are common• Fewer joins → improved query performance• Fewer tables → schema is easier to understand

Different data modeling requirements

18


• An OLTP system targets one specific process– For example: ordering from an online store

• OLAP integrates data from different processes– Combine sales, inventory, and purchasing data– Analyze experiments conducted by different labs

• OLAP often makes use of historical data– Identify long-term patterns– Notice changes in behavior over time

• Terminology, schemas vary across data sources– Integrating data from disparate sources is a major

challenge

Analysis requires data from many sources

19

Data Warehouses

• Doing OLTP and OLAP in the same database system is often impractical– Different performance requirements– Different data modeling requirements– Analysis queries require data from many sources

• Solution: Build a “data warehouse”– Copy data from various OLTP systems– Optimize data organization, system tuning for OLAP– Transactions aren’t slowed by big analysis queries– Periodically refresh the data in the warehouse

20

Data Warehouse EvolutionT

IME

200019951990198519801960 1975

Information-Based Management

DataRevolution

“MiddleAges”

“PrehistoricTimes”

RelationalDatabases

PC’s andSpreadsheets

End-userInterfaces

1st DW Article

DWConfs.

Vendor DWFrameworks

CompanyDWs

“Building theDW”

Inmon (1992)Data Replication

Tools

21

What is a Data Warehouse?A Practitioners Viewpoint

“A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context.”-- Barry Devlin, IBM Consultant

22

What is a Data Warehouse?An Alternative Viewpoint

“A DW is a – subject-oriented,– integrated,– time-varying,– non-volatile

collection of data that is used primarily in organizational decision making.”

-- W.H. Inmon, Building the Data Warehouse, 1992

23

A Data Warehouse is...• Stored collection of diverse data

– A solution to data integration problem– Single repository of information

• Subject-oriented– Organized by subject, not by application– Used for analysis, data mining, etc.

• Optimized differently from transaction-oriented db

• User interface aimed at executive

24

A Data Warehouse is...(Cont’d)• Large volume of data (Gb, Tb)• Non-volatile

– Historical– Time attributes are important

• Updates infrequent• May be append-only• Examples

– All transactions ever at WalMart– Complete client histories at insurance firm– Stockbroker financial information and portfolios

25

Summary

Operational SystemsEnterpriseModeling

BusinessInformation Guide

DataWarehouse

CatalogData Warehouse

Population

DataWarehouse

Business InformationInterface

26

Warehouse is a Specialized DBStandard DB

• Mostly updates• Many small transactions• Mb - Gb of data• Current snapshot• Index/hash on p.k.• Raw data• Thousands of users (e.g.,

clerical users)

Warehouse• Mostly reads• Queries are long and complex• Gb - Tb of data• History• Lots of scans• Summarized, reconciled data• Hundreds of users (e.g.,

decision-makers, analysts)

27

Types of Data• Business Data - represents meaning

– Real-time data (ultimate source of all business data)– Reconciled data– Derived data

• Metadata - describes meaning– Build-time metadata– Control metadata– Usage metadata

• Data as a product* - intrinsic meaning– Produced and stored for its own intrinsic value– e.g., the contents of a text-book

28

Data Warehouse Architectures: Conceptual View

• Single-layer– Every data element is stored once only– Virtual warehouse

• Two-layer– Real-time + derived data– Most commonly used approach in

industry today

“Real-time data”

Operationalsystems

Informationalsystems

Derived Data

Real-time data

Operationalsystems


29

Three-layer Architecture: Conceptual View

• Transformation of real-time data to derived data really requires two steps

Derived Data

Real-time data

Operationalsystems


Reconciled Data Physical Implementationof the Data Warehouse

View level“Particular informational

needs”

30

Data Warehousing: Two Distinct Issues

(1) How to get information into warehouse“Data warehousing”

(2) What to do with data once it’s in warehouse“Warehouse DBMS”

• Both rich research areas• Industry has focused on (2)

31

Issues in Data Warehousing• Warehouse Design• Extraction

– Wrappers, monitors (change detectors)• Integration

– Cleansing & merging• Warehousing specification & Maintenance• Optimizations• Miscellaneous (e.g., evolution)

32

Loading the Data Warehouse

Source Systems Data Staging Area Data Warehouse(OLTP)

Data is periodically extracted

Data is cleansed and transformed

Users query the data warehouse

33

Data Extraction

• Source types– Relational, flat file, WWW, etc.

• How to get data out?– Replication tool– Dump file– Create report– ODBC or third-party “wrappers”

34

WrapperConverts data and queries from one data model to another

Extends query capabilities for sources with limited capabilities

DataModel

B

DataModel

A

Queries

Data

Queries SourceWrapper

35

Data Transformations

• Convert data to uniform format– Byte ordering, string termination– Internal layout

• Remove, add & reorder attributes– Add key– Add data to get history

• Sort tuples

36

Monitors

• Goal: Detect changes of interest and propagate to integrator

• How?– Triggers– Replication server– Log sniffer– Compare query results– Compare snapshots/dumps

37

Data Integration

• Receive data (changes) from multiple wrappers/monitors and integrate into warehouse

• Rule-based• Actions

– Resolve inconsistencies– Eliminate duplicates– Integrate into warehouse (may not be empty)– Summarize data– Fetch more data from sources (DW updates)– etc.

38

Data Integration is Hard

• Data warehouses combine data from multiple sources• Data must be translated into a consistent format• Data integration represents ~80% of effort for a

typical data warehouse project!• Some reasons why it’s hard:

– Metadata is poor or non-existent– Data quality is often bad

• Missing or default values• Multiple spellings of the same thing

(Cal vs. UC Berkeley vs. University of California)– Inconsistent semantics

• What is a tree ?

39

Data Cleansing

• Find (& remove) duplicate tuples– e.g., Jane Doe vs. Jane Q. Doe

• Detect inconsistent, wrong data– Attribute values that don’t match

• Patch missing, unreadable data• Notify sources of errors found

40

Course Objectives

• Gain practical understanding of how data warehouses are built and used

• Gain exposure to data modeling “best practices”• Learn techniques used to process complex queries

over very large data sets• Understand the performance trade-offs that come

from alternative data structures• Learn commonly-used methods for mining and

analysis of large data sets• Become familiar with current research directions

in data warehousing and related areas

41

Research Topics• Logical Database Design

– How should the data be modeled?– Designing the data warehouse schema

• Query Processing– Analysis queries are hard to answer efficiently– What techniques are available to the DBMS?

• Physical Database Design– How should the data be organized on disk?– What data structures should be used?

• Data Mining– What use is all this data?– Which questions should we ask our data warehouse

42

Additional Topics• Data integration• Data cleaning• Approximate query answering• Data lineage• Data visualization• Incremental maintenance of materialized views• Answering queries using views• Indexing special data types (spatial, text,

geographic)• Metadata management

Introduction to Data Warehousing - :: 건국대학교...

Documents

Transcript of Introduction to Data Warehousing - :: 건국대학교...