Introduction to Data Warehousing - :: 건국대학교...
Transcript of Introduction to Data Warehousing - :: 건국대학교...
Introduction to Data Warehousing
Ki-Joon HanDatabase lab.
Konkuk University
2
Outline of Talk
• Data Warehousing and Information Integration
• Brief History of Data Warehousing• OLTP vs. OLAP• What is a Data Warehouse?• Types of Data and Their Uses• Data Warehouse Architectures• Issues in Data Warehousing• Course Objectives
3
A Brief History of Information Technology
• The “dark ages”: paper forms in file cabinets• Computerized systems emerge
– Initially for big projects like Social Security– Same functionality as old paper-based systems
• The “golden age”: databases are everywhere– Most activities tracked electronically– Stored data provides detailed history of activity
• The next step: use data for decision-making– Made possible by omnipresence of IT– Identify inefficiencies in current processes– Quantify likely impact of decisions
4
Databases for Decision Support• 1st phase: Automating existing processes makes
them more efficient.– Automation → Lots of well-organized, easily accessed
data
• 2nd phase: Data analysis allows for better decision-making. – Analyze data → better understanding– Better understanding → better decisions
• “Data Entry” vs. “Thinking”– Data analysts are decision-makers: managers,
executives, etc.
5
Problem: Heterogeneous Information Sources
“Heterogeneities are everywhere”
Different interfacesDifferent data representationsDuplicate and inconsistent information
PersonalDatabases
Digital Libraries
Scientific DatabasesWorldWideWeb
6
Problem: Data Management in Large Enterprises
• Vertical fragmentation of informational systems
• Result of application (user)-driven development of operational systems
Sales Administration Finance Manufacturing ...
Sales PlanningStock Mngmt
...
Suppliers
...Debt Mngmt
Num. Control
...Inventory
7
Goal: Unified Access to Data
Integration System
• Collects and combines information• Provides integrated view, uniform user interface• Supports sharing
WorldWideWeb
Digital Libraries Scientific Databases
PersonalDatabases
8
The Traditional Research Approach
Source SourceSource. . .
Integration System
. . .
Metadata
Clients
Wrapper WrapperWrapper
• Query-driven (lazy, on-demand)
9
Disadvantages of Query-Driven Approach
• Delay in query processing– Slow or unavailable information sources– Complex filtering and integration
• Inefficient and potentially expensive for frequent queries
• Competes with local processing at sources• Hasn’t caught on in industry
10
The Warehousing Approach
DataDataWarehouseWarehouse
Clients
Source SourceSource. . .
Extractor/Monitor
Integration System
. . .
Metadata
Extractor/Monitor
Extractor/Monitor
• Information integrated in advance
• Stored in DWfor direct querying and analysis
11
Advantages of Warehousing Approach• High query performance
– But not necessarily most current information• Doesn’t interfere with local processing at sources
– Complex queries at warehouse– OLTP at information sources
• Information copied at warehouse– Can modify, annotate, summarize, restructure, etc.– Can store historical information– Security, no auditing
• Has caught on in industry
12
Not Either-Or Decision
• Query-driven approach still better for– Rapidly changing information– Rapidly changing information sources– Truly vast amounts of data from large numbers
of sources– Clients with unpredictable needs
13
Federated Databases• An alternative to data warehouses• Data warehouse
– Create a copy of all the data – Execute queries against the copy
• Federated database – Pull data from source systems as needed to answer
queries• “lazy” vs. “eager” data integration
Data WarehouseFederated Database
Query
Answer
QueryExtraction
RewrittenQueries
AnswerSourceSystems
Warehouse Mediator
SourceSystems
14
Warehouses vs. Federation• Advantages of federated databases:
– No redundant copying of data– Queries see “real-time” view of evolving data– More flexible security policy
• Disadvantages of federated databases:– Analysis queries place extra load on transactional systems– Query optimization is hard to do well– Historical data may not be available– Complex “wrappers” needed to mediate between analysis
server and source systems• Data warehouses are much more common in practice
– Better performance– Lower complexity– Slightly out-of-date data is acceptable
15
OLTP vs. OLAP
• OLTP: On-Line Transaction Processing– Many short transactions (queries
+ updates)– Examples:
• Update account balance• Enroll in course• Add book to shopping cart
– Queries touch small amounts of data (one record or a few records)
– Updates are frequent– Concurrency is biggest
performance concern
• OLAP: On-Line Analytical Processing– Long transactions, complex
queries– Examples:
• Report total sales for each department in each month
• Identify top-selling books• Count classes with fewer than
10 students– Queries touch large amounts
of data– Updates are infrequent– Individual queries can require
lots of resources
16
Why OLAP & OLTP don’t mix (1)
• Transaction processing (OLTP):– Fast response time important (< 1 second)– Data must be up-to-date, consistent at all times
• Data analysis (OLAP):– Queries can consume lots of resources– Can saturate CPUs and disk bandwidth– Operating on static “snapshot” of data usually OK
• OLAP can “crowd out” OLTP transactions– Transactions are slow → unhappy users
• Example: – Analysis query asks for sum of all sales– Acquires lock on sales table for consistency– New sales transaction is blocked
Different performance requirements
17
Why OLAP & OLTP don’t mix (2)
• Transaction processing (OLTP):– Normalized schema for consistency– Complex data models, many tables– Limited number of standardized queries and updates
• Data analysis (OLAP):– Simplicity of data model is important
• Allow semi-technical users to formulate ad hoc queries
– De-normalized schemas are common• Fewer joins → improved query performance• Fewer tables → schema is easier to understand
Different data modeling requirements
18
Why OLAP & OLTP don’t mix (3)
• An OLTP system targets one specific process– For example: ordering from an online store
• OLAP integrates data from different processes– Combine sales, inventory, and purchasing data– Analyze experiments conducted by different labs
• OLAP often makes use of historical data– Identify long-term patterns– Notice changes in behavior over time
• Terminology, schemas vary across data sources– Integrating data from disparate sources is a major
challenge
Analysis requires data from many sources
19
Data Warehouses
• Doing OLTP and OLAP in the same database system is often impractical– Different performance requirements– Different data modeling requirements– Analysis queries require data from many sources
• Solution: Build a “data warehouse”– Copy data from various OLTP systems– Optimize data organization, system tuning for OLAP– Transactions aren’t slowed by big analysis queries– Periodically refresh the data in the warehouse
20
Data Warehouse EvolutionT
IME
200019951990198519801960 1975
Information-Based Management
DataRevolution
“MiddleAges”
“PrehistoricTimes”
RelationalDatabases
PC’s andSpreadsheets
End-userInterfaces
1st DW Article
DWConfs.
Vendor DWFrameworks
CompanyDWs
“Building theDW”
Inmon (1992)Data Replication
Tools
21
What is a Data Warehouse?A Practitioners Viewpoint
“A data warehouse is simply a single, complete, and consistent store of data obtained from a variety of sources and made available to end users in a way they can understand and use it in a business context.”-- Barry Devlin, IBM Consultant
22
What is a Data Warehouse?An Alternative Viewpoint
“A DW is a – subject-oriented,– integrated,– time-varying,– non-volatile
collection of data that is used primarily in organizational decision making.”
-- W.H. Inmon, Building the Data Warehouse, 1992
23
A Data Warehouse is...• Stored collection of diverse data
– A solution to data integration problem– Single repository of information
• Subject-oriented– Organized by subject, not by application– Used for analysis, data mining, etc.
• Optimized differently from transaction-oriented db
• User interface aimed at executive
24
A Data Warehouse is...(Cont’d)• Large volume of data (Gb, Tb)• Non-volatile
– Historical– Time attributes are important
• Updates infrequent• May be append-only• Examples
– All transactions ever at WalMart– Complete client histories at insurance firm– Stockbroker financial information and portfolios
25
Summary
Operational SystemsEnterpriseModeling
BusinessInformation Guide
DataWarehouse
CatalogData Warehouse
Population
DataWarehouse
Business InformationInterface
26
Warehouse is a Specialized DBStandard DB
• Mostly updates• Many small transactions• Mb - Gb of data• Current snapshot• Index/hash on p.k.• Raw data• Thousands of users (e.g.,
clerical users)
Warehouse• Mostly reads• Queries are long and complex• Gb - Tb of data• History• Lots of scans• Summarized, reconciled data• Hundreds of users (e.g.,
decision-makers, analysts)
27
Types of Data• Business Data - represents meaning
– Real-time data (ultimate source of all business data)– Reconciled data– Derived data
• Metadata - describes meaning– Build-time metadata– Control metadata– Usage metadata
• Data as a product* - intrinsic meaning– Produced and stored for its own intrinsic value– e.g., the contents of a text-book
28
Data Warehouse Architectures: Conceptual View
• Single-layer– Every data element is stored once only– Virtual warehouse
• Two-layer– Real-time + derived data– Most commonly used approach in
industry today
“Real-time data”
Operationalsystems
Informationalsystems
Derived Data
Real-time data
Operationalsystems
Informationalsystems
29
Three-layer Architecture: Conceptual View
• Transformation of real-time data to derived data really requires two steps
Derived Data
Real-time data
Operationalsystems
Informationalsystems
Reconciled Data Physical Implementationof the Data Warehouse
View level“Particular informational
needs”
30
Data Warehousing: Two Distinct Issues
(1) How to get information into warehouse“Data warehousing”
(2) What to do with data once it’s in warehouse“Warehouse DBMS”
• Both rich research areas• Industry has focused on (2)
31
Issues in Data Warehousing• Warehouse Design• Extraction
– Wrappers, monitors (change detectors)• Integration
– Cleansing & merging• Warehousing specification & Maintenance• Optimizations• Miscellaneous (e.g., evolution)
32
Loading the Data Warehouse
Source Systems Data Staging Area Data Warehouse(OLTP)
Data is periodically extracted
Data is cleansed and transformed
Users query the data warehouse
33
Data Extraction
• Source types– Relational, flat file, WWW, etc.
• How to get data out?– Replication tool– Dump file– Create report– ODBC or third-party “wrappers”
34
WrapperConverts data and queries from one data model to another
Extends query capabilities for sources with limited capabilities
DataModel
B
DataModel
A
Queries
Data
Queries SourceWrapper
35
Data Transformations
• Convert data to uniform format– Byte ordering, string termination– Internal layout
• Remove, add & reorder attributes– Add key– Add data to get history
• Sort tuples
36
Monitors
• Goal: Detect changes of interest and propagate to integrator
• How?– Triggers– Replication server– Log sniffer– Compare query results– Compare snapshots/dumps
37
Data Integration
• Receive data (changes) from multiple wrappers/monitors and integrate into warehouse
• Rule-based• Actions
– Resolve inconsistencies– Eliminate duplicates– Integrate into warehouse (may not be empty)– Summarize data– Fetch more data from sources (DW updates)– etc.
38
Data Integration is Hard
• Data warehouses combine data from multiple sources• Data must be translated into a consistent format• Data integration represents ~80% of effort for a
typical data warehouse project!• Some reasons why it’s hard:
– Metadata is poor or non-existent– Data quality is often bad
• Missing or default values• Multiple spellings of the same thing
(Cal vs. UC Berkeley vs. University of California)– Inconsistent semantics
• What is a tree ?
39
Data Cleansing
• Find (& remove) duplicate tuples– e.g., Jane Doe vs. Jane Q. Doe
• Detect inconsistent, wrong data– Attribute values that don’t match
• Patch missing, unreadable data• Notify sources of errors found
40
Course Objectives
• Gain practical understanding of how data warehouses are built and used
• Gain exposure to data modeling “best practices”• Learn techniques used to process complex queries
over very large data sets• Understand the performance trade-offs that come
from alternative data structures• Learn commonly-used methods for mining and
analysis of large data sets• Become familiar with current research directions
in data warehousing and related areas
41
Research Topics• Logical Database Design
– How should the data be modeled?– Designing the data warehouse schema
• Query Processing– Analysis queries are hard to answer efficiently– What techniques are available to the DBMS?
• Physical Database Design– How should the data be organized on disk?– What data structures should be used?
• Data Mining– What use is all this data?– Which questions should we ask our data warehouse
42
Additional Topics• Data integration• Data cleaning• Approximate query answering• Data lineage• Data visualization• Incremental maintenance of materialized views• Answering queries using views• Indexing special data types (spatial, text,
geographic)• Metadata management