版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Corey Wei技术顾问甲骨文公司
基于Hadoop和RDBMS
的Oracle大数据分析
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 | 2
Agenda
Big Data Solution Overview
Big Data Appliance
Oracle NoSQL Database
Big Data SQL
Big Data Connectors
Oracle Advanced Analytics
Case Study
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Big Data Solution Overview
3
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Big Data: Techniques and
Technologies that Make Handling
Data at Extreme Scale
Economical.
Brian Hopkins and Boris Evelson, Forrester Research, “Expand Your Digital Horizons with Big Data” (September 2011)
Big Data Definition
4
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Stream Acquire – Organize – Analyze
In-D
atab
ase
An
alyt
ics
Data
Marts / ODS
Predictive
Analytics
Decide
Event / StreamData Capture
Log / FileData Capture
Ap
plic
atio
ns
NoSQL
Hadoop
Predictive
Analytics
Bridge Unstructured/
Structured
ETL
Data
Warehouse
Dashboards,Reporting & Query
Real-Time Information Discovery
Oracle Big Data Approach - Functional View
5
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Stream Acquire – Organize – Analyze
In-D
atab
ase
An
alyt
ics
Data
Warehouse
Oracle Advanced
Analytics
Oracle
Database
Decide
Oracle Event Processing
Apache Flume
Ap
plic
atio
ns
Oracle NoSQL
Database
Cloudera
Hadoop
Oracle R
Distribution
Oracle Big Data Connectors
Oracle DataIntegrator
Oracle Industry
Data Model(s)
Oracle BI Enterprise Edition
Oracle Real-TimeDecisions
Endeca Information Discovery
Oracle Big Data Approach - Product View
6
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Stream Acquire – Organize – Analyze
In-D
atab
ase
An
alyt
ics
Data
Warehouse
Oracle
Advanced
Analytics
Oracle
Database
Oracle BI Enterprise Edition
Oracle Real-TimeDecisions
Endeca Information Discovery
Decide
Oracle Event Processing
Apache Flume A
pp
licat
ion
s
Oracle
NoSQL
Database
Cloudera
Hadoop
Oracle R
Distribution
Oracle Big Data Connectors
Oracle DataIntegrator
Oracle Big Data Approach – Engineered Systems
• Complete
• Integrated
• Scalable
7
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Big Data Appliance
8
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Sun Oracle X4-2L Servers with per server:
• 2 * 8 Core Intel Xeon E5 Processors
• 64 GB Memory
• 48TB Disk space
Integrated Software:
• Oracle Linux
• Oracle Java VM
• Cloudera Distribution of Apache Hadoop (CDH)
• Cloudera Manager
• All Cloudera Options
• Oracle R Distribution
• Oracle NoSQL Database
All integrated software (except NoSQL DB CE) is supported as part of Premier Support for Systems and Premier Support for Operating Systems
9
Big Data Appliance X4-2
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Starter Rack is a fully cabled and
configured for growth with 6 servers
In-Rack Expansion delivers 6 server
modular expansion block
Full Rack delivers optimal blend of
capacity and expansion options
Grow by adding rack – up to 18 racks
without additional switches
Big Data Appliance Product Family
10
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Engineered Systems Benefits
Lower TCO than DIY Hadoop Clusters
Faster Time to Value
Higher Performance out-of-box
Lower Management Overhead
Integrated and Comprehensive Security
Tight Integration with your Infrastructure
Big Data Appliance
11
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
TCO Data Points:
18 servers (DL380 vs. X4-2L)
864TB Raw Storage
288 Cores
1152GB Total Memory
Cloudera Enterprise Subscription
with all options
Subscription vs. Perpetual
Equivalent Installation Cost
Not calculated:
Soft Cost (people and time to value)
Data integration licenses
$0
$200,000
$400,000
$600,000
$800,000
$1,000,000
$1,200,000
$1,400,000
Year 1 Year 2 Year 3 Year 4 Year 5
Oracle BDA
HP + Cloudera
Savings
List Price Comparisons
Cu
mu
lative
Co
st a
nd
Sa
vin
gs
Engineered Systems Benefits
12
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
BDA 3.0 DIY CDH 5.0
Management Console
Single Command Patching
and Upgrade
Full Stack Patching and
Upgrading
Automatic Cluster Re-
Configuration
Security (AAA) out-of-box
Encryption out-of-box
(network and at-rest)
InfiniBand + Optimizations
Stack Tuning
(OS, Java, Hadoop)
Engineered Systems Benefits
13
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Authentication through Kerberos
Authorization through Apache Sentry
Auditing through Oracle Audit Vault
Encryption for Data-at-Rest
Network Encryption
BDA Security Overview
14
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Management Infrastructure combines EM and CM
Quick view of Hardware and Software status
in Oracle Enterprise Manager
Integrated Management Framework
15
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Oracle NoSQL Database
16
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Features
Scalable, Highly Available, Key-Value Database
Application
Storage NodesDatacenter B
Storage NodesDatacenter A
Application
NoSQL DB Driver
Application
NoSQL DB Driver
Application
• Key-value, JSON & RDF data
• Large Object API
• BASE & ACID Transactions
• Data Center Support
• Online Rolling Upgrade
• Online Cluster Management
• Table data model
• Secondary Indices
• Secondary Zones (Data Centers)
• Security
Oracle NoSQL Database
17
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
• Automatic election of new Master
• Rejoining nodes automatically synchronize with the Master
• Isolated nodes can still service reads
• All nodes are symmetric
Automatic Failover
Replication factor = 5
Rep
Node
Master
Rep
Node
Replica
Rep
Node
Replica
Rep
Node
Replica
Rep
Node
Replica
New Master
Features - Failover
18
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
• Simple data model – key-value pair (major+minor-key paradigm)
• Simple operations – read/insert/update/delete, RMW support
• Major key: hashed to a Shard (partition), Minor key Btree within a Shard
• Raw Key/Value and JSON schema APIs supported
Key-Value pairs
userid
addresssubscriptions
email idphone #expiration date
Major key:
Minor key:
Value:
Strings
Byte Array
Value Options: Key-Value JSON RDF Triples Tables/Rows
picture
.jpg
Features – Flexible Data Model
19
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
• Benefits
– Lower barrier to adoption, shorter time to market
– Simplified application modeling
– Uses familiar table concepts
• Features
– Layered on top of distributed key-value model
– Compatible with Release 2.0 JSON schemas
– Supports table evolution, retains flexible client access
• Sets foundation for future capabilities
NoSQL DB Table Model
Features – Flexible Data Model
20
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
• Configurable Durability per operation
• Configurable Consistency per operation
• ACID by default
• Transaction scope is single API call
• Records share same major key
• Multiple operations supported
Greater Flexibility
Features – Configurable Transactions
21
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
• Increase Data Capacity
– Add more storage nodes
– New shards automatically created
• Increase Data Throughput
– More shards = better write throughput
– More replicas/shard = better read throughput
On Demand NoSQL DB Driver
Application
Master
Replica
Replica
StorageNode StorageNode StorageNode
Shard-1
Master
Replica
Replica
Shard-2
On-Demand Cluster Expansion
Features – Elasticity
22
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
• Supports heterogeneous storage topology
• Replicas move from over-utilized to under-utilized storage nodes
• Number of shards and replication factor remain unchanged
Improve PerformanceStorage Node 1 Storage Node 2 Storage Node 3
Represents a partition
Features – Automatic Rebalancing
23
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
0
2.5
5
7.5
10
12.5
15
17.5
72 (24x3) 144 (48x3) 216 (72x3)
Tim
e t
o U
pgr
ade
(m
in)
Total Nodes (Shards x Rep. Factor)
Online Rolling Upgrade
• We did do it!• Admin commands available to
describe safe upgrade order• Scripted available hands-free
upgrade experience• Read/Write availability
throughout the upgrade process
What’s the Big Deal
Ever tried to upgrade a 200 node system while it‟s active?
Features – Online Rolling Upgrades
24
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Query NoSQL data from Oracle Database
Access NoSQL data from Hadoop for DW and analytics
Share data with Coherence for extensible in-memory cache grid
Persist history & event streams for processing with OEP
Store & query RDF data using Oracle RDF for NoSQL
Oracle NoSQL Database: Integrated out of the box
Features – Integration
25
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
0
1
2
3
4
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1,400,000
6 (2x3) 12 (4x3) 24 (8x3) 30 (10x3) Ave
rage
Lat
en
cy (
ms)
Thro
ugh
pu
t (o
ps/
sec)
Cluster Size
Mixed Throughput
Throughput (ops/sec) Write Latency (ms)
Read Latency (ms)
•1.25M ops/sec
• 2 billion records
• 2 TB of data
• 95% read, 5% update
• Low latency
• High Scalability
(Yahoo Cloud Scalability Benchmark)
Benchmark Results - YCSB
26
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Big Data SQL
27
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
• Hadoop is good at some things
• Databases are good at others
• SQL is very important
Strengths of Both Systems
28
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Exadata+
Oracle Database
Big Data Appliance+
Hadoop & NoSQL
UnifyDevelopment languages
Security
Administration
Support
Workload management
Lifecycle management
Availability
Embrace Innovation and Integrate
29
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 | 30
Big Data Appliance+
Hadoop
HDFS
Data Node
Exadata+
Oracle Database
Oracle Catalog
External Table
create table customer_address
( ca_customer_id number(10,0)
, ca_street_number char(10)
, ca_state char(2)
, ca_zip char(10)
)
organization external (
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS
(com.oracle.bigdata.cluster hadoop_cl_1)
LOCATION ('hive://customer_address')
)
HDFS
Data Node
HDFS
Name Node
Hive metadata
External Table
Hive metadata
Publish Hadoop Metadata to Oracle Catalog
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 | 31
HDFS
Data Node
Oracle Catalog
External TableHDFS
Data Node
HDFS
Name Node
Hive metadata
External Table
Hive metadata
HDFS
Data Node
HDFS
Data Node
Determine:• Data locations • Data structure• Parallelism
Send to specific data nodes:• Data request• Context
Executing Queries on HadoopSelect c_customer_id
, c_customer_last_name
, ca_county
From customers
, customer_address
where c_customer_id = ca_customer_id
and ca_state = „CA‟
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 | 32
HDFS
Data Node
Oracle Catalog
External Table
Select c_customer_id
, c_customer_last_name
, ca_county
From customers
, customer_address
where c_customer_id = ca_customer_id
and ca_state = „CA‟
HDFS
Data Node
HDFS
Name Node
Hive metadata
External Table
Hive metadata
HDFS
Data Node
HDFS
Data Node
“Tables”
Do I/O and Smart Scan:• Filter rows• Project columns
Move only relevant data• Relevant rows• Relevant columns
Apply join with database data
Executing Queries on Hadoop
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Storage Indexes
• Automatically collect and store the minimum and maximum value within a storage unit
• Before scanning a storage unit, verify whether the data requires falls within the Min-Max
• If not, skip scanning the block and reduce scan time
33
HDFS
Data Node
HDFS
Data Node
HDFS
Name Node
Hive metadata
HDFS
Data Node
HDFS
Data Node
“Blocks”
MinMax
MinMax
MinMax
Optimizing Scans on Hadoop
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
One Query Spanning Oracle Database, Hadoop & NoSQL
Query Data in RDBMS,
Hadoop & NoSQL
Oracle SQL
Oracle
NoSQL DB
HDFS
Data Node
Oracle
NoSQL DB
HDFS
Data Node
Oracle Database
Storage Server
Oracle Database
Storage Server
FastMassive Parallelism
Storage Indexes
Filtered Locally
Minimized Data Movement
Oracle Big Data SQL
34
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Big Data Connectors
35
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
SHUFFLE
/SORT
SHUFFLE
/SORT
REDUCE
REDUCE
REDUCE
MAP
MAP
MAP
MAP
MAP
MAP
REDUCE
REDUCE
ORACLE LOADER FOR HADOOP Offloads data pre-processing from the database server to Hadoop
Works with a range of input data formats
Automatic balancing in case of skew in input data
Online and offline modes
Kerberos authentication
Connect to the database from reducer nodes, load into database partitions in parallel (JDBC or direct path)
Partition, sort, and convert into Oracle data types on Hadoop
Oracle Loader for Hadoop
36
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Generate external table in database pointing to HDFS data
Load into database or query data in place on HDFS
Fine-grained control over data type mapping
Parallel load with automatic load balancing
Kerberos authentication
Use Oracle SQL to Access Data on HDFS
External
Table
OSCH
OSCH
OSCH
SQL Query
HDFS
Client
Hadoop Oracle Database
Access or load into the database in parallel using external table mechanism
OSCH
Oracle SQL Connector for HDFS
37
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
R Analytics leveraging Hadoop and HDFS
Linearly Scale a Robust Set of R Algorithms
Leverage MapReduce for R Calculations
Compute Intensive Parallelism for SimulationsHDFS
Hadoop
Oracle R Client
MAPMAP MAPMAP
REDUCE REDUCE
Oracle R Advanced Analytics for Hadoop
38
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Transforms
Via MapReduce(HIVE)
Loads
Activates
Oracle
Loader for
Hadoop
Oracle Data
Integrator
Benefits
• Consistent tooling across BI/DW, SOA, Integration and Big Data
• Reduce complexities of processing Hadoop through graphical tooling
• Improves productivity when processing Big Data (Structured + Unstructured)
Oracle Database
Improving Productivity and
Efficiency for Big Data
Oracle Data Integrator Application Adapters for Hadoop
39
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Acquire – Organize – Analyze
Oracle Big Data Connectors
Oracle DataIntegrator Oracle
Loaderfor
Hadoop
OXH is a transformation engine for Big Data
XQuery language executed on the Hadoop
XQuery
for $ln in
text :collect ion()
let $f :=
tokenize($ln)
where $f[1] = 'x '
return
text :put ($f[2] )
Map/Reduce
Execut ion Plan
M/R
M/R
M/R
M/R
Map/Reduce
Worker Nodes
HDFS
OXH
Engine
Oracle XQuery for Hadoop
40
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Simplify Map Reduce
OLH
&
OSCH
Oracle
Data
Integrator
• Automatically generates MapReduce code
• High performance loads into Data Warehouse leveraging both OLH and OSCH
• Manages the process across platforms
Oracle Data Integrator
41
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Oracle Advanced Analytics
42
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Function Algorithms Applicability
Classification
Logistic Regression (GLM)Decision TreesNaïve Bayes Support Vector Machines (SVM)
Classical statistical techniquePopular / Rules / transparencyEmbedded appWide / narrow data / text
RegressionLinear Regression (GLM)Support Vector Machine (SVM)
Classical statistical technique
Wide / narrow data / text
Anomaly Detection
One Class SVM Unknown fraud cases or anomalies
Attribute Importance
Minimum Description Length (MDL)Principal Components Analysis (PCA)
Attribute reduction, Reduce data noise
Association Rules
Apriori Market basket analysis / Next Best Offer
ClusteringHierarchical k-MeansHierarchical O-ClusterExpectation-Maximization Clustering (EM)
Product grouping / Text miningGene and protein analysis
Feature Extraction
Nonnegative Matrix Factorization (NMF)Singular Value Decomposition (SVD)
Text analysis / Feature reduction
In-Database Data Mining Algorithms
A1 A2 A3 A4 A5 A6
A7
F1 F2 F3 F4
Oracle Advanced Analytics
43
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
• R-SQL Transparency Framework intercepts R
functions for scalable in-database execution
• Function intercept for data transforms,
statistical functions and advanced analytics
• Interactive display of graphical results and flow
control as in standard R
• Submit entire R scripts for execution by
database
• Scale to large datasets
• Access tables, views, and external tables,
as well as data through DB LINKS
• Leverage database SQL parallelism
• Leverage new and existing in-database
statistical and data mining capabilities
R Engine Other R
packages
Oracle R Enterprise packages
User R Engine on desktop
• Database can spawn multiple R engines for
database-managed parallelism
• Efficient data transfer to spawned R
engines
• Emulate map-reduce style algorithms and
applications
• Enables “lights-out” execution of R scripts
1User tables
Oracle DatabaseSQL
Results
Database Compute Engine
2R Engine Other R
packages
Oracle R Enterprise packages
R Engine(s) spawned by Oracle DB
R
Results
3
?x
ROpen Source
Oracle R Enterprise Compute Engines
Oracle Advanced Analytics
44
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Unified Analytics API
SQL R MR
Unified Analytics Processing Platform
Hadoop RDBMS
IB
Management Framework and Tools
Unified access model supporting all analysys capabilities: SQL, R & MR
Oracle Enabling Technologies
45
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Case Study
46
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
1. Provision systems are
complex, expensive and inefficient
2. Lack of business agility and very long
time-to-market and time-to-value (6-12
months)
3. Business users are by-passing IT
corporate systems
4. Datamarts are strongly siloed with no
interoperability
5. Complex Operations with very limited
backup/recovery and no HA
capabilities
6. Unstructured information not managed
7. Lack of Advanced Analytic capabilities
Current (“as-is”) architecture is based on a
“years 90s” design: siloed datamarts with
complex and expensive provision
systems, unable to respond to the new
business requirements with agility
Siloed Operational Systems with complex , heavy and slow data
transformation and flows to Data Marts
Current (“as-is”) Architecture
47
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
Security and Metadata
Source Data Layer
SAP RRHH
Diario Electronico
Streaming
Sensors
Social/Text
Information Management
Data Integration : Data Factory Engine & ODI + Metadata
Staging & Raw Data Layer
Access & Level 4
Performance Layer
Knowledge Discovery Area
Embedded
Data Marts
Level 0 + 1
Data
Quality
High Density
Information Access
Alerts, Dashboards,
Reporting
Services
Foundation Layer Level 2 + 3
Rapid Development SandboxAnalytical Discovery Sandbox
Advanced Analysis &
Data Science(Discovery)
BI A
bst
ract
ion
& Q
uer
y Fe
der
atio
n
Performance Management
Mainframe
Low Density
High DensityD
ata
Fac
tory
En
gin
e
OD
I
GG
MQ
FTE
Low Density
OtrosM
QFT
E Transformed data
Interfases
MQ
BI S
erve
r
Da
ta F
ac
tory
En
gin
e
Da
ta F
ac
tory
En
gin
eOD
I
OD
I
Data Marts
Data Pool Logical Architecture
48
版权所有 © 2014,Oracle 和/或其关联公司。保留所有权利。 |
DC2DC1
IB IB
PRD
Backup
Snapshot
TSM10GbE
VTL VTL
Oracle DataGuard
FC
IB IB IB IB
ZFS Replication
UATPRD‟
BDR
Replication
ZS-3 Backup
Oracle RMANTSM
FC 10GbEBackup
Snapshot
Data Pool Data Pool
ZS-3 Backup
Oracle RMAN
UAT‟
SAN SAN
Data Pool Hardware Architecture
49
Top Related