Analiza danych przy użyciu IBM Netezza Analytics

44
11 Grzegorz Puchawski Data analysis within IBM Netezza Analytics 14 czerwca 2011 r. Warszawa, Sheraton Warsaw Hotel

description

Grzegorz PuchawskiData analysis withinIBM Netezza Analytics

Transcript of Analiza danych przy użyciu IBM Netezza Analytics

Page 1: Analiza danych przy użyciu IBM Netezza Analytics

11

Grzegorz Puchawski

Data analysis withinIBM Netezza Analytics

14 czerwca 2011 r.Warszawa, Sheraton Warsaw Hotel

Page 2: Analiza danych przy użyciu IBM Netezza Analytics

11

In a nutshell, what is IBM Netezza Analytics

Page 3: Analiza danych przy użyciu IBM Netezza Analytics

Big Data Meets Big Math

Analytics Without Constraints

Page 4: Analiza danych przy użyciu IBM Netezza Analytics

Massive Data and Massive Computation

Data Intensity Computational Intensity

Depth of Data

Width of Data

Computational Complexity

Model Complexity

Page 5: Analiza danych przy użyciu IBM Netezza Analytics

11

In-Database AnalyticsIn-Database Analytics

Software Development KitSoftware Development KitParallel Analytic EnginesParallel Analytic Engines

nzMatrixnzEngine

forR

nzEngine for Hadoop

nzAdaptorsfor

C, C++, Java, Python, Fortran

nzPlug-infor

Eclipse

nzPackagefor

R GUI

Open Source AnalyticsOpen Source Analytics

R R AnalyticsAnalyticsData PrepData Prep Data Data

MiningMiningPredictive Predictive AnalyticsAnalytics

nzAnalyticsnzAnalytics

SpatialSpatial

Custom

Customer/ Partner Analytics

Streaming Accelerator

IBM Netezza AMPP™ Platform

Page 6: Analiza danych przy użyciu IBM Netezza Analytics

11

Who is the target audience for IBM Netezza Analytics?

Page 7: Analiza danych przy użyciu IBM Netezza Analytics

Who is the target audience for IBM Netezza Analytics?

• Line of Business Owner

– Areas of Interest – Gaining / sustaining competitive advantage, discovering new opportunities to increase revenue or decrease costs, ability to use all data collecting

– Benefits – Fast results, add significant business value / big bets, leverage all the data, performance at scale

• Business Intelligence

– Areas of Interest – Analysis beyond SQL, analytics dashboards and reports

– Benefits – Rich set of analytics beyond SQL

• Data Miners

– Areas of Interest – Marketing, life sciences, fraud, network analysis

– Benefits – Ability to explore more data, quick to failure, identify new opportunities, new package of analytic tools, ability to process large data

Page 8: Analiza danych przy użyciu IBM Netezza Analytics

Who is the target audience for IBM Netezza Analytics?

• Modelers

– Areas of Interest – Logistics, yield, forecasting, risk

– Benefits– Simplification of analytic processes, ability to use new and innovative models, quick to failure, model at scale using parallelized analytics, score at scale

• Quants / Statisticians

– Areas of Interest – Risk, forecasting, descriptive statistics, correlation of factors

– Benefits – Simplification of analytic processes, quick to failure, in-database analytics

• Programmers, Developers

– Areas of Interest – Low level programming tools, multi-language environment, User Defined Functions (UDFs), User Defined Analytic Process (AEs), Eclipse

– Benefits – Power and simplification of in-database analytics, flexibility of porting analytics/application

Page 9: Analiza danych przy użyciu IBM Netezza Analytics

11

How is the IBM Netezza Analytics

platform used?

Page 10: Analiza danych przy użyciu IBM Netezza Analytics

11

High Performance on Massive Data

1 Exploratory Data Analysis

Embed Algorithms

Build Model

Deploy Model

2

3

4

• Descriptive Modeling• Predictive Modeling• Optimization Model

• Data Exploration• Data Cleansing• Data Transformation

• Scoring• Forecasting• Decision Management

• Embarrassingly Parallel Algorithms• Heroic Computations• Model Parallelism

Page 11: Analiza danych przy użyciu IBM Netezza Analytics

11

Embed Algorithms

UDF, UDAPStored Procedures

Shared Libraries nzAdaptors

nzMatrixR

User Interface

Eclipse R GUI/CLI

Development Env.

Exploratory Data Analysis

SQLR GUI/CLI

nzAnalyticsR Analytics

Customer AnalyticsPartner Analytics

User Interface

Analytics

Build Model

nzAnalyticsR Analytics

Customer AnalyticsPartner Analytics

User Interface

Analytics

SQLR GUI/CLI

Eclipse

Deploy / Score Model

nzAnalyticsR Analytics

Customer AnalyticsPartner Analytics

nzAdaptorsUDF, UDAP

Shared LibraryStored Procedures

nzPackage for R

Analytics

Deploy/Scoring

Page 12: Analiza danych przy użyciu IBM Netezza Analytics

Embedding Algorithms

• What is it?

– The ability to run programs directly on the S-Blade

• What is it used for?

– Bringing complex computation to the Netezza data stream

• What technology does it use?

– User Interface – Eclipse, R GUI/CLI

– Development Environment - UDFs, User Defined Analytic Process, Stored Procedures, Shared Libraries, nzAdaptors, nzMatrix, R Packages (for implementing algorithms run from R GUI)

• What are the benefits?

– Ability to process data as it stream directly on the S-Blade

– Ability to harness total compute power of a TwinFin for parallel processing

Page 13: Analiza danych przy użyciu IBM Netezza Analytics

Exploratory Data Analysis

• What is it?

– The exercise of looking at data for the purpose of coming up with hypotheses

• What is it used for?

– Exploratory data analysis – Data profiling/ Descriptive Statistics, General Diagnostic Measures, Statistics, Sampling, Histograms

– Data cleansing – Feature selection

– Data transformation – Data Prep / Transformations

• What technology does it use?

– User interface – SQL, R GUI/CLI, others

– Analytics – In-database Analytics, R Analytics, Customer/Partner Analytics

• What are the benefits?

– Discovery on more data, faster

Page 14: Analiza danych przy użyciu IBM Netezza Analytics

Build Model

• What is it?

– Choosing which method will give the best results

– Finding the best parameters to give the best predictions

• What is it used for?

– Predictive Analytics – Regression, Classification, Bayesian Networks, Model Testing, Sample Size

– Data Mining – Association Rules Mining, Clustering, Feature Selection

• What technology does it use?

– User interface – SQL, R GUI/CLI, Eclipse, ...

– Analytics – In-database Analytics, R Analytics, Customer/Partner Analytics

– Development tools – Language Adapters, UDFs, UDAP, Stored Procedures

• What are the benefits?

– Moving the computation processing to the data

– Parallel computational processing on all of the data

Page 15: Analiza danych przy użyciu IBM Netezza Analytics

Deploying / Scoring Model

• What is it?

– Parallelized application of a model using parameters from the build step

• What is it used for?

– Predictive Analytics – Regression, Classification, Bayesian Networks, Model Testing

– Data Mining – Association Rules Mining, Clustering, Feature Extraction

• What technology does it use?

– Analytics – In-database Analytics, R Analytics, Customer/Partner Analytics

– Deploying/Scoring – Language Adapters, UDFs, User Defined Analytic Process, Shared Libraries, Stored Procedures, R package

– Development tools – Language Adapters, UDFs, AE, Stored Procedures

• What are the benefits?

– Score and experiment in parallel

– Faster model scoring and therefore time to insight/value

Page 16: Analiza danych przy użyciu IBM Netezza Analytics

11

What is in IBM Netezza Analytics?

Page 17: Analiza danych przy użyciu IBM Netezza Analytics

11

In-Database AnalyticsIn-Database Analytics

Software Development KitSoftware Development KitParallel Analytic EnginesParallel Analytic Engines

nzMatrixnzEngine

forR

nzEngine for Hadoop

nzAdaptorsfor

C, C++, Java, Python, Fortran

nzPlug-infor

Eclipse

nzPackagefor

R GUI

Open Source AnalyticsOpen Source Analytics

R R AnalyticsAnalyticsData PrepData Prep Data Data

MiningMiningPredictive Predictive AnalyticsAnalytics

nzAnalyticsnzAnalytics

SpatialSpatial

Custom

Customer/ Partner Analytics

Streaming Accelerator

IBM Netezza AMPP™ Platform

Page 18: Analiza danych przy użyciu IBM Netezza Analytics

Streaming Accelerator

• What is it?

– Our unique differentiator that combines our historical strength in fast data stream processing with powerful in-database analytics processing and new inter-node analytics processing capabilities

• What is it used for?

– Parallelizing data and analytics processing

• What technology does it use?

– FPGA

– UDFs, User Defined Analytic Process

– Message Passing Interface (MPI) for distributed processing

• What are the benefits?

– Accelerates data processing for analytics

– Accelerates parallel matrix operations on big data

– Simplifies parallelization

Streaming Accelerator

Netezza AMPP™ Platform

Page 19: Analiza danych przy użyciu IBM Netezza Analytics

IBM Netezza Matrix Engine

• What is it?

– Parallelized linear algebra package

• What is it used for?

– Building block for higher order parallelized analytics

• What technology does it use?

– Scalable Linear Algebra Package (ScaLAPACK)

– Message Passing Interface (MPI) for distributed processing

• What are the benefits?

– Simplifies analytic algorithm and model development

– Accelerates parallel matrix operations on big data

Parallel Analytic EnginesParallel Analytic Engines

nzMatrix

Page 20: Analiza danych przy użyciu IBM Netezza Analytics

• Supports the following parallel matrix operations– Basic Linear Algebra Subroutines (ie: Matrix Multiplication, Matrix Dot Function ,

etc.)

– Solving a System of Linear Equations

– Solving Linear Least Squared Problems

– Eigenvalues and Eigenvectors

– Singular Value Decomposition (SVD)

– Matrix Factorization

– Matrix Inversion

– Matrix Element Scalar Functions

– Matrix Reduction Functions (e.g. min, max, sum of squares, sum)

– Matrix Inquiry Functions  (e.g. number of rows and columns)

– Matrix Reshaping Functions

• Call Interface

– Accessible from R, Python, Java, etc. via ODBC and Stored Procedures

Parallel Analytic EnginesParallel Analytic Engines

nzMatrixIBM Netezza Matrix Engine

Page 21: Analiza danych przy użyciu IBM Netezza Analytics

IBM Netezza Engine for Hadoop

• What is it?

– Hadoop-compatible implementation of Hadoop (MapReduce paradigm)

• What is it used for?

– Clickstream & social data analysis

– ETL/ELT and analytics processing of key/value pairs

• What technology does it use?

– Java User Defined Analytic Process

• What are the benefits?

– Enables effective parallel processing of data from Netezza database tables

– Bringing Hadoop to database with minimal refactoring of existing Hadoop code

– Only database offering Hadoop interface (all others are home-grown)

Parallel Analytic EnginesParallel Analytic Engines

Hadoop

Page 22: Analiza danych przy użyciu IBM Netezza Analytics

Hadoop by Apache vs Hadoop by Netezza

Slice 1

Slice 2

Slice 3

Slice 4

Reducer 1

Reducer 2

Reducer 2

Mapper 1

Mapper 2

Mapper 3

Mapper 4

REDISTRIBUTION

HDFS

Input table(dataslices) Cluster nodes

SPUs

Cluster nodes

SPUs

HDFS

Output table(dataslices)

Page 23: Analiza danych przy użyciu IBM Netezza Analytics

Netezza Engine for HadoopExample

• Example: Clickstream analysis– Data:

• Table containing data about users and visited pages

• User groups’ definitions

– Task:• For each group, find all pages that have been visited by

all members of this group

23

Parallel Analytic EnginesParallel Analytic Engines

Hadoop

Page 24: Analiza danych przy użyciu IBM Netezza Analytics

Netezza Engine for HadoopExample

• Sample data: Clickstream analysis

24

USER URL

A ibm.com

A netezza.com

A sheraton.pl

B ibm.com

D netezza.com

D apache.org

GROUP USER

FIRST A

FIRST B

SECOND A

SECOND D

GROUP URL

FIRST ibm.com

SECOND netezza.com

Parallel Analytic EnginesParallel Analytic Engines

Hadoop

Page 25: Analiza danych przy użyciu IBM Netezza Analytics

IBM Netezza Engine for R• What is it?

– Native R running pushed down onto the S-Blade for parallel analytics processing

• What is it used for?

– Exploratory data analysis, building models, scoring models, etc

• What technology does it use?

– Open Source R

– User Defined Analytic Process, Data Stream Processing

• What are the benefits?

– Accelerates and scales R to run on big data

– Leverage open-source CRAN repository of algorithms

• Supports the following parallel R operations

– R interpreter running in parallel

– R CRAN Analytics applied in parallel

• Call Interface

– Invoked via SQL (a la User Defined Analytic Process) , R

Parallel Analytic EnginesParallel Analytic Engines

R

Page 26: Analiza danych przy użyciu IBM Netezza Analytics

In-database Analytics

• What is it?

– Parallelized in-database analytics for data prep, data mining, prediction, and geospatial

• What is it used for?

– Building and deploying/scoring models

• What technology does it use?

– UDFs, Stored Procedures, User Defined Analytic Process, nzMatrix

• What are the benefits?

– Starter kit of parallelized analytics that are designed for parallel environment that work on large scale data

In-Database AnalyticsIn-Database Analytics

nzAnalyticsnzAnalytics

Page 27: Analiza danych przy użyciu IBM Netezza Analytics

11

In-database Analytics

Data Profiling / Descriptive Statistics

Probability Density and Inverse Functions• Normal• Fisher• Exponential• Uniform• Weibull• Wilcoxn

• Man-Whitney

• tStudent

• Chi-Square

General Diagnostic Measures

Error Calculation• Classification Error

• Mean Absolute Error

• Mean Squared Error

• Relative Absolute Error

• Relative Squared Error

Sampling

Uniform Random Sampling• Uniform Random Sampling Count

• Uniform Random Sampling Fraction

Data Prep

Statistics

Histogram and Frequency Table• Histogram

• Bivariate Frequency Table

• Univariate Frequency Table

Quantiles

• Quantiles

• Median

• Outliers

• Quartile

Parametric Statistics

• Chi-Square

• tStudent

Non-Parametric Statistics

• Spearman’s Rank Correlation

• Man-Whitney-Wilcoxn

• Wilcoxn

Moments

• Kurtosis

• Skewness

Data Prep / Transformations

Binning and Discretization• Entropy Minimization

• Equal Width

• Equal Frequency

Standardization and Normalization

• Standardization and Normalization

In-Database AnalyticsIn-Database Analytics

nzAnalyticsnzAnalytics

Page 28: Analiza danych przy użyciu IBM Netezza Analytics

11

Association Rules Mining

Association • FP-Growth

Clustering

K-Means

Hierarchical Clustering• Divisive Clustering

• Agglomerative Clustering

Data Mining

Feature Extraction

Dimension Reduction• Principal Components Analysis

Model Testing

Error Calculation• Cross Validation

• Percentage Split

• Train / Test

Predictive Analytics

Regression

Linear Regression• Generalized Linear Models

Sample Size

One-Way ANOVA• Complete Randomized Design • Randomized Block Design

Classification

Decision Trees• Entropy Decision Tree • Gini Index Decision Tree• Regression Tree

Neighborhood Methods• K Nearest Neighbors

Bayesian Methods

Classifier • Naïve Bayes

Graphical Model• Bayesian Networks

In-Database AnalyticsIn-Database Analytics

nzAnalyticsnzAnalyticsIn-database Analytics

Page 29: Analiza danych przy użyciu IBM Netezza Analytics

11

What are these data mining algorithms used for?

Clustering• Finding naturally occurring

groups– Market segmentation– Find disease subgroups– Distinguish normal from

non-normal behavior

Association Rules Mining• Find co-occurring items

in a market basket– Suggest product

combinations– Design better item

placement on shelves

Feature Extraction• Identify most influential

attributes for a target attribute> Factors associated with

high costs, responding to an offer, etc.

A1

A2

A3

A4

A5

A6

A7

A8

Page 30: Analiza danych przy użyciu IBM Netezza Analytics

11

Classification• Predict customers most

likely to:– Respond to a campaign

or offer– Incur the highest costs

• Target your best customers• Develop customer profiles

Regression• Predict a numeric value

> Predict a purchase amount or cost

> Predict the value of a home

What are these data mining algorithms used for?

Page 31: Analiza danych przy użyciu IBM Netezza Analytics

11

Association Rules Mining Example

Regular database

• # transactions = 71M

• # items = 250k

• Implementation in SQL

• Offline process

• Computation time around ~5 hours

IBM Netezza Analytics

Support Time Itemsets

1% (708 208) 1m 87

0.1% (70 828) 16m 4000

0.01% (7 082) 41m 5 583 391

0.001% (708) 51m 346 749 521

• In-database Analytics using FPGrowth algorithm

• Ability to run on-demand analysis

Find co-occurring items in a market basket

Page 32: Analiza danych przy użyciu IBM Netezza Analytics

IBM Netezza Spatial Engine

• What is it?

– Location Intelligence Extension for IBM Netezza TwinFin Appliance

• What is it used for?

– Processing queries about geographical data to perform spatial analysis

• What technology does it use?

– GGL, GEOS libraries

• What are the benefits?

– Set of the functions to run GIS analysis on large size of data.

– Analyze spatial information all in the database.

– Better and faster analysis using spatial data.

In-Database AnalyticsIn-Database Analytics

Open Source Open Source

Page 33: Analiza danych przy użyciu IBM Netezza Analytics

Spatial Concepts

• Goal: to process queries about geometric features or geographical data in order to perform various types of analysis.

• Examples of geographical data:

– The location of a store, a wireless service tower or other landmark

– A running feature such as street, river or power line

• Examples of spatial analysis:

– Identify the number of wireless calls that occur in a particular area so that you can better plan the addition of new towers to improve wireless service

– Calculate driving distance form a certain point to the nearest N fire stations to calculate the cost of insurance premium

In-Database AnalyticsIn-Database Analytics

Open Source Open Source

Page 34: Analiza danych przy użyciu IBM Netezza Analytics

Examples of Usage

– Area

– Distance

– Length

– Perimiter

• Because IBM Netezza Spatial functionsare implemented as UDFs, it allowsus to utilize the full potential ofNetezza’s Massively ParrallelProcessing Architecture

Page 35: Analiza danych przy użyciu IBM Netezza Analytics

SDK – nzAdaptors • What is it?

– APIs that allow in-database user defined functions to be written in various languages

• What is it used for?

– Enable any program to run on the S-Blades (with minimal refactoring)

• What technology does it use?

– User Defined Analytic Process

• What are the benefits?

– Flexibility to build and deploy analytics/models in multiple languages

– Eliminate rewriting of model score code having to be rewritten and revalidated

– Analytics can be written in different language than calling application language

• Supports the following parallel operations

– Parallel execution of the analytic, model, application

• Call Interface

– Language-specific API

– Invoked via SQL

Software Development KitSoftware Development KitnzAdaptors

forC, C++, Java,

Python, Fortran

Page 36: Analiza danych przy użyciu IBM Netezza Analytics

SDK – nzPackage for R

• What is it?

– R packages that integrates the R GUI/CLI with Netezza

• Provide interfaces to tables, matrices, apply operations, and nzAnalytics

• What is it used for?

– Data frame integration with data warehouse, pushing analytics processing S-Blades, scoring on S-Blades, installation of R packages, integration with SQL, Matrix integration

• What technology does it use?

– R API for creating packages, open-source CRAN packages (e.g., RODBC)

• What are the benefits?

– Ability to use S-Blades for scaling R analytics/models

– Large-scale linear algebra via Matrix

– Access to nzAnalytics from R

Software Development KitSoftware Development KitnzPackage

for R

Page 37: Analiza danych przy użyciu IBM Netezza Analytics

SDK – nzPlug-in for Eclipse

• What is it?

– A plug-in for Eclipse that facilitates easier development of UDFs and Stored Procedures

• What is it used for?

– UDFs and Stored Procedure wizards

– Remote SSH terminal, database object explorer, SQL editors, source code control, issue management, system monitoring, documentation builder

• What technology does it use?

– Eclipse

• What are the benefits?

– Faster, more targeted development

– Leverage the many available open-source plug-ins for Eclipse

Software Development KitSoftware Development Kit

nzPlug-in for

Eclipse

Page 38: Analiza danych przy użyciu IBM Netezza Analytics

Netezza plugin for Eclipse

• What’s included– Predefined Project Perspective– NZ Admin– NZ Cartridge Manager– Logs Browser– Editors with Syntax Highlighting – Remote Console and SSH Terminals – Template Wizards (NZ project, UDX, UDTF, Stored Procedures,

Makefile, …) – Synchronization between local and remote projects – Data Tools – Database Object Explorer, SQL Editors, Data

Explorer, … (with support for Netezza database)

Software Development KitSoftware Development Kit

nzPlug-in for

Eclipse

Page 39: Analiza danych przy użyciu IBM Netezza Analytics

11

What are the key points?

Page 40: Analiza danych przy użyciu IBM Netezza Analytics

Key Points• Target Audience

– Line of Business, BI, Data Miners, Modelers, Quants, Statisticians, Programmers

• IBM Netezza Analytics Uses

– Embedding algorithms, exploratory data analysis, building model, deploying/scoring model

• 3 Major Components of IBM Netezza Analytics

– Parallel Analytic Engines, SDK, In-Database Analytics

• Streaming Accelerator

– Unique differentiator for combination of data stream processing, in-database analytics processing and inter-node processing

Page 41: Analiza danych przy użyciu IBM Netezza Analytics

Key Differentiators

• Faster and scalable analytics processing

• Parallelized in-database analytics

• Large scale matrix operations

• Rich development environment

Page 42: Analiza danych przy użyciu IBM Netezza Analytics

Key Benefits

• Eliminates inefficient analytics data processing - data remains in place

• Speeds up time to insight, action & business value

• Achieves parallelism without parallel programming

• Enables increased analytics experimentation

• Protects and leverages investment in existing analytics

• Reduces technology barriers for large scale analytics

Page 43: Analiza danych przy użyciu IBM Netezza Analytics

11

IBM Netezza Analytics

Big Math

Big Data

IBM Netezza Analytics

Page 44: Analiza danych przy użyciu IBM Netezza Analytics

Thank youYour Data. Your Site. Our Appliance.

ANALYTICSANALYTICS