11
Grzegorz Puchawski
Data analysis withinIBM Netezza Analytics
14 czerwca 2011 r.Warszawa, Sheraton Warsaw Hotel
11
In a nutshell, what is IBM Netezza Analytics
Big Data Meets Big Math
Analytics Without Constraints
Massive Data and Massive Computation
Data Intensity Computational Intensity
Depth of Data
Width of Data
Computational Complexity
Model Complexity
11
In-Database AnalyticsIn-Database Analytics
Software Development KitSoftware Development KitParallel Analytic EnginesParallel Analytic Engines
nzMatrixnzEngine
forR
nzEngine for Hadoop
nzAdaptorsfor
C, C++, Java, Python, Fortran
nzPlug-infor
Eclipse
nzPackagefor
R GUI
Open Source AnalyticsOpen Source Analytics
R R AnalyticsAnalyticsData PrepData Prep Data Data
MiningMiningPredictive Predictive AnalyticsAnalytics
nzAnalyticsnzAnalytics
SpatialSpatial
Custom
Customer/ Partner Analytics
Streaming Accelerator
IBM Netezza AMPP™ Platform
11
Who is the target audience for IBM Netezza Analytics?
Who is the target audience for IBM Netezza Analytics?
• Line of Business Owner
– Areas of Interest – Gaining / sustaining competitive advantage, discovering new opportunities to increase revenue or decrease costs, ability to use all data collecting
– Benefits – Fast results, add significant business value / big bets, leverage all the data, performance at scale
• Business Intelligence
– Areas of Interest – Analysis beyond SQL, analytics dashboards and reports
– Benefits – Rich set of analytics beyond SQL
• Data Miners
– Areas of Interest – Marketing, life sciences, fraud, network analysis
– Benefits – Ability to explore more data, quick to failure, identify new opportunities, new package of analytic tools, ability to process large data
Who is the target audience for IBM Netezza Analytics?
• Modelers
– Areas of Interest – Logistics, yield, forecasting, risk
– Benefits– Simplification of analytic processes, ability to use new and innovative models, quick to failure, model at scale using parallelized analytics, score at scale
• Quants / Statisticians
– Areas of Interest – Risk, forecasting, descriptive statistics, correlation of factors
– Benefits – Simplification of analytic processes, quick to failure, in-database analytics
• Programmers, Developers
– Areas of Interest – Low level programming tools, multi-language environment, User Defined Functions (UDFs), User Defined Analytic Process (AEs), Eclipse
– Benefits – Power and simplification of in-database analytics, flexibility of porting analytics/application
11
How is the IBM Netezza Analytics
platform used?
11
High Performance on Massive Data
1 Exploratory Data Analysis
Embed Algorithms
Build Model
Deploy Model
2
3
4
• Descriptive Modeling• Predictive Modeling• Optimization Model
• Data Exploration• Data Cleansing• Data Transformation
• Scoring• Forecasting• Decision Management
• Embarrassingly Parallel Algorithms• Heroic Computations• Model Parallelism
11
Embed Algorithms
UDF, UDAPStored Procedures
Shared Libraries nzAdaptors
nzMatrixR
User Interface
Eclipse R GUI/CLI
Development Env.
Exploratory Data Analysis
SQLR GUI/CLI
nzAnalyticsR Analytics
Customer AnalyticsPartner Analytics
User Interface
Analytics
Build Model
nzAnalyticsR Analytics
Customer AnalyticsPartner Analytics
User Interface
Analytics
SQLR GUI/CLI
Eclipse
Deploy / Score Model
nzAnalyticsR Analytics
Customer AnalyticsPartner Analytics
nzAdaptorsUDF, UDAP
Shared LibraryStored Procedures
nzPackage for R
Analytics
Deploy/Scoring
Embedding Algorithms
• What is it?
– The ability to run programs directly on the S-Blade
• What is it used for?
– Bringing complex computation to the Netezza data stream
• What technology does it use?
– User Interface – Eclipse, R GUI/CLI
– Development Environment - UDFs, User Defined Analytic Process, Stored Procedures, Shared Libraries, nzAdaptors, nzMatrix, R Packages (for implementing algorithms run from R GUI)
• What are the benefits?
– Ability to process data as it stream directly on the S-Blade
– Ability to harness total compute power of a TwinFin for parallel processing
Exploratory Data Analysis
• What is it?
– The exercise of looking at data for the purpose of coming up with hypotheses
• What is it used for?
– Exploratory data analysis – Data profiling/ Descriptive Statistics, General Diagnostic Measures, Statistics, Sampling, Histograms
– Data cleansing – Feature selection
– Data transformation – Data Prep / Transformations
• What technology does it use?
– User interface – SQL, R GUI/CLI, others
– Analytics – In-database Analytics, R Analytics, Customer/Partner Analytics
• What are the benefits?
– Discovery on more data, faster
Build Model
• What is it?
– Choosing which method will give the best results
– Finding the best parameters to give the best predictions
• What is it used for?
– Predictive Analytics – Regression, Classification, Bayesian Networks, Model Testing, Sample Size
– Data Mining – Association Rules Mining, Clustering, Feature Selection
• What technology does it use?
– User interface – SQL, R GUI/CLI, Eclipse, ...
– Analytics – In-database Analytics, R Analytics, Customer/Partner Analytics
– Development tools – Language Adapters, UDFs, UDAP, Stored Procedures
• What are the benefits?
– Moving the computation processing to the data
– Parallel computational processing on all of the data
Deploying / Scoring Model
• What is it?
– Parallelized application of a model using parameters from the build step
• What is it used for?
– Predictive Analytics – Regression, Classification, Bayesian Networks, Model Testing
– Data Mining – Association Rules Mining, Clustering, Feature Extraction
• What technology does it use?
– Analytics – In-database Analytics, R Analytics, Customer/Partner Analytics
– Deploying/Scoring – Language Adapters, UDFs, User Defined Analytic Process, Shared Libraries, Stored Procedures, R package
– Development tools – Language Adapters, UDFs, AE, Stored Procedures
• What are the benefits?
– Score and experiment in parallel
– Faster model scoring and therefore time to insight/value
11
What is in IBM Netezza Analytics?
11
In-Database AnalyticsIn-Database Analytics
Software Development KitSoftware Development KitParallel Analytic EnginesParallel Analytic Engines
nzMatrixnzEngine
forR
nzEngine for Hadoop
nzAdaptorsfor
C, C++, Java, Python, Fortran
nzPlug-infor
Eclipse
nzPackagefor
R GUI
Open Source AnalyticsOpen Source Analytics
R R AnalyticsAnalyticsData PrepData Prep Data Data
MiningMiningPredictive Predictive AnalyticsAnalytics
nzAnalyticsnzAnalytics
SpatialSpatial
Custom
Customer/ Partner Analytics
Streaming Accelerator
IBM Netezza AMPP™ Platform
Streaming Accelerator
• What is it?
– Our unique differentiator that combines our historical strength in fast data stream processing with powerful in-database analytics processing and new inter-node analytics processing capabilities
• What is it used for?
– Parallelizing data and analytics processing
• What technology does it use?
– FPGA
– UDFs, User Defined Analytic Process
– Message Passing Interface (MPI) for distributed processing
• What are the benefits?
– Accelerates data processing for analytics
– Accelerates parallel matrix operations on big data
– Simplifies parallelization
Streaming Accelerator
Netezza AMPP™ Platform
IBM Netezza Matrix Engine
• What is it?
– Parallelized linear algebra package
• What is it used for?
– Building block for higher order parallelized analytics
• What technology does it use?
– Scalable Linear Algebra Package (ScaLAPACK)
– Message Passing Interface (MPI) for distributed processing
• What are the benefits?
– Simplifies analytic algorithm and model development
– Accelerates parallel matrix operations on big data
Parallel Analytic EnginesParallel Analytic Engines
nzMatrix
• Supports the following parallel matrix operations– Basic Linear Algebra Subroutines (ie: Matrix Multiplication, Matrix Dot Function ,
etc.)
– Solving a System of Linear Equations
– Solving Linear Least Squared Problems
– Eigenvalues and Eigenvectors
– Singular Value Decomposition (SVD)
– Matrix Factorization
– Matrix Inversion
– Matrix Element Scalar Functions
– Matrix Reduction Functions (e.g. min, max, sum of squares, sum)
– Matrix Inquiry Functions (e.g. number of rows and columns)
– Matrix Reshaping Functions
• Call Interface
– Accessible from R, Python, Java, etc. via ODBC and Stored Procedures
Parallel Analytic EnginesParallel Analytic Engines
nzMatrixIBM Netezza Matrix Engine
IBM Netezza Engine for Hadoop
• What is it?
– Hadoop-compatible implementation of Hadoop (MapReduce paradigm)
• What is it used for?
– Clickstream & social data analysis
– ETL/ELT and analytics processing of key/value pairs
• What technology does it use?
– Java User Defined Analytic Process
• What are the benefits?
– Enables effective parallel processing of data from Netezza database tables
– Bringing Hadoop to database with minimal refactoring of existing Hadoop code
– Only database offering Hadoop interface (all others are home-grown)
Parallel Analytic EnginesParallel Analytic Engines
Hadoop
Hadoop by Apache vs Hadoop by Netezza
Slice 1
Slice 2
Slice 3
Slice 4
Reducer 1
Reducer 2
Reducer 2
Mapper 1
Mapper 2
Mapper 3
Mapper 4
REDISTRIBUTION
HDFS
Input table(dataslices) Cluster nodes
SPUs
Cluster nodes
SPUs
HDFS
Output table(dataslices)
Netezza Engine for HadoopExample
• Example: Clickstream analysis– Data:
• Table containing data about users and visited pages
• User groups’ definitions
– Task:• For each group, find all pages that have been visited by
all members of this group
23
Parallel Analytic EnginesParallel Analytic Engines
Hadoop
Netezza Engine for HadoopExample
• Sample data: Clickstream analysis
24
USER URL
A ibm.com
A netezza.com
A sheraton.pl
B ibm.com
D netezza.com
D apache.org
GROUP USER
FIRST A
FIRST B
SECOND A
SECOND D
GROUP URL
FIRST ibm.com
SECOND netezza.com
Parallel Analytic EnginesParallel Analytic Engines
Hadoop
IBM Netezza Engine for R• What is it?
– Native R running pushed down onto the S-Blade for parallel analytics processing
• What is it used for?
– Exploratory data analysis, building models, scoring models, etc
• What technology does it use?
– Open Source R
– User Defined Analytic Process, Data Stream Processing
• What are the benefits?
– Accelerates and scales R to run on big data
– Leverage open-source CRAN repository of algorithms
• Supports the following parallel R operations
– R interpreter running in parallel
– R CRAN Analytics applied in parallel
• Call Interface
– Invoked via SQL (a la User Defined Analytic Process) , R
Parallel Analytic EnginesParallel Analytic Engines
R
In-database Analytics
• What is it?
– Parallelized in-database analytics for data prep, data mining, prediction, and geospatial
• What is it used for?
– Building and deploying/scoring models
• What technology does it use?
– UDFs, Stored Procedures, User Defined Analytic Process, nzMatrix
• What are the benefits?
– Starter kit of parallelized analytics that are designed for parallel environment that work on large scale data
In-Database AnalyticsIn-Database Analytics
nzAnalyticsnzAnalytics
11
In-database Analytics
Data Profiling / Descriptive Statistics
Probability Density and Inverse Functions• Normal• Fisher• Exponential• Uniform• Weibull• Wilcoxn
• Man-Whitney
• tStudent
• Chi-Square
General Diagnostic Measures
Error Calculation• Classification Error
• Mean Absolute Error
• Mean Squared Error
• Relative Absolute Error
• Relative Squared Error
Sampling
Uniform Random Sampling• Uniform Random Sampling Count
• Uniform Random Sampling Fraction
Data Prep
Statistics
Histogram and Frequency Table• Histogram
• Bivariate Frequency Table
• Univariate Frequency Table
Quantiles
• Quantiles
• Median
• Outliers
• Quartile
Parametric Statistics
• Chi-Square
• tStudent
Non-Parametric Statistics
• Spearman’s Rank Correlation
• Man-Whitney-Wilcoxn
• Wilcoxn
Moments
• Kurtosis
• Skewness
Data Prep / Transformations
Binning and Discretization• Entropy Minimization
• Equal Width
• Equal Frequency
Standardization and Normalization
• Standardization and Normalization
In-Database AnalyticsIn-Database Analytics
nzAnalyticsnzAnalytics
11
Association Rules Mining
Association • FP-Growth
Clustering
K-Means
Hierarchical Clustering• Divisive Clustering
• Agglomerative Clustering
Data Mining
Feature Extraction
Dimension Reduction• Principal Components Analysis
Model Testing
Error Calculation• Cross Validation
• Percentage Split
• Train / Test
Predictive Analytics
Regression
Linear Regression• Generalized Linear Models
Sample Size
One-Way ANOVA• Complete Randomized Design • Randomized Block Design
Classification
Decision Trees• Entropy Decision Tree • Gini Index Decision Tree• Regression Tree
Neighborhood Methods• K Nearest Neighbors
Bayesian Methods
Classifier • Naïve Bayes
Graphical Model• Bayesian Networks
In-Database AnalyticsIn-Database Analytics
nzAnalyticsnzAnalyticsIn-database Analytics
11
What are these data mining algorithms used for?
Clustering• Finding naturally occurring
groups– Market segmentation– Find disease subgroups– Distinguish normal from
non-normal behavior
Association Rules Mining• Find co-occurring items
in a market basket– Suggest product
combinations– Design better item
placement on shelves
Feature Extraction• Identify most influential
attributes for a target attribute> Factors associated with
high costs, responding to an offer, etc.
A1
A2
A3
A4
A5
A6
A7
A8
11
Classification• Predict customers most
likely to:– Respond to a campaign
or offer– Incur the highest costs
• Target your best customers• Develop customer profiles
Regression• Predict a numeric value
> Predict a purchase amount or cost
> Predict the value of a home
What are these data mining algorithms used for?
11
Association Rules Mining Example
Regular database
• # transactions = 71M
• # items = 250k
• Implementation in SQL
• Offline process
• Computation time around ~5 hours
IBM Netezza Analytics
Support Time Itemsets
1% (708 208) 1m 87
0.1% (70 828) 16m 4000
0.01% (7 082) 41m 5 583 391
0.001% (708) 51m 346 749 521
• In-database Analytics using FPGrowth algorithm
• Ability to run on-demand analysis
Find co-occurring items in a market basket
IBM Netezza Spatial Engine
• What is it?
– Location Intelligence Extension for IBM Netezza TwinFin Appliance
• What is it used for?
– Processing queries about geographical data to perform spatial analysis
• What technology does it use?
– GGL, GEOS libraries
• What are the benefits?
– Set of the functions to run GIS analysis on large size of data.
– Analyze spatial information all in the database.
– Better and faster analysis using spatial data.
In-Database AnalyticsIn-Database Analytics
Open Source Open Source
Spatial Concepts
• Goal: to process queries about geometric features or geographical data in order to perform various types of analysis.
• Examples of geographical data:
– The location of a store, a wireless service tower or other landmark
– A running feature such as street, river or power line
• Examples of spatial analysis:
– Identify the number of wireless calls that occur in a particular area so that you can better plan the addition of new towers to improve wireless service
– Calculate driving distance form a certain point to the nearest N fire stations to calculate the cost of insurance premium
In-Database AnalyticsIn-Database Analytics
Open Source Open Source
Examples of Usage
– Area
– Distance
– Length
– Perimiter
• Because IBM Netezza Spatial functionsare implemented as UDFs, it allowsus to utilize the full potential ofNetezza’s Massively ParrallelProcessing Architecture
SDK – nzAdaptors • What is it?
– APIs that allow in-database user defined functions to be written in various languages
• What is it used for?
– Enable any program to run on the S-Blades (with minimal refactoring)
• What technology does it use?
– User Defined Analytic Process
• What are the benefits?
– Flexibility to build and deploy analytics/models in multiple languages
– Eliminate rewriting of model score code having to be rewritten and revalidated
– Analytics can be written in different language than calling application language
• Supports the following parallel operations
– Parallel execution of the analytic, model, application
• Call Interface
– Language-specific API
– Invoked via SQL
Software Development KitSoftware Development KitnzAdaptors
forC, C++, Java,
Python, Fortran
SDK – nzPackage for R
• What is it?
– R packages that integrates the R GUI/CLI with Netezza
• Provide interfaces to tables, matrices, apply operations, and nzAnalytics
• What is it used for?
– Data frame integration with data warehouse, pushing analytics processing S-Blades, scoring on S-Blades, installation of R packages, integration with SQL, Matrix integration
• What technology does it use?
– R API for creating packages, open-source CRAN packages (e.g., RODBC)
• What are the benefits?
– Ability to use S-Blades for scaling R analytics/models
– Large-scale linear algebra via Matrix
– Access to nzAnalytics from R
Software Development KitSoftware Development KitnzPackage
for R
SDK – nzPlug-in for Eclipse
• What is it?
– A plug-in for Eclipse that facilitates easier development of UDFs and Stored Procedures
• What is it used for?
– UDFs and Stored Procedure wizards
– Remote SSH terminal, database object explorer, SQL editors, source code control, issue management, system monitoring, documentation builder
• What technology does it use?
– Eclipse
• What are the benefits?
– Faster, more targeted development
– Leverage the many available open-source plug-ins for Eclipse
Software Development KitSoftware Development Kit
nzPlug-in for
Eclipse
Netezza plugin for Eclipse
• What’s included– Predefined Project Perspective– NZ Admin– NZ Cartridge Manager– Logs Browser– Editors with Syntax Highlighting – Remote Console and SSH Terminals – Template Wizards (NZ project, UDX, UDTF, Stored Procedures,
Makefile, …) – Synchronization between local and remote projects – Data Tools – Database Object Explorer, SQL Editors, Data
Explorer, … (with support for Netezza database)
Software Development KitSoftware Development Kit
nzPlug-in for
Eclipse
11
What are the key points?
Key Points• Target Audience
– Line of Business, BI, Data Miners, Modelers, Quants, Statisticians, Programmers
• IBM Netezza Analytics Uses
– Embedding algorithms, exploratory data analysis, building model, deploying/scoring model
• 3 Major Components of IBM Netezza Analytics
– Parallel Analytic Engines, SDK, In-Database Analytics
• Streaming Accelerator
– Unique differentiator for combination of data stream processing, in-database analytics processing and inter-node processing
Key Differentiators
• Faster and scalable analytics processing
• Parallelized in-database analytics
• Large scale matrix operations
• Rich development environment
Key Benefits
• Eliminates inefficient analytics data processing - data remains in place
• Speeds up time to insight, action & business value
• Achieves parallelism without parallel programming
• Enables increased analytics experimentation
• Protects and leverages investment in existing analytics
• Reduces technology barriers for large scale analytics
11
IBM Netezza Analytics
Big Math
Big Data
IBM Netezza Analytics
Thank youYour Data. Your Site. Our Appliance.
ANALYTICSANALYTICS
Top Related