© Supermicro 2012
The Infrastructure of Tomorrow, Today – Integrating Supermicro, Greenplum and SAS
to enable Big Data Analytics
Jeff Tsai 蔡穎碩
Solution Manager
Agenda
Big Data Analytics Platform & Infrastructure
EMC+Supermicro
1,000 Nodes Hadoop Cluster
!!!
!!!
!!!
!!!
!!!
“Big Data Is Less
About Size, And
More About
Freedom” ―Techcrunch
!!!
!!!
!!! “Findings: „Big Data‟
Is More Extreme
Than Volume”
― Gartner
“Big Data! It‟s Real,
It‟s Real-time, and
It‟s Already
Changing Your
World” ―IDB
“Total data: „bigger‟ than big data”
― 451 Group
THE ERA OF
BIG DATA IS HERE…
Data Sources Are Expanding
Source : 2011 IDC Digital Universe Study
GROW 44X IN THE NEXT 10 YEARS
THE DIGITAL UNIVERSE WILL
BIG Data is Just a Bunch of Data to Store…?
2009 2010 2011 2012 2013 2014
0
10
20
30
40
50
60
70
80
90
Source: IDC
File Based: 60.7% CAGR Block Based: 21.8% CAGR
By 2012, 80% of all storage capacity sold will be for file-based data
Big
Data
Sources
OR
To Create Significant value to your business…
HOW?...
Make BIG Data
Accessible
Identify the data source
Store the data
Connect applications and users
Utilize the data in different views
EMC UAP Solutions – Analytics Platform
This is what my
analytics
environment looks
like…
Building The Big Data Analytics “Stack”
Greenplum Chorus Enterprise Collaboration Platform for Data
Greenplum Database
Enterprise & Community Editions
World’s Most Scalable MPP Database Platform
Analytic Toolsets (Business Analytics, BI, Statistics, etc.)
Greenplum HD
Hadoop Enterprise & Community Editions
Enterprise Analytics Platform for Unstructured Data
Greenplum Data Computing Appliances Purpose-built for Big Data Analytics
E M C A C Q U I R E S G R E E N P L U M O N J U LY 2 0 1 0
“For three years, Gartner has identified Greenplum as
the most advanced vendor in the visionary
quadrant of its data warehouse DBMS Magic Quadrant….”
– Gartner
Greenplum Becomes the Foundation of EMC’s Data Computing Division
SAS at a Glance
Company Highlight: • Founded 1976: 11,000+ employees in 400+
offices
• 2010 worldwide revenue $2.43 B
• IDC: SAS is leader in Analytics with a 34.5%
market share : Analytics and Reporting
• 4.5 million users worldwide
• 50,000+sites in 114 countries
• From Tools to Vertical Solutions Services
11% Financial Services
42%
Retail
4% Other
2%
Manufacturing
6%
Healthcare
& Life Sciences
8%
Government
14% Energy & Utilities
2%
Education
3%
Communications
8%
Overview
Revenues: FY09 $500M, FY10 $721M , FY11 ~$1B
Global Footprint: >100 Countries
Production: US, EU and Asia Production facilities
Engineering: 70% of workforce in engineering (30% growth through recession)
Market Share: #1 Server Channel (SMCI enables ~10% of global server market)
Brand Equity: Growing public profile since 2007 IPO
Corporate Focus: Energy Efficiency, Earth-friendly, Green Technology Innovation
Founded in 1993, HQ– San Jose, CA, 2007 NASDAQ: SMCI
SMC Inc., HQ
San Jose, CA
SMC BV,
The Netherlands
SMC TW,
Taiwan
Product Family
Resource Optimized (WIO/UIO) Twin Architecture GPU SuperComputing
Embedded
SuperBlade
Storage Server
Workstation Mainstream Business Solutions
Application Optimized: Multi I/O
Data Center Optimized
Server Building Block Solutions®
>550
Motherboards >1300
Chassis
> 350
Cooling
Modules
> 140 Power
Supplies Open
CPU/ Memory
Operating
Systems /
Applications
In-House Design and Server Building Block Solutions®
Technology Partners Customer Requirements
OEM
Specs
In-House Design
Optimized
Data Center
Tri-Lab
(1) As of Q2, 2009
Server Building Block Solutions®
Application Optimized
Big Data Analytics on Hadoop
Internet companies are not built on SQL but are building Analytics on Hadoop/NoSQL
Existing Hadoop Users (Internet)
This is what I think
my analytics
environment looks
like…
Pig
Hadoop S
yste
m
Manag
em
ent
& C
oord
ination
Hadoop Storage
MapReduce Layer
ETL Tools
Web Portal,
Social Networks
Hive
BI &
Reporting
HBase
Web Apps
Hadoop Components (hadoop.apache.org)
• Hadoop Distributed File System HDFS
• Framework for writing scalable data applications MapReduce
• Procedural language that abstracts lower level MapReduce Pig
• Highly reliable distributed coordination Zookeeper
• Data warehouse infrastructure built on top of Hadoop Hive
• Database for random, real time read/write access HBase
• workflow/coordination to manage jobs Oozie
• Scalable machine learning libraries Mahout
What can Hadoop do for you?
Financial Services
Better knowing customers
Risk analysis and management.
Fraud detection and security
analytics.
Telecommunications Customer churn prevention.
Price optimization and marketing
Network analysis and optimization
Customer experience management
Healthcare
Patient care quality
Drug development
Data Source: Cloudera
Web & e-Tailing Web usage, click stream behavior
Market & customer segmentation
Ad customer targeting
On-line fraud detection
Government Fraud detection
Compliance and regulatory analytics
Retail
Market and consumer segmentation
Merchandizing and cross-selling
Promotion and campaign analysis
Hadoop Use Cases
Linkedin – “People You May Know” and other facts
Yahoo! – Hadoop to support AdSystems and web search
Visa – Credit card fraud detection and analysis
T-Mobile – Churn analysis, user experience
Amazon, Baidu, AOL, eBay, Facebook, Twitter, …
Data Source: Cloudera
Hadoop Cluster HW selection
What’s the HW configuration for Hadoop clusters?...
It depends, workloads matter.
CPU Intensive
Machine learning
Natural language processing
Complex data mining
Feature extraction
I/O Intensive
Data importing and exporting
Indexing
Searching
Grouping
Decoding/decompressing
Data Storage Capacity
# of data mirroring
TCO Rack space
Power consumption
Different workloads
General Configuration
2 Quad Core CPUs
16-96GB Memory
2 x GE
1TB-2TB Disk x n
1U/2U Rack mount
Production-scale testing of Apache Trunk & hosted environment for customer POC‟s
Proven at Scale with Worldwide Support
Industry’s largest Hadoop
support team
Industry‟s most accomplished
Hadoop talents (from Yahoo!,
LinkedIn, Talend, etc.)
Tested at scale on the
Greenplum Analytics
Workbench
1,000-node, 24-petabyte cluster
Multi-million dollar investment
by EMC and partners
Reduced risk for EMC
customers
Certification of partner products
Bringing Rapid Innovation
to Hadoop
Supermicro Server Functions in the Cluster
Supermicro
Data Nodes
Supermicro Infrastructure
Nodes
2U Storage Server
2U Twin2 Server
• 1,000+ Physical Supermicro Server Nodes (10k virtual nodes)
• 12,000 Processor Cores
• 24 Petabytes of Storage Capacity (6Gbps SATA)
• 48 Terabytes RAM
• 56 Gbps Infiniband Connectivity
Supermicro Multi-Node Server Solutions
Switch Data Center - Las Vegas NV
…Results before fine-tuning.
World record performance results expected to be announced before 2013.
Min
ute
s
Initial Benchmark Data
Other testing programs – Supermicro & Intel
CPU Benchmark
Supermicro Advantages
Why Supermicro…
Building Blocks for different
Workloads & Requirement
-Meet any Hadoop workloads by models
-I/O, CPU, Disks, Density
- Customize by specific workload requirement
High Efficiency, High Quality
-Green IT
-High Efficiency Power
-High Quality for highest system availability and
best utilization
Proven solutions
-EMC Greenplum proven solutions
-100% Apache Hadoop Compatible
-Benchmark and testing programs with partners
TCO
Solutions to Cost-Effective Hadoop Clusters
Best choice of Hadoop Hardware platforms
Shipped Directly From US, NL, TW
Turnkey Hadoop:
Supermicro Complete Rack Solutions
One Stop Shop for Hardware, End to End Total Solutions
Speedup Deployment With Ready to Run Rack Systems
Single Source, Consistent Build Quality and Delivery Time
Multi-Vendor Compatibility Test, Zero Compatibility Issue
Premium Service With Competitive Pricing
Broad Product Portfolios and Building Blocks
Best platform to your Hadoop cluster
Q&A
Thank You
SMC Inc., HQ
San Jose, CA
SMC BV,
The Netherlands
SMC TW,
Taiwan
Top Related