AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

December 10, 2014 | Korea

김 일호, Solutions Architect

BDT201 - Big Data and HPC State of the Union

BDT202 - HPC Now Means 'High Personal Computing'

BDT203 - From Zero to NoSQL Hero: Amazon DynamoDB Tutorial

BDT204 - Rendering a Seamless Satellite Map of the World with AWS and NASA Data

BDT205 - Your First Big Data Application on AWS

BDT206 - See How Amazon Redshift is Powering Business Intelligence in the Enterprise

BDT207 - Use Streaming Analytics to Exploit Perishable Insights

BDT208 - Finding High Performance in the Cloud for HPC

BDT209 - Intel’s Healthcare Cloud Solution Using Wearables for Parkinson’s Disease Research

BDT302 - Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift

BDT303 - Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift

BDT305 - Lessons Learned and Best Practices for Running Hadoop on AWS

BDT306 - Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesis

BDT307 - Running NoSQL on Amazon EC2

BDT308 - Using Amazon Elastic MapReduce as Your Scalable Data Warehouse

BDT308-JT - Using Amazon Elastic MapReduce as Your Scalable Data Warehouse - Japanese Track

BDT309 - Delivering Results with Amazon Redshift, One Petabyte at a Time

BDT309-JT - Delivering Results with Amazon Redshift, One Petabyte at a Time - Japanese Track

BDT310 - Big Data Architectural Patterns and Best Practices on AWS

BDT311 - MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production

Workloads

BDT312 - Using the Cloud to Scale from a Database to a Data Platform

BDT401 - Big Data Orchestra - Harmony within Data Analysis Tools

BDT402 - Performance Profiling in Production: Analyzing Web Requests at Scale Using Amazon Elastic MapReduce and Storm

BDT403 - Netflix's Next Generation Big Data Platform

Redshift EMR EC2

Process & Analyze

Store

AWS Direct Connect

S3

Amazon Kinesis

Glacier

AWS Import/Export

DynamoDB

Collect

Automate AWS Data Pipeline

Amazon SQS

Amazon S3

DynamoDB

Any SQL or NO SQL Store

Log Aggregation tools

Amazon

EMR

Amazon

Redshift

Visualization tools

Business

Intelligence Tools

Business

Intelligence Tools

GIS tools

Amazon data pipeline

Log4J

EMR-Kinesis Connector

Hive with

Amazon S3 Amazon Redshift

parallel COPY from

Amazon S3

Amazon Kinesis

processing state

Launch a 3-instance Hadoop 2.4 cluster with Hive installed:

m3.xlarge

YOUR-AWS-REGION

YOUR-AWS-SSH-KEY

YOUR-BUCKET-NAME

Create an Amazon Kinesis stream to hold incoming data:

aws kinesis create-stream \

--stream-name AccessLogStream \

--shard-count 2

\

CHOOSE-A-REDSHIFT-PASSWORD

YOUR-IAM-ACCESS-KEY YOUR-IAM-SECRET-KEY

YOUR-AWS-SSH-KEYYOUR-EMR-MASTER-PRIVATE-DNS

YOUR-EMR-MASTER-PRIVATE-DNSYOUR-EMR-HOSTNAME

Start Hive:

hive

YOUR-IAM-ACCESS-KEY

YOUR-IAM-SECRET-KEY;

YOUR-AWS-REGION

hive>

hive>

hive>

hive>

hive>

hive>

hive>

STORED BY 'com.amazon.emr.kinesis.hive.KinesisStorageHandler'

TBLPROPERTIES("kinesis.stream.name"="AccessLogStream");

-- return the first row in the stream

hive>

-- return count all items in the Stream

hive>

-- return count of all rows with given host hive>

Log4J


http://127.0.0.1:19026/cluster

http://127.0.0.1:19101

http://127.0.0.1:19026/cluster

hive>

YOUR-S3-BUCKET/emroutput

-- set up Hive's "dynamic partioning"

-- splits output files when writing to Amazon S3

hive>

hive>

-- compress output files on Amazon S3 using Gzip

hive>

hive>

hive>

hive>

-- convert the Apache log timestamp to a UNIX timestamp

-- split files in Amazon S3 by the hour in the log lines

hive>

Log4J


Hive with

Amazon S3

YOUR-S3-BUCKET

YOUR-S3-BUCKET

# using the PostgreSQL CLI

YOUR-REDSHIFT-ENDPOINT

Or use any JDBC or ODBC SQL client with the PostgreSQL

8.x drivers or native Redshift support

• Aginity Workbench for Amazon Redshift

• SQL Workbench/J

YOUR-S3-BUCKET

YOUR-IAM-ACCESS_KEY

YOUR-IAM-SECRET-KEY

-- show all requests from a given IP address

-- count all requests on a given day

-- show all requests referred from other sites

Log4J


Hive with


parallel COPY from

Amazon S3

Bonus:

hive>

hive>

hive>

hive>

hive>

-- Create an external table on Amazon S3

-- to hold query results.

-- Partition (split files on Amazon S3) by iteration

hive>

YOUR-S3-BUCKET

-- set up a first iteration

-- create OS-ERROR_COUNT result (404 error codes) under dynamic partition 0

-- set up a second iteration over the data in the Kinesis Stream

-- create OS-ERROR_COUNT result under dynamic partition 1. -- if file is empty, the previous iteration read all remaining stream data

Log4J


Hive with


parallel COPY from

Amazon S3

Amazon Kinesis

processing state

YOUR-S3-BUCKET

YOUR-S3-BUCKET

YOUR-S3-BUCKET YOUR-PREFIX.gz .

YOUR-PREFIX.gz

DataXu

DataXu Records

tx_id: "AFTfN0uAWZ"

exchange: “APPNEXUS"

request_id:"bb656107-3bf7-47a7-8548-8229563e9dc9”

….

adslot: {slot_id: "2686449714718898993”, uuid: "9d2403f1-fc6c-4d38-b6b1-

839fe4b42455”, price_micro_cpm: 661385, currency: "USD”, seat_id: "12-914”,

campaign_id: "C0513n7”, creative_id: “R53a537”}

…

time_stamp: 1415393474434

serviced_by_host: "cr02.us-east-01”

Confirmation Record

[- 69.120.26.172 - - [08/Nov/2014:21:59:54 -0500] "GET

/rs?id=fc6f2106175a43df8ae4f3b7e6fa8c37&t=marketing&cbust=14155020001916

62 HTTP/1.1" 302 - "http://ads-

by.madadsmedia.com/tags/25628/10217/iframe/728x90.html" "Mozilla/5.0

(compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)" "wfivefivec=c876d00e-

1831-4eba-b78d-cd99188e951a" "OWW=-"

Fraud Record

Continuous

Processing

CDN

Real-time

Bidding

Retargeting

Platform

Reporting

Qubole

Real Time

Apps KCL Apps

Archiver

Amazon Kinesis Event Replay Amazon S3

Producers Aggregator Continuous

Processing Storage Analytics

Redshift

Client/Sensor Aggregator Continuous Processing

Storage Analytics + Reporting

https://github.com/awslabs/kinesis-log4j-appender

Amazon Kinesis storage is replicated across

Availability Zones

Amazon Web Services

AZ AZ AZ

Durable, highly consistent storage replicates dataacross three data centers (availability zones)

Aggregate andarchive to S3

Millions ofsources producing100s of terabytes

per hour

FrontEnd

AuthenticationAuthorization

Ordered streamof events supportsmultiple readers

Real-timedashboardsand alarms

Machine learningalgorithms or

sliding windowanalytics

Aggregate analysisin Hadoop or adata warehouse

Inexpensive: $0.028 per million puts

0

200000

400000

600000

800000

1000000

1200000

0 100 200 300 400 500 600 700 800 900 1000 1100

1K

B M

essag

es/s

ec

Shards

TCO for average 1M events/second:

with 50:1 packing and 10:1 compression: $6351/month

raw: $28610/month

Amazon Kinesis

14 17 18 21 23

Shard-i

2 3 5 8 10

Shard

ID

Lock Seq

num

Shard-i

Host A

Host B

Shard ID Last Archived

Shard-i

0

10

18 X 2

3

5

8

10

14

17

18

21

23

0

3 10

Host A Host B

{Event 10, …}

10 23

14

17

18 21

23

CDN

Real Time

Bidding

Retargetin

g

Platform

Reporting

Qubole

Real Time

Apps KCL Apps

Archiver

Kinesis Event Replay S3



CDN

Real-time

Bidding

Retargeting

Platform

Reporting

Qubole

Real Time

Apps KCL Apps

Archiver


Amazon

Redshift



CDN

Real-time

Bidding

Retargeting

Platform

Reporting

Qubole

Real Time

Apps KCL Apps

Archiver


Redshift

• Unordered processing

– Randomize partition key to distribute events over

many shards and use multiple workers

• Exact order processing

– Control the partition key to ensure events are

grouped onto the same shard and read by the

same worker.

• Need both? Get global sequence number Producer

Get Global

Sequence Unordered

Stream

Campaign Centric

Stream

Fraud Inspection

Stream

Get Event

Metadata

Id event Stream – partition key

1 confirmation Campaign-centric stream - UUID

2 fraud Unordered Stream

Fraud-inspection stream – sessionid

HTTP

Post

AWS SDK

LOG4J

Flume

Fluentd

Get* APIs

Apache

Storm

Amazon

Elastic

MapReduce

Sending Reading

Amazon EMR

Playback Amazon S3

Archiver

http://bit.ly/aws-bdt205

General Purpose: M1, M3 (, T2)

Compute Optimized: C1, CC2, C3, C4

Memory Optimized: M2, CR1, R3

Storage Optimized: HI1, HS1, I2

GPU: CG1, G2

Micro: T1, T2

2006 2007 2008 2009 2010 2011 2012-2013 December, 2014

m1.small

m1.xlarge

m1.large

m1.small

m2.2xlarge

m2.4xlarge

c1.medium

c1.xlarge

m1.xlarge

m1.large

m1.small

cc2.8xlarge

cc1.4xlarge

cg1.4xlarge

t1.micro

m2.xlarge

m2.2xlarge

m2.4xlarge

c1.medium

c1.xlarge

m1.xlarge

m1.large

m1.small

c3.large

c3.xlarge

c3.2xlarge

c3.4xlarge

c3.8xlarge

cr1.8xlarge

hs1.8xlarge

m3.xlarge

m3.2xlarge

hi1.4xlarge

m1.medium

cc2.8xlarge

cc1.4xlarge

cg1.4xlarge

t1.micro

m2.xlarge

m2.2xlarge

m2.4xlarge

c1.medium

c1.xlarge

m1.xlarge

m1.large

m1.small

cc1.4xlarge

cg1.4xlarge

t1.micro

m2.xlarge

m2.2xlarge

m2.4xlarge

c1.medium

c1.xlarge

m1.xlarge

m1.large

m1.small

c3.large

c3.xlarge

c3.2xlarge

c3.4xlarge

c3.8xlarge

hs1.8xlarge

m3.xlarge

m3.2xlarge

hi1.4xlarge

m1.medium

cc2.8xlarge

cc1.4xlarge

cg1.4xlarge

t1.micro

m2.xlarge

m2.2xlarge

m2.4xlarge

c1.medium

c1.xlarge

m1.xlarge

m1.large

m1.small

c1.medium

c1.xlarge

m1.xlarge

m1.large

m1.small

new

existing

g2.2xlarge

m3.medium

m3.large

i2.large

i2.xlarge

i2.4xlarge

i2.8xlarge

r3.large

r3.xlarge

r3.2xlarge

r3.4xlarge

r3.8xlarge

t2.micro

t2.small

t2.medium

c4.large

c4.xlarge

c4.2xlarge

c4.4xlarge

c4.8xlarge

2010

introducing now

78

The next generation of Amazon EC2 Compute-optimized instances • Based on Intel Xeon E5-2666 v3 (Haswell) processors

• 2.9 GHz – peaking at 3.5 GHz with Turbo Boost

Ideal for running tier 1 applications, gaming and web servers, transcoding, and high performance computing workloads.

EBS-optimized by default… and at no additional cost!

Instance Name vCPU Count RAM Network Performance

c4.large 2 3.75 GiB Moderate

c4.xlarge 4 7.5 GiB Moderate

c4.2xlarge 8 15 GiB High

c4.4xlarge 16 30 GiB High

c4.8xlarge 36 60 GiB 10 Gbps

Preliminary specifications. May change prior to release

79

Increases to the performance and capacity of General Purpose

(SSD) and Provisioned IOPS (SSD) volumes.

EBS Name Capacity IOPS Throughput

Amazon EBS General Purpose (SSD) 16 TB (up from 1TB)

10000 IOPS (up from 3000 IOPS)

160 MBps *

Amazon EBS Provisioned IOPS (SSD) 16 TB (up from 1TB)

20000 IOPS (up from 4000 IOPS)

320 MBps *

* When attached to EBS Optimized instances

AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호

Technology

Transcript of AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호