AWS re:Invent re:Cap - 새로운 관계형 데이터베이스 엔진: Amazon Aurora - 양승도
AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
-
Upload
amazon-web-services-korea -
Category
Technology
-
view
393 -
download
2
Transcript of AWS re:Invent re:Cap - 데이터 분석: Amazon EC2 C4 Instance + Amazon EBS - 김일호
December 10, 2014 | Korea
김 일호, Solutions Architect
BDT201 - Big Data and HPC State of the Union
BDT202 - HPC Now Means 'High Personal Computing'
BDT203 - From Zero to NoSQL Hero: Amazon DynamoDB Tutorial
BDT204 - Rendering a Seamless Satellite Map of the World with AWS and NASA Data
BDT205 - Your First Big Data Application on AWS
BDT206 - See How Amazon Redshift is Powering Business Intelligence in the Enterprise
BDT207 - Use Streaming Analytics to Exploit Perishable Insights
BDT208 - Finding High Performance in the Cloud for HPC
BDT209 - Intel’s Healthcare Cloud Solution Using Wearables for Parkinson’s Disease Research
BDT302 - Big Data Beyond Hadoop: Running Mahout, Giraph, and R on Amazon EMR and Analyzing Results with Amazon Redshift
BDT303 - Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and Amazon Redshift
BDT305 - Lessons Learned and Best Practices for Running Hadoop on AWS
BDT306 - Mission-Critical Stream Processing with Amazon EMR and Amazon Kinesis
BDT307 - Running NoSQL on Amazon EC2
BDT308 - Using Amazon Elastic MapReduce as Your Scalable Data Warehouse
BDT308-JT - Using Amazon Elastic MapReduce as Your Scalable Data Warehouse - Japanese Track
BDT309 - Delivering Results with Amazon Redshift, One Petabyte at a Time
BDT309-JT - Delivering Results with Amazon Redshift, One Petabyte at a Time - Japanese Track
BDT310 - Big Data Architectural Patterns and Best Practices on AWS
BDT311 - MegaRun: Behind the 156,000 Core HPC Run on AWS and Experience of On-demand Clusters for Manufacturing Production
Workloads
BDT312 - Using the Cloud to Scale from a Database to a Data Platform
BDT401 - Big Data Orchestra - Harmony within Data Analysis Tools
BDT402 - Performance Profiling in Production: Analyzing Web Requests at Scale Using Amazon Elastic MapReduce and Storm
BDT403 - Netflix's Next Generation Big Data Platform
Redshift EMR EC2
Process & Analyze
Store
AWS Direct Connect
S3
Amazon Kinesis
Glacier
AWS Import/Export
DynamoDB
Collect
Automate AWS Data Pipeline
Amazon SQS
Amazon S3
DynamoDB
Any SQL or NO SQL Store
Log Aggregation tools
Amazon
EMR
Amazon
Redshift
Visualization tools
Business
Intelligence Tools
Business
Intelligence Tools
GIS tools
Amazon data pipeline
Log4J
EMR-Kinesis Connector
Hive with
Amazon S3 Amazon Redshift
parallel COPY from
Amazon S3
Amazon Kinesis
processing state
Launch a 3-instance Hadoop 2.4 cluster with Hive installed:
m3.xlarge
YOUR-AWS-REGION
YOUR-AWS-SSH-KEY
YOUR-BUCKET-NAME
Create an Amazon Kinesis stream to hold incoming data:
aws kinesis create-stream \
--stream-name AccessLogStream \
--shard-count 2
\
CHOOSE-A-REDSHIFT-PASSWORD
YOUR-IAM-ACCESS-KEY YOUR-IAM-SECRET-KEY
Log4J
YOUR-AWS-SSH-KEYYOUR-EMR-MASTER-PRIVATE-DNS
YOUR-EMR-MASTER-PRIVATE-DNSYOUR-EMR-HOSTNAME
Start Hive:
hive
YOUR-IAM-ACCESS-KEY
YOUR-IAM-SECRET-KEY;
YOUR-AWS-REGION
hive>
hive>
hive>
hive>
hive>
hive>
hive>
STORED BY 'com.amazon.emr.kinesis.hive.KinesisStorageHandler'
TBLPROPERTIES("kinesis.stream.name"="AccessLogStream");
-- return the first row in the stream
hive>
-- return count all items in the Stream
hive>
-- return count of all rows with given host hive>
Log4J
EMR-Kinesis Connector
hive>
YOUR-S3-BUCKET/emroutput
-- set up Hive's "dynamic partioning"
-- splits output files when writing to Amazon S3
hive>
hive>
-- compress output files on Amazon S3 using Gzip
hive>
hive>
hive>
hive>
-- convert the Apache log timestamp to a UNIX timestamp
-- split files in Amazon S3 by the hour in the log lines
hive>
Log4J
EMR-Kinesis Connector
Hive with
Amazon S3
YOUR-S3-BUCKET
YOUR-S3-BUCKET
# using the PostgreSQL CLI
YOUR-REDSHIFT-ENDPOINT
Or use any JDBC or ODBC SQL client with the PostgreSQL
8.x drivers or native Redshift support
• Aginity Workbench for Amazon Redshift
• SQL Workbench/J
YOUR-S3-BUCKET
YOUR-IAM-ACCESS_KEY
YOUR-IAM-SECRET-KEY
-- show all requests from a given IP address
-- count all requests on a given day
-- show all requests referred from other sites
Log4J
EMR-Kinesis Connector
Hive with
Amazon S3 Amazon Redshift
parallel COPY from
Amazon S3
Bonus:
hive>
hive>
hive>
hive>
hive>
-- Create an external table on Amazon S3
-- to hold query results.
-- Partition (split files on Amazon S3) by iteration
hive>
YOUR-S3-BUCKET
-- set up a first iteration
-- create OS-ERROR_COUNT result (404 error codes) under dynamic partition 0
-- set up a second iteration over the data in the Kinesis Stream
-- create OS-ERROR_COUNT result under dynamic partition 1. -- if file is empty, the previous iteration read all remaining stream data
Log4J
EMR-Kinesis Connector
Hive with
Amazon S3 Amazon Redshift
parallel COPY from
Amazon S3
Amazon Kinesis
processing state
YOUR-S3-BUCKET
YOUR-S3-BUCKET
YOUR-S3-BUCKET YOUR-PREFIX.gz .
YOUR-PREFIX.gz
DataXu
DataXu Records
tx_id: "AFTfN0uAWZ"
exchange: “APPNEXUS"
request_id:"bb656107-3bf7-47a7-8548-8229563e9dc9”
….
adslot: {slot_id: "2686449714718898993”, uuid: "9d2403f1-fc6c-4d38-b6b1-
839fe4b42455”, price_micro_cpm: 661385, currency: "USD”, seat_id: "12-914”,
campaign_id: "C0513n7”, creative_id: “R53a537”}
…
time_stamp: 1415393474434
serviced_by_host: "cr02.us-east-01”
Confirmation Record
[- 69.120.26.172 - - [08/Nov/2014:21:59:54 -0500] "GET
/rs?id=fc6f2106175a43df8ae4f3b7e6fa8c37&t=marketing&cbust=14155020001916
62 HTTP/1.1" 302 - "http://ads-
by.madadsmedia.com/tags/25628/10217/iframe/728x90.html" "Mozilla/5.0
(compatible; MSIE 10.0; Windows NT 6.2; Trident/6.0)" "wfivefivec=c876d00e-
1831-4eba-b78d-cd99188e951a" "OWW=-"
Fraud Record
Continuous
Processing
CDN
Real-time
Bidding
Retargeting
Platform
Reporting
Qubole
Real Time
Apps KCL Apps
Archiver
Amazon Kinesis Event Replay Amazon S3
Producers Aggregator Continuous
Processing Storage Analytics
Redshift
Client/Sensor Aggregator Continuous Processing
Storage Analytics + Reporting
https://github.com/awslabs/kinesis-log4j-appender
Client/Sensor Aggregator Continuous Processing
Storage Analytics + Reporting
Amazon Kinesis storage is replicated across
Availability Zones
Amazon Web Services
AZ AZ AZ
Durable, highly consistent storage replicates dataacross three data centers (availability zones)
Aggregate andarchive to S3
Millions ofsources producing100s of terabytes
per hour
FrontEnd
AuthenticationAuthorization
Ordered streamof events supportsmultiple readers
Real-timedashboardsand alarms
Machine learningalgorithms or
sliding windowanalytics
Aggregate analysisin Hadoop or adata warehouse
Inexpensive: $0.028 per million puts
0
200000
400000
600000
800000
1000000
1200000
0 100 200 300 400 500 600 700 800 900 1000 1100
1K
B M
essag
es/s
ec
Shards
TCO for average 1M events/second:
with 50:1 packing and 10:1 compression: $6351/month
raw: $28610/month
Client/Sensor Aggregator Continuous Processing
Storage Analytics + Reporting
Amazon Kinesis
14 17 18 21 23
Shard-i
2 3 5 8 10
Shard
ID
Lock Seq
num
Shard-i
Host A
Host B
Shard ID Last Archived
Shard-i
0
10
18 X 2
3
5
8
10
14
17
18
21
23
0
3 10
Host A Host B
{Event 10, …}
10 23
14
17
18 21
23
Client/Sensor Aggregator Continuous Processing
Storage Analytics + Reporting
CDN
Real Time
Bidding
Retargetin
g
Platform
Reporting
Qubole
Real Time
Apps KCL Apps
Archiver
Kinesis Event Replay S3
Producers Aggregator Continuous
Processing Storage Analytics
CDN
Real-time
Bidding
Retargeting
Platform
Reporting
Qubole
Real Time
Apps KCL Apps
Archiver
Amazon Kinesis Event Replay Amazon S3
Amazon
Redshift
Producers Aggregator Continuous
Processing Storage Analytics
CDN
Real-time
Bidding
Retargeting
Platform
Reporting
Qubole
Real Time
Apps KCL Apps
Archiver
Amazon Kinesis Event Replay Amazon S3
Redshift
Client/Sensor Aggregator Continuous Processing
Storage Analytics + Reporting
• Unordered processing
– Randomize partition key to distribute events over
many shards and use multiple workers
• Exact order processing
– Control the partition key to ensure events are
grouped onto the same shard and read by the
same worker.
• Need both? Get global sequence number Producer
Get Global
Sequence Unordered
Stream
Campaign Centric
Stream
Fraud Inspection
Stream
Get Event
Metadata
Id event Stream – partition key
1 confirmation Campaign-centric stream - UUID
2 fraud Unordered Stream
Fraud-inspection stream – sessionid
HTTP
Post
AWS SDK
LOG4J
Flume
Fluentd
Get* APIs
Apache
Storm
Amazon
Elastic
MapReduce
Sending Reading
Amazon EMR
Playback Amazon S3
Archiver
Client/Sensor Aggregator Continuous Processing
Storage Analytics + Reporting
http://bit.ly/aws-bdt205
General Purpose: M1, M3 (, T2)
Compute Optimized: C1, CC2, C3, C4
Memory Optimized: M2, CR1, R3
Storage Optimized: HI1, HS1, I2
GPU: CG1, G2
Micro: T1, T2
2006 2007 2008 2009 2010 2011 2012-2013 December, 2014
m1.small
m1.xlarge
m1.large
m1.small
m2.2xlarge
m2.4xlarge
c1.medium
c1.xlarge
m1.xlarge
m1.large
m1.small
cc2.8xlarge
cc1.4xlarge
cg1.4xlarge
t1.micro
m2.xlarge
m2.2xlarge
m2.4xlarge
c1.medium
c1.xlarge
m1.xlarge
m1.large
m1.small
c3.large
c3.xlarge
c3.2xlarge
c3.4xlarge
c3.8xlarge
cr1.8xlarge
hs1.8xlarge
m3.xlarge
m3.2xlarge
hi1.4xlarge
m1.medium
cc2.8xlarge
cc1.4xlarge
cg1.4xlarge
t1.micro
m2.xlarge
m2.2xlarge
m2.4xlarge
c1.medium
c1.xlarge
m1.xlarge
m1.large
m1.small
cc1.4xlarge
cg1.4xlarge
t1.micro
m2.xlarge
m2.2xlarge
m2.4xlarge
c1.medium
c1.xlarge
m1.xlarge
m1.large
m1.small
c3.large
c3.xlarge
c3.2xlarge
c3.4xlarge
c3.8xlarge
hs1.8xlarge
m3.xlarge
m3.2xlarge
hi1.4xlarge
m1.medium
cc2.8xlarge
cc1.4xlarge
cg1.4xlarge
t1.micro
m2.xlarge
m2.2xlarge
m2.4xlarge
c1.medium
c1.xlarge
m1.xlarge
m1.large
m1.small
c1.medium
c1.xlarge
m1.xlarge
m1.large
m1.small
new
existing
g2.2xlarge
m3.medium
m3.large
i2.large
i2.xlarge
i2.4xlarge
i2.8xlarge
r3.large
r3.xlarge
r3.2xlarge
r3.4xlarge
r3.8xlarge
t2.micro
t2.small
t2.medium
c4.large
c4.xlarge
c4.2xlarge
c4.4xlarge
c4.8xlarge
2010
introducing now
78
The next generation of Amazon EC2 Compute-optimized instances • Based on Intel Xeon E5-2666 v3 (Haswell) processors
• 2.9 GHz – peaking at 3.5 GHz with Turbo Boost
Ideal for running tier 1 applications, gaming and web servers, transcoding, and high performance computing workloads.
EBS-optimized by default… and at no additional cost!
Instance Name vCPU Count RAM Network Performance
c4.large 2 3.75 GiB Moderate
c4.xlarge 4 7.5 GiB Moderate
c4.2xlarge 8 15 GiB High
c4.4xlarge 16 30 GiB High
c4.8xlarge 36 60 GiB 10 Gbps
Preliminary specifications. May change prior to release
79
Increases to the performance and capacity of General Purpose
(SSD) and Provisioned IOPS (SSD) volumes.
EBS Name Capacity IOPS Throughput
Amazon EBS General Purpose (SSD) 16 TB (up from 1TB)
10000 IOPS (up from 3000 IOPS)
160 MBps *
Amazon EBS Provisioned IOPS (SSD) 16 TB (up from 1TB)
20000 IOPS (up from 4000 IOPS)
320 MBps *
* When attached to EBS Optimized instances