Post on 18-Jan-2017
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
5/20/2016
Scale to 12,000,000 users with AWS
Kevin Li, Product Lead, 17 Media
Principle of Architecture Evolution
Figure out the business need of the current stage
Balance the quality and time to market
Optimize the bottleneck first
What’s the need of 17 architecture?
Scalable Available Personalized
Grow with the users Always there for users Understand the users
First 100 users
Don’t even think about scalability
Launch and verify the idea ASAP
Amazon
Route 53
Amazon EC2 MongoDB
Request
User 100,000
Cache the database
Use CDN to deliver the live streaming content
CDNAmazon
Route 53
Amazon EC2
MongoDBRequest
Amazon
ElastiCache
Design for failure
Failures are the norm, not exceptions
Suppose the rate of failure of one machine is once
every 10 years (120 month)
The mean time of failure (MTTF) is
1 month if you have 120 servers
Design for failure
Amazon
Route 53
Amazon
EC2MongoDB
Amazon
EC2
Amazon
EC2
MongoDB
MongoDB
Elastic Load
Balancing
Multi-AZ Multi-AZ
Mix spot and on-demand to
save the cost
TIP: Use C3 instance for spotAmazon
ElastiCache
Pet servers
Unique, lovingly hand raised servers
When they get ill, someone has to fix it at 4 am
Usually database server, like mysql, mongo,…
Cattle servers
They are almost identical
If they get ill, replace with another one
Usually API servers, workers
User 5 million – Build loosely coupled systems
Our system was a monolithic system consists of
API ServerStreaming Server Worker
Application Server
API ServerStreaming Server Worker
Application Server
We discovered a bug that the API servers
didn’t send requests to worker
API ServerStreaming Server Worker
Application Server
After fixed, the overloaded worker crashed the whole server
Build loosely coupled systems
API
Server
Streaming
Server
Worker
API
Server
API
Server
Streaming
Server
Worker
Worker
Amazon
SQS
API Cluster Worker Cluster
Streaming Cluster
Business Intelligence
Who’s our best streamers?
How’s the retention changes
among different version?
We need a real-time data pipeline and self-service tool for the business team
Which event is the most effective?
User 10,000,000 - Data and Personalization
Amazon
Kinesis
Amazon S3
bucket
Amazon
EC2
Event
Data
AWS
Lambda
A real-time self-service dashboard for
the management and marketing team
Fraud Detection and Security Monitoring
Hackers are always trying to get valuable stuff from your
service, like virtual goods, data,…
Lots of spammer leaves dirty words or fraud information
You’ll need enough data to detect the fraud and prevent it
“50% of reddit’s development
time focused on stopping spam
and vote cheating” - Jeremy Edberg, Chief Architect of Reddit
Log Search
Our customer service often received many questions
To answer those questions, we need a log search system
“I bought 1,000 points, but didn’t receive”
“My stream isn’t smooth enough, there is a bug!”
User 10,000,000 – Log Search
Amazon
KinesisAmazon
EC2
Event
Data
HTTP
Request
AWS
Lambda Amazon
Elasticsearch
Service
Amazon S3
bucket
User 10,000,000 – Data Architecture
Amazon
Kinesis
Amazon S3
bucket
Amazon
EC2
Event
Data
AWS
LambdaHTTP
Request
AWS
Lambda Amazon
Elasticsearch
Service