Как сделать высоконагруженный сервис, не зная...

Post on 16-Apr-2017

609 views 4 download

Transcript of Как сделать высоконагруженный сервис, не зная...

How to make High-Loaded service with no data about loadOleg Obleukhov Site Reliability Engineer, InnoGames

What is CRM and why do we need it

What is CRM and why do we need it

• Actions are tracked while playing

What is CRM and why do we need it

• Actions are tracked while playing

• Data in Hadoop (~25TB, 400B events)

What is CRM and why do we need it

• Actions are tracked while playing

• Data in Hadoop (~25TB, 400B events)

• Templates of behavior

What is CRM and why do we need it

• Actions are tracked while playing

• Data in Hadoop (~25TB, 400B events)

• Templates of behavior

• Near-time campaign

What is CRM and why do we need it

• Actions are tracked while playing

• Data in Hadoop (~25TB, 400B events)

• Templates of behavior

• Near-time campaign

• Money, money, money

What is CRM and why do we need it

• Actions are tracked while playing

• Data in Hadoop (~25TB, 400B events)

• Templates of behavior

• Near-time campaign

• Money, money, money

• Keep players playing

What is CRM and why do we need it

• Actions are tracked while playing

• Data in Hadoop (~25TB, 400B events)

• Templates of behavior

• Near-time campaign

• Money, money, money

• Keep players playing

• Attract new players

Blackbox of CRM

Blackbox of CRM

Blackbox of CRM

• Games send 500 M events per day in realtime to Hadoop

Blackbox of CRM

• Games send 500 M events per day in realtime to Hadoop

• Events need to be selected, filtered, used

Blackbox of CRM

• Games send 500 M events per day in realtime to Hadoop

• Events need to be selected, filtered, used

• In-game messages are sent to the game

Blackbox of CRM

• Games send 500 M events per day in realtime to Hadoop

• Events need to be selected, filtered, used

• In-game messages are sent to the game

• As fast as possible

Blackbox of CRM

• Games send 500 M events per day in realtime to Hadoop

• Events need to be selected, filtered, used

• In-game messages are sent to the game

Questions• As fast as possible

Blackbox of CRM

• Games send 500 M events per day in realtime to Hadoop

• Events need to be selected, filtered, used

• In-game messages are sent to the game

Questions• Architecture?

• As fast as possible

Blackbox of CRM

• Games send 500 M events per day in realtime to Hadoop

• Events need to be selected, filtered, used

• In-game messages are sent to the game

Questions

• Reliability?

• Architecture?

• As fast as possible

Blackbox of CRM

• Games send 500 M events per day in realtime to Hadoop

• Events need to be selected, filtered, used

• In-game messages are sent to the game

Questions

• Reliability?

• Architecture?

• How much load?

• As fast as possible

Service architecture

Service architecture

Service architecture

Frontend

Service architecture

Data-api

Frontend

Database

Service architecture

Data-api

Frontend

Database

Service architecture

Consumer

Data-api

Frontend

Database

Service architecture

QueueConsumer

Data-api

Frontend

Database

Service architecture

Producer

QueueConsumer

Data-api

Frontend

Database

Service architecture

Producer

QueueConsumer

Data-api

Frontend

Database

•Microservices

Service architecture

Producer

QueueConsumer

Data-api

Frontend

Database

•Microservices

•Load grows sequentially

Service architecture

Producer

QueueConsumer

Data-api

Frontend

Database

•Microservices

•Load grows sequentially

Service architecture

Producer

QueueConsumer

Data-api

Frontend

Database

•Microservices

•Load grows sequentially

Service architecture

Producer

QueueConsumer

Data-api

Frontend

Database

•Microservices

•Load grows sequentially

Service architecture

Producer

QueueConsumer

Data-api

Frontend

Database

•Microservices

•Load grows sequentially

•Components need to be reliable

Service architecture

Producer

QueueConsumer

Data-api

Frontend

Database

•Microservices

•Load grows sequentially

•Components need to be reliable

•How many components?

Watts and Virtual servers

Watts and Virtual servers

Watts and Virtual servers

Autoscaling!

Autoscaling!

• When idle - keep only high availability

Autoscaling!

• When idle - keep only high availability

• If needed - add instances

Autoscaling!

• When idle - keep only high availability

• If needed - add instances

• Not enough - add servers

Autoscaling!

• When idle - keep only high availability

• If needed - add instances

• Not enough - add servers

• System repairs itself

Infrastructure What we had

Infrastructure What we had• 3 DataCenters

Infrastructure What we had• 3 DataCenters

• Thousands of VMs, a hundreds of HW

Infrastructure What we had• 3 DataCenters

• Thousands of VMs, a hundreds of HW

• Just migrated from Xen to KVM (live migrations)

Infrastructure What we had• 3 DataCenters

• Thousands of VMs, a hundreds of HW

• Just migrated from Xen to KVM (live migrations)

• Testing different cloud solutions. Much more expensive

Infrastructure What we had• 3 DataCenters

• Thousands of VMs, a hundreds of HW

• Just migrated from Xen to KVM (live migrations)

• Testing different cloud solutions. Much more expensive

• Testing of Docker

Infrastructure What we had• 3 DataCenters

• Thousands of VMs, a hundreds of HW

• Just migrated from Xen to KVM (live migrations)

• Testing different cloud solutions. Much more expensive

• Testing of Docker

Infrastructure What we had• 3 DataCenters

• Thousands of VMs, a hundreds of HW

• Just migrated from Xen to KVM (live migrations)

• Testing different cloud solutions. Much more expensive

• Testing of Docker

Infrastructure What we had• 3 DataCenters

• Thousands of VMs, a hundreds of HW

• Just migrated from Xen to KVM (live migrations)

• Testing different cloud solutions. Much more expensive

• Testing of Docker

Infrastructure LB pools and nodes

Infrastructure LB pools and nodes

Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N

• Every host has IP of service on lo interface

HWLB

Infrastructure LB pools and nodes

• Load balancing is done by PF and FreeBSD

Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

LB pool 1 212.53.146.1

LB pool N 212.53.146.5

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N

• Every host has IP of service on lo interface

HWLB

Infrastructure LB pools and nodes

• Load balancing is done by PF and FreeBSD

Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

LB pool 1 212.53.146.1

LB pool N 212.53.146.5

Internet

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N

• Every host has IP of service on lo interface

HWLB

Infrastructure LB pools and nodes

• Load balancing is done by PF and FreeBSD

Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

LB pool 1 212.53.146.1

LB pool N 212.53.146.5

Internet

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N

• Every host has IP of service on lo interface

HWLB

Infrastructure LB pools and nodes

• Load balancing is done by PF and FreeBSD

Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

LB pool 1 212.53.146.1

LB pool N 212.53.146.5

Internet

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N

• Every host has IP of service on lo interface

• Linux network stack wants to use a «short way»

HWLB

Infrastructure LB pools and nodes

• Load balancing is done by PF and FreeBSD

Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

LB pool 1 212.53.146.1

LB pool N 212.53.146.5

Internet

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N

• Every host has IP of service on lo interface

• Linux network stack wants to use a «short way»

HWLB

Infrastructure LB pools and nodes

• Load balancing is done by PF and FreeBSD

Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

LB pool 1 212.53.146.1

LB pool N 212.53.146.5

Internet

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N

• Every host has IP of service on lo interface

• Linux network stack wants to use a «short way»

• Going always via LBpool

HWLB

Infrastructure LB pools and nodes

• Load balancing is done by PF and FreeBSD

Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

LB pool 1 212.53.146.1

LB pool N 212.53.146.5

Internet

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N

• Every host has IP of service on lo interface

• Linux network stack wants to use a «short way»

• Going always via LBpool# delete route from local table $ ip route list table local local 212.53.146.1 dev lo proto kernel scope host src 212.53.146.1

# Create table 42 and add there our IP $ ip route list table 42 local 212.53.146.1 dev lo proto kernel scope host src 212.53.146.1

# Add rule for our IP $ ip rule 42: from all to 212.53.146.1 iif eth0 lookup 42

Infrastructure Load balancing in CRM

Infrastructure Load balancing in CRM

Host1

Microservice1

MicroserviceN

HostH

Microservice1

Microservice3

Infrastructure Load balancing in CRM

Host1

Microservice1

MicroserviceN

HostH

Microservice1

LB pool 1

LB pool N

Microservice3

Infrastructure Load balancing in CRM

Host1

Microservice1

MicroserviceN

HostH

Microservice1

LB pool 1

LB pool N

Microservice3

Infrastructure Load balancing in CRM

Host1

Microservice1

MicroserviceN

HostH

Microservice1

LB pool 1

LB pool N

Microservice3

Infrastructure Load balancing in CRM

Host1

Microservice1

MicroserviceN

HostH

Microservice1

LB pool 1

LB pool N

Microservice3

Infrastructure Load balancing in CRM

Host1

Microservice1

MicroserviceN

HostH

Microservice1

LB pool 1

LB pool N

MariaDBHost1

MariaDBHost2

MariaDBHostD

Microservice3

Infrastructure Load balancing in CRM

Host1

Microservice1

MicroserviceN

HostH

Microservice1

LB pool 1

LB pool N

MariaDB LB

MariaDBHost1

MariaDBHost2

MariaDBHostD

Microservice3

Infrastructure Load balancing in CRM

Host1

Microservice1

MicroserviceN

HostH

Microservice1

LB pool 1

LB pool N

MariaDB LB

MariaDBHost1

MariaDBHost2

MariaDBHostD

Microservice3

Infrastructure Load balancing in CRM

Host1

Microservice1

MicroserviceN

HostH

Microservice1

LB pool 1

LB pool N

MariaDB LB

MariaDBHost1

MariaDBHost2

MariaDBHostD

RMQ LB

RMQHost1

RMQHost2

RMQHostRMicroservice3

Autoscaling A chain

Autoscaling A chain

Hypervisor 1

Autoscaling A chain

Hypervisor 1

Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Virtual Host V

Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Microservice M

Virtual Host V

Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Microservice MGrafsy

Virtual Host V

• Microservices report CPU and MEM usage

Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Microservice MGrafsy

Graphite

Virtual Host V

• Microservices report CPU and MEM usage

• Grafsy reliably sends it to Graphite

Hypervisor H

Infrastructure Graphite

Infrastructure Graphite

Infrastructure Graphite• Only 2 hosts, ~400 GB of RAM. Whisper

Infrastructure Graphite• Only 2 hosts, ~400 GB of RAM. Whisper

• 50000+ metrics per second

Infrastructure Graphite• Only 2 hosts, ~400 GB of RAM. Whisper

• Tried integration with Clickhouse, Cassandra…

• 50000+ metrics per second

Infrastructure Graphite• Only 2 hosts, ~400 GB of RAM. Whisper

• Client: Grafsy

• Tried integration with Clickhouse, Cassandra…

• 50000+ metrics per second

Infrastructure Graphite• Only 2 hosts, ~400 GB of RAM. Whisper

• Client: Grafsy

• Notifier: Graphite2monitoring

• Tried integration with Clickhouse, Cassandra…

• 50000+ metrics per second

Infrastructure Graphite• Only 2 hosts, ~400 GB of RAM. Whisper

• Client: Grafsy

• Notifier: Graphite2monitoring

• Statistic: igcollect

• Tried integration with Clickhouse, Cassandra…

• 50000+ metrics per second

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Microservice MGrafsy

Graphite

Virtual Host V

• Microservices report CPU and MEM usage

• Grafsy reliably sends it to Graphite

Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Microservice MGrafsy

Graphite

Nagios

Virtual Host V

• Microservices report CPU and MEM usage

• Grafsy reliably sends it to Graphite

• Graphite2monitoring notifies Nagios

Hypervisor H

Infrastructure. Nagios

Infrastructure. Nagios

• Special host-type «aggregator»

Infrastructure. Nagios

• 451.44 checks/sec

• Special host-type «aggregator»

Infrastructure. Nagios

• 451.44 checks/sec

• 2 Fully replicated hosts

• Special host-type «aggregator»

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Microservice MGrafsy

Graphite

Nagios

Virtual Host V

• Microservices report CPU and MEM usage

• Grafsy reliably sends it to Graphite

• Graphite2monitoring notifies Nagios

Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Microservice MGrafsy

Graphite

Nagios

BrassmonkeyVirtual Host V

• Microservices report CPU and MEM usage

• Grafsy reliably sends it to Graphite

• Graphite2monitoring notifies Nagios

• Brassmonkey reacts on events in Nagios

Hypervisor H

Infrastructure. Brassmonkey: Python-sysadmin

Infrastructure. Brassmonkey: Python-sysadmin

Infrastructure. Brassmonkey: Python-sysadmin

• Used for routine tasks (reboot server, restart daemon, cron…)

Infrastructure. Brassmonkey: Python-sysadmin

• Used for routine tasks (reboot server, restart daemon, cron…)

• Checking Nagios

Infrastructure. Brassmonkey: Python-sysadmin

• Used for routine tasks (reboot server, restart daemon, cron…)

• Checking Nagios

• Notifying admins

Infrastructure. Brassmonkey: Python-sysadmin

• Used for routine tasks (reboot server, restart daemon, cron…)

• Checking Nagios

• Notifying admins

• Performing actions

Infrastructure. Brassmonkey: Python-sysadmin

• Used for routine tasks (reboot server, restart daemon, cron…)

• Checking Nagios

• Notifying admins

• Performing actions

• Can do more…

Infrastructure. Brassmonkey: Python-sysadmin

• Used for routine tasks (reboot server, restart daemon, cron…)

• Checking Nagios

• Notifying admins

• Performing actions

• Can do more… Autoscaling

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Microservice MGrafsy

Graphite

Nagios

BrassmonkeyVirtual Host V

• Microservices report CPU and MEM usage

• Grafsy reliably sends it to Graphite

• Graphite2monitoring notifies Nagios

• Brassmonkey reacts on events in Nagios

Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Microservice MGrafsy

Graphite

Nagios

BrassmonkeyVirtual Host V

Serveradmin

• Microservices report CPU and MEM usage

• Grafsy reliably sends it to Graphite

• Graphite2monitoring notifies Nagios

• Brassmonkey reacts on events in Nagios

Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Microservice MGrafsy

Graphite

Nagios

BrassmonkeyVirtual Host V

Serveradmin

• Microservices report CPU and MEM usage

• Grafsy reliably sends it to Graphite

• Graphite2monitoring notifies Nagios

• Brassmonkey reacts on events in Nagios

• Serveradmin makes changes (new host/Puppet)

Hypervisor H

Infrastructure Serveradmin

Infrastructure Serveradmin

•Single source of truth

Infrastructure Serveradmin

•Single source of truth

•Controls role by Puppet classes

Infrastructure Serveradmin

•Single source of truth

•Controls role by Puppet classes

•DNS

Infrastructure Serveradmin

•LB node of LB pool

•Single source of truth

•Controls role by Puppet classes

•DNS

Infrastructure Serveradmin

•LB node of LB pool

•Location of VM

•Single source of truth

•Controls role by Puppet classes

•DNS

Infrastructure Serveradmin

•LB node of LB pool

•Location of VM

•Single source of truth

•Controls role by Puppet classes

•Nagios checks/Graphite graphs

•DNS

Autoscaling. Less components

Autoscaling. Less components

2016-10-13 12:40:42,115 [DEBUG] Initilizing crm3 2016-10-13 12:40:42,115 [DEBUG] Query adminapi for 'hostname=aggregator.crm state=online' 2016-10-13 12:40:42,215 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-batch-target-producer 2016-10-13 12:40:42,215 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-catalog 2016-10-13 12:40:42,215 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-content-producer 2016-10-13 12:40:42,216 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-data-api 2016-10-13 12:40:42,216 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-event-consumer 2016-10-13 12:40:42,216 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-event-target-producer 2016-10-13 12:40:42,216 [DEBUG] Found service: aggregator.crm -> num_crm-frontend 2016-10-13 12:40:42,217 [DEBUG] Running crm3_instance 2016-10-13 12:40:42,311 [WARNING] [crm3_instance] Reached the lowest limit of instances for batch-target-producer 2016-10-13 12:40:42,393 [WARNING] [crm3_instance] Reached the lowest limit of instances for catalog 2016-10-13 12:40:42,472 [WARNING] [crm3_instance] Reached the lowest limit of instances for content-producer 2016-10-13 12:40:42,550 [WARNING] [crm3_instance] Reached the lowest limit of instances for data-api 2016-10-13 12:40:42,646 [WARNING] [crm3_instance] Reached the lowest limit of instances for event-consumer 2016-10-13 12:40:42,718 [WARNING] [crm3_instance] Reached the lowest limit of instances for event-target-producer 2016-10-13 12:41:35,876 [INFO] [crm3_instance] added frontend on af-web02.crm

Autoscaling. Load is coming

Autoscaling. More load

Autoscaling. More load

2016-10-14 18:31:06,141 [WARNING] [crm3_instance] Reached the highest limit of instances for event-consumer

Autoscaling. More load

2016-10-14 18:31:06,141 [WARNING] [crm3_instance] Reached the highest limit of instances for event-consumer

Autoscaling. Load is gone

Autoscaling. Load is gone

2016-10-13 15:10:05,449 [DEBUG] Initilizing crm3 2016-10-13 15:10:05,449 [DEBUG] Query adminapi for 'hostname=aggregator.crm state=online' 2016-10-13 15:10:05,524 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-batch-target-producer 2016-10-13 15:10:05,525 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-catalog 2016-10-13 15:10:05,525 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-content-producer 2016-10-13 15:10:05,525 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-data-api 2016-10-13 15:10:05,525 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-event-consumer 2016-10-13 15:10:05,525 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-event-target-producer 2016-10-13 15:10:05,525 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-frontend 2016-10-13 15:10:05,526 [DEBUG] Running crm3_instance 2016-10-13 15:10:05,604 [WARNING] [crm3_instance] Reached the instances limit of components for batch-target-producer 2016-10-13 15:10:05,709 [WARNING] [crm3_instance] Reached the instances limit of components for catalog 2016-10-13 15:10:05,789 [WARNING] [crm3_instance] Reached the instances limit of components for content-producer 2016-10-13 15:10:05,903 [WARNING] [crm3_instance] Reached the instances limit of components for data-api 2016-10-13 15:11:25,887 [INFO] [crm3_instance] removed event-consumer from af-web02.crm

The path

The path• Needs to be written

The path• Needs to be written

• Grafsy

The path• Needs to be written

• Grafsy

• Graphite2monitoring

The path• Needs to be written

• Grafsy

• Graphite2monitoring

• ClusterHC - healthchecking clusters

The path• Needs to be written

• Grafsy

• Graphite2monitoring

• ClusterHC - healthchecking clusters

• mmdu - management of users in MySQL (Puppet module sucks)

The path• Needs to be written

• Grafsy

• Graphite2monitoring

• ClusterHC - healthchecking clusters

• mmdu - management of users in MySQL (Puppet module sucks)

• Brassmonkey modules (Python)

The path• Needs to be written

• Grafsy

• Graphite2monitoring

• ClusterHC - healthchecking clusters

• mmdu - management of users in MySQL (Puppet module sucks)

• Architecture

• Brassmonkey modules (Python)

The path• Needs to be written

• Grafsy

• Graphite2monitoring

• ClusterHC - healthchecking clusters

• mmdu - management of users in MySQL (Puppet module sucks)

• Architecture

• Integral or differential way of monitoring

• Brassmonkey modules (Python)

The path• Needs to be written

• Grafsy

• Graphite2monitoring

• ClusterHC - healthchecking clusters

• mmdu - management of users in MySQL (Puppet module sucks)

• Architecture

• Integral or differential way of monitoring

• Build VMs with one command

• Brassmonkey modules (Python)

The path• Needs to be written

• Grafsy

• Graphite2monitoring

• ClusterHC - healthchecking clusters

• mmdu - management of users in MySQL (Puppet module sucks)

• Architecture

• Integral or differential way of monitoring

• Build VMs with one command

• Brassmonkey modules (Python)

• More then 1 component work at the same moment. Deadlocks

Conclusion

Conclusion• It is not hard to make your own autoscaling

Conclusion

• Much cheaper than AWS or Azure

• It is not hard to make your own autoscaling

Conclusion

• Much cheaper than AWS or Azure

• It is not hard to make your own autoscaling

• Does not require migration to other technology

Conclusion

• Saves company resources

• Much cheaper than AWS or Azure

• It is not hard to make your own autoscaling

• Does not require migration to other technology

Conclusion

• Saves company resources

• Much cheaper than AWS or Azure

• It is not hard to make your own autoscaling

• Does not require migration to other technology

• Keeps sysadmin calm

Conclusion

• Saves company resources

• Much cheaper than AWS or Azure

• It is not hard to make your own autoscaling

• Does not require migration to other technology

• Keeps sysadmin calm

• Forces to have proper application architecture

https://github.com/leoleovich https://github.com/innogames

https://www.innogames.com