Как сделать высоконагруженный сервис, не зная...

How to make High-Loaded service with no data about loadOleg Obleukhov Site Reliability Engineer, InnoGames

What is CRM and why do we need it


• Actions are tracked while playing



• Data in Hadoop (~25TB, 400B events)




• Templates of behavior





• Near-time campaign






• Money, money, money







• Keep players playing







• Keep players playing

• Attract new players

Blackbox of CRM

Blackbox of CRM

• Games send 500 M events per day in realtime to Hadoop

Blackbox of CRM


• Events need to be selected, filtered, used

Blackbox of CRM



• In-game messages are sent to the game

Blackbox of CRM




• As fast as possible

Blackbox of CRM




Questions• As fast as possible

Blackbox of CRM




Questions• Architecture?


Blackbox of CRM




Questions

• Reliability?

• Architecture?


Blackbox of CRM




Questions

• Reliability?

• Architecture?

• How much load?


Service architecture


Frontend


Data-api

Frontend

Database


Consumer

Data-api

Frontend

Database


QueueConsumer

Data-api

Frontend

Database


Producer

QueueConsumer

Data-api

Frontend

Database


Producer

QueueConsumer

Data-api

Frontend

Database

•Microservices


Producer

QueueConsumer

Data-api

Frontend

Database

•Microservices

•Load grows sequentially


Producer

QueueConsumer

Data-api

Frontend

Database

•Microservices


•Components need to be reliable


Producer

QueueConsumer

Data-api

Frontend

Database

•Microservices


•Components need to be reliable

•How many components?

Watts and Virtual servers

Autoscaling!

Autoscaling!

• When idle - keep only high availability

Autoscaling!


• If needed - add instances

Autoscaling!



• Not enough - add servers

Autoscaling!



• Not enough - add servers

• System repairs itself

Infrastructure What we had

Infrastructure What we had• 3 DataCenters


• Thousands of VMs, a hundreds of HW



• Just migrated from Xen to KVM (live migrations)




• Testing different cloud solutions. Much more expensive




• Testing different cloud solutions. Much more expensive

• Testing of Docker

Infrastructure LB pools and nodes


Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N

• Every host has IP of service on lo interface

HWLB


• Load balancing is done by PF and FreeBSD

Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

LB pool 1 212.53.146.1

LB pool N 212.53.146.5

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N


HWLB



Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

LB pool 1 212.53.146.1

LB pool N 212.53.146.5

Internet

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N


HWLB



Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

LB pool 1 212.53.146.1

LB pool N 212.53.146.5

Internet

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N


• Linux network stack wants to use a «short way»

HWLB



Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

LB pool 1 212.53.146.1

LB pool N 212.53.146.5

Internet

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N



• Going always via LBpool

HWLB



Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

LB pool 1 212.53.146.1

LB pool N 212.53.146.5

Internet

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N



• Going always via LBpool# delete route from local table $ ip route list table local local 212.53.146.1 dev lo proto kernel scope host src 212.53.146.1

# Create table 42 and add there our IP $ ip route list table 42 local 212.53.146.1 dev lo proto kernel scope host src 212.53.146.1

# Add rule for our IP $ ip rule 42: from all to 212.53.146.1 iif eth0 lookup 42

Infrastructure Load balancing in CRM


Host1

Microservice1

MicroserviceN

HostH

Microservice1

Microservice3


Host1

Microservice1

MicroserviceN

HostH

Microservice1

LB pool 1

LB pool N

Microservice3


Host1

Microservice1

MicroserviceN

HostH

Microservice1

LB pool 1

LB pool N

MariaDBHost1

MariaDBHost2

MariaDBHostD

Microservice3


Host1

Microservice1

MicroserviceN

HostH

Microservice1

LB pool 1

LB pool N

MariaDB LB

MariaDBHost1

MariaDBHost2

MariaDBHostD

Microservice3


Host1

Microservice1

MicroserviceN

HostH

Microservice1

LB pool 1

LB pool N

MariaDB LB

MariaDBHost1

MariaDBHost2

MariaDBHostD

RMQ LB

RMQHost1

RMQHost2

RMQHostRMicroservice3

Autoscaling A chain

Autoscaling A chain

Hypervisor 1

Autoscaling A chain

Hypervisor 1

Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Virtual Host V

Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Microservice M

Virtual Host V

Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Microservice MGrafsy

Virtual Host V

• Microservices report CPU and MEM usage

Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1


Graphite

Virtual Host V


• Grafsy reliably sends it to Graphite

Hypervisor H

Infrastructure Graphite

Infrastructure Graphite• Only 2 hosts, ~400 GB of RAM. Whisper


• 50000+ metrics per second


• Tried integration with Clickhouse, Cassandra…



• Client: Grafsy



https://github.com/leoleovich/grafsy


• Client: Grafsy

• Notifier: Graphite2monitoring




https://github.com/leoleovich/graphite2monitoring


• Client: Grafsy

• Notifier: Graphite2monitoring

• Statistic: igcollect




https://github.com/leoleovich/graphite2monitoring

https://github.com/innogames/igcollect

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1


Graphite

Virtual Host V



Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1


Graphite

Nagios

Virtual Host V



• Graphite2monitoring notifies Nagios

Hypervisor H

Infrastructure. Nagios


• Special host-type «aggregator»


• 451.44 checks/sec



• 451.44 checks/sec

• 2 Fully replicated hosts


Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1


Graphite

Nagios

Virtual Host V




Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1


Graphite

Nagios

BrassmonkeyVirtual Host V




• Brassmonkey reacts on events in Nagios

Hypervisor H

Infrastructure. Brassmonkey: Python-sysadmin


• Used for routine tasks (reboot server, restart daemon, cron…)



• Checking Nagios



• Checking Nagios

• Notifying admins



• Checking Nagios


• Performing actions



• Checking Nagios



• Can do more…



• Checking Nagios



• Can do more… Autoscaling

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1


Graphite

Nagios






Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1


Graphite

Nagios


Serveradmin





Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1


Graphite

Nagios


Serveradmin





• Serveradmin makes changes (new host/Puppet)

Hypervisor H

Infrastructure Serveradmin


•Single source of truth



•Controls role by Puppet classes




•DNS


•LB node of LB pool



•DNS



•Location of VM



•DNS



•Location of VM



•Nagios checks/Graphite graphs

•DNS

Autoscaling. Less components

Autoscaling. Less components

2016-10-13 12:40:42,115 [DEBUG] Initilizing crm3 2016-10-13 12:40:42,115 [DEBUG] Query adminapi for 'hostname=aggregator.crm state=online' 2016-10-13 12:40:42,215 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-batch-target-producer 2016-10-13 12:40:42,215 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-catalog 2016-10-13 12:40:42,215 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-content-producer 2016-10-13 12:40:42,216 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-data-api 2016-10-13 12:40:42,216 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-event-consumer 2016-10-13 12:40:42,216 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-event-target-producer 2016-10-13 12:40:42,216 [DEBUG] Found service: aggregator.crm -> num_crm-frontend 2016-10-13 12:40:42,217 [DEBUG] Running crm3_instance 2016-10-13 12:40:42,311 [WARNING] [crm3_instance] Reached the lowest limit of instances for batch-target-producer 2016-10-13 12:40:42,393 [WARNING] [crm3_instance] Reached the lowest limit of instances for catalog 2016-10-13 12:40:42,472 [WARNING] [crm3_instance] Reached the lowest limit of instances for content-producer 2016-10-13 12:40:42,550 [WARNING] [crm3_instance] Reached the lowest limit of instances for data-api 2016-10-13 12:40:42,646 [WARNING] [crm3_instance] Reached the lowest limit of instances for event-consumer 2016-10-13 12:40:42,718 [WARNING] [crm3_instance] Reached the lowest limit of instances for event-target-producer 2016-10-13 12:41:35,876 [INFO] [crm3_instance] added frontend on af-web02.crm

Autoscaling. Load is coming

Autoscaling. More load

Autoscaling. More load

2016-10-14 18:31:06,141 [WARNING] [crm3_instance] Reached the highest limit of instances for event-consumer

Autoscaling. Load is gone

Autoscaling. Load is gone

2016-10-13 15:10:05,449 [DEBUG] Initilizing crm3 2016-10-13 15:10:05,449 [DEBUG] Query adminapi for 'hostname=aggregator.crm state=online' 2016-10-13 15:10:05,524 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-batch-target-producer 2016-10-13 15:10:05,525 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-catalog 2016-10-13 15:10:05,525 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-content-producer 2016-10-13 15:10:05,525 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-data-api 2016-10-13 15:10:05,525 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-event-consumer 2016-10-13 15:10:05,525 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-event-target-producer 2016-10-13 15:10:05,525 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-frontend 2016-10-13 15:10:05,526 [DEBUG] Running crm3_instance 2016-10-13 15:10:05,604 [WARNING] [crm3_instance] Reached the instances limit of components for batch-target-producer 2016-10-13 15:10:05,709 [WARNING] [crm3_instance] Reached the instances limit of components for catalog 2016-10-13 15:10:05,789 [WARNING] [crm3_instance] Reached the instances limit of components for content-producer 2016-10-13 15:10:05,903 [WARNING] [crm3_instance] Reached the instances limit of components for data-api 2016-10-13 15:11:25,887 [INFO] [crm3_instance] removed event-consumer from af-web02.crm

The path

The path• Needs to be written


• Grafsy


• Grafsy

• Graphite2monitoring


• Grafsy


• ClusterHC - healthchecking clusters


• Grafsy



• mmdu - management of users in MySQL (Puppet module sucks)


• Grafsy




• Brassmonkey modules (Python)


• Grafsy




• Architecture



• Grafsy




• Architecture

• Integral or differential way of monitoring



• Grafsy




• Architecture


• Build VMs with one command



• Grafsy




• Architecture


• Build VMs with one command


• More then 1 component work at the same moment. Deadlocks

Conclusion

Conclusion• It is not hard to make your own autoscaling

Conclusion

• Much cheaper than AWS or Azure

• It is not hard to make your own autoscaling

Conclusion



• Does not require migration to other technology

Conclusion

• Saves company resources




Conclusion





• Keeps sysadmin calm

Conclusion





• Keeps sysadmin calm

• Forces to have proper application architecture

https://github.com/leoleovich https://github.com/innogames

https://www.innogames.com

https://github.com/leoleovich

https://github.com/innogames

https://www.innogames.com

Как сделать высоконагруженный сервис, не зная...

Engineering

Transcript of Как сделать высоконагруженный сервис, не зная...