Как сделать высоконагруженный сервис, не зная...

How to make High-Loaded service with no data about loadOleg Obleukhov Site Reliability Engineer, InnoGames

What is CRM and why do we need it

• Actions are tracked while playing

• Data in Hadoop (~25TB, 400B events)

• Templates of behavior

• Near-time campaign

• Money, money, money

• Keep players playing

• Attract new players

Blackbox of CRM

• Games send 500 M events per day in realtime to Hadoop

Blackbox of CRM

• Events need to be selected, filtered, used

Blackbox of CRM

• In-game messages are sent to the game

Blackbox of CRM

• As fast as possible

Blackbox of CRM

Questions• As fast as possible

Blackbox of CRM

Questions• Architecture?

Blackbox of CRM

Questions

• Reliability?

• Architecture?

Blackbox of CRM

Questions

• Reliability?

• Architecture?

• How much load?

Service architecture

Frontend

Data-api

Frontend

Database

Data-api

Frontend

Database

Consumer

Data-api

Frontend

Database

QueueConsumer

Data-api

Frontend

Database

Producer

QueueConsumer

Data-api

Frontend

Database

Producer

QueueConsumer

Data-api

Frontend

Database

•Microservices

Producer

QueueConsumer

Data-api

Frontend

Database

•Microservices

•Load grows sequentially

Producer

QueueConsumer

Data-api

Frontend

Database

•Microservices

Producer

QueueConsumer

Data-api

Frontend

Database

•Microservices

Producer

QueueConsumer

Data-api

Frontend

Database

•Microservices

Producer

QueueConsumer

Data-api

Frontend

Database

•Microservices

•Components need to be reliable

Producer

QueueConsumer

Data-api

Frontend

Database

•Microservices

•Components need to be reliable

•How many components?

Watts and Virtual servers

Autoscaling!

• When idle - keep only high availability

Autoscaling!

• If needed - add instances

Autoscaling!

• Not enough - add servers

Autoscaling!

• Not enough - add servers

• System repairs itself

Infrastructure What we had

Infrastructure What we had• 3 DataCenters

• Thousands of VMs, a hundreds of HW

• Just migrated from Xen to KVM (live migrations)

• Testing different cloud solutions. Much more expensive

• Testing of Docker

Infrastructure LB pools and nodes

Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N

• Every host has IP of service on lo interface

• Load balancing is done by PF and FreeBSD

Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

LB pool 1 212.53.146.1

LB pool N 212.53.146.5

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N

Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

LB pool 1 212.53.146.1

LB pool N 212.53.146.5

Internet

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N

Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

LB pool 1 212.53.146.1

LB pool N 212.53.146.5

Internet

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N

Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

LB pool 1 212.53.146.1

LB pool N 212.53.146.5

Internet

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N

• Linux network stack wants to use a «short way»

Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

LB pool 1 212.53.146.1

LB pool N 212.53.146.5

Internet

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N

Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

LB pool 1 212.53.146.1

LB pool N 212.53.146.5

Internet

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N

• Going always via LBpool

Host 1 eth0: 10.0.1.1 lo: 212.53.146.1

212.53.146.5

LB pool 1 212.53.146.1

LB pool N 212.53.146.5

Internet

Service 1

Service N

Host N eth0: 10.0.1.5 lo: 212.53.146.1

212.53.146.5

Service 1

Service N

• Going always via LBpool# delete route from local table $ ip route list table local local 212.53.146.1 dev lo proto kernel scope host src 212.53.146.1

# Create table 42 and add there our IP $ ip route list table 42 local 212.53.146.1 dev lo proto kernel scope host src 212.53.146.1

# Add rule for our IP $ ip rule 42: from all to 212.53.146.1 iif eth0 lookup 42

Infrastructure Load balancing in CRM

Microservice1

MicroserviceN

Microservice1

Microservice3

Microservice1

MicroserviceN

Microservice1

LB pool 1

LB pool N

Microservice3

Microservice1

MicroserviceN

Microservice1

LB pool 1

LB pool N

Microservice3

Microservice1

MicroserviceN

Microservice1

LB pool 1

LB pool N

Microservice3

Microservice1

MicroserviceN

Microservice1

LB pool 1

LB pool N

Microservice3

Microservice1

MicroserviceN

Microservice1

LB pool 1

LB pool N

MariaDBHost1

MariaDBHost2

MariaDBHostD

Microservice3

Microservice1

MicroserviceN

Microservice1

LB pool 1

LB pool N

MariaDB LB

MariaDBHost1

MariaDBHost2

MariaDBHostD

Microservice3

Microservice1

MicroserviceN

Microservice1

LB pool 1

LB pool N

MariaDB LB

MariaDBHost1

MariaDBHost2

MariaDBHostD

Microservice3

Microservice1

MicroserviceN

Microservice1

LB pool 1

LB pool N

MariaDB LB

MariaDBHost1

MariaDBHost2

MariaDBHostD

RMQ LB

RMQHost1

RMQHost2

RMQHostRMicroservice3

Autoscaling A chain

Hypervisor 1

Autoscaling A chain

Hypervisor 1

Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Virtual Host V

Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Microservice M

Virtual Host V

Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Microservice MGrafsy

Virtual Host V

• Microservices report CPU and MEM usage

Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Graphite

Virtual Host V

• Grafsy reliably sends it to Graphite

Hypervisor H

Infrastructure Graphite

Infrastructure Graphite• Only 2 hosts, ~400 GB of RAM. Whisper

• 50000+ metrics per second

• Tried integration with Clickhouse, Cassandra…

• Client: Grafsy

• Notifier: Graphite2monitoring

• Client: Grafsy

• Notifier: Graphite2monitoring

• Statistic: igcollect

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Graphite

Virtual Host V

Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Graphite

Nagios

Virtual Host V

• Graphite2monitoring notifies Nagios

Hypervisor H

Infrastructure. Nagios

• Special host-type «aggregator»

• 451.44 checks/sec

• 2 Fully replicated hosts

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Graphite

Nagios

Virtual Host V

Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Graphite

Nagios

BrassmonkeyVirtual Host V

• Brassmonkey reacts on events in Nagios

Hypervisor H

Infrastructure. Brassmonkey: Python-sysadmin

• Used for routine tasks (reboot server, restart daemon, cron…)

• Checking Nagios

• Notifying admins

• Checking Nagios

• Performing actions

• Checking Nagios

• Can do more…

• Checking Nagios

• Can do more… Autoscaling

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Graphite

Nagios

Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Graphite

Nagios

Serveradmin

Hypervisor H

Autoscaling A chain

Hypervisor 1

Virtual Host 1

Microservice 1

Graphite

Nagios

Serveradmin

• Serveradmin makes changes (new host/Puppet)

Hypervisor H

Infrastructure Serveradmin

•Single source of truth

•Controls role by Puppet classes

•DNS

•LB node of LB pool

•DNS

•Location of VM

•DNS

•Location of VM

•Nagios checks/Graphite graphs

•DNS

Autoscaling. Less components

2016-10-13 12:40:42,115 [DEBUG] Initilizing crm3 2016-10-13 12:40:42,115 [DEBUG] Query adminapi for 'hostname=aggregator.crm state=online' 2016-10-13 12:40:42,215 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-batch-target-producer 2016-10-13 12:40:42,215 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-catalog 2016-10-13 12:40:42,215 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-content-producer 2016-10-13 12:40:42,216 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-data-api 2016-10-13 12:40:42,216 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-event-consumer 2016-10-13 12:40:42,216 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-event-target-producer 2016-10-13 12:40:42,216 [DEBUG] Found service: aggregator.crm -> num_crm-frontend 2016-10-13 12:40:42,217 [DEBUG] Running crm3_instance 2016-10-13 12:40:42,311 [WARNING] [crm3_instance] Reached the lowest limit of instances for batch-target-producer 2016-10-13 12:40:42,393 [WARNING] [crm3_instance] Reached the lowest limit of instances for catalog 2016-10-13 12:40:42,472 [WARNING] [crm3_instance] Reached the lowest limit of instances for content-producer 2016-10-13 12:40:42,550 [WARNING] [crm3_instance] Reached the lowest limit of instances for data-api 2016-10-13 12:40:42,646 [WARNING] [crm3_instance] Reached the lowest limit of instances for event-consumer 2016-10-13 12:40:42,718 [WARNING] [crm3_instance] Reached the lowest limit of instances for event-target-producer 2016-10-13 12:41:35,876 [INFO] [crm3_instance] added frontend on af-web02.crm

Autoscaling. Load is coming

Autoscaling. More load

2016-10-14 18:31:06,141 [WARNING] [crm3_instance] Reached the highest limit of instances for event-consumer

Autoscaling. More load

2016-10-14 18:31:06,141 [WARNING] [crm3_instance] Reached the highest limit of instances for event-consumer

Autoscaling. Load is gone

2016-10-13 15:10:05,449 [DEBUG] Initilizing crm3 2016-10-13 15:10:05,449 [DEBUG] Query adminapi for 'hostname=aggregator.crm state=online' 2016-10-13 15:10:05,524 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-batch-target-producer 2016-10-13 15:10:05,525 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-catalog 2016-10-13 15:10:05,525 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-content-producer 2016-10-13 15:10:05,525 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-data-api 2016-10-13 15:10:05,525 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-event-consumer 2016-10-13 15:10:05,525 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-event-target-producer 2016-10-13 15:10:05,525 [DEBUG] Found service: aggregator.crm -> cpu_usage_crm-frontend 2016-10-13 15:10:05,526 [DEBUG] Running crm3_instance 2016-10-13 15:10:05,604 [WARNING] [crm3_instance] Reached the instances limit of components for batch-target-producer 2016-10-13 15:10:05,709 [WARNING] [crm3_instance] Reached the instances limit of components for catalog 2016-10-13 15:10:05,789 [WARNING] [crm3_instance] Reached the instances limit of components for content-producer 2016-10-13 15:10:05,903 [WARNING] [crm3_instance] Reached the instances limit of components for data-api 2016-10-13 15:11:25,887 [INFO] [crm3_instance] removed event-consumer from af-web02.crm

The path

The path• Needs to be written

• Grafsy

• Graphite2monitoring

• Grafsy

• ClusterHC - healthchecking clusters

• Grafsy

• mmdu - management of users in MySQL (Puppet module sucks)

• Grafsy

• Brassmonkey modules (Python)

• Grafsy

• Architecture

• Grafsy

• Architecture

• Integral or differential way of monitoring

• Grafsy

• Architecture

• Build VMs with one command

• Grafsy

• Architecture

• Build VMs with one command

• More then 1 component work at the same moment. Deadlocks

Conclusion

Conclusion• It is not hard to make your own autoscaling

Conclusion

• Much cheaper than AWS or Azure

• It is not hard to make your own autoscaling

Conclusion

• Does not require migration to other technology

Conclusion

• Saves company resources

Conclusion

• Keeps sysadmin calm

Conclusion

• Keeps sysadmin calm

• Forces to have proper application architecture

https://github.com/leoleovich https://github.com/innogames

https://www.innogames.com

Как сделать высоконагруженный сервис, не зная...

Engineering

Transcript of Как сделать высоконагруженный сервис, не зная...

Рабочие нагрузки Skype for business 2015 UC Lab

Центральная Районная библиотека им ... · 2018-11-06 · исполнители других ролей, зная книги, сотрудничая

Курс высокие нагрузки и надежность: отрывок

Курс высокие нагрузки: очереди (отрывок)

Выключатели нагрузки shd200

snip2-01-07-85 ВОЗДЕЙСТВИЯ И НАГРУЗКИ

Дозирование нагрузки при занятиях бегом. Атлетическая гимнастика.

Метод дедукции: узнайте о подписчике все! Зная лишь емейл…

Экологическое нормирование антропогенной нагрузки на экосистемы

ДБН В.1.2-2-2006 Нагрузки и воздействия

СНиП 2.01.07-85 Нагрузки и воздействия · 2016. 12. 24. · СТРОИТЕЛЬНЫЕ НОРМЫ И ПРАВИЛА НАГРУЗКИ И ВОЗДЕЙСТВИЯ

Perl 6 и высокие нагрузки

"Зная прошлое, вперед в будущее!"

Ты мира не узнаешь, не зная края своего

Fmostechprom.ru/download/reducers/bonfiglioli/F-series.pdfK2 – умеренные ударные нагрузки (0.25 < K ≤ 3) K3 – тяжелые ударные нагрузки

Markus Tessmann, InnoGames

World of tanks: высокие нагрузки

РЛМПЫ И НЛЬЕСЫ НЛ4 НИМИ · -Тб2 Таблица 2- Расчетные нагрузки на пилястры 43-Тб9 Таблица 9. Расчетные нагрузки

Доклад на РИТ: Высокие нагрузки (2008)

Эвтрофикация : источники биогенной нагрузки. Программы мониторинга ХЕЛКОМ