Мониторинг. Опять, rootconf 2016
-
Upload
vsevolod-polyakov -
Category
Engineering
-
view
431 -
download
3
Transcript of Мониторинг. Опять, rootconf 2016
МОНИТОРИНГ. ОПЯТЬ.Всеволод Поляков
Что такое метрики?
Успешность
Количество
Время
Взаимодействие
Внутренние процессы
Системные метрики
Зачем нужны метрики?
Алерты
Аналитика
Graphite
Default graphite architecture
what?• RRD-like (gram.ly/gfsx)
• so.it.is.my.metric → /so/it/is/my/metric.wsp
• Fixed retention (by name\pattern)
• Fixed size (actually no)
Retention and size• 1s:1d → 1 036 828 bytes
• 10s:10d → 1 036 828 bytes
• 1s:365d → 378 432 028 bytes (1 TB ~ 3 000)
• 10s:365d → 37 843 228 bytes (1 TB ~ 30 000)
whisper calc
Retention and size• 10s:30d,1m:120d,10m:365d → 4 564 864 bytes
• 240 864 metrics in 1 TB
• aggregation: average, sum, min, max, and last.
• can be assign per metric
How• terraform (https://www.terraform.io/)
• docker (https://www.docker.com/)
• ansible (https://www.ansible.com/)
• rocker (https://github.com/grammarly/rocker)
• rocker-compose (https://github.com/grammarly/rocker-compose)
Default graphite architecture
carbon-cache.py
• single-core
• many options in config file
• default
link
architecturecarbon-cache.py
Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)
• retentions = 1s:1d
• MAX_CACHE_SIZE, MAX_UPDATES_PER_SECOND, MAX_CREATES_PER_MINUTE = inf
• defaults
• almost 1.5h to get limit :(
carbon-cache.py cache size → 75k m\s
results
• 75 000 m\s max
• 60 000 m\s flagman speed
• I\O :(
Try to tune!
• WHISPER_SPARSE_CREATE = true (don’t allocate space on creation) non-linear I\O load.
• CACHE_WRITE_STRATEGY = sorted (default)
cache size 1k → 195k m\s
results
• 120 000 m\s flagman speed • cache flush problem :(
Try to tune!
• CACHE_WRITE_STRATEGY = max will give a strong flush preference to frequently updated metrics and will also reduce random file-io.
from 1k to 150k
results
• 90 000 m\s flagman speed • cache flush problem :(
Try to tune!
• CACHE_WRITE_STRATEGY = naive just flush. Better with random I\O.
from 45k to 135k
results
• 120 000 m\s flagman speed • still CPU
sorted
max
naive
• Maybe it’s I\O EBS limitation? → 512 GB disk.
• No.
go-carbon
• multi-core single daemon
• written in golang
• not many options to tune :(
link
Start load testing• m4.xlarge instance (4 CPU, 16 GB ram, 256 GB disk EBS gp2)
• retentions = 1s:1d
• max-size = 0
• max-updates-per-second = 0
• almost 1h to get limit :(
1k → 130k m\s ~3k/min
results• 120 000 m\s flagman speed • but it’s without sparse. • try to implement
try to tune! remaining := whisper.Size() - whisper.MetadataSize() whisper.file.Seek(int64(remaining-1), 0) whisper.file.Write([]byte{0}) chunkSize := 16384 zeros := make([]byte, chunkSize) for remaining > chunkSize { // if _, err = whisper.file.Write(zeros); err != nil { // return nil, err // } remaining -= chunkSize } if _, err = whisper.file.Write(zeros[:remaining]); err != nil { return nil, err }
Уже есть в go-carbon
180 000 m\s !
try to tune!
• max update operation = 1500
results
• TLDR 210 000 - 240 000 m\s flagman speed
• 31 000 000 cache size!
try to tune!
• max update operation = 0
• input-buffer = 400 000
results
• 270 000 m\s flagman speed
• 10-20kk cache size!
try to tune!
• vm.dirty_background_ratio=40
• vm.dirty_ratio=60
300 000 req\s
results
• 300 000 m\s flagman speed
• 180k+ m\s ±without cache
Re:Lays
Default graphite architecture
arch forward
arch named\regexp
arch hash
arch hash replicafactor: 2
carbon-relay.py
• twisted based
• native
Start load testing• c4.xlarge instance (4 CPU, 7.5 GB ram)
• ~1 Gb lan
• default parameters
• hashing
• 10 connections
WTF!
carbon-relay-ng• golang-based
• web-panel
• live-updates
• aggregators
• spooling
link
<150 000 req\s
carbon-c-relay
• написан на C
• advanced cluster management
from 100 000 to 1 600 000 req\s
1 400 000 flagman speed. Or not?
Итак…go-carbon + carbon-c-relay = ♡
Контейнеры
Всё перепутано
Различия• Окружение
• Роль
• Трек (Модификатор)
• IP
• Датацентр
• Что-угодно
Теги
TSDB с тегами
• influxDB
• openTSDB (hbase)
• cyanite (cassandra)
• newTS (cassandra)
• Prometheus
(cluster) influx, 130k metric\sувеличить график
openTSDB single instance + hbase cluster = upto 150k metric\s
Compaction
Graphite
Найти уникальное
Работает с Grafana
Zipper
• https://github.com/grobian/carbonserver
• https://github.com/dgryski/carbonzipper
• https://github.com/dgryski/carbonapi
ALSO
• https://github.com/jssjr/carbonate
• https://github.com/jjneely/buckytools
• https://github.com/dgryski/carbonmem
• https://github.com/grobian/carbonwriter
Планы
• Патч statsd → ES
• Патч carbonserver → carbonlink
feel free to ask• Vsevolod Polyakov
• skype: ctrlok1987
• github.com/ctrlok
• twitter.com/ctrlok
• slack: HangOps
• Gitter: dev_ua/devops
• skype: DevOps from Ukraine
• slack.ukrops.club
Мы хайрим!