Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture)...

36
Stateful workloads on kubernetes with ceph 네이버 유장선

Transcript of Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture)...

Page 1: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

Stateful workloads on kubernetes with ceph

네이버 유장선

Page 2: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Agenda

► CaaS▶ Kubernetes ▶ Ceph Storage▶ Operation

Page 3: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Cloud Service Model

Applications

Data

Runtime

Middleware

OS

Virtualization

Server

Storage

Network

On Premises

원하는대로(비표준), 비용 증가, 시간 증가 표준화, 비용 절감, On Demand

Applications

Data

Runtime

Middleware

OS

Virtualization

Server

Storage

Network

IaaS CaaS

Applications

Data

Runtime

Middleware

OS

Virtualization

Server

Storage

Network

PaaS

Applications

Data

Runtime

Middleware

OS

Virtualization

Server

Storage

Network

SaaS

Applications

Data

Runtime

Middleware

OS

Virtualization

Server

Storage

Network

Page 4: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Transformation of deployment

App

Bin/Library

OS

Hardware

Traditional

App App

Hypervisor

OS

Hardware

Virtualized

Container Runtime

OS

Hardware

Containerized

App

Bin/Library

OS

App

VM

App

Bin/Library

OS

App

VM

App 간의 간섭 발생Library 호환성 이슈

Node 분리 시 비용 증가

VM 으로 격리시킴보안성 향상

리소스 효율화 / 확장성 증가VM OS 로 인한 리소스 증가

부팅 시간 증가

App

Container

Bin/Library

App

Container

Bin/Library

App

Container

Bin/Library

VM 이 비해 가벼움(OS 공유)배포가 빠름

Namespace 로 격리작고, 독립적인 단위로 분리

고효율 / 고집적

Page 5: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

MSA (micro-service architecture)

Monolithic

코드가 커지고, 복잡해짐수정 시 QA 범위가 커짐

연계 서비스 변경에 따른 영향장애 시 리스크 증가시장의 빠른 변화개발 패러다임 변화

Application Server

Service A Service B

Service C Service D

DB

Service A

DB

Service B

DB

Service C

DB

Service D

DB

Microservice

서비스를 작게 나눔배포를 단순화 시킴서비스별 기술 다변화

(libraries, languages, framework)확장성 증가

Page 6: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Container Orchestration

• Provisioning / Deployment of containers• Fault Tolerance (Replicas)• Load Balancing• Service Discovery• Auto Scaling (Scale in/out)• Resource Limit Control • Scheduling • Health Checking • Cluster Management• Configuration Management • Monitoring

Svc A

Svc C

Svc D

Svc B

Svc A

Svc C

Svc D

Svc B

Svc A

Svc C

Svc B

Svc A

Svc D

Svc B

worker node

Svc A

Svc C

Svc B

Svc A

Svc D

Svc B

worker node

Svc A

Svc C

Svc B

Svc A

Svc D

Svc B

worker node

Svc A

Svc C

Svc B

Svc A

Svc D

Svc B

worker nodeworker node

Docker Swarm Mesos Marathon Cloud Foundry CoreOS Fleet

Kubernetes Google Container Engine Amazon ECS Azure Container Service

Page 7: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Kubernetes (K8S)

• Container Orchestrator 의 de-factor • Google 내부 컨테이너 서비스인 Borg 의 오픈소스 버전 (15년의 운영경험)• CNCF 기부 (Cloud Native Computing Foundation)• 다양한 클라우드 및 베어메탈 환경 지원• Go 언어로 작성• Self-healing• Horizontal Scaling• Service Discovery / Load Balancing• Automatic rollouts / rollbacks• Secret / configuration management• Storage orchestration• Batch execution (crontab)• …

Page 8: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Kubernetes

Worker Node

YAML

Kind : deploymentSelector : app: nginxReplicas : 3Template:

image : nginxlabel : app:nginx Worker Node

Worker Node

nginx

nginx

nginx

Kind : serviceSelector : app: nginxType : LoadBalancer

External IPDNS/LB

K

K

K

Kubectl

API

K8s Master

R

client

service

BE

BE

BE

Internal IPLB

Page 9: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

CI/CD Pipeline

Developer

Commit & Push Code

Git Repository

CI Server

Build & Run Tests

Build Docker Image

Push Docker Image

UpdateKubernetesDepoyment

Docker Registry

Kubernetes

Create New Pod

Health Check

New Pod

Healthy Pod

Delete Old Pod

Canary / Blue-Green Deployments

Not Healthy Pod

Old Pod Running

Restart New Pod

Page 10: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Autoscaling

HPA

MetricsHorizontal

PodAutoscaler

Check metrics

Threshold is met ?

DEPLOYMENT

Change number of replicas

POD

POD

POD

POD

Scale in / out number of pods

VPA

VerticalPod

Autoscaler

Check metrics

Threshold is met ?

DEPLOYMENT

POD

POD

POD

Scale up / down number of pods

Change cpu / memory values

Page 11: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Stateful workloads

Page 12: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Storage in K8S

Local Storage Remote Storage

Ephemeral Shared Storage Block Storage

데이터 저장 Pod(컨테이너) 내부 호스트 로컬 디스크 외부 네트워크 스토리지(여러 Pod 가 스토리지 공유)

외부 네트워크 스토리지(Pod 별 스토리지 할당)

Pod 삭제 시 데이터도 함께 삭제 삭제되지 않음 삭제되지 않음 삭제되지 않음

Host 장애 시 데이터 사용 불가 데이터 사용 불가 서비스 영향 없음 서비스 영향 없음

POD / Container

HOST

POD / Container

HOST

Local Disk

POD

HOST

POD POD

HOST

POD

Page 13: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Volume Plugin in Kubernetes

• Openstack(Cinder) 와 연동• Multi-tenancy 지원• 인증 / 권한 연동 (사내 인증 연동)• Flexvolume 방식 개발• Docker Volume Plugin 운영 노하우 적용• On-line Resize 지원• Read-only Multi-attached 지원• Snapshot 지원• Cephfs Fuse / Kernel mount 지원• RBD Multi-attached 방지(lock)• Node Drain 시 BlackList 추가 기능• IO Monitoring• Front-end QoS (using cgroups)• Quotas 지원• …

Plugin 자체 개발 진행

Page 14: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

StatefulsetapiVersion: apps/v1kind: StatefulSetspec:replicas: 3template:

spec:containers:- name: nginx

image: k8s.gcr.io/nginx-slim:0.8volumeMounts:- name: www

mountPath: /usr/share/nginx/htmlvolumeClaimTemplates:- metadata:

name: wwwspec:

accessModes: [ "ReadWriteOnce" ]resources:

requests:storage: 1Gi

WEB-0

PODs PVC PV

WWW-WEB-O PV-uuid

Vol

WEB-1 WWW-WEB-1 PV-uuid

WEB-2 WWW-WEB-2 PV-uuid

… …

Page 15: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Volume Plugin in NAVER

Statefulset

Ceph Provisioner

CinderCeph

vol

Ceph Driver

vol vol

Keystone

PV Check

생성요청

인증

생성

YAML 정의

Attach/mount

Attach / Mount

Rbd (kernel map / mount )POD

vol vol vol

vol vol vol

/dev/rbd0

Page 16: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Distributed platform on distributed storage

KAFKA #1

KAFKA #2 KAFKA #3

C

C

C

C

C C

C

C C

3 copy -> 9 copy

Kafka on ceph rbd (3 copy) Elastic Search (Warm) on ceph ec (1.5 copy)

ES #1 (Warm)

ES #2 (Warm)

C

ES : 2 copyEC : 1.5 copy= 3 copy

C C C C C

C C C C C C

Page 17: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Single Copy Storage

KAFKA #1

KAFKA #2

KAFKA #3

Zone Group #1…

DISK

DISK

DISK

VOL#1

VOL#2

VOL#3

DISK

DISK

DISK

……

VG VG

DISK

DISK

DISK

VOL#1

VOL#2

VOL#3

DISK

DISK

DISK

……

VG VG

DISK

DISK

DISK

VOL#1

VOL#2

VOL#3

DISK

DISK

DISK

……

VG VG

Zone Group #2

Zone Group #3

iSCSI

Page 18: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Ceph Storage

Page 19: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Ceph Storage Service

SWIFT S3OpenStack 인증

DockerRegistry

사내스토리지

Object -> NFS

Docker/K8S

rbd.ko

nbd.ko

librbd

PM/VM

fuse kernel

CEPHFS -> NFSiSCSI Export

QEMU

Page 20: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

FileStore

6TB * 8 = 48 TB (66% 사용)

BlueStore 전환 (+ NVMe)

DISK DISK

SSD SSD

DISK DISK

DISK DISK

DISK DISK

DISK DISK

Raid1 / OS

Journal

DATA

BlueStore

6TB * 12 = 72 TB (100% 사용)

DISK DISK

DISK DISK

DISK DISK

DISK DISK

DISK DISK

DISK DISK

NVMe

DATA

OS

Docker

DB/WAL

BCache

Page 21: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

CephFS 제공

• Shared File System (like NFS)• POSIX-compliant file system• Data Pool (RBD 동시 사용)• Metadata Pool • Multiple MDS Server• Hot Standby / Standby MDS• Scheduling • Direct Access file data blocks• Fuse / Kernel Mount• Quota Support

https://docs.ceph.com/docs/master/_images/cephfs-architecture.svg

Page 22: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

MDS High Availability : Standby MDS

MDS #1(RANK:

0)

Floating Standby MDS

MDS #2(RANK:

1)

MDS(STANDB

Y)

Hot Standby MDS

MDS #1(RANK:

0)

MDS #2(RANK:

1)

MDS #1(H/S)

MDS #2(H/S)

Page 23: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Multiple MDS

MDS #1(RANK:

0)

Single MDS Multiple MDS

MDS #2(RANK:

1)

MDS #3(RANK:

2)

Bottleneck

MDS #1(RANK:

0)

ceph fs set <fs_name> max_mds 3

Page 24: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Type

Subtree Pinning (static)

setfattr –n ceph.dir.pin -v <rank> </path>

Shard

Data

… … … …

Root

MDS #2(RANK:

1)

MDS #3(RANK:

2)

MDS #1(RANK:

0)

cephfs_volume_prefix = /ceph_ssdsetfattr -n ceph.dir.layout.pool -v SSD_POOL /ceph_ssd

/ceph_hdd /ceph_ssd

Page 25: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

User space

Kernel space

Application

VFS

CEPH-FUSE

FUSE

ceph-fuse kernel mount

User space

Kernel space

Application

VFS CephFSKernel

FastSupport Quotas

CephFS : fuse / kernel

Page 26: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

https://pommi.nethuis.nl/ssd-caching-using-linux-and-bcache/

• bcache (writeback)• kernel 3.10 • Flashcache (facebook),

EnhanceIO• NVMe• Random RW

Block Cache : bcache

Page 27: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

/dev/bcache0p1

/var/lib/ceph/osd/dd961c1a-0a05-4581-af03-77c28fb8fcbcmount

/var/lib/ceph/osd/ceph-2/bind

ceph-osd … --id 2read

block block.db

/dev/disk/by-partuuid/4c06c8f6-2967-4165-9e85-fa0382d6ab17link

/dev/disk/by-partuuid/e530dbba-706a-4f2f-91e3-b90b0df3a650link

/dev/bcache0p2

link

/dev/bcache0

/sys/fs/bcache/99c3ab27-e819-4f18-8947-924c53045bbcbdev0

cache0

HDD

sda sdb sdc

sde sdf sdx

NVMe

/dev/nvme0n1p6

/dev/nvme0n1p17

… For DB / WAL

/dev/nvme0n1p18

/dev/nvme0n1p29

… For bcache

link

bcache

Page 28: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Frontend QoS (current) Backend QoS (WIP)

Container #1

Container #2

Container #3

docker

cgroups

cpu

memory

io

network

Cephfs (file)(based token-bucket algorism)

Ceph rbd (block)(based dm-clock)

QoS (Quality of Service)

/sys/fs/cgroup/blkio/docker/{container id}/blkio.throttle.read_bps_deviceblkio.throttle.read_iops_deviceblkio.throttle.write_bps_deviceblkio.throttle.write_iops_device

{Major}:{minor} {value}

252:48 1048576

Page 29: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

DockerRegistry

Worker #1 Worker #2 Worker

Pull Images

?? minutes

Worker

Single Private Docker Registry

USA

EUROPE

Worker

Push Images

KOREA

Page 30: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

HarborRegistry (M)

Ceph

Push ImagesHarbor

Registry (L)

Ceph

Replicate Images

Worker #1 Worker #2

Pull Images

Worker #3 Worker #4

Pull Images

Replicate Repository : Harbor

Emergency

Page 31: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Master Local

pushreplicate #1

replicate #2replicate #N 부하발생

Worker

pull

Harbor Master

Ceph RGW

Worker

Harbor Local 2

Ceph RGW

Replication Bottleneck

Local

Harbor Local 1

Ceph RGW

Ceph CephCeph

Page 32: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Ceph RGW Multi-Site : Sync

Push Images

EuropeHarbor

Ceph

Radosgw

MasterHarbor

Ceph

Radosgw

USAHarbor

Ceph

RadosgwSyncSync

WorkersIn

Europe

WorkersIn

USA

Emergency

Page 33: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Extend PG NUM

PG Count =

• Too few PGs per OSD can lead to drawbacks regarding performance• recovery time and uneven data distribution

Doubling the PGs (e.g. from 8192 to 16384)

$ ceph osd <pool> pg_num <int>$ ceph osd <pool> pgp_num <int>

• pg_num : splitting• pgp_num : rebalancing / backfilling

Page 34: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Extend PG NUM

• 1개 증가 시 0.1 % misplace 발생

• 64개까지는 동일한 비율로 증가함

• 128개 증가 시 9.8%로 2 ~ 3% misplace 가 덜 발생함

• 16개 이상 증가 시 duration 도 함께 증가됨

Page 35: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

Extend PG NUM

1개씩 8번의 pg 변경 시 7번의 이동 발생함

한번에 pg 8개 변경 시 7번의 이동 발생함

결론, 128 개 이상 단위로 PG 변경을 하지 않는다면,1개씩 진행하나, 여러개씩 진행하나 비슷하게 이동됨PG 1개 변경 시 시간과 PG 64개 변경 시 시간도 거의 유사함

Page 36: Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture) Monolithic 코드가커지고,복잡해짐 수정시QA 범위가커짐 연계서비스변경에따른영향

10/28/2019 NVRAMOS 2019

감사합니다.

Q&A