Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture)...
Transcript of Statefulworkloads on kuberneteswith ceph...10/28/2019 NVRAMOS 2019 MSA (micro-service architecture)...
Stateful workloads on kubernetes with ceph
네이버 유장선
10/28/2019 NVRAMOS 2019
Agenda
► CaaS▶ Kubernetes ▶ Ceph Storage▶ Operation
10/28/2019 NVRAMOS 2019
Cloud Service Model
Applications
Data
Runtime
Middleware
OS
Virtualization
Server
Storage
Network
On Premises
원하는대로(비표준), 비용 증가, 시간 증가 표준화, 비용 절감, On Demand
Applications
Data
Runtime
Middleware
OS
Virtualization
Server
Storage
Network
IaaS CaaS
Applications
Data
Runtime
Middleware
OS
Virtualization
Server
Storage
Network
PaaS
Applications
Data
Runtime
Middleware
OS
Virtualization
Server
Storage
Network
SaaS
Applications
Data
Runtime
Middleware
OS
Virtualization
Server
Storage
Network
10/28/2019 NVRAMOS 2019
Transformation of deployment
App
Bin/Library
OS
Hardware
Traditional
App App
Hypervisor
OS
Hardware
Virtualized
Container Runtime
OS
Hardware
Containerized
App
Bin/Library
OS
App
VM
App
Bin/Library
OS
App
VM
App 간의 간섭 발생Library 호환성 이슈
Node 분리 시 비용 증가
VM 으로 격리시킴보안성 향상
리소스 효율화 / 확장성 증가VM OS 로 인한 리소스 증가
부팅 시간 증가
App
Container
Bin/Library
App
Container
Bin/Library
App
Container
Bin/Library
VM 이 비해 가벼움(OS 공유)배포가 빠름
Namespace 로 격리작고, 독립적인 단위로 분리
고효율 / 고집적
10/28/2019 NVRAMOS 2019
MSA (micro-service architecture)
Monolithic
코드가 커지고, 복잡해짐수정 시 QA 범위가 커짐
연계 서비스 변경에 따른 영향장애 시 리스크 증가시장의 빠른 변화개발 패러다임 변화
Application Server
Service A Service B
Service C Service D
DB
Service A
DB
Service B
DB
Service C
DB
Service D
DB
Microservice
서비스를 작게 나눔배포를 단순화 시킴서비스별 기술 다변화
(libraries, languages, framework)확장성 증가
10/28/2019 NVRAMOS 2019
Container Orchestration
• Provisioning / Deployment of containers• Fault Tolerance (Replicas)• Load Balancing• Service Discovery• Auto Scaling (Scale in/out)• Resource Limit Control • Scheduling • Health Checking • Cluster Management• Configuration Management • Monitoring
Svc A
Svc C
Svc D
Svc B
Svc A
Svc C
Svc D
Svc B
Svc A
Svc C
Svc B
Svc A
Svc D
Svc B
worker node
Svc A
Svc C
Svc B
Svc A
Svc D
Svc B
worker node
Svc A
Svc C
Svc B
Svc A
Svc D
Svc B
worker node
Svc A
Svc C
Svc B
Svc A
Svc D
Svc B
worker nodeworker node
…
Docker Swarm Mesos Marathon Cloud Foundry CoreOS Fleet
Kubernetes Google Container Engine Amazon ECS Azure Container Service
10/28/2019 NVRAMOS 2019
Kubernetes (K8S)
• Container Orchestrator 의 de-factor • Google 내부 컨테이너 서비스인 Borg 의 오픈소스 버전 (15년의 운영경험)• CNCF 기부 (Cloud Native Computing Foundation)• 다양한 클라우드 및 베어메탈 환경 지원• Go 언어로 작성• Self-healing• Horizontal Scaling• Service Discovery / Load Balancing• Automatic rollouts / rollbacks• Secret / configuration management• Storage orchestration• Batch execution (crontab)• …
10/28/2019 NVRAMOS 2019
Kubernetes
Worker Node
YAML
Kind : deploymentSelector : app: nginxReplicas : 3Template:
image : nginxlabel : app:nginx Worker Node
Worker Node
nginx
nginx
nginx
Kind : serviceSelector : app: nginxType : LoadBalancer
External IPDNS/LB
K
K
K
Kubectl
API
K8s Master
R
client
service
BE
BE
BE
Internal IPLB
10/28/2019 NVRAMOS 2019
CI/CD Pipeline
Developer
Commit & Push Code
Git Repository
CI Server
Build & Run Tests
Build Docker Image
Push Docker Image
UpdateKubernetesDepoyment
Docker Registry
Kubernetes
Create New Pod
Health Check
New Pod
Healthy Pod
Delete Old Pod
Canary / Blue-Green Deployments
Not Healthy Pod
Old Pod Running
Restart New Pod
10/28/2019 NVRAMOS 2019
Autoscaling
HPA
MetricsHorizontal
PodAutoscaler
Check metrics
Threshold is met ?
DEPLOYMENT
Change number of replicas
POD
POD
POD
POD
…
Scale in / out number of pods
VPA
VerticalPod
Autoscaler
Check metrics
Threshold is met ?
DEPLOYMENT
POD
POD
POD
Scale up / down number of pods
Change cpu / memory values
10/28/2019 NVRAMOS 2019
Stateful workloads
10/28/2019 NVRAMOS 2019
Storage in K8S
Local Storage Remote Storage
Ephemeral Shared Storage Block Storage
데이터 저장 Pod(컨테이너) 내부 호스트 로컬 디스크 외부 네트워크 스토리지(여러 Pod 가 스토리지 공유)
외부 네트워크 스토리지(Pod 별 스토리지 할당)
Pod 삭제 시 데이터도 함께 삭제 삭제되지 않음 삭제되지 않음 삭제되지 않음
Host 장애 시 데이터 사용 불가 데이터 사용 불가 서비스 영향 없음 서비스 영향 없음
POD / Container
HOST
POD / Container
HOST
Local Disk
POD
HOST
POD POD
HOST
POD
10/28/2019 NVRAMOS 2019
Volume Plugin in Kubernetes
• Openstack(Cinder) 와 연동• Multi-tenancy 지원• 인증 / 권한 연동 (사내 인증 연동)• Flexvolume 방식 개발• Docker Volume Plugin 운영 노하우 적용• On-line Resize 지원• Read-only Multi-attached 지원• Snapshot 지원• Cephfs Fuse / Kernel mount 지원• RBD Multi-attached 방지(lock)• Node Drain 시 BlackList 추가 기능• IO Monitoring• Front-end QoS (using cgroups)• Quotas 지원• …
Plugin 자체 개발 진행
10/28/2019 NVRAMOS 2019
StatefulsetapiVersion: apps/v1kind: StatefulSetspec:replicas: 3template:
spec:containers:- name: nginx
image: k8s.gcr.io/nginx-slim:0.8volumeMounts:- name: www
mountPath: /usr/share/nginx/htmlvolumeClaimTemplates:- metadata:
name: wwwspec:
accessModes: [ "ReadWriteOnce" ]resources:
requests:storage: 1Gi
WEB-0
PODs PVC PV
WWW-WEB-O PV-uuid
Vol
WEB-1 WWW-WEB-1 PV-uuid
WEB-2 WWW-WEB-2 PV-uuid
… …
10/28/2019 NVRAMOS 2019
Volume Plugin in NAVER
Statefulset
Ceph Provisioner
CinderCeph
vol
Ceph Driver
vol vol
Keystone
PV Check
생성요청
인증
생성
YAML 정의
Attach/mount
Attach / Mount
Rbd (kernel map / mount )POD
vol vol vol
vol vol vol
/dev/rbd0
10/28/2019 NVRAMOS 2019
Distributed platform on distributed storage
KAFKA #1
KAFKA #2 KAFKA #3
C
C
C
C
C C
C
C C
3 copy -> 9 copy
Kafka on ceph rbd (3 copy) Elastic Search (Warm) on ceph ec (1.5 copy)
ES #1 (Warm)
ES #2 (Warm)
C
ES : 2 copyEC : 1.5 copy= 3 copy
C C C C C
C C C C C C
10/28/2019 NVRAMOS 2019
Single Copy Storage
KAFKA #1
KAFKA #2
KAFKA #3
Zone Group #1…
DISK
DISK
DISK
…
VOL#1
VOL#2
VOL#3
DISK
DISK
DISK
……
VG VG
…
DISK
DISK
DISK
…
VOL#1
VOL#2
VOL#3
DISK
DISK
DISK
……
VG VG
…
DISK
DISK
DISK
…
VOL#1
VOL#2
VOL#3
DISK
DISK
DISK
……
VG VG
Zone Group #2
Zone Group #3
iSCSI
10/28/2019 NVRAMOS 2019
Ceph Storage
10/28/2019 NVRAMOS 2019
Ceph Storage Service
SWIFT S3OpenStack 인증
DockerRegistry
사내스토리지
Object -> NFS
Docker/K8S
rbd.ko
nbd.ko
librbd
PM/VM
fuse kernel
CEPHFS -> NFSiSCSI Export
QEMU
10/28/2019 NVRAMOS 2019
FileStore
6TB * 8 = 48 TB (66% 사용)
BlueStore 전환 (+ NVMe)
DISK DISK
SSD SSD
DISK DISK
DISK DISK
DISK DISK
DISK DISK
Raid1 / OS
Journal
DATA
BlueStore
6TB * 12 = 72 TB (100% 사용)
DISK DISK
DISK DISK
DISK DISK
DISK DISK
DISK DISK
DISK DISK
…
NVMe
DATA
OS
Docker
DB/WAL
BCache
10/28/2019 NVRAMOS 2019
CephFS 제공
• Shared File System (like NFS)• POSIX-compliant file system• Data Pool (RBD 동시 사용)• Metadata Pool • Multiple MDS Server• Hot Standby / Standby MDS• Scheduling • Direct Access file data blocks• Fuse / Kernel Mount• Quota Support
https://docs.ceph.com/docs/master/_images/cephfs-architecture.svg
10/28/2019 NVRAMOS 2019
MDS High Availability : Standby MDS
MDS #1(RANK:
0)
Floating Standby MDS
MDS #2(RANK:
1)
MDS(STANDB
Y)
Hot Standby MDS
MDS #1(RANK:
0)
MDS #2(RANK:
1)
MDS #1(H/S)
MDS #2(H/S)
10/28/2019 NVRAMOS 2019
Multiple MDS
MDS #1(RANK:
0)
Single MDS Multiple MDS
MDS #2(RANK:
1)
MDS #3(RANK:
2)
Bottleneck
MDS #1(RANK:
0)
ceph fs set <fs_name> max_mds 3
10/28/2019 NVRAMOS 2019
Type
Subtree Pinning (static)
setfattr –n ceph.dir.pin -v <rank> </path>
Shard
Data
… … … …
Root
MDS #2(RANK:
1)
MDS #3(RANK:
2)
MDS #1(RANK:
0)
cephfs_volume_prefix = /ceph_ssdsetfattr -n ceph.dir.layout.pool -v SSD_POOL /ceph_ssd
/ceph_hdd /ceph_ssd
10/28/2019 NVRAMOS 2019
User space
Kernel space
Application
VFS
CEPH-FUSE
FUSE
ceph-fuse kernel mount
User space
Kernel space
Application
VFS CephFSKernel
FastSupport Quotas
CephFS : fuse / kernel
10/28/2019 NVRAMOS 2019
https://pommi.nethuis.nl/ssd-caching-using-linux-and-bcache/
• bcache (writeback)• kernel 3.10 • Flashcache (facebook),
EnhanceIO• NVMe• Random RW
Block Cache : bcache
10/28/2019 NVRAMOS 2019
/dev/bcache0p1
/var/lib/ceph/osd/dd961c1a-0a05-4581-af03-77c28fb8fcbcmount
/var/lib/ceph/osd/ceph-2/bind
ceph-osd … --id 2read
block block.db
/dev/disk/by-partuuid/4c06c8f6-2967-4165-9e85-fa0382d6ab17link
/dev/disk/by-partuuid/e530dbba-706a-4f2f-91e3-b90b0df3a650link
/dev/bcache0p2
link
/dev/bcache0
/sys/fs/bcache/99c3ab27-e819-4f18-8947-924c53045bbcbdev0
cache0
HDD
sda sdb sdc
sde sdf sdx
NVMe
/dev/nvme0n1p6
/dev/nvme0n1p17
… For DB / WAL
/dev/nvme0n1p18
/dev/nvme0n1p29
… For bcache
link
bcache
10/28/2019 NVRAMOS 2019
Frontend QoS (current) Backend QoS (WIP)
Container #1
Container #2
Container #3
docker
cgroups
cpu
memory
io
network
Cephfs (file)(based token-bucket algorism)
Ceph rbd (block)(based dm-clock)
QoS (Quality of Service)
/sys/fs/cgroup/blkio/docker/{container id}/blkio.throttle.read_bps_deviceblkio.throttle.read_iops_deviceblkio.throttle.write_bps_deviceblkio.throttle.write_iops_device
{Major}:{minor} {value}
252:48 1048576
10/28/2019 NVRAMOS 2019
DockerRegistry
Worker #1 Worker #2 Worker
Pull Images
?? minutes
Worker
Single Private Docker Registry
USA
EUROPE
Worker
Push Images
KOREA
10/28/2019 NVRAMOS 2019
HarborRegistry (M)
Ceph
Push ImagesHarbor
Registry (L)
Ceph
Replicate Images
Worker #1 Worker #2
Pull Images
Worker #3 Worker #4
Pull Images
Replicate Repository : Harbor
Emergency
10/28/2019 NVRAMOS 2019
Master Local
pushreplicate #1
replicate #2replicate #N 부하발생
Worker
pull
Harbor Master
Ceph RGW
Worker
Harbor Local 2
Ceph RGW
Replication Bottleneck
Local
Harbor Local 1
Ceph RGW
Ceph CephCeph
10/28/2019 NVRAMOS 2019
Ceph RGW Multi-Site : Sync
Push Images
EuropeHarbor
Ceph
Radosgw
MasterHarbor
Ceph
Radosgw
USAHarbor
Ceph
RadosgwSyncSync
WorkersIn
Europe
WorkersIn
USA
Emergency
10/28/2019 NVRAMOS 2019
Extend PG NUM
PG Count =
• Too few PGs per OSD can lead to drawbacks regarding performance• recovery time and uneven data distribution
Doubling the PGs (e.g. from 8192 to 16384)
$ ceph osd <pool> pg_num <int>$ ceph osd <pool> pgp_num <int>
• pg_num : splitting• pgp_num : rebalancing / backfilling
10/28/2019 NVRAMOS 2019
Extend PG NUM
• 1개 증가 시 0.1 % misplace 발생
• 64개까지는 동일한 비율로 증가함
• 128개 증가 시 9.8%로 2 ~ 3% misplace 가 덜 발생함
• 16개 이상 증가 시 duration 도 함께 증가됨
10/28/2019 NVRAMOS 2019
Extend PG NUM
1개씩 8번의 pg 변경 시 7번의 이동 발생함
한번에 pg 8개 변경 시 7번의 이동 발생함
결론, 128 개 이상 단위로 PG 변경을 하지 않는다면,1개씩 진행하나, 여러개씩 진행하나 비슷하게 이동됨PG 1개 변경 시 시간과 PG 64개 변경 시 시간도 거의 유사함
10/28/2019 NVRAMOS 2019
감사합니다.
Q&A