4. v sphere big data extensions hadoop

20
© 2009 VMware Inc. All rights reserved vSphere Big Data Extensions 之 Hadoop 之之之之之之之之之之之 李李李 李李李李李李李李李李 VMware 李李李李李李

description

VMWare Big Data Forum

Transcript of 4. v sphere big data extensions hadoop

Page 1: 4. v sphere big data extensions   hadoop

© 2009 VMware Inc. All rights reserved

vSphere Big Data Extensions 之Hadoop参考架构和性能最佳实践李欣慧大数据研发高级工程师VMware中国研发中心

Page 2: 4. v sphere big data extensions   hadoop

2

Agenda

Recommended Deployment Topology

Plan Your Cluster

Page 3: 4. v sphere big data extensions   hadoop

3

Virtualization Host

VMDK

Shared storageSAN/NAS

Local disks

OS Image – VMDK

VMDK VMDK VMDK VMDK VMDK

HadoopVirtualNode 2

Datanode

Ext4

Task-tracker

Ext4 Ext4 Ext4

mapred.local.dir

Standard Deployment Configuration on Single Worker

VMDKVMDK

Ext4 Ext4 Ext4 Ext4

Page 4: 4. v sphere big data extensions   hadoop

4

Standard Deployment Configuration on Single Worker

Virtualization Host

VMDK

Local disks

OS Image – VMDK

VMDK VMDK VMDK VMDK VMDK

HadoopVirtualNode 2

Datanode

Ext4

Task-tracker

Ext4 Ext4 Ext4

mapred.local.dir

VMDKVMDK

Ext4 Ext4 Ext4 Ext4

Page 5: 4. v sphere big data extensions   hadoop

5

Virtualization Host

VMDKOS Image – VMDK

HadoopVirtualNode 1

Datanode

Ext4

Task-tracker

Ext4 Ext4 Ext4

Shared storageSAN/NAS

Local disks

OS Image – VMDK

VMDK VMDK VMDK VMDK VMDK VMDK VMDK

HadoopVirtualNode 2

Datanode

Ext4

Task-tracker

Ext4 Ext4 Ext4

mapred.local.dir

Standard Deployment Configuration

Page 6: 4. v sphere big data extensions   hadoop

6

Virtualization Host

VMDKOS Image – VMDK

HadoopVirtualNode 1

Datanode

Ext4

Task-tracker

Ext4 Ext4 Ext4

Local disks

OS Image – VMDK

VMDK VMDK VMDK VMDK VMDK VMDK VMDK

HadoopVirtualNode 2

Datanode

Ext4

Task-tracker

Ext4 Ext4 Ext4

mapred.local.dir

Standard Deployment Configuration

Page 7: 4. v sphere big data extensions   hadoop

7

Virtualization Host

OS Image – VMDK

HadoopVirtualNode 1

Task-tracker

Shared storageSAN/NAS

Local disks

OS Image – VMDK

VMDK VMDK VMDK VMDK VMDK VMDK VMDK

HadoopVirtualNode 2

Datanode

Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4

VMDK

VMDK VMDK VMDK VMDK VMDK VMDK VMDKVMDK

… …

Standard Deployment Configuration for D/C Separation

Page 8: 4. v sphere big data extensions   hadoop

8

Data Path for Combined vs. Data/Compute Separation

Virtualization Host

Virtualization Host

Hadoop Virtual Node 1

Hadoop Virtual Node 2

TaskTrackerTaskTracker

Virtual Switch

Hadoop Virtual NodeHadoop Virtual Node

Virtual Switch

TaskTrackerTaskTracker

Serengeti provide local storage based temp for D/C separation.

• Each compute VM needs its own temp space

• Required temp space is different from an application to another

• Can result in wasted space

Page 9: 4. v sphere big data extensions   hadoop

9

Recommended Topology of Data/Compute Separation

Virtualization Host

VMDKOS Image – VMDK

HadoopVirtualNode 1

Ext4

Task-tracker

Shared storageSAN/NAS

Local disks

OS Image – VMDK

VMDK VMDK VMDK VMDK VMDK VMDK VMDK

HadoopVirtualNode 2

Datanode

VMDK

Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4

Page 10: 4. v sphere big data extensions   hadoop

10

Virtualization Host

Hadoop Virtual Node 1

Hadoop Virtual Node 2

TaskTrackerTaskTracker

Virtual Switch Virtualization Host

Hadoop Virtual Node 1

Hadoop Virtual Node 2

TaskTrackerTaskTracker

Virtual Switch

Data Path for Local TT Storage vs. NFS Temp

Serengeti provide NFS based temp for D/C separation

• Improve local storage space utilization.

• Trade-off between bandwidth efficiency vs. overhead of NFS.

Page 11: 4. v sphere big data extensions   hadoop

11

Consolidated Storage on Single DN VM

Virtualization Host

OS Image – VMDK

HadoopVirtualNode 1

Task-tracker

Shared storageSAN/NAS

Local disks

OS Image – VMDK

VMDK VMDK VMDK VMDK VMDK VMDK VMDK

HadoopVirtualNode 2

Datanode

Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4dirdirdirdirdirdirdirdir

VMDK

… …

NFS Client

NFS Server

Page 12: 4. v sphere big data extensions   hadoop

12

Recommended Topology of Computing Only Cluster

Virtualization Host

OS Image – VMDK

Shared storageSAN/NAS

OS Image – VMDK

HadoopVirtualNode 2

Datanode

Ext4

HadoopVirtualNode 1

Task-tracker

Ext4Ext4Ext4Ext4Ext4Ext4Ext4Ext4

VMDK VMDK VMDK VMDK VMDK VMDK VMDKVMDK

VMDK

Page 13: 4. v sphere big data extensions   hadoop

13

Plan Your Cluster

Start with a small cluster and grow it as required

• Initially just four or six nodes

• Increase amount of computation/data/memory as required

• Available space of HDFS = (DFS Remaining . value * 95%)/ dfs.replication.value

Choose right hardware – master node

• Namenode and Jobtracker often run on same machine for smaller clusters

• Consider HA/FT settings

• separate NameNode and Jobtracker from slave nodes’ host.

• Dual power supplies

Page 14: 4. v sphere big data extensions   hadoop

14

Plan Your Cluster

Choose right hardware – slave node• 2 * Quad-core CPUs at least, HT enabled

• RAM

• Consider 6% overhead for virtualization

• Recommend 4-8 GB memory per core

• Storage

• At least 8 disks per host, 12 disks per host may be ideal for absolute performance but probably not for price-performance.

• Recommend 1-1.5 disks per core

• JBOD, SATA RPM7,200 is fine

• A good practical maximum is 24TB or 36TB per slave node. More than that will result in massive network traffic if a node dies and block re-replication must take place.

Page 15: 4. v sphere big data extensions   hadoop

15

Plan Your Cluster

Networking

• Use dedicate switches for your Hadoop cluster and Nodes are connected to a top-of-rack switch

• Nodes should be connected at a minimum speed of 1Gb/sec and consider 10Gb/sec for clusters with large scale of intermediate data

• Racks are interconnected via core switches

• Core switches should connect to top-of-rack switches by dual 10Gb/sec links

• Redundant top-of-rack switches, core switches

• Separate management network and vm network

• Adopt vDS and dvport groups that span hosts and ensure configuration consistency for vms and virtual ports for functions of Vmotion and network storage

• Leave the management port out of your vDS

Page 16: 4. v sphere big data extensions   hadoop

16

Virtualization Host

Networking Configurations – Four 1G NICs

vmnic 0

pSwitch 1

Virtual Switch 1

Hadoop cluster VM portgroup

vmnic 1

pSwitch 2

Virtual Switch 0

MGMT192.168.1.100

VMOTION192.168.3.100

FT192.168.4.100

VMKERNEL192.168.2.100

vmnic 3

Hadoop vm traffic goes through vSwitch1 (vmnic2 and vmnic3, both active)

On vSwitch0, it goes through MGMT, VM kernel on vmnic0(active, vmnic1 on standby)

vMotion and FT on vmnic1 (active, vmnic0 on standby)

1Gbs 1Gbs

vmnic 2

1Gbs 1Gbs

Page 17: 4. v sphere big data extensions   hadoop

17

Virtualization Host

Networking Configurations -10G for Hadoop VMs

vmnic 0

pSwitch 1

Virtual Switch 1

Hadoop cluster VM portgroup

vmnic 1

pSwitch 2

Virtual Switch 0

MGMT192.168.1.100

VMOTION192.168.3.100

FT192.168.4.100

VMKERNEL192.168.2.100

vmnic 2

Hadoop vm traffic goes through vSwitch1 (vmnic3)

10G for Hadop cluster vms

• more performance benefits

• If any need, keep redundancy with the other suit of vmnic /pSwitch

Keep redundancy for management network

pSwitch 3

1Gbs 1Gbs

10 GBe

Page 18: 4. v sphere big data extensions   hadoop

18

vSphere Configurations

Configure hosts with NTP service and to ensure the time on all the nodes is synchronized

Virtual Disk Settings

• One datastore per physical disk

• Warm-up is needed on the provisioned cluster

NUMA scheduler important for virtualized Hadoop performance

• Poor configuration can result in 12%(1) performance degradation

• Data VM preferably should be distributed across NUMA nodes

Provision right VM size

• Reserve 6% memory for vSphere usage

• Avoid over-commitment

• Enable NUMA and keep VM size within the NUMA node

Page 19: 4. v sphere big data extensions   hadoop

19

For Existing Devices

Crudely fit existing resource capacity for Hadoop

• CPU : RAM : Throughput - 4*1333MHZ: 32G: 800M/s

Use powerful machine to run master node/computing node

Use high throughput machine for slave node/data node

Page 20: 4. v sphere big data extensions   hadoop

20

Q&A