Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in...

26
IBM Research - Tokyo IISWC 2010 @ Atlanta © 2010 IBM Corporation Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors Hiroshi Inoue and Toshio Nakatani IBM Research – Tokyo

Transcript of Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in...

Page 1: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

IBM Research - Tokyo

IISWC 2010 @ Atlanta © 2010 IBM Corporation

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors

Hiroshi Inoue and Toshio NakataniIBM Research – Tokyo

Page 2: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

2

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

An Old Question on New Platforms

Threads vs. Processes: Which is better to achieve higher performance?

– Each process has own virtual memory space

Using processes provides better inter-process isolation

– Threads in one process shares a virtual memory space

Multi-thread processing is better for performance due to its memory efficiency (smaller footprint)

Is this answer still valid on today’s processors with multiple cores and multiple SMT threads in a core?

Page 3: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

3

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

Approach

Comparing multi-thread model and multi-process model on two types of hardware parallelism

– SMT scalability

– Core scalability

Page 4: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

4

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

SMT Scalability and Core Scalability

core 0 core 1 core 2 core 3 core 4 core 5 core 6 core 7

threadthreadthreadthread

threadthreadthreadthread

threadthreadthreadthread

threadthreadthreadthread

threadthreadthreadthread

threadthreadthreadthread

threadthreadthreadthread

SMT scalability: performance improvement using increasing number of SMT threads in one core

threadthreadthreadthread

Page 5: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

5

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

SMT Scalability and Core Scalability

core 0 core 1 core 2 core 3 core 4 core 5 core 6 core 7

threadthreadthread thread

threadthread

threadthreadthread

threadthreadthread

threadthreadthread

threadthreadthread

threadthreadthread

threadthreadthread

SMT scalability: performance improvement using increasing number of SMT threads in one core

Core scalability: performance improvement using increasing number of cores with one thread in each core

thread thread thread thread thread thread threadthread

Page 6: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

6

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

Experimental Setup

Systems– Niagara system

• UltraSPARC T1 (Niagara 1) 1.2 GHz • 8 cores with 4 SMT threads in each core• Solaris 10

– Nehalem system• Xeon X5570 (Nehalem) 2.93 GHz• 4 cores with 2 SMT threads in each core• Red Hat Enterprise Linux 5.4

Software– Benchmarks: SPECjbb2005, SPECjvm2008 – 32-bit HotSpot Server VM for Java 6 Update 17– Java heap size: 256 MB per thread using large page

Page 7: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

7

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

0

2000

4000

6000

8000

10000

12000

14000

0 1 2 3 4number of SMT threads

thro

ughp

ut (j

bb_o

ps).

multi-thread modelmulti-proces model

0

10000

20000

30000

40000

50000

60000

70000

0 1 2number of SMT threads

thro

ughp

ut (j

bb_o

ps) .

multi-thread modelmulti-proces model

SMT Scalability of SPECjbb2005

multi-thread model was 9.2% faster multi-thread model was 5.5% faster

multi-thread model uses 1 JVM

multi-process model uses 4 JVMs

higher is faster

on Niagara on Nehalem

Page 8: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

8

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

0

5000

10000

15000

20000

25000

30000

35000

40000

0 1 2 3 4 5 6 7 8number of cores

thro

ughp

ut (j

bb_o

ps).

multi-thread modelmulti-proces model

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

0 1 2 3 4number of cores

thro

ughp

ut (j

bb_o

ps).

multi-thread modelmulti-proces model

Core Scalability of SPECjbb2005

multi-thread model was 2.1% slower

higher is faster

on Niagara

multi-thread model was 3.4% faster

on Nehalem

Page 9: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

9

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

Core Scalability and SMT Scalability on Niagara

multi-thread model was 9.6% faster on average

Core scalabilitySMT scalability

No performance advantage for multi-thread model

(please refer to the paper on results for Nehalem)

0.0

0.5

1.0

1.5

2.0

2.5

3.0

SPECjbb20

05

compil

er.co

mpiler

compil

er.su

nflow

xml.v

alida

tion

xml.tr

ansfo

rmse

rial

crypto

.aes

compre

ssmpe

gaud

ioav

erage

spee

dup

with

four

thre

ads

multi-thread modelmulti-process model

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

SPECjbb20

05

compil

er.co

mpiler

compil

er.su

nflow

xml.v

alida

tion

xml.tr

ansfo

rmse

rial

crypto

.aes

compre

ssmpe

gaud

ioav

erage

spee

dup

with

eig

ht c

ores

multi-thread modelmulti-process model

higher is faster

Page 10: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

10

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

0.0

0.2

0.4

0.6

0.8

1.0

1.2

L1I cache miss L2 instructionmiss

L1D cache miss L2 data miss DTLB missrela

tive

num

ber o

f eve

nts

for m

ulti-

thre

adm

odel

ove

r mul

ti-pr

oces

s m

odel

1 thread2 threads 3 threads 4 threads

Micro Architectural Statistics for SPECjbb2005

using increasing number of SMT threads (up to 4 threads)

multi-process model = 1.0

multi-thread m

odelis better

multi-process m

odelis better

Page 11: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

11

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

0.0

1.0

2.0

3.0

4.0

5.0

6.0

7.0

8.0

L1I cache miss L2 instructionmiss

L1D cache miss L2 data miss DTLB miss

rela

tive

num

ber o

f eve

nts

for m

ulti-

thre

adm

odel

ove

r mul

ti-pr

oces

s m

odel 1 core

2 cores4 cores6 cores8 cores

Micro Architectural Statistics for SPECjbb2005

significant increase in DTLB misses for multi-thread modelwith increasing number of cores used.(7.4x on 8 cores of Niagara and 3.3x on 4 cores on Nehalem)

multi-process model = 1.0

multi-thread m

odelis better

multi-process m

odelis better

using increasing number of cores (up to 8 cores)

Page 12: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

12

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

Difference in Memory Access Patterns

core 0 core 1

JVM 0 JVM 1

core 2 core 3

JVM 2 JVM 3

Javaheap

Javaheap

Javaheap

Javaheap

256 MB 256 MB 256 MB 256 MB

core 0 core 1 core 2 core 3

JVM

Javaheap

1 GB = 256 MB x 4

☺ each core accesses only 256-MB heap☺ each memory page is accessed

from only 1 core

each core accesses 1-GB memory spaceeach memory page is accessed from 4 cores

multi-process (multi-JVM) modelmulti-thread (one-JVM) model

Page 13: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

13

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

Experimental Setup for a larger PHP workload

Benchmark– MediaWiki (wiki server used in Wikipedia)

PHPruntimes

mysqldclientemulator

lighttpd

client

x86 / Linux

application server

UltraSPARC T1 1.2 GHz / Solaris

databaseserver

FastCGIover Unix domain socket

x86 / Linux

Page 14: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

14

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

PHP runtime configuration

each runtime instance handles independent requests no communication among PHP runtime instances

processsharing virtual memory space

PHP runtimeinstance 1

PHP runtimeinstance 2

PHP runtimeinstance 3

PHP runtimeinstance 4

PHP runtimeinstance 1

PHP runtimeinstance 2

PHP runtimeinstance 3

PHP runtimeinstance 4

process

HTTPserver

multi-process PHP runtime (default) multi-threaded PHP runtime

HTTPserver

Page 15: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

15

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

0

2

4

6

8

10

12

14

16

0 1 2 3 4number of SMT threads

thro

ughp

ut (t

rans

actio

ns/s

ec) .

multi-thread modelmulti-proces model

05

101520253035404550

0 1 2 3 4 5 6 7 8number of cores

thro

ughp

ut (t

rans

actio

ns/s

ec) .

multi-thread modelmulti-proces model

Core Scalability and SMT Scalability of MediaWiki

consistent with results for Java benchmarks

core scalabilitySMT scalability

higher is faster

multi-thread model was 2.5% slowermulti-thread model was 5.5% faster

Page 16: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

16

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

0.0

0.2

0.4

0.6

0.8

1.0

1.2

L1I cache miss L2 instructionmiss

L1D cache miss L2 data miss DTLB missrela

tive

num

ber o

f eve

nts

for m

ulti-

thre

adm

odel

ove

r mul

ti-pr

oces

s m

odel

1 thread2 threads 3 threads 4 threads

Micro Architectural Statistics for MediaWiki

multi-process model = 1.0

using increasing number of SMT threads (up to 4 threads)

multi-thread m

odelis better

multi-process m

odelis better

Page 17: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

17

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

L1I cache miss L2 instructionmiss

L1D cache miss L2 data miss DTLB missrela

tive

num

ber o

f eve

nts

for m

ulti-

thre

adm

odel

ove

r mul

ti-pr

oces

s m

odel

1 core2 cores4 cores6 cores8 cores

Micro Architectural Statistics for MediaWiki

multi-process model = 1.0

using increasing number of cores (up to 8 cores)

multi-thread m

odelis better

multi-process m

odelis better

Page 18: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

18

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

Performance of MediaWiki using All SMT Threads

core 0 core 1 core 2 core 3 core 4 core 5 core 6 core 7

threadthreadthreadthread thread

threadthreadthread

threadthreadthreadthread

threadthreadthreadthread

threadthreadthreadthread

threadthreadthreadthread

threadthreadthreadthread

threadthreadthreadthread

☺ multi-thread model was 5.5% faster

☺ TLB misses were reduced by 60%

Page 19: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

19

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

Performance of MediaWiki using All SMT Threads

core 0 core 1 core 2 core 3 core 4 core 5 core 6 core 7

threadthreadthreadthread

threadthreadthreadthread

threadthreadthreadthread

threadthreadthreadthread

threadthreadthreadthread

threadthreadthreadthread

threadthreadthreadthread

threadthreadthreadthread

☺ multi-thread model was 5.5% faster

☺ TLB misses were reduced by 60%

multi-thread model was only 1.7% faster

TLB misses were reduced by only 19%

Page 20: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

20

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

Our Technique: Core-aware Memory Allocation

core 0 core 1 core 2 core 3

multi-threaded PHP runtime

heap

Page 21: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

21

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

Our Technique: Core-aware Memory Allocation

core 0 core 1 core 2 core 3

multi-threaded PHP runtime

physical page size (4 MB)

Page 22: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

22

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

Our Technique: Core-aware Memory Allocation

core 0 core 1 core 2 core 3

multi-threaded PHP runtime

core 0 core 1 core 2 core 3

multi-threaded PHP runtime

physical page size (4 MB)physical page size (4 MB)

- avoid sharing the memory space among cores within a physical page

Core-aware Memory Allocation

Page 23: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

23

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

Performance of MediaWiki with Our Core-aware Malloc

Our core-aware allocator improved the performance of multi-thread model by 3.0% over the default allocator in libc

relative throughput of multi-thread model over multi-process model

0.96

0.98

1

1.02

1.04

1.06

1.08

0 1 2 3 4 5 6 7 8number of cores

rela

tive

thro

ughp

ut o

f mul

ti-th

read

mod

el .

over

mul

ti-pr

oces

s m

odel

default allocatorour core-aware allocator

higher is faster

multi-process model = 1.0

(non-zeroorigin)

3.0%

Page 24: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

24

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

DTLB misses with Our Core-aware Malloc

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 3 4 5 6 7 8number of cores

rela

tive

num

ber o

f DTL

B m

isse

s fo

r mul

ti-th

read

mod

el o

ver m

ulti-

proc

ess

mod

el

default allocatorour core-aware allocator

multi-process model = 1.0

shorter is faster

reduced DTLB missesby 46.7%

Our core-aware allocator reduced the DTLB misses for the multi-thread model by 46.7%

Page 25: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

25

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

Summary

The multi-thread model tends to generate fewer cache misses but more DTLB misses on multi-core processors

The increase in DTLB misses becomes more significant with increasing number of cores

Core-aware memory allocation can maximize the benefit of multi-thread processing by reducing DTLB misses

Page 26: Performance of Multi-Process and Multi-Thread Processing ......• 8 cores with 4 SMT threads in each core • Solaris 10 – Nehalem system • Xeon X5570 (Nehalem) 2.93 GHz • 4

26

IBM Research - Tokyo

Performance of Multi-Process and Multi-Thread Processing on Multi-core SMT Processors © 2010 IBM Corporation

Our Answer to the Question

Threads vs. Processes: Which is better to achieve higher performance?

Multi-thread model has advantage over multi-process model, but memory allocator need to be enhanced