Prof. Hironori Kasahara · 2006. 10. 20. · - Architecture Support Speculative Execution Runtime...

1

Prof. Hironori KasaharaHead: Department of Computer Science

Director: Advanced Chip Multiprocessor Research InstituteWaseda University

3-4-1 Ohkubo Shinjuku-ku, Tokyo, 169-8555, Japan E-mail: [email protected]

URL:http://www.kasahara.cs.waseda.ac.jpMembers:

Associate Professor: Dr. Keiji KimuraResearch Associates: Mr. Jun Shirako, Mr. Yasutaka Wada

Ph. D Students: 7, M2: 8, M1: 11, B4: 20 (inc.M0: 14)

2

Hironori Kasahara＜Personal History＞

B.S. (1980,Waseda), M.S.(1982,Waseda), Ph.D.(1985,EE, Waseda). Res.Assoc. (1983,Waseda),Special Research Fellow JSPS (1985) ,Visiting Scholar (1985.Univ.California at Berkeley), .Assist. Prof. (1986.Waseda), Assoc. Prof.(1988,Waseda), Visiting Research Scholar(1989-1990. Center for Supercomputing R&D, Univ.of Illinois at Urbana-Champaign), Prof.(1997-,Dept. CS, Waseda). , IFAC World Congress Young Author Prize (1987), IPSJ Sakai Memorial Special Award (1997), STARC Industry-Academia Cooperative Research Award (2004)

＜Activities for Societies＞IPSJ：Sig. Computer Architecture(Chair), Trans of IPSJ Editorial Board (HG Chair),

Journal of IPSJ Editorial Board (HWG Chair), 2001 Journal of IPSJ Special Issue on Parallel Processing(Chair of Editorial Board: Guest Editor, JSPP2000 (Program Chair) etc.ACM ：International Conference on Supercomputing(ICS)(Program Committee)

Int’l conf. on Supercomputing (PC, esp. ‘96 ENIAC 50th Anniversary Co-Prog. Chair). IEEE: Computer Society Japan Chapter Chair, Tokyo Section Board Member, SC07 PCOTHER: PCs of many conferences on Supercomputing and Parallel Processing.

<Activities for Governments>METI：IT Policy Proposal Forum(Architecture/HPC WG Chair),

Super Advanced Electronic Basis Technology Investigation CommitteeNEDO:Millennium Project IT21 “Advanced Parallelizing Compiler”( Project Leader),Computer Strategy WG (Chair).Multicore for Realtime Consumer Electronics Project Leader etc.MEXT:Earth Simulator project evaluation committee,JAERI: Research accomplishment evaluation committee, CCSE 1st class invited researcher. JST: Scientific Research Fund Sub Committee, COINS Steering Committee ,

Precursory Research for Embryonic Science and Technology (Research Area Adviser)Cabinet Office: CSTP Expert Panel on Basic Policy, Information & Communication Field

Promotion Strategy , R&D Infrastructure WG, Software & Security WG ＜Papers＞151 Papers with Review, 20 Papers for Symposium with Review, 105 Technicar Reports,

154 Papers for Annual Convention, 49 Invited Talks, 74 Articles in Newspaper & Web, etc.

3

Multi-core EverywhereMulti-core from embedded to supercomputers

Consumer Electronics (Embedded)Mobile Phone, Game, Digital TV, Car Navi, DVD,

CameraIBM/ Sony/ Toshiba Cell, Fuijtsu FR1000, NEC/ARMMPCore&MP211, Panasonic Uniphier, Renesas SH multi-core(RP1)

PCs, ServersIntel Dual-Core Xeon, Core 2 Duo, MontecitoAMD Quad and Dual-Core Opteron

WSs, Deskside & Highend ServersIBM Power4,5,5+, pSeries690(32way), p5 550Q(8 way) , Sun N iagara(SparcT1,T2), SGI ALTIX350,

SupercomputersEarth Simulator:40TFLOPS, 2002, 5120 vector proc.IBM Blue Gene/L: 360TFLOPS, 2005,

Low power CMP based 128K processor chipsHigh quality application software, Productivity, Costperformance, Low power consumption are important

Ex, Mobile phones, GamesCompiler cooperated multi-core processors are promising to realize the above futures

IBM First Chip Multiprocessor (CMP) Power 4 processor

2 cores on a chip

Earth simulaor：5120 vector processorsMultiprocessor Supercomputer

4

Roadmap of compiler cooperative multicore project

Mar.Oct.

02 03 04 05 06 07 0800 01 09 10

Practical ResarchBasic resrach

Arch. % CompilR&D

Plan Practical Use

STARC：Semiconductor

Technology Academic Research Center

Fujitsu,Toshiba,NEC,Renesas,Panasonic,

Sony etc.

STARC：Semiconductor

Technology Academic Research Center

Fujitsu,Toshiba,NEC,Renesas,Panasonic,

Sony etc.

Soft realtimeSoft realtime

PlanMulticore Arch. Compiler

API R&DPractical Use

（Waseda・Toshiba・Fujitsu・Panasonic・Sony）

（Waseda・Fujitsu・NEC・Toshiba）

Apply for Supercomputer CompilersCompiler development ofMultiprocessor Servers

Millennium Project IT21NEDO Advanced Parallelizing Compiler

（Waseda Univ. Fujitsu,Hitachi,JIPDEC,AIST）

STARC Compiler Cooperative Chip Multiprocessor（ Waseda Univ., Fujitsu, NEC,Toshiba, Panasonic,Sony）

NEDO Heterogeneous Multiprocessor（Waseda Univ., Hitachi）

NEDO Multicore Technology for Realtime Consumer Electronics

Waseda Univ., Hitachi, Renesas,Fujitsu, NEC, Toshiba, PanasonicPower Saving Multicore Architecture,Parallelizing Compiler,API

Mar.Mar.

5

National Project by METI/NEDOMulti-core for Real-time Consumer Electronics

<Goal> R&D of compiler cooperative multi-core processor technology for consumer electronics like Mobile phones, Games, DVD, Digital TV, Car navigation systems.

<Period> From July 2005 to March 2008

<Features> ・Good cost performance・Short hardware and software development periods・Low power consumption・Scalable performance improvement with the advancement of semiconductor ・Use of the same parallelizing compiler for multi-cores from different vendors using newly developed API

API：Application Programming Interface

開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ開発マルチコアチップは情報家電へ

CMP m(マルチ

コア・チップm)

(プロセッサ

コアn)

CSM / L2 Cache(集中共有メモリあるいはL２キャッシュ )

PC0(プロセッサコア0) PC1

(プロ

セッサコア1)

PC n

IntraCCN (チップ内結合網：　複数バス、クロスバー等 )

DSM(分散共有メモリ)

LDM/D-cache

(ローカルデータメモリ/L1データ

キャッシュ)

LPM/I-Cache

(ローカルプログラムメモリ/

命令キャッシュ)

CMP 0 （マルチコアチップ0）

InterCCN (チップ間結合網：　複数バス、クロスバー、多段ネットワーク等 )

CSM j

CSM(集中

共有メモリ)

I/O

ＣＳＰ k

(入出力用マルチコア・チップ)

NI(ネットワークインターフェイス)

CPU(プロセッサ）

DTC(データ

転送コントローラ)

I/O DevicesI/O

（入出力装置）CMP m(マルチ


(プロセッサ

コアn)



(プロ

セッサコア1)

PC n



LDM/D-cache


キャッシュ)

LPM/I-Cache





CSM j

CSM(集中

共有メモリ)

I/O

ＣＳＰ k




DTC(データ


I/O DevicesI/O

（入出力装置）

新マルチコアプロセッサ

•高性能

•低消費電力

•短HW/SW開発期間

•各チップ間でアプリケーション共用可

•高信頼性

•半導体集積度と共に性能向上


•高性能

•低消費電力



•高信頼性


マルチコア統合ECU

,,

CMP m(マルチ


(プロセッサ

コアn)



(プロ

セッサコア1)

PC n



LDM/D-cache


キャッシュ)

LPM/I-Cache





CSM j

CSM(集中

共有メモリ)

I/O

ＣＳＰ k




DTC(データ


I/O DevicesI/O

（入出力装置）CMP m(マルチ


(プロセッサ

コアn)



(プロ

セッサコア1)

PC n



LDM/D-cache


キャッシュ)

LPM/I-Cache





CSM j

CSM(集中

共有メモリ)

I/O

ＣＳＰ k




DTC(データ


I/O DevicesI/O

（入出力装置）


•高性能

•低消費電力



•高信頼性



•高性能

•低消費電力



•高信頼性


マルチコア統合ECU

,,

経済産業省/NEDOリアルタイム情報家電用マルチコア(2005.7～2008.3)**

**日立,富士通,ルネサス,東芝,松下,NEC

6

METI/NEDO Advanced Parallelizing Compiler Technology Project

Goal: Double the effective performance

1980 20001990

Contents of Research and Development① R & D of advanced parallelizing compiler

Multigrain, Data localization, Overhead hiding② R & D of Performance evaluation technology

for parallelizing compilers

Slow down

IT, Bio-tech., Device, Earth environment, Next-generation VLSI design, Financial engineering, Weather forecast, New clean energy, Space development, Automobile, Electric Commerce, etc

Ripple Effect① Development of competitive next

generation PC and HPC② Putting the innovative automatic

parallelizing compiler technology to practical use

③ Development and market acquisitionof future single-chip multiprocessors

④ Boosting R&D in the following many fields:

Background and Problems①Adoption of parallel processing as a core

technology on PC to HPC ② Increase of importance of software on IT③ Need for improvement of cost-performance

and usability

Performance

Hardware Peak performance

Effective Performance

<Purpose>

Theoretical maximum performance vs.Effective performance of HPC

year

Expa

nd G

ap

APC

Pro

ject

Improvement of① Effective performance ②Cost-performance③ Ease of use

1T

1G

Millenium Project IT21 2000.9.8 –2003.3.31Waseda Univ., Fujitsu, Hitachi, AIST

7

• Improve effective performance, cost-performance and productivity and reduce consumed power– Multigrain Parallelization

• Exploitation of parallelism from the whole program by use ofcoarse-grain parallelism among loops and subroutines, near fine grain parallelism among statements in addition to loop parallelism

– Data Localization• Automatic data distribution for distributed shared memory, cache

and local memory on multiprocessor systems.– Data Transfer Overlapping

• Data transfer overhead hiding by overlapping task execution and data transfer using DMA or data pre-fetching

– Power Reduction• Reduction of consumed power by compiler control of frequency,

voltage and power shut down with hardware supports.

OSCAR Parallelizing Compiler

8

Earliest Executable Condition Analysis for coarse grain tasks (Macro-tasks)

BPA

BPA

1 BPA

3 BPA2 BPA

4 BPA

5 BPA

6 RB

7 RB15 BPA

8 BPA

9 BPA 10 RB

11 BPA

12 BPA

13 RB

14 RB

END

RB

RB

BPA

RB

Data Dependency

Control flow

Conditional branch

Repetition Block

RB

BPA

Block of PsuedoAssignment Statements

7

11

14

1

2 3

4

5 6

15 7

8

9 10

12

13

Data dependency

Extended control dependencyConditional branch

6

ORAND

Original control flowA Macro Flow Graph

A Macro Task Graph

9

Data-Localization Loop Aligned Decomposition

• Decompose multiple loop (Doall and Seq) into CARs and LRsconsidering inter-loop data dependence.– Most data in LR can be passed through LM.– LR: Localizable Region, CAR: Commonly Accessed Region

DO I=69,101DO I=67,68DO I=36,66DO I=34,35DO I=1,33

DO I=1,33

DO I=2,34

DO I=68,100

DO I=67,67

DO I=35,66

DO I=34,34

DO I=68,100DO I=35,67

LR CAR CARLR LR

C RB2(Doseq)DO I=1,100

B(I)=B(I-1)+A(I)+A(I+1)

ENDDO

C RB1(Doall)DO I=1,101

A(I)=2*IENDDO

RB3(Doall)DO I=2,100

C(I)=B(I)+B(I-1)ENDDO

C

10

APC Compiler Organization

Parallelizing Tuning System

Program Visualization Technique

Techniques for Profiling & Utilizing Runtime Info

Feedback-directed Selection Technique of

Compiler Directives

- Hierarchical Parallel- Affine Partitioning

MultigrainParallelizationSource

Program

FORTRAN

OpenMPCode

Generation

APC Multigrain Parallelizing Compiler

Feedback

- Cache optimization- DSM optimization

Automatic DataDistribution

- Static Scheduling- Dynamic Scheduling

Scheduling

- Inter-procedural- Conditional

Data DependenceAnalysis

- Coarse Grain- Medium Grain- Architecture Support

SpeculativeExecution

RuntimeInfo

Variety of Shared Memory Parallel machines

SUNUltra 80

IBMpSeries690

IBMRS6000

IBM XL FortranVer.8.1

Sun Forte 6Update 2

IBM XL FortranVer.7.1

…

11

Image of Generated OpenMP Code for Hierarchical Multigrain Parallel Processing

Centralized scheduling code

Distributed scheduling code

T0 T1 T2 T3

Thread group0

MT1_1

SYNC SENDMT1_2

SYNC RECV

T4 T5 T6 T7

1_4_2

1_4_4

1_4_3

1_4_1

1_4_2

1_4_4

1_4_3

1_4_11_3_2

1_3_4

1_3_3

1_3_1

1_3_6

1_3_5

1_3_2

1_3_4

1_3_3

1_3_1

1_3_6

1_3_5

1_3_2

1_3_4

1_3_3

1_3_1

1_3_6

1_3_5

SECTIONSSECTION SECTION

END SECTIONS

Thread group1

MT1_1

MT1_3SB

MT1_2DOALL

MT1_4RB

1st layer

2nd layer

1_3_1

1_3_2 1_3_3

1_3_5

1_3_4

1_3_6

1_4_21_4_3 1_4_4

1_4_1

MT1-4

MT1-3

2nd layer

12

IBM pSeries690 RegattaH• Up to 16 Power4: 32 way SMP Server

– L1(D) : 64 KB (32KB/processor, 2 way assoc.), L1(I): 128 KB (64KB/processor, Direct map)– L2 : 1.5 MB (shared cache with 2 procs., 4~8 way assoc.)– L3 : 32 MB (external, 8 way assoc.) [x 8 = 256 MB]

GXGX

P

L2

PP

L2

P

P

L2

P P

L2

P

GXGX

P

L2

PP

L2

P

P

L2

P P

L2

P

GX

GX

P

L2

PP

L2

P

P

L2

P P

L2

P

GX

GX

GX

GX

GX

P

L2

PP

L2

P

P

L2

P P

L2

P

GX

GX

GX

GX

GX

MemSlot

GX Sl ot

L3 L3 L3 L3L3 L3L3 L3

L3 L3

L3 L3

L3 L3L3 L3 L3 L3

L3 L3

L3 L3

L3 L3

L3 L3

L3 L3

L3 L3

L3 L3

MCM 1

MCM 3MCM 2

MCM 0

GX Slot

MemSlot

MemSlot

MemSlot

MemSlot

MemSlot

MemSlot

MemSlot

GX Slot

GX Sl ot

Four 8-way MCM Featur es Assembled into a 32-way pSer ies 690

IBM @server pSeries690 Configuring for Performance

http://www-1.ibm.com/servers/eserver/pseries/hardware/whitepapers/p690_config.html

13

Performance on IBM pSeries690Power4 24 processors SMP server

• OSCAR compiler gave us 4.82 times speedupagainst XL Fortran Ver.8.1

0

5

10

15

20

25to

mca

tv

swim

su2c

or

hydr

o2d

mgr

id

appl

u

turb

3d

apsi

fppp

p

wav

e5

swim

mgr

id

appl

u

apsi

spec95 spec2000

spee

dup

ratio

XL Fortran Ver.8.1OSCAR

14

SGI Altix 350

2.4 times speedup against Intel Fortran 7.1 using OpenMP code for IBM p690

16 Itanium 2 (1.3GHz) SMP serverL1 16KB(4way), L2 256KB(8way), L3 3MB(12way)

15

IBM p5-550Q server (1.5 or 1.65 GHz)

I/O Hub

Four PCIX & One PCIX DDR

GX Bus4.4 GB/sec

MEMORY

Two Quad-Core modules105.6

GB/sec

Four I/O Drawers

Four I/ODrawers

GX adapters share PCIX slots 4 and 5

4.0 GB/sec

GX Bus on 2nd processor card4.4 GB/sec

MemCtl

L2 Cache

Distributed Switch

P P42.2

GB/sec 1 to 64GBDDR2

533 MHzDIMM

L3L3

MemCtlr

L2 Cache

SMP Bus

L2 Cache

2-core POWER5+ processor

2-core POWER5+ processor36MB

1.9MB

16

Performance on IBM p5 550POWER5+ 8 processors SMP server

• OSCAR compiler– 2.74 times speedup against XL Fortran Ver.10.1 without SMT– 2.78 times speedup against XL Fortran Ver.10.1 with SMT

0

1

2

3

4

5

6

7

8

9

tomcatv swim su2cor hydro2d mgrid applu turb3d apsi fpppp wave5 swim mgrid applu apsi

spec95 spec2000

spee

dup

ratio

XL Fortran Ver.10.1(SMT off)XL Fortran Ver.10.1(SMT on)OSCAR(SMT off)

17

Power Reduction by Power Supply, Clock Frequencyand Voltage Control by OSCAR Compiler

• Shortest execution time mode

18

OSCAR Multi-Core ArchitectureCMP m

CSM / L2 Cache

PE0 PE1 PE n

Intra-chip connection network(Multiple Buses, Crossbar, etc)

DSM

LDM/D-cacheLPM/

I-Cache

CMP (chip multiprocessor 0)0

Inter-chip connection network (Crossbar, Buses, Multistage network, etc)

CSM j

CSM

I/OCMP k

CPUDTC

I/ODevicesI/O

Devices

Network Interface

CSM: central shared mem.DSM: distributed shared mem.DTC: Data Transfer Controller

LDM : local data mem.LPM : local program mem.FVR: frequency / voltage control register

FVR

FVR

FVRFVR

FVR

FVR FVR FVR

19

Consumed Energy in Real-time Processing mode

• deadline = sequential execution time

0

50

100

150

200

250

300

350

400

450

1 2 4 1 2 4 1 2 4

tomcatv swim applu

ener

gy

[

w/o Savingw Saving

0500

10001500200025003000350040004500

1 2 4

mpeg2enc

ener

gy

[m

-85.6%

-42.6%

-86.7%

-46.5%

-74.0%

-82.7%

-30.8%

num. of proc. num. of proc.

0

50

100

150

200

250

300

350

400

450

1 2 4 1 2 4 1 2 4

tomcatv swim applu

ener

gy

[

w/o Savingw Saving

0500

10001500200025003000350040004500

1 2 4

mpeg2enc

ener

gy

[m

-85.6%

-42.6%

-86.7%

-46.5%

-74.0%

-82.7%

-30.8%

num. of proc. num. of proc.

20

INPUT

OUTPUT

SPICE Circuit Simulator Fortran Fortran + OpenMP

SMP

Native OpenMPFortran Compiler

WS PC

Native SequentialFortran Compiler

Single ChipMultiprocessor SMP OSCAR

VectorSuper

Computer

OSCAR MultigrainFortran Compiler

SuperComputer

Cluster

Developed Circuit SimulatorDeveloped Circuit SimulatorOptimized loop free code generarion

SPICE INPUT

Parallel Processing of VLSI Electronic Circuit Simulation

Sequential processing on PC and WS

Parallel processing on multiprocessors

Prof. Hironori Kasahara · 2006. 10. 20. · - Architecture Support Speculative Execution Runtime...

Documents

Transcript of Prof. Hironori Kasahara · 2006. 10. 20. · - Architecture Support Speculative Execution Runtime...