성균관대학교 정보통신공학부 © 조준동 2006 년 가을 1 SCHEDULING AND TIMING...

성균관대학교 정보통신공학부 © 조준동 2006 년 가을 1

SCHEDULING AND TIMING ANALYSIS OF HW/SW ON-CHIP COMMUNICATION IN

MP SOC DESIGN

© 조준동 , 2006 년 여름 2

Contents

– Introduction

– Motivation

– Communication Scheduling and HW/SW

Timing Analysis

– Experimental Results

– Conclusion

© 조준동 , 2006 년 여름 3

• On-chip Communication Architecture in MP SoC design

• On-chip Communication Design– Design of HW/SW Communication Architecture– Mapping and scheduling of on-chip communication

• Contribution of this work is the consideration of both of– Dynamic behavior of SW communication architecture– Physical communication buffer sharing

Introduction

© 조준동 , 2006 년 여름 4

Preliminaries• Mapping/Allocation

• Communication Scheduling

© 조준동 , 2006 년 여름 5

Motivation

© 조준동 , 2006 년 여름 6

Extended Task Graph

© 조준동 , 2006 년 여름 7

Communication Delay ModelCommunication Delay

– Communication Delay of SW Communication

Architecture

– Communication Delay of HW Communication Interface

– Communication Delay of On-chip Communication Network

(n)C(n)C(n)C(n)C OCN

i

HW

i

SW

ii

)()()( nC(n)C(n)CvR(n)C DDCSISRiSW

i

Interface) ionCommunicat HW of delay(n(n)C HW

i

BLOCKEDtrans

OCN

i DnD(n)C

© 조준동 , 2006 년 여름 8

• ILP for Scheduling Communication Nodes and Tasks of ETG– Data dependency constraints

– Resource contention constraints

– Processor and on-chip communication network

Communication Scheduling and Timing Analysis

startendstartend tttt 3121 and

© 조준동 , 2006 년 여름 9

ILP for Binary variable– Physical communication buffer contention constraints

© 조준동 , 2006 년 여름 10

• List scheduling

Heuristic Algorithm

LIST (G(V,E),a) {

I=1;

Repeat {

For each resource type k=1,2,…., nres {

Determine candidate tasks

Determine unfinished tasks

Select nodes, such that

Schedule the tasks at time

by setting

}

} until ( is scheduled);

Return (t);

}

;,1 kkk aTS

;, kIU

;, KIT

KIk US ,

kS I

):(, ki SviiIt

nv

© 조준동 , 2006 년 여름 11

Experiments

© 조준동 , 2006 년 여름 12

Experimental Results

• Execution delay of tasks in the H.263 system

• Delay of software communication architecture– Software

communication architecture delay measured by ISS.

Functional Block

Execution Cycle

Source 1386

Motion predictor

1123650

Macro block 875358

Encoder 875358

VLC 2103156

Services Execution Cycle

ISR 857

CS 1060

DD(read) 5376

DD(write) 6528

© 조준동 , 2006 년 여름 13

Execution Time of H.263 encoder

© 조준동 , 2006 년 여름 14

Execution Time of JPEG and IS-95 example

© 조준동 , 2006 년 여름 15

• On-chip Communication Design– Design of HW/SW Communication Architecture– Mapping and scheduling of on-chip communication

• Communication Scheduling and Timing Analysis– ILP Formulation– Heuristic

• Consideration of– Dynamic behavior of SW communication architecture– Physical communication buffer sharing

• Future Work– To extend the approach to the complicated On-Chip

Network– To design On-Chip Communication Scheduler

Conclusion


Deep-submicron 저전력 설계


Communication Architecture of MP-SoC Platform

성균관대 성균관대 조준동조준동

© 조준동 , 2006 년 여름 18

Talk Outline

• MP-SoC Architectures:

– Homogeneous Architecture– Heterogeneous Architecture

• Crossbar Interconnection Models

• MP-SoC Evaluation Methodology

© 조준동 , 2006 년 여름 19

Networks-on-Silicon, Phillips

Albert van der Werf, Philips Research

© 조준동 , 2006 년 여름 20

DSP based commercial SoC

© 조준동 , 2006 년 여름 21

MP-SOC Platforms?

• Platforms : An architecture that is designed for an application domain

( 소비자 수요에 대처 , 앞으로의 변화 예측 )

• Multiprocessor systems-on-chips: (Usually heterogeneous multiprocessor)

– CPUs, DSPs, etc.– Hardwired accelerators.– Mixed-signal front end.

© 조준동 , 2006 년 여름 22

SoC topics …

• Task level analysis and optimizations– Scheduling and Resource sharing– Mapping data flow to the target HW– Task distributed splitting and merging

• Efficient communication Optimization– BW and Memory allocation– Synchronizations– Buffers– Routing: Shortest path routing w/ minimal router logics– Irregular meshes– Mapping to topology: What topology will suit a particular

application?

© 조준동 , 2006 년 여름 23

– System optimization:latency, throughput, BW. – Interconnect consumes up to 50% of energy and growin

g… – Scalable– Should also be optimized

• Topology• Links BW• Place & Route• Routing protocol

From computation centric to communication centric architectures

© 조준동 , 2006 년 여름 24

4G: Multiple standards Software Defined Radio & Multimedia

• A number of components might be pleased on a single die in order to decrease the production cost. Ex) h.264+ MPEG4

• Multi-DSP : Wibro, MoIP• Maximum parallelism !

– Parallel data transfer among the components.– Read, write and calculate processes should be decoupled.

DSP1

DSP2

DSPn

TXRAMnetwork

© 조준동 , 2006 년 여름 25

DSP

Why Multi-Threaded Cores?

Out

NoC

In SRAM

DSPDSP DSP

H/W O/S Schedulers

$RISC

I$D$ I$

Increasing gap: memory & processor

speeds(2x / 2 years)

Increasing gap: interconnect &

gate delays(multi-clock

intra-chip delay)

More parallel processing

(lower-power, higher-perf./mm2)

STMicroelctronics MultiFlex MP-SoC

© 조준동 , 2006 년 여름 26

1~8 2~6

Exploitable Parallelism

GP O/SThread-LevelParallelism

Instruction-Level

Parallelism

1

10 000’sInstructions

Min parallel grain size (instrns.)

Exploitable taskparallelism

1~100

MultiFlex Thread-Level

Parallelism

100’s

© 조준동 , 2006 년 여름 27

Execution speed upMPEG4 VGA Video Codec Results

0

5

10

15

20

25

30

35

2 3 4 5 6Number of ARMs

Fra

me

/Se

c

8 threads Theoretical 2 threads4 threads Latency = 0

Theoretical upper bound

MultiFlex result (STBus)

MultiFlex (0 latency bus)

© 조준동 , 2006 년 여름 28

Intel IXP1200 Network Processor

© 조준동 , 2006 년 여름 29

IXP1200 MicroEngine

© 조준동 , 2006 년 여름 30

Holistic design of multi-core architectures

• Naïve Methodology is inefficient

• Demonstrated inefficiency for cores and proposed alternatives– Single-ISA Heterogeneous Multi-core Architectures for

Power[MICRO03]– Single-ISA Heterogeneous Multi-core Architectures for

Performance[ISCA04]– Conjoined-core Chip Multiprocessing [MICRO04]

• What about interconnects?– How much can interconnects impact processor architecture?– Need to be co-designed with caches and cores?

© 조준동 , 2006 년 여름 31

• Drew Wingard, CTO, Sonics, Inc.• [email protected]

– Delivering clocks is problematic– Wire delay is dominant– Too many constraints, from too many blocks

© 조준동 , 2006 년 여름 33

Various Interconnect Topologies

Linear (Pipeline)

Bus (Old stile)Ring (Chip&Slow)Star (Switch)

Tree(PCI-Express)

Mesh (NoC)

Fully connected

(When do we need it?)

Hybrid (Dedicated)Ring w/ bypass

(Async)

© 조준동 , 2006 년 여름 34

State of the Art: Network on Chip

Networks are preferred over buses:

• Higher bandwidth

• Concurrency, effective spatial reuse of resources

• Higher levels of abstraction

• Modularity - Design Productivity Improvement

• Scalability

© 조준동 , 2006 년 여름 35

Network on Chip

The idea: • Decouple Communication and Computation• Simple routers instead of repeaters• Route packets instead of wires• Provide diff. Services on the same infrastructure

M odule

M odule M odule

M odule M odule

M odule M odule

M odule

M odule

M odule

M odule

M odule

© 조준동 , 2006 년 여름 36

Communication Centric Design Flow, Jeremy

Application Architecture Library

Architecture / Application Model

Good?

Evaluate

Analyse / Profile

Configure

Refine

NoC Optimisation

No

Synthesis

Optimized

NoC

© 조준동 , 2006 년 여름 37

System Development Flow of QNoC

Connect modules with an ideal network

And measure traffic

Good?

Analyse / Profile

Place Modules

Map traffic to grid using given QNoC architcture

Balance utilization, minimize cost, and meet QoSRefine QnoC

NoC Optimisation

No

Synthesis

Optimized

NoC

© 조준동 , 2006 년 여름 38

MPSoC design time application optimization and exploration

응용 분야 스펙

QoS 만족 ?

플랫폼 관리자

개선된 응용 분야 스펙 + 파레토 커브 발생기

개선

응용 분석 , 분할 , 변환 및 탐색

No

리소스 할당 플랫폼 검증

© 조준동 , 2006 년 여름 40

45.0

3.8

1.0 1.0

2.9

0.8

0.1

1.0

10.0

100.0

BUS NoC PTP

Wire-Length(Area) and Power

Wire Length

Power

QNoC (Technion) vs. Alternative Solutions

Mesh (4x4): Uniform scenario (Same QoS):

Arch.Frequenc

y

Utilization

Av. Link

Width

QNoC1GH

z30% 28

Bus50

MHz50% 3 700

PTP100MHz

80% 6

BUS QNoC PTP

Cost

© 조준동 , 2006 년 여름 41

• explore various topology options to determine those that best meet the system objectives. • tune the arbitration scheme and network links (for example, burst lengths, communication FIFO sizes, etc.) •NoCcompiler is used to create the NoC instance ( packet format options, burst types, special transactions, maximum packet payload length, network port configurations, etc) •A cycle-accurate simulation model can then be generated to verify with SystemC or RTL simulation.

NoC Design solution

© 조준동 , 2006 년 여름 42

Danube NoC IP• • Support for OCP 2.0, AMBA AHB and AMBA AXI socket interfaces• • Clock Frequency up to 750 MHz. in 90 nm process• • GALS links for spanning distance and crossing clock boundaries• • Unlimited user defined topology through Arteris NoCexplorer• and NoCcompiler• • Flexible pipelining and FIFO management• • Customizable NoC Transaction and Transport Protocol (NTTP) packet for

mat• • On chip protocol for runtime SoC application debug• • Lightweight Service bus for runtime debug, error management,• register configuration

© 조준동 , 2006 년 여름 43

SoftStream HERA 3000(mediaexcel)(2)

PCI board ( can see 6 DSPs)

© 조준동 , 2006 년 여름 44

KOMPROCESSOR AVC(Ateme)

The Kompressor Board is a multi-encodin

g short PCI board

MPEG-4 AVC H.264 Live encode, IP St

reaming and simultaneous file recording

TMS320C64x DSPs from Texas Instruments

used

FPGA – Cyclone from Altera used

Input – NTSC or PAL

VIDEO Compression

MPEG-4 AVC / H.264(ISO/IEC 14496-

10) Baseline, Main, High Profiles

© 조준동 , 2006 년 여름 45

Application 과 전망• IP 카메라와 스트리머 , 네트워크 비디오 레코더 (N

VR) 등 실시간 비디오 감시

• 다기능 프린터와 스캐너 , 필름 처리 장비 , 화상 인식 시스템 등 고성능 이미지 처리 시장의 성능과 통합 요건을 만족하고 또한 인코딩 , 스트리밍 및 트랜스코딩 장비 , 화상회의 시스템 , 비디오 폰 등 방송 및 IP TV

© 조준동 , 2006 년 여름 46

HiBRID-SoC Architecture

HIBRID-SoC multi-core system-on-chip Architecture

Integrate a powerful on-chip communication structure

A well-balanced memory system to account for the growing amount of data memory

system (e.g., in the area of video, Mpeg-4 part 10 or Advanced Video Coding (AV

C))

Dedicated chips for the Mpeg-4 Simple Profile, consists of a very general processing

demend

Three programmable cores Each adapted towards a specific class of algorithms

Combination of the cores and their software development environment

An extention of a programmable core with dedicated modules (e.g.,Trimedia)

HIBRID-SoC multi core

Developed at the University of Hannover

© 조준동 , 2006 년 여름 47

Morpho(MS1-16 v002)

• -16 16x16-bit MAC operations/cycle @500MHz

• -4.8 GMAC/s @500MHz • -420 and 480 TBGA packaging (0.18u/0.13u) • -core voltages at 1.8V/1.2V and 3.3V

http://www.morphotech.com/

© 조준동 , 2006 년 여름 48

SandBridge(SB3000)

• - 응용 분야 • MPEG capture and playback for video or videoconfer

encing • JPEG capture and playback for camera and display f

unctions • MP3 capture and playback for music or ringtone fun

ctions • Other computation-intensive multimedia functions

(e.g. speech control)

© 조준동 , 2006 년 여름 49

NEC 와 ARM

• 전화 및 가전제품용으로 2 개 이상의 프로세싱 코어와 연산 유닛을 마이크로프로세서 안에 장착한 새로운 칩을 개발중이다 .

• 휴대폰 제조업체들에게도 다양한 활용방안이 있다 . 예를 들면 코어중 하나는 통화기능에 , 다른 하나는 인터넷 트래픽 관리에 사용할 수 있다 .

• 1.2GHz 의 ARM11 프로세서 제품군과 동급의 성능을 기록했으며 600 mW가량의 전력소비량으로 최고 1440 DMIPS 의 성능 , 130nm 프로세스

• 리눅스 SMP OS 포트 : 전력 소비 절감과 프로세서간 애플리케이션 로드 자동 밸런싱 기능

© 조준동 , 2006 년 여름 50

Efficient Shared DRAM Subsystems for SOCs, Sonics Inc.

• Increases SOC performance • Improves efficiency of off-chip DRAM by up to 40% • Guarantees Quality of Service for on-chip cores

• Lowers SOC costs • Consolidates and reduces multiple distributed buffers • Single Smart Interconnet replaces multiple layered busses

• Shortens time to market • Smart Interconnect removes wire routing problem of classical architectures • Accurate architectural exploration ensures functionality in first days of develop

ment

• Increased market penetration • DRAM technology selection decoupled from the rest of the SOC • Threaded architecture enables easy scalability without re-design of memory su

bsystem

© 조준동 , 2006 년 여름 51

Shared Bus and DRAM Subsystem

The traditional computer bus organization suffers from lowDRAM efficiency, a lack of quality-of-service.

© 조준동 , 2006 년 여름 52

Star Topology Access to a Shared DRAM Subsystem

• DRAM controller 1) sees all initiator requests at the same time and can select the order of

servicing Them 2) optimize the performance of the DRAM subsystem by reordering

requests 3) providing flexible quality-of-service to each of the initiators.4) causes a large number of wires to converge on the DRAM controller,

producing physical problems for the design.

© 조준동 , 2006 년 여름 53

SiliconBackplane and DRAM Scheduler

The shared μnetwork remedies the wire congestion problem. The DRAM scheduler addresses both the DRAM performance issues.Quality-of-service guarantees by selectively scheduling the DRAM accesses.

© 조준동 , 2006 년 여름 56

Bandwidth Profile of Bus with Round-Robin Arbitration

Many of the initiators receive less than theirrequired bandwidth so overall application requirements are unsatisfied.

© 조준동 , 2006 년 여름 57

Bandwidth Profile of Bus with Priority Arbitration

From 5000 to 9000 cycles all but two initiators (CPU and DSP) receive no service at all. Clearly, this is unacceptable for the set-top-box application.

© 조준동 , 2006 년 여름 58

Bandwidth Profile of Sonics Solution

Each of the initiators are connected to the DRAM using a Silicon Backplane μNetwork, a DRAM scheduler, DRAM controller. The Silicon Backplane and DRAM bandwidth have been allocated to the different initiators according to their needs. All application requirements are met and overall DRAM utilization is pretty steady at around 70%.

© 조준동 , 2006 년 여름 59

References

• Terry Tao Ye, On-Chip Multiprocessor Communication Network Design and Analysis, Ph.D. Dissertation, Stanford Univ.

• E. Bolotin, et al., Automatic hardware-Efficient SoC Integration by QoS network on Chip, Israel Institute of Tech, Haifa, Israel.

• E. Bolotin, et al., Efficient Routing in Irregular Topology NoCs,

Technion- Israel Institute of Tech


The Standford Hydra CMP

• Lance Hammond• Benedict A. Hubbert• Michael Siu• Manohar K. Prabhu• Michael Chen• Kunle Olukotun

Presented by Jason Davis

© 조준동 , 2006 년 여름 61

Introduction• Hydra CMP with 4 MIPS Processors• L1 cache for each CPU and L2 cache that

holds the permanent states• Why?

– Moore’s law is reaching its end– Finite amount of ILP– TLP (Thread Level Parallelism) vs ILP in pip

elined architecture– CMP can use ILP as well (TLP and ILP are

orthogonal)– Wire Delay– Design Time (CPU core doesn’t need to be

redesigned) just increase the number

• Problems– Integration densities just now giving reaso

ns to consider new models– Difficult to convert uniprocessor code– Multiprogramming is hard

© 조준동 , 2006 년 여름 62

Base Design

• 4 MIPS Cores (250 MHz)– Each core:

• L1 Data Cache• L1 Primary Instruction

Cache

– Share a single L2 Cache– Virtual Buses (pipelined with

repeaters)

• Read bus (256 bits)– Acts as general purpose system bus for

moving data between CPUs, L2, and external memory

– Wide enough to handle entire cache line (CMP explicit gain, multiprocessor systems would require too many pins

• Write bus (64 bits)– Writes directly from 4 CPUs to L2– Pipelined to allow for single-cycle occupancy

(not a bottleneck)– Uses simple invalidation for caches

(broadcast invalidates all other L1s)

• L2 Cache– Point of communication (10-20 cycles)

• Bus Sufficient for 4-8 MIPS cores, more need larger system buses

© 조준동 , 2006 년 여름 65

Thread Speculation• Takes sequence of instructions on normal program and arbitrarily breaks it into a

sequenced group of threads– Hardware must track all interthread dependencies to insure program acts the same way– Must re-execute code that follows a data violation based upon a true dependency

• Advantages:– Does not require synchronization (different than enforcing dependencies on multiproces

sor systems)– Dynamic (done at runtime) so programmer only needs to consider for maximum perform

ance– Conventional Parallelizing compilers miss a lot of TLP because synchronization points

must be inserted where dependencies can happen and not just where they do happen • 5 Issues to address:

© 조준동 , 2006 년 여름 66

Thread Speculation

1. Forward data between parallel threads

2. Detect when reads occur to early (RAW)

3. Safely Discard speculative state after violations

© 조준동 , 2006 년 여름 67

Thread Speculation

5. Provide Memory renaming (WAR hazards)

4. Retire speculative writes in correct order (WAW hazard)

© 조준동 , 2006 년 여름 68

Hydra Speculation Implementation

• Takes care of the 5 issues:– Forward data between parallel threads:

• When thread writes to bus, newer threads that need the data have their current cache lines for that data invalidated

• On miss in L1, access L2, write buffers of current or older thread replaces data returned from L2 byte-byte

– Detect when read occurs too early:• Primary cache bits are set to mark possible violations, if write to that

address of an earlier thread invalidates – Violation detected and thread is restarted.

– Safely discard speculative states after violation:• Permanent state kept in L2, any L1 lines that are speculative data are

invalidated, L2 buffer for thread is discarded (permanent state not effected)

© 조준동 , 2006 년 여름 69


– Place speculative writes in memory in correct order:• Separate speculative data L2 buffers kept for each thread• Must be drained into L2 in original sequence• Thread sequencing system also sequences the buffer draining

– Memory Renaming:• Each CPU can only read data written by itself or earlier threads• Writes from later threads don’t cause immediate invalidations (since

writes from these threads should not be visible yet)• Ignored invalidations are recorded with pre-invalidate bit• If thread accesses L2 it must only access data it should be able to see

from itself or earlier L2 buffers• When current thread completes all currently pre-invalidated lines are

check against future threads for violations

© 조준동 , 2006 년 여름 73

Prototype

• MIPS-based RC32364• SRAM macro cells• 8-Kbyte L1 data and instruction caches• 128 Kbytes L2• Die is 90 mm^2, .25-micron process• Have a verilog model, moving to physical design using synthe

sis• Central Arbritration for Buses will be the most difficult part, h

ard to pipeline, must accept many requests, and must reply with grant signals

© 조준동 , 2006 년 여름 76

Conclusion

• Hydra CMP– High performance- Cost effective alternative to large chip single processors- Similar die area can achieve similar to uniprocessor performanc

e on integer programs using thread speculation- Multiprogrammed or High Parallelism can do better then single

processor- Hardware Thread-Speculation is not cost intensive, and can give

great gains to performance

성균관대학교 정보통신공학부 © 조준동 2006 년 가을 1 SCHEDULING AND TIMING...

Documents

Transcript of 성균관대학교 정보통신공학부 © 조준동 2006 년 가을 1 SCHEDULING AND TIMING...