성균관대학교 정보통신공학부 © 조준동 2006 년 가을 1 SCHEDULING AND TIMING...
-
Upload
reynold-bradford -
Category
Documents
-
view
217 -
download
4
Transcript of 성균관대학교 정보통신공학부 © 조준동 2006 년 가을 1 SCHEDULING AND TIMING...
성균관대학교 정보통신공학부 © 조준동 2006 년 가을 1
SCHEDULING AND TIMING ANALYSIS OF HW/SW ON-CHIP COMMUNICATION IN
MP SOC DESIGN
© 조준동 , 2006 년 여름 2
Contents
– Introduction
– Motivation
– Communication Scheduling and HW/SW
Timing Analysis
– Experimental Results
– Conclusion
© 조준동 , 2006 년 여름 3
• On-chip Communication Architecture in MP SoC design
• On-chip Communication Design– Design of HW/SW Communication Architecture– Mapping and scheduling of on-chip communication
• Contribution of this work is the consideration of both of– Dynamic behavior of SW communication architecture– Physical communication buffer sharing
Introduction
© 조준동 , 2006 년 여름 4
Preliminaries• Mapping/Allocation
• Communication Scheduling
© 조준동 , 2006 년 여름 5
Motivation
© 조준동 , 2006 년 여름 6
Extended Task Graph
© 조준동 , 2006 년 여름 7
Communication Delay ModelCommunication Delay
– Communication Delay of SW Communication
Architecture
– Communication Delay of HW Communication Interface
– Communication Delay of On-chip Communication Network
(n)C(n)C(n)C(n)C OCN
i
HW
i
SW
ii
)()()( nC(n)C(n)CvR(n)C DDCSISRiSW
i
Interface) ionCommunicat HW of delay(n(n)C HW
i
BLOCKEDtrans
OCN
i DnD(n)C
© 조준동 , 2006 년 여름 8
• ILP for Scheduling Communication Nodes and Tasks of ETG– Data dependency constraints
– Resource contention constraints
– Processor and on-chip communication network
Communication Scheduling and Timing Analysis
startendstartend tttt 3121 and
© 조준동 , 2006 년 여름 9
ILP for Binary variable– Physical communication buffer contention constraints
© 조준동 , 2006 년 여름 10
• List scheduling
Heuristic Algorithm
LIST (G(V,E),a) {
I=1;
Repeat {
For each resource type k=1,2,…., nres {
Determine candidate tasks
Determine unfinished tasks
Select nodes, such that
Schedule the tasks at time
by setting
}
} until ( is scheduled);
Return (t);
}
;,1 kkk aTS
;, kIU
;, KIT
KIk US ,
kS I
):(, ki SviiIt
nv
© 조준동 , 2006 년 여름 11
Experiments
© 조준동 , 2006 년 여름 12
Experimental Results
• Execution delay of tasks in the H.263 system
• Delay of software communication architecture– Software
communication architecture delay measured by ISS.
Functional Block
Execution Cycle
Source 1386
Motion predictor
1123650
Macro block 875358
Encoder 875358
VLC 2103156
Services Execution Cycle
ISR 857
CS 1060
DD(read) 5376
DD(write) 6528
© 조준동 , 2006 년 여름 13
Execution Time of H.263 encoder
© 조준동 , 2006 년 여름 14
Execution Time of JPEG and IS-95 example
© 조준동 , 2006 년 여름 15
• On-chip Communication Design– Design of HW/SW Communication Architecture– Mapping and scheduling of on-chip communication
• Communication Scheduling and Timing Analysis– ILP Formulation– Heuristic
• Consideration of– Dynamic behavior of SW communication architecture– Physical communication buffer sharing
• Future Work– To extend the approach to the complicated On-Chip
Network– To design On-Chip Communication Scheduler
Conclusion
성균관대학교 정보통신공학부 © 조준동 2006 년 가을 16
Deep-submicron 저전력 설계
성균관대학교 정보통신공학부 © 조준동 2006 년 가을 17
Communication Architecture of MP-SoC Platform
성균관대 성균관대 조준동조준동
© 조준동 , 2006 년 여름 18
Talk Outline
• MP-SoC Architectures:
– Homogeneous Architecture– Heterogeneous Architecture
• Crossbar Interconnection Models
• MP-SoC Evaluation Methodology
© 조준동 , 2006 년 여름 19
Networks-on-Silicon, Phillips
Albert van der Werf, Philips Research
© 조준동 , 2006 년 여름 20
DSP based commercial SoC
© 조준동 , 2006 년 여름 21
MP-SOC Platforms?
• Platforms : An architecture that is designed for an application domain
( 소비자 수요에 대처 , 앞으로의 변화 예측 )
• Multiprocessor systems-on-chips: (Usually heterogeneous multiprocessor)
– CPUs, DSPs, etc.– Hardwired accelerators.– Mixed-signal front end.
© 조준동 , 2006 년 여름 22
SoC topics …
• Task level analysis and optimizations– Scheduling and Resource sharing– Mapping data flow to the target HW– Task distributed splitting and merging
• Efficient communication Optimization– BW and Memory allocation– Synchronizations– Buffers– Routing: Shortest path routing w/ minimal router logics– Irregular meshes– Mapping to topology: What topology will suit a particular
application?
© 조준동 , 2006 년 여름 23
– System optimization:latency, throughput, BW. – Interconnect consumes up to 50% of energy and growin
g… – Scalable– Should also be optimized
• Topology• Links BW• Place & Route• Routing protocol
From computation centric to communication centric architectures
© 조준동 , 2006 년 여름 24
4G: Multiple standards Software Defined Radio & Multimedia
• A number of components might be pleased on a single die in order to decrease the production cost. Ex) h.264+ MPEG4
• Multi-DSP : Wibro, MoIP• Maximum parallelism !
– Parallel data transfer among the components.– Read, write and calculate processes should be decoupled.
DSP1
DSP2
DSPn
TXRAMnetwork
© 조준동 , 2006 년 여름 25
DSP
Why Multi-Threaded Cores?
Out
NoC
In SRAM
DSPDSP DSP
H/W O/S Schedulers
$RISC
I$D$ I$
Increasing gap: memory & processor
speeds(2x / 2 years)
Increasing gap: interconnect &
gate delays(multi-clock
intra-chip delay)
More parallel processing
(lower-power, higher-perf./mm2)
STMicroelctronics MultiFlex MP-SoC
© 조준동 , 2006 년 여름 26
1~8 2~6
Exploitable Parallelism
GP O/SThread-LevelParallelism
Instruction-Level
Parallelism
1
10 000’sInstructions
Min parallel grain size (instrns.)
Exploitable taskparallelism
1~100
MultiFlex Thread-Level
Parallelism
100’s
© 조준동 , 2006 년 여름 27
Execution speed upMPEG4 VGA Video Codec Results
0
5
10
15
20
25
30
35
2 3 4 5 6Number of ARMs
Fra
me
/Se
c
8 threads Theoretical 2 threads4 threads Latency = 0
Theoretical upper bound
MultiFlex result (STBus)
MultiFlex (0 latency bus)
© 조준동 , 2006 년 여름 28
Intel IXP1200 Network Processor
© 조준동 , 2006 년 여름 29
IXP1200 MicroEngine
© 조준동 , 2006 년 여름 30
Holistic design of multi-core architectures
• Naïve Methodology is inefficient
• Demonstrated inefficiency for cores and proposed alternatives– Single-ISA Heterogeneous Multi-core Architectures for
Power[MICRO03]– Single-ISA Heterogeneous Multi-core Architectures for
Performance[ISCA04]– Conjoined-core Chip Multiprocessing [MICRO04]
• What about interconnects?– How much can interconnects impact processor architecture?– Need to be co-designed with caches and cores?
© 조준동 , 2006 년 여름 31
• Drew Wingard, CTO, Sonics, Inc.• [email protected]
– Delivering clocks is problematic– Wire delay is dominant– Too many constraints, from too many blocks
© 조준동 , 2006 년 여름 32
© 조준동 , 2006 년 여름 33
Various Interconnect Topologies
Linear (Pipeline)
Bus (Old stile)Ring (Chip&Slow)Star (Switch)
Tree(PCI-Express)
Mesh (NoC)
Fully connected
(When do we need it?)
Hybrid (Dedicated)Ring w/ bypass
(Async)
© 조준동 , 2006 년 여름 34
State of the Art: Network on Chip
Networks are preferred over buses:
• Higher bandwidth
• Concurrency, effective spatial reuse of resources
• Higher levels of abstraction
• Modularity - Design Productivity Improvement
• Scalability
© 조준동 , 2006 년 여름 35
Network on Chip
The idea: • Decouple Communication and Computation• Simple routers instead of repeaters• Route packets instead of wires• Provide diff. Services on the same infrastructure
M odule
M odule M odule
M odule M odule
M odule M odule
M odule
M odule
M odule
M odule
M odule
© 조준동 , 2006 년 여름 36
Communication Centric Design Flow, Jeremy
Application Architecture Library
Architecture / Application Model
Good?
Evaluate
Analyse / Profile
Configure
Refine
NoC Optimisation
No
Synthesis
Optimized
NoC
© 조준동 , 2006 년 여름 37
System Development Flow of QNoC
Connect modules with an ideal network
And measure traffic
Good?
Analyse / Profile
Place Modules
Map traffic to grid using given QNoC architcture
Balance utilization, minimize cost, and meet QoSRefine QnoC
NoC Optimisation
No
Synthesis
Optimized
NoC
© 조준동 , 2006 년 여름 38
MPSoC design time application optimization and exploration
응용 분야 스펙
QoS 만족 ?
플랫폼 관리자
개선된 응용 분야 스펙 + 파레토 커브 발생기
개선
응용 분석 , 분할 , 변환 및 탐색
No
리소스 할당 플랫폼 검증
© 조준동 , 2006 년 여름 39
© 조준동 , 2006 년 여름 40
45.0
3.8
1.0 1.0
2.9
0.8
0.1
1.0
10.0
100.0
BUS NoC PTP
Wire-Length(Area) and Power
Wire Length
Power
QNoC (Technion) vs. Alternative Solutions
Mesh (4x4): Uniform scenario (Same QoS):
Arch.Frequenc
y
Utilization
Av. Link
Width
QNoC1GH
z30% 28
Bus50
MHz50% 3 700
PTP100MHz
80% 6
BUS QNoC PTP
Cost
© 조준동 , 2006 년 여름 41
• explore various topology options to determine those that best meet the system objectives. • tune the arbitration scheme and network links (for example, burst lengths, communication FIFO sizes, etc.) •NoCcompiler is used to create the NoC instance ( packet format options, burst types, special transactions, maximum packet payload length, network port configurations, etc) •A cycle-accurate simulation model can then be generated to verify with SystemC or RTL simulation.
NoC Design solution
© 조준동 , 2006 년 여름 42
Danube NoC IP• • Support for OCP 2.0, AMBA AHB and AMBA AXI socket interfaces• • Clock Frequency up to 750 MHz. in 90 nm process• • GALS links for spanning distance and crossing clock boundaries• • Unlimited user defined topology through Arteris NoCexplorer• and NoCcompiler• • Flexible pipelining and FIFO management• • Customizable NoC Transaction and Transport Protocol (NTTP) packet for
mat• • On chip protocol for runtime SoC application debug• • Lightweight Service bus for runtime debug, error management,• register configuration
© 조준동 , 2006 년 여름 43
SoftStream HERA 3000(mediaexcel)(2)
PCI board ( can see 6 DSPs)
© 조준동 , 2006 년 여름 44
KOMPROCESSOR AVC(Ateme)
The Kompressor Board is a multi-encodin
g short PCI board
MPEG-4 AVC H.264 Live encode, IP St
reaming and simultaneous file recording
TMS320C64x DSPs from Texas Instruments
used
FPGA – Cyclone from Altera used
Input – NTSC or PAL
VIDEO Compression
MPEG-4 AVC / H.264(ISO/IEC 14496-
10) Baseline, Main, High Profiles
© 조준동 , 2006 년 여름 45
Application 과 전망• IP 카메라와 스트리머 , 네트워크 비디오 레코더 (N
VR) 등 실시간 비디오 감시
• 다기능 프린터와 스캐너 , 필름 처리 장비 , 화상 인식 시스템 등 고성능 이미지 처리 시장의 성능과 통합 요건을 만족하고 또한 인코딩 , 스트리밍 및 트랜스코딩 장비 , 화상회의 시스템 , 비디오 폰 등 방송 및 IP TV
© 조준동 , 2006 년 여름 46
HiBRID-SoC Architecture
HIBRID-SoC multi-core system-on-chip Architecture
Integrate a powerful on-chip communication structure
A well-balanced memory system to account for the growing amount of data memory
system (e.g., in the area of video, Mpeg-4 part 10 or Advanced Video Coding (AV
C))
Dedicated chips for the Mpeg-4 Simple Profile, consists of a very general processing
demend
Three programmable cores Each adapted towards a specific class of algorithms
Combination of the cores and their software development environment
An extention of a programmable core with dedicated modules (e.g.,Trimedia)
HIBRID-SoC multi core
Developed at the University of Hannover
© 조준동 , 2006 년 여름 47
Morpho(MS1-16 v002)
• -16 16x16-bit MAC operations/cycle @500MHz
• -4.8 GMAC/s @500MHz • -420 and 480 TBGA packaging (0.18u/0.13u) • -core voltages at 1.8V/1.2V and 3.3V
http://www.morphotech.com/
© 조준동 , 2006 년 여름 48
SandBridge(SB3000)
• - 응용 분야 • MPEG capture and playback for video or videoconfer
encing • JPEG capture and playback for camera and display f
unctions • MP3 capture and playback for music or ringtone fun
ctions • Other computation-intensive multimedia functions
(e.g. speech control)
© 조준동 , 2006 년 여름 49
NEC 와 ARM
• 전화 및 가전제품용으로 2 개 이상의 프로세싱 코어와 연산 유닛을 마이크로프로세서 안에 장착한 새로운 칩을 개발중이다 .
• 휴대폰 제조업체들에게도 다양한 활용방안이 있다 . 예를 들면 코어중 하나는 통화기능에 , 다른 하나는 인터넷 트래픽 관리에 사용할 수 있다 .
• 1.2GHz 의 ARM11 프로세서 제품군과 동급의 성능을 기록했으며 600 mW가량의 전력소비량으로 최고 1440 DMIPS 의 성능 , 130nm 프로세스
• 리눅스 SMP OS 포트 : 전력 소비 절감과 프로세서간 애플리케이션 로드 자동 밸런싱 기능
© 조준동 , 2006 년 여름 50
Efficient Shared DRAM Subsystems for SOCs, Sonics Inc.
• Increases SOC performance • Improves efficiency of off-chip DRAM by up to 40% • Guarantees Quality of Service for on-chip cores
• Lowers SOC costs • Consolidates and reduces multiple distributed buffers • Single Smart Interconnet replaces multiple layered busses
• Shortens time to market • Smart Interconnect removes wire routing problem of classical architectures • Accurate architectural exploration ensures functionality in first days of develop
ment
• Increased market penetration • DRAM technology selection decoupled from the rest of the SOC • Threaded architecture enables easy scalability without re-design of memory su
bsystem
© 조준동 , 2006 년 여름 51
Shared Bus and DRAM Subsystem
The traditional computer bus organization suffers from lowDRAM efficiency, a lack of quality-of-service.
© 조준동 , 2006 년 여름 52
Star Topology Access to a Shared DRAM Subsystem
• DRAM controller 1) sees all initiator requests at the same time and can select the order of
servicing Them 2) optimize the performance of the DRAM subsystem by reordering
requests 3) providing flexible quality-of-service to each of the initiators.4) causes a large number of wires to converge on the DRAM controller,
producing physical problems for the design.
© 조준동 , 2006 년 여름 53
SiliconBackplane and DRAM Scheduler
The shared μnetwork remedies the wire congestion problem. The DRAM scheduler addresses both the DRAM performance issues.Quality-of-service guarantees by selectively scheduling the DRAM accesses.
© 조준동 , 2006 년 여름 54
Unified DRAM and QoS Scheduling
© 조준동 , 2006 년 여름 55
Set-top-box System
© 조준동 , 2006 년 여름 56
Bandwidth Profile of Bus with Round-Robin Arbitration
Many of the initiators receive less than theirrequired bandwidth so overall application requirements are unsatisfied.
© 조준동 , 2006 년 여름 57
Bandwidth Profile of Bus with Priority Arbitration
From 5000 to 9000 cycles all but two initiators (CPU and DSP) receive no service at all. Clearly, this is unacceptable for the set-top-box application.
© 조준동 , 2006 년 여름 58
Bandwidth Profile of Sonics Solution
Each of the initiators are connected to the DRAM using a Silicon Backplane μNetwork, a DRAM scheduler, DRAM controller. The Silicon Backplane and DRAM bandwidth have been allocated to the different initiators according to their needs. All application requirements are met and overall DRAM utilization is pretty steady at around 70%.
© 조준동 , 2006 년 여름 59
References
• Terry Tao Ye, On-Chip Multiprocessor Communication Network Design and Analysis, Ph.D. Dissertation, Stanford Univ.
• E. Bolotin, et al., Automatic hardware-Efficient SoC Integration by QoS network on Chip, Israel Institute of Tech, Haifa, Israel.
• E. Bolotin, et al., Efficient Routing in Irregular Topology NoCs,
Technion- Israel Institute of Tech
성균관대학교 정보통신공학부 © 조준동 2006 년 가을 60
The Standford Hydra CMP
• Lance Hammond• Benedict A. Hubbert• Michael Siu• Manohar K. Prabhu• Michael Chen• Kunle Olukotun
Presented by Jason Davis
© 조준동 , 2006 년 여름 61
Introduction• Hydra CMP with 4 MIPS Processors• L1 cache for each CPU and L2 cache that
holds the permanent states• Why?
– Moore’s law is reaching its end– Finite amount of ILP– TLP (Thread Level Parallelism) vs ILP in pip
elined architecture– CMP can use ILP as well (TLP and ILP are
orthogonal)– Wire Delay– Design Time (CPU core doesn’t need to be
redesigned) just increase the number
• Problems– Integration densities just now giving reaso
ns to consider new models– Difficult to convert uniprocessor code– Multiprogramming is hard
© 조준동 , 2006 년 여름 62
Base Design
• 4 MIPS Cores (250 MHz)– Each core:
• L1 Data Cache• L1 Primary Instruction
Cache
– Share a single L2 Cache– Virtual Buses (pipelined with
repeaters)
• Read bus (256 bits)– Acts as general purpose system bus for
moving data between CPUs, L2, and external memory
– Wide enough to handle entire cache line (CMP explicit gain, multiprocessor systems would require too many pins
• Write bus (64 bits)– Writes directly from 4 CPUs to L2– Pipelined to allow for single-cycle occupancy
(not a bottleneck)– Uses simple invalidation for caches
(broadcast invalidates all other L1s)
• L2 Cache– Point of communication (10-20 cycles)
• Bus Sufficient for 4-8 MIPS cores, more need larger system buses
© 조준동 , 2006 년 여름 63
Base Design
© 조준동 , 2006 년 여름 64
Parallel Software Performance
© 조준동 , 2006 년 여름 65
Thread Speculation• Takes sequence of instructions on normal program and arbitrarily breaks it into a
sequenced group of threads– Hardware must track all interthread dependencies to insure program acts the same way– Must re-execute code that follows a data violation based upon a true dependency
• Advantages:– Does not require synchronization (different than enforcing dependencies on multiproces
sor systems)– Dynamic (done at runtime) so programmer only needs to consider for maximum perform
ance– Conventional Parallelizing compilers miss a lot of TLP because synchronization points
must be inserted where dependencies can happen and not just where they do happen • 5 Issues to address:
© 조준동 , 2006 년 여름 66
Thread Speculation
1. Forward data between parallel threads
2. Detect when reads occur to early (RAW)
3. Safely Discard speculative state after violations
© 조준동 , 2006 년 여름 67
Thread Speculation
5. Provide Memory renaming (WAR hazards)
4. Retire speculative writes in correct order (WAW hazard)
© 조준동 , 2006 년 여름 68
Hydra Speculation Implementation
• Takes care of the 5 issues:– Forward data between parallel threads:
• When thread writes to bus, newer threads that need the data have their current cache lines for that data invalidated
• On miss in L1, access L2, write buffers of current or older thread replaces data returned from L2 byte-byte
– Detect when read occurs too early:• Primary cache bits are set to mark possible violations, if write to that
address of an earlier thread invalidates – Violation detected and thread is restarted.
– Safely discard speculative states after violation:• Permanent state kept in L2, any L1 lines that are speculative data are
invalidated, L2 buffer for thread is discarded (permanent state not effected)
© 조준동 , 2006 년 여름 69
Hydra Speculation Implementation
– Place speculative writes in memory in correct order:• Separate speculative data L2 buffers kept for each thread• Must be drained into L2 in original sequence• Thread sequencing system also sequences the buffer draining
– Memory Renaming:• Each CPU can only read data written by itself or earlier threads• Writes from later threads don’t cause immediate invalidations (since
writes from these threads should not be visible yet)• Ignored invalidations are recorded with pre-invalidate bit• If thread accesses L2 it must only access data it should be able to see
from itself or earlier L2 buffers• When current thread completes all currently pre-invalidated lines are
check against future threads for violations
© 조준동 , 2006 년 여름 70
Hydra Speculation Implementation
© 조준동 , 2006 년 여름 71
Hydra Speculation Implementation
© 조준동 , 2006 년 여름 72
Speculation Performance
© 조준동 , 2006 년 여름 73
Prototype
• MIPS-based RC32364• SRAM macro cells• 8-Kbyte L1 data and instruction caches• 128 Kbytes L2• Die is 90 mm^2, .25-micron process• Have a verilog model, moving to physical design using synthe
sis• Central Arbritration for Buses will be the most difficult part, h
ard to pipeline, must accept many requests, and must reply with grant signals
© 조준동 , 2006 년 여름 74
Prototype
© 조준동 , 2006 년 여름 75
Prototype
© 조준동 , 2006 년 여름 76
Conclusion
• Hydra CMP– High performance- Cost effective alternative to large chip single processors- Similar die area can achieve similar to uniprocessor performanc
e on integer programs using thread speculation- Multiprogrammed or High Parallelism can do better then single
processor- Hardware Thread-Speculation is not cost intensive, and can give
great gains to performance