2
Contents• Section I
– Introduction to reconfigurable computing– FPGA Logic/Routing architecture
• Section II– Core-embedded FPGA– ALTERA/XILINX/TRISCEND/SiDSA
• Section III– Multiple-FPGA architecture– Emulation/Simulation acceleration using FPGA’s
3
Introduction• Design execution methodology
– Hardware• Very fast & efficient• No alteration after fabrication• Expensive process to redesign and refabrication
– Software-programmed processors• Set of instructions determines a specific operation.• Functionality can be easily changed.• Performance is far below that of an ASIC.
4
Reconfigurable Computing• Fill the gap between hardware and software
– FPGA is an array of computational elements and the routing wires among them.
– The configuration is determined by programmable configuration bits.
• Development – 1963 : Concept of “restructurable computing” appeared.– 1980’s : FPGA technology developed as a hybrid device betw
een PALs and MPGA(Mask Programmable Gate Arrays) by Xilinx, Altera, Lucent, QuickLogic..
– SRAM-programmable FPGA : high density– 1999-Now : Core-embedded FPGA incorporates both of progr
ammable processor and FPGA.
5
Logic Block• LUT-based logic block
– Efficient logic block architecture adopted in many commercial FPGA’s
– Composed of LUT, DFF(Latch), and mux
carry logic
carry logic
4-LUT4-LUT
DFFDFF
Cout CinI1 I2 I3 I4
Out
6
Logic Block• 4-LUT
– Any function with 4 input variables can be implemented.
• FF– Used for pipelining, registers, – It can be configured for latch by configuration– Clock signals come from global signals routed on
special resources (Global net)
• Carry logic– Speed up the carry-based arithmetic functions– Bypass the routing resources but connected directly
to the neighboring CLB
7
Interconnection Architecture
• Island-style FPGA routing architecture– Routing architecture of most FPGA architectures– Sea of routing resources for connection between
rows and columns of logic blocks– Connection blocks : Programmable multiplexers that
selects the signals in the given routing channel to be connected to the logic block’s terminal.
– Switch Box: Connections between horizontal and vertical routing resources
8
Interconnection Architecture
• island-style routing architecture
9
Interconnection Architecture
• Routing resources with various lengths– Local interconnections : Routing between logical
blocks (ex. dedicated carry chain)– Medium length lines : Routing wire that runs width of
several logical blocks– Long lines : Routing wire that runs the whole chip
height or width– Global lines : Routing wire that runs the entire area
of the chip • High-speed, low-skew, connections to all logic blocks• Usually used for clocks, resets.
10
Two Routing Architectures• Segmented routing architecture
– Local communication traffic by short wires– Long wires are frequently used to travel long distances witho
ut passing through many switches– Researches
• How many wires should be contained in each channel?• How many types of long wires would be efficient?• Proper portion of each wire type in the whole routing resources
– Companies : Xilinx, Lucent, Vantis
11
Two Routing Architectures• Hierarchical routing architecture
– Cluster-based routing architecture• Routing within a cluster is at the local level, only
connecting within that cluster.• Longer wires connect different clusters together.
– Each routing level contains several clusters– Background
• Most connections between logic blocks are local with only a limited amount of communication traversing long distance
– Good placement algorithm is required. – Company : ALTERA
12
Two Routing ArchitecturesSegmented Routing Hierarchical Routing
Logic blocks
Connection switches cluster
13
Heterogeneous architecture
• Multiplier embedding– Multiplier implementation in FPGA is usually inefficient.– Custom/Configurable hardware for multiplication with various
operand widths and choice of signed/unsigned can be embedded using a reconfigurable array of FAB’s (special full adder blocks).
– (Haynes, Field-Programmable Custom Computing Machines, 1998)
14
15
Heterogeneous architecture
• Embedded memory blocks– Use of available LUTs as RAM structure (Xilinx XC4000, Virte
x FPGAs)– Dedicated memory blocks within array (Xilinx Virtex and Altera
FPGAs)
16
Xilinx Virtex architectureBlock SelectRAM is embedded inside logic blocks as a column.
17
Heterogeneous Architecture
• Processor embedding– At late 2000, several commercial FPGA companies have ann
ounced plans to include entire microprocessors. – Altera
• ARM9-based Excalibur device– Xilinx
• PowerPC based Virtex-II device– Triscend
• 8051/ARM based SoC integration platform
19
Core-Embedded FPGA’s• ALTERA
– ExcaliburTM
• ARM-embedded FPGA– StratixTM
• Currently without ARM core. Excalibur’s next version is under development.
• XILINX– Virtex-II ProTM
• IBM’s PowerPC-embedded FPGA. • Triscend
– A7• ARM-embedded FPGA
– E5• 8051-embedded FPGA
20
ALTERA’s Excalibur• ARM9 core integrated with FPGA
– “SOPC (System On Programmable Chip)”– C/C++ compiler/debugger integrated in the FPGA compiler.
• Interface between processor and FPGA– AMBA (Advanced Microcontroller Bus Architecture)– The widely used internal bus architecture for SoC.– The connection between ARM processor and FPGA block is d
one by AMBA bus.
21
ALTERA’s Excalibur
Clock Domain 2(AHB2)Clock Domain 2(AHB2)(up to 90MHz)(up to 90MHz)
Clock Domain 3 (PLD)Clock Domain 3 (PLD)(up to 100MHz)(up to 100MHz)
Clock Domain 1 (AHB1)Clock Domain 1 (AHB1)(up to 180MHz)(up to 180MHz)
22
Clock Domain 2(AHB2)Clock Domain 2(AHB2)(up to 90MHz)(up to 90MHz)
Clock Domain 3 (PLD)Clock Domain 3 (PLD)(up to 100MHz)(up to 100MHz)
Clock Domain 1 (AHB1)Clock Domain 1 (AHB1)(up to 180MHz)(up to 180MHz)
23
ALTERA’s Excalibur• AHB1
– Bridge for AHB2– Interrupt controller,
watchdog timer– Single Port & Dual
Port SRAM– The Embedded
processor is the only bus master on AHB1
24
ALTERA’s Excalibur• AHB2
– PLD transfers data with memories, UART or PLD slave
– Dedicated interfaces between stripe (Processor and Peripherals) and PLD
25
• AHB2– PLD transfers data with
memories, UART or PLD slave
– Dedicated interfaces between stripe (Processor and Peripherals) and PLD
26
XILINX’s Virtex-II Pro• PowerPC core integrated with FPGA
– “Platform FPGA architecture”– Up to four PPC cores can be integrated.
• Interface between processor and FPGA– CoreConnect Bus
• PLB (Processor Local Bus)• DCR (Device Control Register) bus
– OCM(On-Chip Memory) interface• Dedicated interface between the block RAM and OCM signals of
PPC core.
27
Virtex-II Pro Block Diagram
PowerPC core. This block diagram contains two PPC cores.
Block RAM and multiplier blocks
Configurable logic block array
28
PPC Core Block
PPC 405 Core
OC
M c
on
trolle
rO
CM
con
trolle
r
OC
M c
on
trolle
rO
CM
con
trolle
r
Control
Control
Block RAM
Block RAM
Block RAM
Block RAM
OCM controller is dedicated interface between PPC and Block RAM.
Block RAM can be configured as Instruction-Side Block RAM(ISBRAM) or Data-Side Block RAM(DSBRAM).
Fixed latency of memory access guarantees higher speed execution.
Block RAM can be configured as dual-port RAM (Data communication between PPC and FPGA).
PLB master interface ports are at the boundary of PPC core.
DCR bus
29
Triscend’s E5/A7• E5/A7
– “CSoC(Configurable System-on-Chip)”– E5 contains 8051 core, CSL(Configurable System Logic) matr
ix, and peripheral interfaces(JTAG, DMA, Timer, FIFO)– A7 contains ARM core instead of 8051.
• CSI (Configurable System Interconnect)– Bus developed by Triscend. – Pipelined bus architecture for the performance optimization
30
Triscend E5/A7• Bus architecture allows the bus to be
expanded throughout the whole chip while preserving high-performance.– The internal system bus is extended throughout the
user-configurable system logic.
• Objectives– Inclusion of any processor is possible.– High-performance assured regardless of the CSL size
31
Triscend’s A7 Architecture• CSI Bus
– Configurable System Interconnect
– Masters of CSI• ARM• JTAG(Configuration)• DMA0, DMA1, DMA2,
DMA3– Sideband Signals
• Dedicated small # of signals for UART, Timer
32
Triscend’s CSL matrixVertical/Horizontal BreakersVertical/Horizontal Breakers1.1. Vertical : Address Decoder part of Vertical : Address Decoder part of
CSICSI2.2. Horizontal : Data read/write port Horizontal : Data read/write port
of CSIof CSI
Vertical/Horizontal BreakersVertical/Horizontal Breakers1.1. Vertical : Address Decoder part of Vertical : Address Decoder part of
CSICSI2.2. Horizontal : Data read/write port Horizontal : Data read/write port
of CSIof CSI
Selector Selector 1.1. Decodes address Decodes address 2.2. Registers are arranged in Registers are arranged in
vertical column of CSL cellsvertical column of CSL cells3.3. Pre-programmed at the Pre-programmed at the
initializationinitialization
Selector Selector 1.1. Decodes address Decodes address 2.2. Registers are arranged in Registers are arranged in
vertical column of CSL cellsvertical column of CSL cells3.3. Pre-programmed at the Pre-programmed at the
initializationinitialization
33
Triscend’s System Architecture
CPUCPU
DMADMA
JTAGJTAG
Bus FIFO/Arbiter
for multiple Masters
Bus FIFO/Arbiter
for multiple Masters
CSLCSL
RAMRAM
ROMROM
Memory Interface
Memory Interface
Bus master requires
grant signals from arbiter
CPU runs boot code initially. Boot code is for configuring CSL as well as storing program/data.
34
CSI Bus Architecture
Bus FIFO
Bus FIFO
Master Write – Address/Data/Control Slave Write – Address/Data/Control
Master Master
Arbiter
Master Read – Data/Control
Selectors and pipe registers
Selectors and pipe registers
Slave Read – Data/Control
Dedicated Slave
Dedicated Slave
CSL
CSL
Arbiter
35
Pipelined Write Transaction
Bus FIFO
Bus FIFO
Master Write – Address/Data/Control Slave Write – Address/Data/Control
Master Master
Arbiter
Master Read – Data/Control
Selectors and pipe registers
Selectors and pipe registers
Slave Read – Data/Control
Dedicated Slave
Dedicated Slave
CSL
CSL
Time Slot T+1
Time Slot T+2
Arbiter
Time Slot T
36
Pipelined Read Transaction
Bus FIFO
Bus FIFO
Master Write – Address/Data/Control Slave Write – Address/Data/Control
Master Master
Arbiter
Master Read – Data/Control
Selectors and pipe registers
Selectors and pipe registers
Slave Read – Data/Control
Dedicated Slave
Dedicated Slave
CSL
CSL
Time Slot T+1
Time Slot T+2
Time Slot T+3
Arbiter
Time Slot T
37
Pipeline in view of Bus Logic
mastermaster
arbiterarbiter
Address/Data
Address/Data
Configure SelectorDecode
Configure SelectorDecode
Read from CSL
Read from CSL
Bus FIFOBus FIFO
Data from CSL to Master
Data from CSL to Master
T T+1 T+2 T+3
38
Wait State• Why is it generated?
– 1. The handshake operation inside the logic implemented in CSL.
– 2. CSL logic is too slow to respond in one cycle.
• Sequence of generation– 1. “Address Selector” in CSL generates wait state if
the system tries to access the Selector’s address. – 2. If more than one wait state is required, the CSL
function inserts additional wait states.
39
Wait State Insertion
mastermaster
arbiterarbiter
Address/Data
Address/Data
Configure SelectorDecode
Configure SelectorDecode
Read from CSL
Read from CSL
Bus fifoBus fifo
Data from CSL to Master
Data from CSL to Master
T T+1 T+2 T+3
OR
Waitnow
40
CSL Physical Structure• Bus pipeline registers at each
bank boundary Time slots for user logic is independent of the signal transport time between banks.
• The write/read bus is distributed throughout CSL and buffered and piped into the bank as shown by the red arrows.
16x8 RAM System Logic8K
RAM16x8 RAM
Bank Bank Bank Bank
Bank Bank Bank Bank
Bank Bank Bank Bank
Bank Bank Bank Bank
Logic tile
• The wait signals generated from each bank is propagated to the pipeline registers in all other banks.
Wait Dist.Logic Cell
Wait Dist.Logic Cell
Wait Dist.Logic Cell
Wait Dist.Logic Cell
Wait Dist.Logic Cell
Wait Dist.Logic Cell
Wait Dist.Logic Cell
Wait Dist.Logic Cell
Wait Dist.Logic Cell
Wait Dist.Logic Cell
Wait Dist.Logic Cell
Wait Dist.Logic Cell
Wait Dist.Logic Cell
Wait Dist.Logic Cell
Wait Dist.Logic Cell
Wait Dist.Logic Cell
41
Structure Bank/Bus/Selector
Tile Tile Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile Tile Tile
Tile Tile Tile Tile Tile Tile Tile Tile
Selector Selector Selector Selector Selector Selector Selector Selector
Bank
Horizontal data line writes data to CSL cell. The read data is OR’ed to the horizontal read data line.
4 wires each tile
Configured initially for the selection of the column/wait generation.
42
E5 Physical Implementation
• 8051 CPU core• 0.35um, 40MHz CSL operation
8051 CPU core and RAM/ROM
CSL matrix
43
SiDSA’s FIPSOC• Integration of CAB (Configurable Analog Block)
– 8051 microcontroller– FPGA– Configurable analog cells optimized for data
acquisition applications
• Dynamic reconfiguration– Two configuration bits for each CLB– User can download extra configuration data while the
cells are in operation.
44
Analog Subsystem• Configurable Analog Blocks (CAB)
– Differential amplification– Comparison– Data conversion (ADC, DAC)
• Digital part– Digital part to configure CAB is controlled by the P or the pro
grammable logic.
45
Comparison• Xilinx
– Using CoreConnect bus to connect processor and FPGA.– Multiple processor cores can be used simultaneously.
• ALTERA– AMBA bus to connect processor and FPGA.
• Triscend– Processor can read/write any register inside of CSL matrix.
(CSL matrix can be considered as a functional block of the processor)
– Intensive pipeline schemes adopted to maintain/increase the throughput, as the latency otherwise caused by the distributed bus throughout the CSL matrix can be excessive.
Top Related