for Mobile Multimedia Applications - KAISTssl.kaist.ac.kr/2007/data/thesis/WRC_PhD_Thesis.pdf ·...
Transcript of for Mobile Multimedia Applications - KAISTssl.kaist.ac.kr/2007/data/thesis/WRC_PhD_Thesis.pdf ·...
박 사 학 위 논 문
Doctoral Thesis
휴대용 멀티미디어 기기를 위한
저전력 3 차원 그래픽 SoC 의 설계 및 구현
Design and Implementation of Low-Power 3D Graphics SoC
for Mobile Multimedia Applications
우 람 찬 (禹 籃 燦 Woo, Ramchan)
전자전산학과 전기및전자공학 전공
Department of Electrical Engineering and Computer Science
Division of Electrical Engineering
한 국 과 학 기 술 원
Korea Advanced Institute of Science and Technology
2004
휴대용 멀티미디어 기기를 위한
저전력 3 차원 그래픽 SoC 의 설계 및 구현
Design and Implementation of Low-Power 3D
Graphics SoC for Mobile Multimedia Applications
Design and Implementation of Low-Power 3D
Graphics SoC for Mobile Multimedia Applications
Advisor : Professor Yoo, Hoi-Jun By
Ramchan Woo
Department of Electrical Engineering and Computer Science
Division of Electrical Engineering
Korea Advanced Institute of Science and Technology
A thesis submitted to the faculty of the Korea Advanced Institute of Science and Technology in partial fulfillment of requirements of the degree of Doctor of Philosophy in the Department of Electrical Engineering and Computer Science, Division of Electrical Engineering.
Daejeon, Korea
2004. 6. 1
Approved by
Professor Yoo, Hoi-Jun
휴대용 멀티미디어 기기를 위한
저전력 3 차원 그래픽 SoC 의 설계 및 구현
우 람 찬
위 논문은 한국과학기술원 박사학위 논문으로
학위논문 심사위원회에서 심사 통과하였음.
2004 년 5 월 12 일
심사위원장 유 회 준 (인)
심사위원 나 종 범 (인)
심사위원 김 이 섭 (인)
심사위원 박 인 철 (인)
심사위원 원 광 연 (인)
사랑하는 부모님께 바칩니다
Dedicated To My Beloved Parents
DEE
20015171
우 람 찬, Woo, Ramchan. Design and Implementation of Low-
Power 3D Graphics SoC for Mobile Multimedia Applications.
휴대용 멀티미디어 기기를 위한 저전력 3 차원 그래픽 SoC
의 설계 및 구현. Department of Electrical Engineering and
Computer Science, Division of Electrical Engineering. 2004.
116p. Advisor Professor Yoo, Hoi-Jun. Text in English
Abstract A low-power graphics SoC implementing full-3D pipeline with texture-
mapping and special rendering effects is designed for mobile multimedia
applications such as PDAs or cell-phones. The chip contains a RISC processor
with MAC as a geometry engine, a 3D rendering engine, a programmable power
optimizer, and 29Mb embedded DRAM. Low-power consumption is achieved by
applying various techniques to the instruction set architecture, pipeline structure,
shading and texturing datapath, memory architecture, clock control, and
embedded DRAM. Programmable clocking allows the chip to operate in lower
power modes for various applications. The chip consumes less than 210mW,
delivering 1Mvertices/s, 66Mpixels/s and 264Mtexle/s texture-mapped pixels
with real-time special effects. The 121mm2 chip is fabricated with 0.16um
256Mb-compatible DRAM process to reduce the fabrication cost. The graphics
SoC is successfully demonstrated on two system evaluation boards running real-
time applications ported with custom-designed MobileGL.
Table of Contents
1. Introduction 1.1 Mobile 3D Graphics 1.2 Limitations 1.3 Design Philosophy 1.4 Previous Work
1.4.1 RAMP-I by KAIST
1.4.2 RAMP-II by KAIST
1.5 Recent Work 1.5.1 Z-3D by Mitsubishi
1.5.2 MBX by ARM
1.5.3 Others
1.6 Architecture Summary of Mobile-3D Hardware 1.7 Contribution of This Research
1.7.1 Design and Implementation of 3D Graphics SoC for Mobile Multimedia Applications
1.7.2 From Application to Demonstration
2. System Architecture 2.1 Target Specification 2.2 Simulation Environment 2.3 SoC Architecture
2.3.1 Geometry Engine with Intelligent Buffer
2.3.2 Rendering Engine
2.3.3 Graphics Memories
2.3.4 Power Management Unit
3. Low-Power Rendering Engine 3.1 3D Rendering Engine 3.2 SlimShader : Main Rendering Pipeline
3.2.1 Instruction Set Architecture
3.2.2 Low-Power Pipeline Structure
3.2.3 Triangle Setup Engine
3.3 Energy-Efficient Texturing Unit 3.3.1 Consideration of Energy Efficiency
3.3.2 Approximation of Perspective Division
3.3.3 Address Alignment Logic
3.4 Memory Programmer : Post Processing Unit 3.5 Memory Access
4. Chip Implementation 4.1 Process Technology 4.2 Chip Fabrication 4.3 Power Consumption 4.4 Performance
4.4.1 Performance Summary
4.4.2 Performance Comparison
4.4.3 Performance of SlimShader with External SDRAM
4.5 Appendix : Design Information 4.5.1 Area Information
4.5.2 Cell Utilization
5. System Evaluation 5.1 Target Configurations 5.2 REMY : System Evaluation Board
5.2.1 System Architecture
5.2.2 REMY-I : First Evaluation Board
5.2.3 REMY-II : PDA Prototype
5.3 Graphics Library : MobileGL 5.4 Demonstration
6. Conclusions and Further Work 6.1 Conclusions 6.2 Further Work
7. Summary
8. Bibliography
Chapter 1 Introduction
1.1 Mobile 3D Graphics As the mobile electronics market increases rapidly, 3G multimedia terminals
such as PDAs or smart cell-phones get popularity. The applications of PDA are
already migrating from text-based PIM (Personal Information Management) to
the real-time multimedia like MP3 audio, MPEG-4 video [1-2] and even 3D
computer graphics [3-4]. Also, today’s cell-phones are no more designated only
for the voice communication. They are already evolving to become Mobile
Multimedia Centers. Taking pictures with built-in camera, watching 2D graphics
animations and MPEG-4 videos, listening to MP3 audio, and even enjoying Java
games are not any more future stories. They are already happening at everyday
life. Therefore, it is very natural to imagine that the 3D computer graphics will be
the next step if we look back upon the PC’s evolution history. The real-time 3D
applications are especially attractive to games, advertisement, and avatars whose
data can be downloaded over the wireless network while occupying only a
limited bandwidth. Since the complex 3D scenes can be simply represented by
the list of vertices, texture images and corresponding camera movements, which
are naturally compressed, 3D graphics are adequate for the bandwidth-critical
Chapter 1 Introduction
wireless applications [5]. In order to satisfy these market demands, much
research on the realization of the 3D graphics for the handheld devices has
recently tried, including the design of hardware-accelerators for mobile platforms
[6-8] as well as the definition of software library [3-4]. However, the hardware
accelerators are far below the market requirements showing only limited shading
operations, without the texture mapping and special rendering effects which are
mandatory requirement for the 3D game applications.
1.2 Limitations Since the realization of real-time 3D computer graphics requires huge
computing power and corresponding memory bandwidth, it has been a critical
issue even in PC or console platforms in the past ten years [9-11, 46]. Although
today’s PC graphics accelerators can draw high-quality 3D images with high
performance GPU (Graphics Processing Unit), however, handheld devices cannot
tolerate those tens-of-watt power monsters. It is more challenging on the mobile
platform because the power consumption and physical dimension have much
more stringent limitations. 1) The most critical factor is limited energy supplied
by the battery. Based on the allocated budget of system power including the host
processor, system memory, input interfaces and LCD display, the power
consumption allocated to the 3D graphics system is confined to less than 300mW
~ 400mW for 2~3 hours continuous playback [12]. 2) And the limited computing
power of a mobile system which has a host processor without FPU and 400MB/s 2
Chapter 1 Introduction
memory system makes it difficult to draw 3D applications only with current
softwares. Since the users grab the wireless terminals and watch pixels on small
display devices, the average eye-to-pixel angle is wider than that of PC graphics
system, that is, 3) each pixel should be drawn with higher fidelity even if its
screen size is far smaller than that of PC. Although the recent trials related to
optimizing the 3D graphics softwares on the handheld devices achieve significant
improvement with integer-only datapath, their performance and quality are still
below the market requirements [4]. 4) Also, we can hardly find the extra space
for the graphics accelerator and corresponding graphics memories since the PCB
footprint is limited. 5) And, the low-cost aspect cannot be ignored because the
target systems will be carried by everybody’s hand. 6) Besides, standard graphics
APIs, which define the reference platforms, are not defined yet for mobile
applications. That is, we also need to define the hardware guidelines including
the supported functions, datapath precision, and other necessities for 3D graphics,
based on PC APIs such as OpenGL [14] or DirectX [15].
1.3 Design Philosophy Real-Time 3D graphics pipeline is composed of computation-intensive
geometry operations calculating the positions of vertices of triangles, and
memory-access-intensive rendering operations filling colors inside of the
triangles [35, 50]. Although the bottleneck in the geometry stage can be relieved
by using the fast and parallel datapath, the rendering performance cannot be 3
Chapter 1 Introduction
easily improved since up to tens of bytes must be accessed per every pixel. As
the energy consumption is proportional to the number of memory access, recent
research mainly focuses on reducing off-chip bandwidth to enhance the battery
lifetime for mobile 3D applications. MBX architecture reduces the memory
access with tile-based rendering, but the performance is still limited by the
system bus and the tiling overhead itself [24]. Moller’s POOMA texturing system
proposes several reduction schemes of texture requests, but the real measurement
results have not been reported through the hardware implementation yet [26].
Since 3D rendering requires various buffers to store frame, depth and texture
images, merging their requests and accessing them with limited number of off-
chip ports can make the interface circuitry more complex. Solving the bandwidth
bottleneck with the traditional approaches such as prefetching, caching, and
scheduling can be another burden for the energy.
If we move our viewpoint from off-chip bandwidth reduction to the integration
of the memory itself for the efficient 3D rendering, more effective architectures
or implementation schemes can come out in terms of the performance, the area
and the cost as well as the power consumption. It is clear that the on-chip
memory can provide more bandwidth while eliminating power-consuming off-
chip access [51]. If various buffers are integrated, each of them can be separately
and selectively activated to reduce its power consumption further. For the
wireless applications which limit screen resolution less than QVGA for a time,
the required capacity of on-chip memory is affordable, ranging from several
mega-bits to tens of mega-bits. Z3D, a 3D rendering core designed by Mitsubishi 4
Chapter 1 Introduction
[23], contains about 1Mbit SRAM assigned to the rendering. Its 53mW power
consumption allows it implemented inside of the cell-phones. However, the use
of SRAM still limits the storage capacity, thus, in turn, limits the performance
and functionalities. So the new architectures with embedded DRAM must be
explored to accelerate the realistic drawing of the 3D graphics for the wireless
applications.
To realize low-power 3D rendering at high performance, I proposed an
application specific embedded DRAM architecture, RAMP-IV architecture.
Instead of merely integrating a global DRAM and connecting it by huge number
of wires and corresponding crossbar switch, I determined the memory
configuration after analyzing the bandwidth requirements and access pattern of
the application. Various buffers and pixel-parallel characteristics of 3D rendering
operation allow me to distribute the memory access, not only providing sufficient
bandwidth, but reducing the power consumption by activating one or some of the
memory locally. After than, I specify the design of the embedded DRAM
according to its locations and access patterns. Therefore, the latency, throughput,
number of bus, and commands of the DRAM are not assumed to be determined
as given parameters. They are all treated as application-specific variables. Then, I
tune the logic pipeline to take full advantage of modified timing and functions of
DRAM. Finally, I applied various low-power techniques to the inside of the
memory and logic themselves. This design methodology is backed up by the
prediction of ITRS roadmap [52] which emphasizes the size of memory is ever
increasing as the scaling of silicon process advances, and more than half of the 5
Chapter 1 Introduction
chip area is already occupied by on-chip memory. That is, memory can be no
more treated as a passive device, nor called as a sub-system.
PixelEngine
T$F$
ExternalMemoryInterface
DDR-SDRAM
DDR-SDRAM
DDR-SDRAM
DDR-SDRAM
PixelEngine
T$F$
GPU
Prefetching,Request-Meging
PowerConsuming
SRAM Cache
High SpeedOff-Chip Interface
with Crossbar Switch
CPU
DDR-SDRAM
NorthBridgeChipset
DDR-SDRAM
System Memory
AGP
Graphics CardMain Board
[Fig. 1.3-1 : Example of PC Graphics Architecture]
Fig. 1.3-1 shows a typical example of today’s PC graphics architecture, in
which the GPU is evolved to attain huge memory bandwidth [44, 53]. The data
stored in system memories are not frequently accessed by the GPU since they
cannot satisfy enough bandwidth both for the CPU and GPU. In the GPU
architecture, several pixel engines work in parallel to boost up the performance,
fetching data from dedicated T$ (texture cache) and F$ (frame cache) memories.
Then, External Memory Interface (EMI) merges various transactions from cache
memories, and transfers them to off-chip DDR-SDRAMs dedicated only for the
graphics processing. The memories are connected to the EMI through the high-
speed crossbar switch and their data are accessed by burst-mode operations to
fully utilize their bandwidth. However, it causes its architecture power hungry.
Since the required data inside of the SRAM cache is transferred together with 6
Chapter 1 Introduction
adjacent data which may not be used at all, integrating cache memories can waste
its power. Also, merging many transactions from different cache memories can
make the circuitry of EMI more complex and more power consuming. Moreover,
prefetching data from DDR-SDRAM implies that unwanted data may be
accessed together through the high speed signal interface wasting power.
Therefore, I proposed the RAMP-IV architecture, in which the pixel engine is
directly connected to the local DRAM, instead of using the complex cache and
the memory interface. The baseband modem, CPU can access the data stored in
the SDRAM through the limited bandwidth of power-consuming system bus, as
shown in fig. 1.3-2 which describes the typical example of cell-phone
architecture. Therefore, the graphics data should be stored inside of the local
DRAM and simply accessed by local interconnection.
DRAMDRAM
PixelEngine
DRAM
PixelEngine
PixelEngine
PixelEngine
CPUBasebandModem
System Bus
RAMP-IV
Communication
Shared System Memories
Application
SRAM Flash SDRAM
Integrated DRAMsNo Cache Systems
No Bus Transactionfor 3D Graphics Rendering
DRAM
DRAMDRAM
DRAMDRAM
DRAM-optimizedLogic Pipelline
System Memory
[Fig. 1.3-2 : Example of Cell-Phone Architecture]
7
Chapter 1 Introduction
1.4 Previous Work
1.4.1. RAMP-I by KAIST
RAMP-I is a single-chip rendering engine which consists of 64 DRAM frame
buffers, 64 pixel processors (PP), 8 edge processors (EP) and a 32bit RISC core
for low-power 3D graphics as shown in fig. 1.4-1 [6]. The PPs are distributed
over the corresponding DRAMs and they work in parallel to fill the pixels inside
the polygon. Also, each PP and DRAM can be selectively activated according to
the shape of the polygon to save the overall power consumption. Although the
architectural performance of RAMP-I shows 11.1Mpolygons/s rendering speed,
however, it performs only simple shading, alpha blending, and depth-comparison
for 8x8 pre-clipped polygons. Also, it contains too many, 64, pixel processors,
some of which can be hardly utilized. In this architecture, 64 DRAMs are
independently controlled with their own controllers. Each DRAM covers only a
small portion of screen area since the small screen resolution of target PDA is
distributed. Therefore, this architecture cannot be easily implemented even with
0.18um CMOS process because the total area including the memories is too large.
Actually, fabricated chip of RAMP-I contains only 1/8 of the full architecture at
0.35um technology. Although RAMP-I were designed with 0.18um CMOS, it
would take about 100mm2 as shown in the following estimation:
µm).(withmmµm).(withmm
AreaRouting)...(FBPPEP
180100350400
7364906435864648
2
2
=
=
+×+×+×=×+×+×
Also, its distributed architecture makes it difficult to implement general 3D
graphics functionalities such as texture mapping or special rendering effects. 8
Chapter 1 Introduction
Queue
8PPs+
8DRAM
EP0
8PPs+
8DRAM
EP1
8PPs+
8DRAM
EP2
8PPs+
8DRAM
EP3
8PPs+
8DRAM
EP7
8PPs+
8DRAM
EP6
8PPs+
8DRAM
EP5
8PPs+
8DRAM
EP4
Queue160bit
DRAM64kb
SAM
PP0
DRAM64kb
SAM
PP1
DRAM64kb
SAM
PP2
DRAM64kb
SAM
PP3
DRAM64kb
SAM
PP7
DRAM64kb
SAM
PP6
DRAM64kb
SAM
PP5
DRAM64kb
SAM
PP4
64bit
EP_L
EP_R
Selector
Ctrl
24bit
24bit
Fabricated Test Chip [Fig. 1.4-1 : RAMP-I]
1.4.2 RAMP-II by KAIST
RAMP-II is a low-power 3D rendering engine which is implemented as part of
mobile PDA chip [7, 16]. 6Mb embedded DRAM macros attached to 8-pixel-
parallel rendering logic are logically localized with a 3.2GByte/s run-time
reconfigurable bus as shown in fig. 1.4-2, reducing the area by 25% compared
with conventional local frame-buffer architecture such as RAMP-I. It is the world
first 3D core integrated into the PDA-Chip, consuming 120mW and taking
24mm2 with 0.18um CMOS process. Although its maximum drawing rate
reaches up to 70Mpixels/s, however, low utility of 8 pixel processors and
unmatched load balance between PPs cut down the sustained fill rate to less than 9
Chapter 1 Introduction
20Mpixels/s. Moreover, run-time reconfigurable bus takes about 80% of power
consumption in the rendering logic because embedded DRAM has too many
data-bits (2048-bit) and their routes are changed at every 20MHz. Supported 3D
functions are exactly the same as RAMP-I – Simple shading, alpha-blending, and
depth-comparison for 8x8 pre-clipped triangles without texture mapping and
programmability for special rendering effects.
Fetch & Control
Polygon Data
L R
EdgeProcessor
PixelProcessor
512kb 512kb
512kb 512kb512kb 512kb
512kb 512kbA0 B0
A1 B1
SAM(1.5Kb SRAM)
6Mb eDRAM Frame Buffer(Z-Buffer + Double Color-Buffer)12 x 512kb independent Macros
RGB out
640bits 8 PixelProcessors1280bits
768bits
Rendering Logic1 x Edge Processor8 x Pixel Processors
ZC2C1
Run-timeReconfigurable
Bus
8-pixel-parallel renderingat every clock cycle
[Fig. 1.4-2 : RAMP-II]
1.5 Recent Work Many researches on mobile 3D graphics acceleration have been reported ever
since this work was first presented [17-22], and this section summarizes their
architectures and features. 10
Chapter 1 Introduction
1.5.1 Z-3D by Mitsubishi
93kB SRAM
Memory
TextureMemory
53kB SRAM
DisplayList Buffer
120kB SRAM
93kB SRAM93kB SRAM
MemoryDisplay
List Buffer120kB SRAM
TextureMemory
53kB SRAM
CPU
DMAC
HostIF
GeometryEngine
DisplayList Buffer
120kB SRAM
RenderingEngine
2D Engine
PixelEngine
Frame Buffer
TextureMemory
53kB SRAM
RenderingPipeline
Setup
Raster
Texture
Z Buffer
Memory
LCDInterface
LCD(176 x 132)
FPU
FPU
INT
[Fig. 1.5-1 : Z-3D]
Z-3D is the world first commercial implementation of hardware accelerated
3D graphics on cell-phones. It is targeted for 3D game, walk though, and
advertisement. Designed by Mitsubishi, the Z-3D is commercialized by NTT
DoCoMo and equipped into phones, D504i and D505i. As shown in fig. 1.5-1,
Z3D is composed of a geometry engine, rendering engine, pixel engine and on-
chip SRAM [23]. The geometry engine reads vertex data from 120kB display list
buffer and processes them to calculate coordinate transformation, lighting
calculation and clipping with one 24bit integer processing unit and two 24bit
floating processing units inside of the datapath. After the rendering engine fills
triangles performing smooth shading and texture mapping with 53kB on-chip 11
Chapter 1 Introduction
texture memory, the pixel engine performs hidden surface removal and opacity
display (alpha-blending) with 93kB on-chip frame and Z buffers at the end of the
3D pipeline. Therefore, in this architecture, the small capacity of on-chip
memories limits the contents, textures, and screen resolution. Also, its
performance, showing 185Kvertex/s transformation and 5.1Mpixels/s fill rate at
30MHz, is still below requirements of real-time 3D gaming applications although
Z-3D consumes relatively small amount of power, 38mW.
1.5.2 MBX by ARM
MBX is a 2D/3D graphics core co-developed by Imagination Technology and
ARM to accelerate the 3D graphics on ARM-based mobile platform [24]. As
shown in fig. 1.5-2, it contains a tile accelerator, a HSR (Hidden Surface
Removal) engine, a texture shading unit, a pixel blender and a 512Byte texture
cache. Containing only minimal set of rendering memories, it shares the system
memory with main processor to store frame, depth, and textures as well as
display list. Unlike the conventional graphics pipeline [11], MBX performs tile-
based rendering to save the memory bandwidth. This deferred-rendering
technique may reduce the bandwidth to access the data for frame and textures,
however, it needs extra time and bandwidth to setup parameters for the tiling
itself. Besides, the overall performance is severely degraded in the system since
the limited bandwidth from 32bit 100MHz AMBA AHB is even shared with the
CPU core. Assuming that the 400MByte/s bus can be utilized by 50% and half of 12
Chapter 1 Introduction
the acquired bandwidth is shared with the CPU, the bandwidth assigned to the
MBX is only about 100MByte/s. Therefore, the sustained pixel fill rate can be
only 9Mpixels/s at 100MHz, which is less than 10% of maximum rendering
performance, when drawing 16bpp pixels with 16bpp textures.
9Mpixels/sygon)Pixels/PolofNumber(Average 16
Vertices)of(Number20k Rate)referensh(Screen 30HzRateFillPixel
=××=
Also, this MBX is directly ported from the PC graphics, Kyro architecture [25],
with little modification, this architecture cannot be called as an optimized one for
low-power platforms. It supports too many functions such as anisotropic texture
filtering and vertex programming, some of which are useless and can be the
power and area overheads for now.
ExternalSDRAM
ExternalSDRAM
AccumulationBuffer
AccumulationBuffer
TiledZ-Buffer(16 x 16)
TiledZ-Buffer(16 x 16)
T&LVertexFeed
VGP
FPUFPUFPUFPU
Clipping
Viewportand
ScreenTransform
InputStreamParser
RegionGenerator
Culling
PointerCache
TileAccelerator
TiledZ-Buffer(16 x 16)
PE0PE1PE2
PE15
HSREngine
Iterate U
Iterate V
Iterate 1/W
Iterate R
Iterate G
Iterate B
Iterate A
TextureShading Unit
HSRFPU
Display ListParser
TextureShading
FPU
Parameter Fetch
AccumulationBuffer
BlendingUnit
PixelBlender
Texture Cache(512Byte)
TextureAddress
Generators
Cache A
Cache A
Cache A
Cache B
Cache A
Cache D
Cache A
Cache C
ArbitorDisplay List Z-buffer read/write Display List Texture
Fram
e B
uffe
r Writ
e
Eventmanager
RegisterBlock
SoCInterface
MBX HR-S Core
CPU
AHB MemoryInterface
MBX Memory InterfaceExternalSDRAM
[Fig. 1.5-2 : MBX] 13
Chapter 1 Introduction
1.5.3 Others
Recently, Akenine-Moller presented a hardware rasterization architecture for
mobile phones mainly focusing on low-cost aspect [26]. The proposed
architecture focuses on rasterizing textured triangles to save memory bandwidth
to the external memory. Only with software simulations, this work proposes an
inexpensive multisampling antialiasing scheme, a new filtering method with
texture minification and compression, and a scan-line based culling scheme that
avoids a significant amount of z-buffer access.
SONY announced a 3D graphics engine dedicated for mobile to stationary
products [27]. It contains floating-point geometry engine and rasterization engine
with color, z, and texture caches as illustrated in fig. 1.5-3. Memory interface
merges the requests from three caches, by which this core can easily interface
with the system memory. The chip can draw pixels at 600Mtexels/s rate
consuming 109.5mW.
Triangle Setup RenderingEngine
128
Pixel Generation
Texture BlendingAlpha TestingDepth Testing
Alpha BlendingPixel Operation
Texture $
Color $
Z $
GeometryEngineFMAC FDIV
128
128
MemoryInterface
BusBridge
32
32
CPUBus
3DCGIP
SystemMemory
[Fig. 1.5-3 : 3D Graphics Engine by SONY] 14
Chapter 1 Introduction
Also, a 3G baseband processor with 3D capability [28] and an application
processor [29], including a geometry-FPU [30] and a 3D rendering engine, are
trying to realize 3D graphics on 3G multimedia terminals. Concurrently, standard
graphics APIs such as OpenGL-ES [31] and JSR-184 [32] are being defined
these days.
1.6 Architecture Summary of Mobile-3D Hardware Because supplying the sufficient bandwidth to the rendering engine decides the
overall graphics performance, the mobile-3D hardwares listed in the previous
sections can be categorized by the memory access – Bus-attached system memory
and Local graphics memory [45].
Fig. 1.6-1 shows the example of bus-attached system memory. The 3DCG-IP
(3D Computer Graphics IP), integrated into the application processor with Host
CPU, accesses the system memory, where frame, depth, and texture data are
stored, to draw pixels through the system bus. The processed pixels stored inside
the system memory are also transferred through the system bus to the LCD
display to be drawn. In this architecture, 3D graphics can be easily accelerated
with least area overhead since the memory is shared with Host CPU. Therefore,
this architecture is adopted by some hardware vendors like SONY [27], ARM
[24], and Qualcomm [30]. However, the slow and narrow system bus, which
shows only about 400MByte/s at 32bit 100MHz, is far below the bandwidth
requirements of real-time 3D graphics applications – several GByte/s sustained 15
Chapter 1 Introduction
bandwidth for 100Mpixels/s with Bilinear texturing. And it is even shared with
other IPs such as Host CPU and LCD interface. Therefore, the rendering
performance is so much limited. Also, although the process technology gets
shrunk, there is not much room for the bus frequency and data-width to be
increased since they, in turn, increase the power consumption. Cache memories
can alleviate the bandwidth bottleneck, however, they are also the area and the
power overhead.
Fig. 1.6-2 shows the example of local graphics memory. The graphics IP is
integrated into the application processor with corresponding graphics memory
which supplies the required rendering data through the wide local bus between
the 3DCG-IP and the graphics memory. Then, the rendered pixels are directly
transferred to the LCD display. Low power consumption can be achieved in this
architecture since the necessary rendering data is acquired by accessing only
short-distanced local memory bus, not accessing the capacitive system bus.
Although this architecture requires additional space for the graphics memory, it
will be solved as the process goes shrunk. Also, more bandwidth is reserved to
other IPs such as Host CPU and possibly MPEG-4 in the application processor
since the system bus is almost free from the rendering operation. Solving area-
overhead will be easier than solving bus-bandwidth bottleneck in the near future,
because the process technology is heading for several nano-meters past 90nm.
Therefore, this work (RAMP-IV) and Mitsubishi [23] adopt this architecture,
showing greater rendering performance.
16
Chapter 1 Introduction
System Memory
BasebandProcessor
RX
TX
Flash /SRAM
Flash /SRAM Peripheries
Communication Application
LCDDisplay3DCG-IPHost
CPU
System Bus
[Fig. 1.6-1 : Bus-attached System Memory]
System Memory
BasebandProcessor
LCDDisplay
System Bus
RX
TX
Flash /SRAM
Flash /SRAM Peripheries
Communication Application
3DCG-IPHostCPU
GraphicsMemory
[Fig. 1.6-2 : Local Graphics Memory]
1.7 Contribution of This Research
1.7.1 Design and Implementation of 3D Graphics SoC for Mobile
Multimedia Applications
This work is the world first publication on 3D graphics SoC implementing full
3D pipeline with texturing and special effects for PDAs or 3G cell-phones.
1) The chip is highly integrated, containing a geometry engine, a rendering
engine, 29Mb embedded DRAM and power management unit. The proposed
rendering engine shows the highest performance ever announced in the world 17
Chapter 1 Introduction
with the help of energy-efficient texturing architecture and local graphics
DRAMs.
2) Low-power consumption is achieved by applying various techniques to the
instruction set architecture, pipeline structure, shading and texturing datapath,
memory architecture, and clock control.
3) It is also the world first implementation of mobile graphics processor with
pure DRAM technology to reduce the fabrication cost. The chip is fabricated with
256Mb-compatible DRAM process. DRAM process also suppresses leakage
current which is as important as run-time current in the mobile devices.
1.7.2 From Application to Demonstration
In addition to the chip implementation, complete flow from application
analysis to system demonstration is organized for SoC design.
1) Prior to the chip implementation, I analyzed the real-time applications to
propose and optimize new architecture using the simulation environment, 3D-
Glamor. Since there was no publication related to full 3D graphics pipeline for
mobile devices before, I also defined the required functions and precisions based
on 3D-Glamor.
2) A mobile graphics library, MobileGL, is designed to port the applications to
the proposed system. The MobileGL is world first trial reduction of OpenGL for
mobile platform.
3) Since the chip is implemented with the DRAM process, where only full- 18
Chapter 1 Introduction
custom method was tried, I applied the ASIC design flow to the DRAM process –
designing the standard cells and porting them to various levels of CAD tools.
4) Also, I developed two evaluation systems for the real-time demonstration.
The 3D graphics images are successfully displayed with the fabricated chip. It is
world first demonstration of mobile 3D graphics SoC with real silicon.
19
Chapter 2 System Architecture
2.1 Target Specification Fig. 2.1-1 illustrates a full 3D pipeline which includes a geometry engine, a
vertex buffer, a rendering engine, and corresponding rendering memories. For the
real-time 3D graphics on the handheld devices, the geometry engine needs fast
calculation for more than 0.5Mvertors/s and programmability for the
transformation and lighting (T&L). And the vertex buffer is necessary for the
efficient data transfer. The rendering engine requires more than 10Mpixels/s
parallel calculation and more than 1GByte/s huge memory bandwidth for shading,
depth comparison, and texturing. Also, large amount of rendering memory, more
than 10Mbits, with high bandwidth reaching to several GByte/s must be prepared
to store frame, depth, and various texture images. In this work, I implemented all
these features into a single chip and this chapter explains the architectural details.
2.2 Simulation Environment In order to find out an optimum pipeline architecture, memory size, and
bandwidth, I developed a 3D graphics simulator - 3D-Glamor (3D Graphics
Chapter 2 System Architecture
Library and Memory Simulator). The simulation architecture of 3D-Glamor is
illustrated in fig. 2.2-1. Real-time 3D graphics applications running on OpenGL
are converted to vertex lists, material properties, camera movements, and texture
images. Then, the geometry and rendering codes are executed on MobileGL, a
custom-designed graphics library. Since the conventional 3D graphics libraries
for the PC platforms are optimized only to the power-consuming floating-point
datapath, they are not suitable for the low-power RISC geometry engine with
integer-only datapath. Therefore, I designed 3D graphics library with 32bit fixed-
point arithmetic to optimally use ARM9 datapath, maintaing the compatibility
with de-facto standard OpenGL. Various rendering algorithms and architectures
are simulated by various levels of rendering models as follows:
Reference Renderer : Functional C-model
Cycle-Accurate Renderer : Cycle-Accurate C-model
Verilog PLI : Verilog-model, but datapath is described by C/C++
Verilog RTL : Verilog RTL (Register-Transfer-Level) model
Verilog GATE : P&R-Ready Verilog-model after synthesis
I gathered the necessary information such as the optimum precision of each
datapath, memory bandwidth and utilization, and pipeline efficiency, running
real-time applications. In order to simulate the real-life workloads, four
distinguished vectors are selected and classified by the number of pixels per
21
Chapter 2 System Architecture
polygon, texture size, and existence of texturing. Since the larger texture image
shows poorer texturing performance in general [33], 128x128 or 256x256 sized
texture images, which are relatively large for small screen resolution of cell-
phones, are used to get the worst-case results. Also, all vectors are rotated in
omni-direction and zoomed in or out to average the direction, shape, and size of
the triangles which affect many results like memory access pattern, bus
utilization, and pixel processor load-balance. The characteristics of test vectors
are summarized in table 2.2-1.
3D Pipeline
VertexBuffer
RenderingEngine
RenderingMemories
Operation Requirements
T&L
Shading Texturing
TexturesFrame/Depth
Fast Calculation (>0.5M Vec/s) Programmability
Efficient Data Transfer Scalability
Parallel Calculation (>10M Pix/s) Hugh Memory BW (>1GB/s)
Large Capacity (>10Mb) Fast Cycle Time Many Access Ports
GeometryEngine
[Fig. 2.1-1 : Integration of Full 3D Pipeline]
22
Chapter 2 System Architecture
Applications CodeConversion
OpenGL
C/C++ on PC
MaterialVertexList Camera Texture
GeometryCode
RenderingCode
MobileGL
RenderingInterface
Reference Renderer
Cycle-Accurate Renderer
Hardware Interface
SlimShader Code
Model Data
VirtualFrameBuffer
VirtualDepthBuffer
VirtualTextureMemory
Renderer ModelC/C++ on UNIX
Libr
ary
Spef
icat
ion
RAMP Code
VerilogPLI
VerilogRTL
VerilogGATE
Rendering EngineVerilog
PLIVerilog
RTLVerilogGATE
ARM9
ARMulator
ARM Code
Graphics SoC
VerilogARM SDK
[Fig. 2.2-1 : 3D-Glamor Architecture]
Texture Size TriangleCount
Average PixelCount / Triangle
Number ofAnimated Frames
128 x 128 6,833 11.2 104
256 x 256 6,833 11.2 104
256 x 256 2 15,300 30
Test Vector
A
B
C
NoTexture 5,878 16.5 105D
Comments
Textured by128x128 ImageSmall Polygons
Textured by256x256 ImageSmall Polygons
Textured by256x256 ImageLarge Polygons
Non-TexturedSmall Polygons
[Table 2.2-1 : Characteristics of Test Vectors]
23
Chapter 2 System Architecture
2.3 SoC Architecture Based on the simulation results of 3D-Glamor, I propose the architecture of the
graphics SoC a shown in fig. 2.3-1. It consists of a 32bit RISC processor that is
assigned to the geometry engine, a bandwidth equalizer (BEQ) for vertex buffer,
a 3D rendering engine (3DRE), 29Mb embedded DRAM and programmable
power optimizer (PPO). Dedicated hardware engines and 1.6GByte/s bandwidth
through 416bit-wide DRAM can lower the operation frequency of 3DRE and
DRAM even to 33MHz, while the RISC operates at 132MHz. Programmable
power optimizer manages the power consumption of the chip by controlling four
clock domains – gating the clocks and changing their frequencies during run-time
by the software. Each of these IP blocks will be discussed in details from the next
section.
4kB I$
RISC
ExternalInterface
PPO
BEQ 3DRE
Flow Control
ARM-9 Core
32b128b
Mem
ory
Prog
ram
mer
SlimShader
DisplayOutput
32b
24b
Triangle SetupEngine
PP0
TE0
416b (1.6GB/s @ 33MHz)
DRAM
Ctrl4 CLK
PP1
TE1
AddressAlignment
Logic
32x32 MAC
SHIFT
ALU
MEM Interface
4kB I$
32b
32b
15
31
47
63
0
16
32
48 256Byte SRAM Bank #3
256Byte SRAM Bank #2
256Byte SRAM Bank #1
256Byte SRAM Bank #0
I/O Ctrl
Entry Pointer
128b 128b
132MHz
RISCReq
Queue Entry
33MHz
3DREReq
2Mb Depth Buffer
3Mb Frame Buffer
24Mb Texture MemoryPLL Clock Control Unit
[Fig. 2.3-1 : Block Diagram of Graphics SoC]
24
Chapter 2 System Architecture
2.3.1 Geometry Engine with Intelligent Buffer
The RISC processor with 4KB I/D caches is compatible with ARM-9
architecture and operates at 132MHz [34]. It has a single-cycle 32bit x 32bit
Multiply-Accumulate Unit (MAC) in its datapath to accelerate the 3D geometry
operations. It can calculate as many as 1.04Mvertices/s model-view
transformations when running a customized fixed-point graphics library, which
shows 43% improvement over the conventional ARM9 processor [35]. When the
geometry engine calculates the model-view transformation, perspective
projection and 6-side clipping together, 300Kvertices/s is obtained. If the lighting
(single directional light source of infinite viewer and one-sided lighting model
with ambient, diffuse and specular highlighting) is appended, the rendering
performance shows 70Kvertices/s. The MAC also accelerates the processing of
MPEG-4 SP@L1 video stream. It reduces more than 30% of the cycle time when
executing the IDTC routines which are basically the same operation as the
geometry vector calculation. And the memory interface is optimized for the real-
time multimedia applications so that the RISC can directly supply the 3D data to
the rendering thorough the bandwidth equalizer (BEQ), bypassing the data cache.
To compensate the difference of the processing speed and data-width between
the RISC and the 3D rendering engine, the BEQ buffers the vertex data with
1KByte Dual-Ported SRAM (DP-SRAM). The data in the vertex buffer are
128bit-encoded instructions containing vertex coordinates, texture coordinates
and colors. Revised from the previous implementation [8], the current BEQ saves
25
Chapter 2 System Architecture
more than 20% power consumption in the SRAM with the help of adaptive bank
activation. It partially activates the banks of DP-SRAM according to the required
buffer size, which is decided by the entry pointer. The flow controller keeps track
of the request from the RISC and the 3DRE, and activates the only necessary
SRAM banks. Since the BEQ is also revised to be configured as 1KByte
bidirectional scratch-pad RAM, the RISC can read data from the BEQ for DSP
applications in which software-addressable on-chip memory is preferable to store
coefficients.
2.3.2 Rendering Engine
The rendering engine is the core of this graphics SoC. I designed the rendering
engine as a scalable IP core to satisfy the performance requirement on various
mobile platforms within allowed power budget, since the target applications
range from simple avatars, user interfaces, and commercials on the QCIF
(174x144) display to the real-time 3D games on the QVGA (320x240). More
details of the rendering engine will be discussed in the chapter 3.
2.3.3 Graphics Memories
Since the DRAM is integrated together with the rendering logic in this
architecture, we can optimize each memory for the corresponding operation,
26
Chapter 2 System Architecture
instead of using the conventional SDRAM in the conventional PC graphics
architectures. To save the power consumption of the embedded DRAMs as well
as to optimally utilize their bandwidth, I propose three different DRAM types –
Frame Buffer, Depth Buffer, and Texture memory. As described in table 2.3-1, the
characteristics of each memory are optimized according to the operation
requirements. In order to provide the pixels for depth comparison and alpha
blending, the frame and depth buffers support read-modify-write data transaction
in a single cycle with separated read and write bus. It can drastically simplify the
memory interface of the rendering engine and the pipeline, because the data
required to process two pixels are read from the frame and depth buffers,
calculated in the pixel processor, and written back to the buffers in the pixel
processor within a single clock period without any latency. Therefore, caching
[33, 36, 37] and prefetching [38], which may cause power and area overhead, are
not necessary in this architecture. The example of operation timing of frame or
depth buffer is shown in fig. 2.3-2. The Write-Mask signal, which is generated by
the pixel processor, decides the activation of the write operation. Non-
multiplexed addressing enables the DRAM to partially activate the necessary
wordline block to save the power consumption inside the memory [6-8].
To draw pixels on the 256x256 screen which covers the resolution of most of
the current cell phones, 4 frame macros and 4 depth macros are used in the chip.
Also, 4 texture macros, or 24Mb, store MIPMAP texture images enough for the
3D game applications, and this capacity is equivalent to store 12 x 24bit 256x256
27
Chapter 2 System Architecture
MIPMAP textures or 48 x 24bit 128x128 MIPMAP images. Therefore, the use of
graphics DRAMs can completely eliminate the necessity of external texture
memory. These memories are distributed, enabling the rendering engine to utilize
only necessary memories and to reduce the power consumption. The embedded
DRAM can operate at scalable clock frequencies ranging from 5Mhz to 50Mhz
to match the speed of the rendering logic, providing up to 2.4GByte/s bandwidth
with 416bit-wide bus. The configuration and activation of graphics memories
will be discussed in detail at chapter 3.5.
Frame Buffer Depth Buffer Texture Memory TRC 20ns
Macro Size 768Kbits 512Kbits 6Mbits I/O
Interface 24bit read 24bit write
16bit read 16bit write 24bit I/O
Commands Read-Modify-Write Read Write
Auto-Refresh
Read Write
Auto-Refresh
Latency 0 0 1
[Table 2.3-1 : Characteristics of Embedded DRAM]
W1
C1
0 10ns 20ns
Clock
CMD &ADDR
40ns
Read-bus
15ns
Write-bus
Write-Mask
C2
W2
R2R1
1ns
InternalOperation
Decided by PixelProcessor No Update
Hold WriteW1PCG Active & Read Modify Hold
R2PCG Active & Read
[Fig. 2.3-2 : Timing Diagram of Frame/Depth Buffer]
28
Chapter 2 System Architecture
2.3.4 Power Management Unit
Programmable Power Optimizer (PPO) manages the power consumption of the
chip. Each clock can be selectively turned on or off and its frequency is scalable
by the software program or hardware buttons to adjust the frame rate during run-
time as illustrated in fig. 2.3-3. RISCclk and BEQclk run at the full speed of the
RISC core, and REclk and MEMclk operate at the quarter frequency –
132/33MHz (RISCclk/REclk) for FAST mode, 66/16.5MHz for NORMAL, and
33/8.25MHz for SLOW. The PPO provides zero-latency frequency-scaling to
allow abrupt switching of operating frequencies during the execution of software.
The transition from slow mode to fast mode can be completed quickly without
any hazard.
Frequency ScalingFast Normal Slow
RISCBEQ
3DREDRAM
132
66
33
16.58.25
Block-Level CLK Gating
1/2 x 1/4 x
Mode
ClockFrequency
(MHz) 1 x RISC BEQ 3DRE DRAM
PPO
S/WControl
H/W Control
[Fig. 2.3-3 : Operation of Programmable Power Optimizer]
29
Chapter 3 Low-Power Rendering Engine
3.1 3D Rendering Engine
Fig. 3.1-1 shows the block diagram of 3D rendering engine (3DRE). It consists
of a SlimShader, a Memory Programmer (MP), and a dozen of rendering DRAMs.
The SlimShader performs main rendering operations such as texturing, shading,
blending, and depth comparison. MP enables the special effects such as
antialiasing, motion blur and fog to be programmable by the software. The 29Mb
rendering DRAM contains frame buffers, depth buffers, and texture memories.
12 independently-controlled DRAMs reduce the power consumption since the
only necessary memories can be selectively activated. The 3DRE can accelerate
the drawing of points, lines, and rectangles for 2D graphics as well.
Chapter 3 Low-Power Rendering Engine
512k
b D
B 0
Tria
ngle
Set
up E
ngin
e (T
SE)
PP0
PP1
Text
ure
Add
r.Te
xtur
e A
ddr.
Add
ress
Alig
nmen
t Log
ic(A
AL)
Intp
l. /
Dep
th C
omp.
Intp
l. /
Dep
th C
omp.
Text
ure
Filte
r
Pixe
l Ble
ndin
gPi
xel B
lend
ing
6Mb
Text
ure
Mem
ory
0
Slim
Shad
er
6Mb
TM1
6Mb
TM3
6Mb
TM2
DB
1D
B2
DB
3
768k
b FB
0FB
1FB
2FB
3
Text
ure
Filte
rBEQ
RIS
C
3DR
E
16b
16b
24b
24b
32b
32b
24b
48b
24b
24b
24b
48b
Pipe
Con
trol
LCD
Inte
rfac
e
24b
Dis
play
Out
put
Com
man
dR
egis
ters
160b
160b
SIM
DD
atap
ath
64b
64b
96b
96b
Pixel Data
64b
64b
96b
96b
Mem
ory
Prog
ram
mer
128b
32b
[Fig. 3.1-1 : Low-Power 3D Rendering Engine]
31
Chapter 3 Low-Power Rendering Engine
3.2 SlimShader : Main Rendering Pipeline
3.2.1 Instruction Set Architecture
In order to execute the rendering programs and to control the datapath, 13
128bit-encoded instructions are defined. Since the transferring the vertices takes
most of the rendering cycle, the instructions are optimized for this operation,
RDAT. As shown in fig. 3.2-1, the length of instruction is selected to be 128bit
fixed-format to transfer whole vertex information at every single rendering cycle.
Therefore, colors (R, G, B, A), screen coordinates (X, Y), screen depth (Z),
homogeneous texture coordinates (u, v, 1/w) are transferred together with the
command information. Each color component (R, G, B, A) is represented by 8bit
integer to support true-color rendering with alpha-blending. And each screen
coordinate (X, Y) contains 8bit integer to cover 256x256 screen resolution. The
homogeneous texture coordinate (u, v, 1/w) is represented as 16bit fixed-point
format (8bit integer + 8bit fraction) to preserve necessary dynamic range and
precision for texture calculation.
MODE
Extra Command
EXTRA
CMD127 96
DATA0 DATA1 DATA295 64 63 32 31 0
31 28
TYPE27 22
OP1
21 20
OP219 16 15 0
OP-Code 2OP-Code 1Instruction TypeProcessor Mode
[Fig. 3.2-1 : Instruction Format]
32
Chapter 3 Low-Power Rendering Engine
Although using the 128bit-fixed-length instructions rather using variable-
length packets may waste bandwidth for the other operations such as RBUF,
RCLR, TSTR and ASTR, which are rarely occurred than RDAT, it can simplify the
design of decoding and controlling unit. Fetching one vertex at every cycle
enables the rendering engine to continuously calculate two pixels per cycle for
stripes and fans of even smaller triangles. This 128bit-instruction additionally
requires bandwidth equalizer, which is described in section 2.3.1, to adapt to
32bit geometry engine in this graphics SoC. However, it means this SlimShader
is attachable to any other geometry engine by changing the design of bandwidth
equalizer, without touching the SlimShader rendering core. This 128bit
instruction can be easily transferred from the ARM9 geometry engine by using
the multiple register transfer instruction [39].
The number of instruction is determined to support the subset of OpenGL
rendering operations, since OpenGL provides many high-level functions such as
trilinear texture filtering, non-linear fog, and some blending modes which can be
rarely used in the real gaming applications. Additional instructions to support
real-time special rendering effects, to control the embedded DRAMs, and to
manage the standby power are also defined. The instructions and supported
functions are listed in table 3.2-1 and 3.2-2, respectively. The Power Control
instructions control the refresh commands of the embedded DRAMs and more
details will be discussed in chapter 3.5.
33
Chapter 3 Low-Power Rendering Engine
Type Instruction OP Code Description XXXX MODE = 1111 Normal Mode PHLD MODE = 1011 Hold PIDL MODE = 0011 Idle PSLP MODE = 0001 Sleep
Power Control
POFF MODE = 0000 Off RDAT MODE = 1111
TYPE = 1000 00 OP1 = TRI OP2 = POS EXTRA = W DATA
Fetch Vertex Data W[16b] = 1/w DATA0[16b:16b] = u:v DATA1[8b:8b:16b] = X:Y:Z DATA2[8b:8b:8b:8b] = A:R:G:B
RBUF MODE = 1111 TYPE = 1000 01 OP1 = FB OP2 = ZB
Set Front Buffer Rendering
RCLR MODE = 1111 TYPE = 1000 10 OP1 = FZ
Clear Front Buffer with All-Zero
TSTR MODE = 1111 TYPE = 0100 00 OP1:OP2:EXTRA = ADDR DATA0
Store Texture Map ADDR[22b] = Texture Address DATA0[Xb:8b:8b:8b] = R:G:B
TMOD MODE = 1111 TYPE = 0100 01 OP1:OP2:EXTRA = ADDR DATA0 = BLND:FILT:ID:LOD:SIZE
Set Texture Mode ADDR[22b] = Base Address BLND[4b]: Blending Mode FILT[4b] : MIPMAP Filtering ID[8b]: Texture ID LOD[4b] : LOD Bias SIZE[12b] : Texture Size
Texture
TF2T MODE = 1111 TYPE = 0100 10 OP1:OP2:EXTRA = ADDR
Transfer FB to TM ADDR[22b] = Base Texture Address (Front Buffer Contents are Transferred)
Auxiliary
ASTR MODE = 1111 TYPE = 0010 00 OP1 = FZ EXTRA = ADDR DATA0
Store Data to Front Buffer FZ ADDR = FZB Address DATA0[Xb:8b:8b:8b] = R:G:B (only G:B) is stored into Z-Buffer
MP
MRPG MODE = 1111 TYPE = 0001 00 EXTRA = MOP
Load Memory Program MOP[16b] Defined by Memory Programmer ISA
[Table 3.2-1 : Instruction Set]
34
Chapter 3 Low-Power Rendering Engine
Screen Resolution 256 x 256 Color Depth 24bit True Color Shading Triangle Fan and Strip Support
Gouraud Shading Pixel Alpha Blending Texture Blending Programmable Shading through MemoryProgrammer™
Hidden Surface Removal 16bit Hardware Accelerated Double Z-Buffer Texture Mapping Perspective Correct Texture Address Calculation
Power-Efficient Texture Fetch through Address Alignment Logic™ LOD Bias Texture Filtering
- MIPMAP, No MIPMAP - Point sampling - Bilinear
Allowable Texture Size : 2x2 ~ 256 x 256 (Power of 2) Maximum Number of Textures : 255
Special Features 2D Graphics Acceleration - Line, Triangle, Rectangular Acceleration
Memory Programmer™ - Post Rendering Processing with FB and ZB - Linear Expression Evaluator
Special Rendering Effects - Antialiasing - Motion Blur - Artistic Trajectory - Other Special Rendering Effects
Power Management Scene-Dependent Clock Variation Control - FAST : 33MHz - NORMAL : 16.5MHz - SLOW : 8.25MHz
Instruction-Level Power Management Control - Normal : Normal Rendering Operation - Hold : Waiting for Geometry Pipeline - Idle : No Operation with FB, ZB, TM Refresh - Sleep : No Operation with TM Refresh only - Off : No Operation without eDRAM Refresh
[Table 3.2-2 : Supported Rendering Features]
35
Chapter 3 Low-Power Rendering Engine
3.2.2 Low-Power Pipeline Structure
Fig. 3.2-2 shows the main rendering pipeline attached with graphics memories
and table 3.2-3 describes its operation. It is composed of 14 multi-pipelined
stages to maximally save the power consumption by activating the only
necessary stages. The graphics memories are accessed through distributed
pipeline stages - Depth buffer at PI stage, texture memory at TP2 stage, and
frame buffer at PB stage. Since each pipeline stage is designed as a module with
its own controller, additional rendering features can be easily inserted in the next
revision without modifying the entire pipeline. After fetching the instructions, the
rendering engine shapes the triangle and varies the operation cycles in the next
stages according to the size (HOLD#1) and the shape of the triangles (HOLD#2)
by pausing the previous pipeline stages. As the example of pipeline shows in fig
3.2-3, the rendering can calculate 2 pixels at every cycle.
HOLD #1
IF ID1 ID2 TS EP HSPP#0TP1 TP2 TP3 TF PBTA1 TA2PI
PP#1TP1 TP2 TP3 TF PBTA1 TA2PI
HOLD #2
REclkPPO
DepthBuffer
TextureMemory
FrameBuffer
ClockGating
Graphics Memories
Front-Pipe Back-Pipe
[Fig. 3.2-2 : Main Rendering Pipeline]
36
Chapter 3 Low-Power Rendering Engine
Shaping the triangle is accelerated in the TS stage, performing the horizontal-
order rasterization (scanline-based rasterization) as in fig. 3.2-4. Although this
rasterization simplifies memory address and pipeline control, the rendering
performance can be degraded when the triangle falls across the DRAM pages in
the conventional DRAM architecture [40, 44, 46]. Therefore, I re-defined the
timing of graphics DRAM and assigned the frame and depth buffers as a vertical-
stripe pattern, instead of prefetching data from standard SDRAM. Since the row
of the DRAM can be changed without any latency at 50MHz random row cycle
(TRC=20ns) and each memory (A or B) has its own read/write ports as described
in table 3.2-1, the graphics DRAM can continuously provide the bandwidth
required to process two pixels together. This rasterization order also reduces the
power consumption since the memories corresponding to the only necessary
pixels can be activated. Fig. 3.2.5 shows the rasterization order of GeForce4 [44],
where the 2x2 tiles are traversed in memory page friendly order to maximally
utilize the column access of external frame buffer. However, unnecessary pixels
are also transferred together through the capacitive memory bus, wasting the
power consumption and bandwidth. Although this rasterization order is known to
improve the texture cache performance by reducing the miss-rate [33], it affects
little to the texturing performance of the proposed rendering engine where cache
is not implemented.
Since the rendering engine contains two pixel processors (PP) and each PP has
its own texture unit fetching 4 textures/cycle, the pixel fill rate and the texel rate
37
Chapter 3 Low-Power Rendering Engine
are 100Mpixels/s and 400Mtexels/s at 50MHz, respectively. The two pixel
processors are simply assigned to render horizontally-adjacent pixels. So, it is
easy to gather texture address, and this can be used to propose the energy-
efficient texture unit, which will be covered in the next section.
In order to eliminate the power consumption of the unused blocks as much as
possible, the datapath transition is controlled by clock-gating and latch-enabling.
I put the depth-compare-unit into the earlier pipeline stage (inside PI stage) and
apply a depth-first clock-gating (DFCG) scheme in order to reduce the power
consumption as shown in fig. 3.2-6. In 3D graphics, if a new pixel to be drawn is
already covered by the nearest (old) pixels from the view point, the new pixel
does not need to be processed further. DFCG can prevent the unnecessary
shading and texturing by gating off the clock in the remaining datapath according
to the results of the depth-comparison. It also eliminates the unnecessary requests
to the corresponding memories. Besides, the pipeline latches of the shading and
texturing unit can be independently enabled or disabled to maximally avoid the
unnecessary datapath transition as much as possible. Although DFCG violates the
OpenGL semantics [14], which don’t allow updating the depth buffer until
texture mapping as textured pixels may be completely transparent, this violation
can be solved by removing those triangles in the software prior to rendering
operation.
38
Chapter 3 Low-Power Rendering Engine
Pipe Description IF Instruction Fetch, Main Power Control
ID1 Instruction Decode #1 ID2 Instruction Decode #2, Triangle Shaping TS Triangle Setup EP Edge Processor HS Horizontal Setup, Span Generation PI Pixel Interpolation, Depth Comparison, Depth-Buffer Interface, Clock Gating Control
TA1 Texture Address #1, LOD Calculation, 1/w Division TA2 Texture Address #2, Address Merging TP1 Texture Prefetch #1, Bank Address Aggregation, Texture Memory Command Generation TP2 Texture Prefetch #2, Texture Memory Read TP3 Texture Prefetch #3, Texture Data Alignment, Reverse Procedure of Address Alignment TF Texture Filter PB Pixel Blending
[Table 3.2-3 : Pipeline Description]
IF
ID1
ID2
TS
EP
HS
PI
TA1
TA2
TP1
TP2
TP3
TF
PB
V1 V2 V3
V1 V2 V3
V1 V1 V1V2 V2
V3
V4 V5
P1
P1VT
V4 V5 V6
NO
1 2 3 2
V1V4V3
1 3
V5V4V3
P2
NO
V7
FrontPipe
BackPipe
P1V1
V8
V71
V6
V5V4V3
P1VT
P1VTPL
P1V1
P1V2
Pixel Interpolation@ 2PPs
P1V1PL
P1V1P1
P1V1P2
RF
DRAMRefresh
P1V1PR
P1VTPL
P1V1PL
P1V1P1
RFP1V1P2
P1V1PR
P1V1PL
P1V1P1
RFP1V1P2
P1V1PR
P1VTPL
P1V1PL
P1V1P1
RFP1V1P2
P1V1PR
P1VTPL
P1V2PL
P1V2
P1V3
P1V2P1
P1V2P2
P1V2P3
P1V2PL
P1V2PL
P1V2P1
P1V2P1
P1V2P2
P1V1PL
P1V1P1
RFP1V1P2
P1V1PR
P1VTPL
P1V2PL
P1V2PR
P1V3PL
BankConflict
P1V2P3
P1V2P3
P1V2P2
P1V2P1
P1V2P2
P1V2PL
P1V2P1
P1V2PR
P1V3
P1V3
P1V1PL
P1V1P1
RFP1V1P2
P1V1PR
P1V2PL
P1VTPL
P1V2PL
BC
P1V1PL
P1V1P1
RFP1V1P2
P1V1PR
P1V2PL
BCP1VTPL
P1V1PL
P1V1P1
RFP1V1P2
P1V1PR
P1V2PL
BCP1VTPL
Address Alignment
Address Calculation@ 2PPs
Physical AddressCalculation
Texture CommandGeneration
Data Alignment
Texture Filter@ 2PPs
Pixel Blending@ 2PPs
HOLD #1
HOLD #2 HOLD #2 HOLD #2
[Fig. 3.2-3 : Example of Pipeline Timing]
39
Chapter 3 Low-Power Rendering Engine
HO
LD #
1
HOLD #2
A B A B A B A B A
0 1 2
5
4
6 7 8 9 A
3
Local bus A Local bus B
A memory
B memory
Only necessarypixels are
transferred
1 35 7 9
5 7 9
0 2 46 8 A
6 8 A
20ns
13
57 9
02
46
8A
[Fig. 3.2-4 : Rasterization Order and Frame/Depth Buffer Assignment]
40
Chapter 3 Low-Power Rendering Engine
Power and Bandwidth Waste0 1
2 3
4 5
6 7
8 9 C D
A B E F
G H K L
I J M N
O P
0
External SDRAM
70ns 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P
Unnecessary pixelsare transferred
together
Memory bus
[Fig. 3.2-5 : Rasterization Order of GeForce4]
41
Chapter 3 Low-Power Rendering Engine
x
y
z
#1#2
UnnecessaryOperation
#1#2
Crossbar
Depth BufferDepth Buffer
DepthInterpolation
WriteMask
REclk
DepthCompare
Color/ ScreenCoordinate
Interpolation
PP0
NextPipeline
Stage
1616
Texture CoordinateInterpolation
END Q
CrossbarOldDepth
NewDepth
END Q
Pipe Controller
ShadingDatapath
TexturingDatapath
PI Stage
PP1
Latch-EnableClock-Gating
[Fig. 3.2-6 : Datapath Transition Control]
42
Chapter 3 Low-Power Rendering Engine
3.2.3 Triangle Setup Engine
To render triangles with modified Bresenham’s incremental line drawing
algorithm [47], the position of input vertices must be identified, and the
increments of colors and coordinates must be calculated in the earlier rendering
pipeline – inside the Triangle Setup Engine. Although triangle setup in 3D
graphics took more than 7,000 cycles when it was calculated by the general
purpose RISC processor, the previous implementations [6-8], didn’t contain the
hard-wired setup engine because of its logic complexity. In this work , however, I
simplify the algorithm, optimize the precision of datapath, and implement the
triangle setup engine (TSE) which contains three 9-way SIMD SUBs, three 8-
way SIMD DIVs, and a mid-point interpolation unit inside of the 3DRE to
enhance the overall 3D performance as shown in fig. 3.2-7. SORT_T2B sorts 3
vertices from top to bottom by subtracting each vertex and checking the sign of
the results. Then, VERT_DIV calculates ∆(X,Z,R,G,B,U,V,W)/∆Y. At the last stage,
MID_INTPL checks the type of triangle (P0 or P1) by comparing the mid-point
of longest edge with interpolated point. The total calculation time from the vertex
register to the final MUX is less than 20ns and it decides the maximum operation
frequency of the rendering engine – 50MHz. Therefore, the triangles do not need
to be pre-clipped anymore unlike the previous implementations [6-7].
In order to develop applications quickly in the mobile 3D graphics, the model
data may be shrunk from the PC platform, where triangles are optimized for
higher screen resolution (640x480, 1024x768 or more), to mobile platform which
43
Chapter 3 Low-Power Rendering Engine
has even lower screen resolution (176x144 or 320x240). Therefore, the average
number of pixels inside the triangle will be smaller in mobile 3D, which means
setup time may becomes bottleneck of pixel throughput. Although the exact
latency and throughput of triangle setup engine are not announced in the
conventional highend processors [44, 46], they are more than one and varied
from triangle to triangle. However, the proposed setup engine is designed to
ensure the triangle-setup cycle to be always smaller than pixels-filling cycle even
for a small triangle – one cycle triangle setup without latency.
VertexRegister #0
VertexRegister #1
VertexRegister #2
SORT_T2B(3 x 9-way SIMD SUBs)
VERT_DIV(3 x 8-way SIMD DIVs)
Vertex Delta MID_INTPL
SortedVertices
UnsortedVertices
X, Y, dX, dYVertex
Parameters
dY8
MUL9
MUL17
MUL9
MUL9
MUL9
MUL17
9 17 9 9 9
MUL17
MUL17
17 17 17dX dZ dR dG dB dU dV d(1/W)
SHFT SHFT SHFT SHFT SHFT SHFT SHFT SHFT3
17 17 17 1725 25 25 25
17 17 17 1725 25 25 25
dX/dY dZ/dY dR/dY dG/dY dB/dY dU/dY dV/dY d(1/W)/dY
256
entr
y
LUT(=1/dY)
Mantissa
Expo
nent
8
SIMDDivider
[Fig. 3.2-7 : Triangle Setup Engine]
44
Chapter 3 Low-Power Rendering Engine
During the setup calculation, insufficient precision can cause significant
degradation of image quality, since the errors are accumulated in the following
stages. Even though inaccurate colors are tolerable to eyes, inaccurate
coordinates lead to distortion in shape. Therefore, highend graphics platforms use
floating-point Datapath for setup operation. However, using the conventional
floating-point divider inside this SIMD Datapath of mobile graphics SoC can be
an overhead in terms of area and power consumption, since the screen resolution
is limited. Therefore, proposed engine uses fixed-point arithmetic instead. Once
colors and coordinates are fed into the rendering engine, they are calculated and
stored as fixed-point numbers. However, when division operations are performed
in the TSE, the data are temporally treated by floating-point divider. Because this
TSE requires three 8-way SIMD dividers, the divider can take significant amount
of silicon area. For the SIMD divider, using multipliers with LUT is inevitable
choice considering the power and the area, since each divider shares the divisor –
∆Y. The 8-way SIMD divider is designed by using 8 integer multipliers, 8 shifters,
and one floating-point LUT (Look-Up Table).
Here, optimizing the datapath-width is important to implement the TSE with
small number of transistor gates, while preserving the necessary precision. The
derivatives of setup operation can be written as follows:
)/1(/ yxyxP ∆×∆=∆∆=∆
(∆P = derivative, ∆x = dividend, ∆y = divisor)
Here, x and 1/∆y can be implemented by multiplier and LUT, respectively.
45
Chapter 3 Low-Power Rendering Engine
Since the error of ∆P is accumulated through the incremental shading datapath,
the fractional point of 1/∆y must keep the required precision – m-bit fraction is
required for m-bit screen resolution [48]. Insufficient number of fractional point
results in noticeable distortion as shown in fig. 3.2-8. Also, because ∆y varies
from 2 (21) to 255 (28-1), 1/∆y changes from 0.5 to 0.003921568..., requiring 8-
dynamic range to hold the MSB position. Therefore, 16bit-width is necessary to
store the dynamic range and fractional point of 1/∆y. Cutting out the LSBs of
1/∆y can distort the images as shown in fig. 3.2-9. However, storing 16bit fixed-
point in LUT and calculating corresponding data with MUL can lead to area
burden for the mobile applications. Estimating the gate count of three 8-way
SIMD dividers, 16bit fixed-point division will take about 54,780 gates, which is
even slightly larger than that of ARM9 processor (about 50k gates) in the
geometry engine. The total area of divider is estimated with the following
calculation and the results are summarized in fig. 3.2-10.
)44(3
)44(3
)44(3
)44(3
169161716
119111711
898178
8981711
××
××
××
××
×+×+×=
×+×+×=
×+×+×=
+×+×+×=
MULMULLUTD)LUT16(FIXE
MULMULLUTD)LUT11(FIXE
MULMULLUT)LUT8(FIXED
SHIFTERSMULMULLUTT)LUT11(FLOA
AREAAREAAREAAREA
AREAAREAAREAAREA
AREAAREAAREAAREA
AREAAREAAREAAREAAREA
Therefore, 8bit dynamic range and 8bit fraction are separately stored in the
LUT as floating-point numbers. All leading zeros of 1/∆y are removed and only
meaningful 8bit integer after the leading zeros and 3bit corresponding fractional
46
Chapter 3 Low-Power Rendering Engine
point location are stored in the LUT as a mantissa and an exponent, respectively.
Although the shifters at the last stage in the floating-point LUT division,
LUT11(FLOAT), increases the gate counts by 14%, the total area of three SIMD
dividers is smaller than that of 16bit fixed-point LUT divider, LUT16(FIXED),
by 40%. The area is even smaller by 15% than LUT11(FIXED), while
suppressing unwanted image distortion. FLOAT(SINGLE) shows the image
directly calculated by standard floating-point datapath supporting IEEE-754
single precision, without using shared LUT and multipliers. The image of
LUT11(FLOAT) is even compared to that of FLOAT(SINGLE), while reducing
the power and the area by 95% and 85%, respectively.
47
Chapter 3 Low-Power Rendering Engine
0-bit fraction 4-bit fraction 8-bit fraction
Proposed
[Fig. 3.2-8 : Fractional Point]
FLOAT (Single)
LUT16 (FIXED)LUT8 (FIXED) LUT11 (FIXED)
LUT11 (FLOAT)
FLOATLUT11 (FLOAT)
LUT8 (FIXED)LUT11 (FIXED)LUT16 (FIXED)
IEEE-754 Single Precision (1sign+23+8exp)11bit float (8bit mantissa + 3bit exponent)
8bit fixed (8bit integer)11bit fixed (8bit integer + 3bit fraction)16bit fixed (8bit integer + 8bit fraction)
Proposed
[Fig. 3.2-9 : Precision of LUT]
48
Chapter 3 Low-Power Rendering Engine
1,374
1,890
2,750
8 11 16 IntegerBit-widthMUL17
8 11 16MUL9
7971,100
1,600
8 11 16LUT
590860
430
3,000
2,000
1,000
Numberof Gates
(a) Area of Each Block
32,343
27,342
37,650
54,780
10k
20k
30k
40k
50k
Numberof Gates
DividerPrecision
LUT11(FLOAT)
LUT8(FIXED)
LUT11(FIXED)
LUT16(FIXED)
40%Area
Reduction
SHIFTER
MUL
LUT
LUT MUL SHIFT SIMD Div TOTAL LUT11(FLOAT) 590 8,864 1,507 10,781 32,343 LUT8(FIXED) 430 8,864 0 9,114 27,342 LUT11(FIXED) 590 11,960 0 12,550 37,650 LUT11(FIXED) 860 17,400 0 18,260 54,780
(b) Total Area of Three 8-way SIMD Dividers.
[Fig. 3.2-10 : Divider Area in Triangle Setup Engine]
49
Chapter 3 Low-Power Rendering Engine
3.3 Energy-Efficient Texturing Unit
3.3.1 Consideration of Energy Efficiency
Frame#1
Pow
er C
onsu
mpt
ion
Frame#2 Frame#3
Low-Power(Run-time)
EnergyHigh
Performance
Low-Power(Standby Power)
[Fig. 3.3-1 : Power and Energy Consumption]
Reducing the power consumption is sometimes believed to be an ultimate goad
of designing circuits for mobile applications. However, it is not always true. The
amount of energy consumption will be the same when the power consumption is
cut in half and the calculation time is doubled in contrast, since the energy
consumption is the multiplication of the power and the energy. It is the battery to
drive the mobile devices so that reducing the energy consumption is the key to
enhance the operation lifetime. Fig. 3.3-1 shows the power and energy
consumption when the texturing unit renders consecutive frames. Once the frame
rate is fixed, the rendering engine will wait, after drawing in the frame slot, until
starting the next frame since the job assigned to each frame is finite. Therefore,
reducing the operation time by achieving the high performance, as well as
50
Chapter 3 Low-Power Rendering Engine
reducing the operation power, must be taken into account for long-lasting
operation. Also, suppressing the standby current is necessary to minimize the
over energy consumption. Therefore, I proposed two schemes to achieve the high
performance while keeping the power consumption low: 1) Approximation of
perspective division, and 2) Address alignment logic.
Even though the screen resolution of target PDA is limited, the rendering
quality itself cannot be sacrificed much. The rendering engine must calculate the
pixels correctly within the boundary of the required power budget at high pixel
fill rate. Therefore, the SlimShader contains two texture units, each of which
supports perspective-correct address calculation [42] and bilinear MIPMAP
texture filtering [41].
3.3.2 Approximation of Perspective Division
In the calculation of perspective-correct texture address, per-pixel division is
required and this operation can be described like the following equations [42].
U = u/(1/w) and V = v/(1/w), ……….… [Eq. 1]
1V)U,(0 ≤≤ , …………………… [Eq. 2]
v/w)(,u)w/( ≥≥ 11 , ………… [Eq. 3]
(Where, (u, v, 1/w) and (U,V) are homogeneous texture addresses, and texture
addresses, respectively)
51
Chapter 3 Low-Power Rendering Engine
Direct calculation of upper equation is difficult in a single-cycle even in the
highend 3D graphics system [46], because of the gate count overhead of divider.
Although division-free algorithm was introduces [49], the cycle times are varying
depending on the inputs, which means slower pixel throughput and more
complex pipeline control. Therefore, this architecture uses direct division method
for sustained pixel throughput, keeping pipeline control simple. Since each
operand (u, v, and 1/w) has 16-bit precision in the datapath, 16-bit /16-bit divider
is required to calculate the perspective-correct texture address (U and V).
However, by the definition of the texture address as written in Eq. 2, the range of
1/w can be limited as in Eq. 3. These facts can be used to reduce the power
consumption and the area of the address calculation circuit.
LeadingZeros LSB
LeadingZeros 8-bit Data LSB Zero
Padding
1/wMeaningless Approximation Errorto 8-bit LUT
u, vBefore reformatting
After reformattingShift
8-bit Data
[Fig. 3.3-2 : u, v, 1/w formatting]
The following approximation method enables us to use 16/8 divider, instead of
using 16/16 divider. The 1/w can be represented in a binary form as the
composition of leading zeros, m-bit data and LSBs as shown in fig. 3.3-2. Since u
and v are always equal to or smaller than 1/w, removing the same number of
52
Chapter 3 Low-Power Rendering Engine
leading zeros in u and v still preserves data. Therefore, we have a chance to use
the smaller bit-width divider. In LUT divider, bit-width of divisor (m-bit)
decides the LUT area, which in turn may occupy most of the divider area.
However, using only 8bit data, discarding LSBs, lead to approximation error as
described in the following equations:
Let
wedApproximatwvuwOriginalwuw
a /1)/1(,,/1,,)/1( 000
==
Then, (1/w)0 and (1/w)a can be represented as follows:
DatabitmZerosLeadingaw
LSBsDatabitmZerosLeadingaaw
mLLa
emLL
−+=⋅+⋅=
+−+=+⋅+⋅=
−−−
−−−
160
16
160
160
220)/1(
220)/1(
161 ≤+≤ mL
120 16 −≤≤ −− mLea
Where,
L = Number of Leading Zeros
m = Number of bit-width of DATA to search LUT
Since a0 is the number after leading zeros, the MSB of a0 must not be zero and
it can be written as follows:
)120,(2 111
10 −≤≤+= −− mm awhereaa
53
Chapter 3 Low-Power Rendering Engine
The texture coordinates U0 (original) and Ua (Approximated) can be written as
follows:
a0a w
uUw
uU)/1(
,)/1(
000 ==
Thus, the approximation error can be written as follows:
( )
mLe
a
a
aa
aa
aa
www
uwwuwu
ww
wu
wu
wu
UUUwE
−−⋅=
−=
⎟⎟⎠
⎞⎜⎜⎝
⎛−⎟⎟
⎠
⎞⎜⎜⎝
⎛×=
−=
−=
160
0
0
0000
0
0
0
0
0
0
0
0
2
)/1()/1()/1(
)/1()/1()/1()/1(
1)/1(
1
)/1(
)/1()/1()/1(
Therefore, the maximum error is
mLm
mL
wEMAX −−−
−−
⋅−
= 161
16
2212))/1((
Fig. 3.3-4 shows the estimated gate counts perspective division unit and
approximation errors as the bit-width. The total area of division unit per pixel
processor can be calculated as follows:
MULLUT
MULMULLUTTOTAL
AREAAREAVforAREAUforAREAAREAAREA
×+=++=
2
Since each PP contains the division unit, overall gate count will be doubled in
54
Chapter 3 Low-Power Rendering Engine
the chip which has two PPs. The maximum error occurs when leading zero
doesn’t exist. I can choose m to be 8 to make the maximum error less then 1%.
Also, selecting 8 can also shorten the designing time since 8bit LUT divider is
already designed for TSE and available when designing the TA1 stage. Then, it
reduces the divisor bit-width from 16 to 8, resulting in more than more than 95%
area reduction in the divider if we can sacrifice the image quality within the
0.78% error boundary as shown in fig. 3.3-3. Before fed into the LUT divider, u
and v are also reformatted to match 1/w, which is done by left-shifting them by
the same number of leading zeros as 1/w and padding zeros after LSBs.
16bit / 8bit 16bit / 16bit
Proposed
[Fig. 3.3-3 : Error on Perspective Division]
55
Chapter 3 Low-Power Rendering Engine
0 2 4 6 8 10 12 14 16248
163264
128256512
1024204840968192
163843276865536
Gat
e C
ount
s
m (bit-width of Data)
Total
MUL
LUT
Selected
Log scale
95% AreaReduction
(a) Area
0 2 4 6 8 10 12 14 160
20
40
60
80
100
Max
imum
Err
or (%
)
Number of Leading Zeros
m=1
m=2
m=5
m=3
m=4
0 2 4 6 8 100
1
2
3
Max
imum
Err
or (%
)
Number of Leading Zeros
m=8
m=7
m=6
m=9
(b) Maximum Calculation Error
[Fig. 3.3-4 : Area and Error Estimation of Division Approximation]
56
Chapter 3 Low-Power Rendering Engine
3.3.3 Address Alignment Logic
8 texel requests are generated at every cycle because two texture units perform
the bilinear MIPMAP texture filtering to draw more realistic images [41].
Although the on-chip DRAM is capable of supplying the bandwidth for 8 texels
per every cycle, fetching 8 texels directly from 8 texture memories (TMs) may
consume large amount of power due to the concurrent data transitions in many
capacitive I/Os and the activation power of TMs themselves. Therefore, I propose
Address Alignment Logic (AAL) to reduce the number of memory request as
illustrated in fig. 3.3-4. Because four texel requests are generated by each pixel
processor in the Bilinear MIPMAP filtering, the total number of request is 8.
However, there are several requests that can be overlapped because their
footprints are separated by approximately 1-texel distance as shown in fig. 3.3-6,
based on the definition of MIPMAP filtering.
Fig 3.3-5 shows the block diagram of Address Alignment Logic. After texture
addresses (U and V) are calculated at TA1 stage, four bilinear addresses are
generated from each pixel processor. In this stage, LOD (Level of Detail) is also
calculated. Fig. 3.3-6 shows the variation of integer part of LOD according to the
calculation method [44]. Although LODMAXx shows some difference from widely-
used LODMAXall or LODSQRT [46], I chose the LODMAXx since it can reduce the
hardware cost by eliminating the Square-Root logic and Y-value registers. Also,
the PP0 and PP1 shares the LOD unit to further minimize the hardware because
the LODs of PP1 differs from those of PP0 by about 10% on an average as
57
Chapter 3 Low-Power Rendering Engine
summarized in table 3.3-7. 82% reduction in gate count is achieved compared
with LODSQRT.
TA1
PP0 PP1
uIN[15:0] vIN[15:0] wIN[15:0]
UVW Division UVW Division
uOUT[11:0] vOUT[11:0]
uIN[15:0] vIN[15:0] wIN[15:0]
uOUT[11:0] vOUT[11:0]
TA2TA2_ADDR_LOD TA2_ADDR_LOD
PP0uOUT[8:0]
LOD
PP0vOUT[8:0] PP0uOUT[8:0] PP0vOUT[8:0]
TA2_ADDR_BILINEAR
PP0UV0[15:0] PP0UV3[15:0]
TA2_ADDR_BILINEAR
PP1UV0[15:0] PP1UV3[15:0]
TA2_SPATIAL_ALIGN
TA2_MASK_GEN
TA2_TEMPORAL_ALIGN
SA0[3:0]
SA3[3:0]SA2[3:0]SA1[3:0]
TA0[7:0]TA1[7:0]TA2[7:0]TA3[7:0]TA4[7:0]TA5[7:0]TA6[7:0]TA7[7:0]
SPmask[7:0] TMmask[7:0]TA2_ADDR_TRANSLATION
TMaddr0[19:2]
TMaddr3[19:0] TMaddr4[19:0] TMaddr7[19:0]
TP1
TP1_BANK_AGGREGATION
TMaddr3[1:0]TMaddr0[1:0] TMaddr7[1:0]TMaddr4[1:0]
TMaddr7[19:2]
TP1_MULTI
BAsel0[7:0] BAsel3[7:0]
TMaddr0[19:0]
TP1_ADDR_SELECTTM0_ADDRESS[17:0]TM1_ADDRESS[17:0]TM2_ADDRESS[17:0]TM3_ADDRESS[17:0]
toTexture
Memories
TP1_BAON[3:0]
TP2
TP2_DATA_DISTRIBUTETM0_DATA[23:0]TM1_DATA[23:0]TM2_DATA[23:0]TM3_DATA[23:0]
fromTexture
Memories
TP3
TP3_DATA_DISTRIBUTETMmask[7:0]
SPmask[7:0]
TF
TEXTURE_FILTER
PP0 PP1
TEXEL0[23:0]
TEXTURE_FILTER
TEXEL1[23:0]
[Fig. 3.3-5 : Block Diagram of Address Alignment Logic]
58
Chapter 3 Low-Power Rendering Engine
TextureImage
|AddrPP1 - AddrPP1| ~= 1(Definition of LOD)
0 1 0 12 3 2 3 2 30 1
0 1 0 1 0 12 3
2 3 2 3 2 3
2 32 30 1 0 1 0 1
MIPMAPTextureLOD0
LOD1
LOD2
LOD3
TwoTextureAddress
PP0PP1
Texture Address - 4 Req. / PP
Spatial Aligner - Reduced to ~5
Current
PreviousTemporal Aligner - Reduced to ~2.5
Remaining Req.and TM Assignment
PP0
PP1
SpatiallyOverlappedRequests
TemporallyOverlappedRequests
LODSelection
[Fig. 3.3-6 : AAL Operation]
Original Image
SQRT MAXall
MAXx MAXy
Proposed
),max(),max(
),,,max(
),max( 2222
yyMAXy
xxMAXx
yyxxMAXall
yyxxSQRT
tsLODtsLOD
tstsLOD
tstsLOD
==
=
++=
[Fig. 3.3-7 : LOD Calculation Method]
59
Chapter 3 Low-Power Rendering Engine
In fig. 3.3-5, spatial aligner (TA2_SPATIAL_ALIGN) compares the texture
addresses of PP0 (PP0UV0 ~ PP0UV3) with those of PP1 (PP1UV0 ~ PP1UV3),
setting the overlapped position flag (OPF) on SA0 ~ SA3. Then, temporal aligner
(TA2_TEMPORAL_ALIGN) compares the current texture requests (PP0UV0 ~
PP0UV3, PP1UV0~PP1UV3) with the previous ones which are stored inside the
registers, setting the OPF on TA0 ~ TA7. Mask generation block
(TA2_MASK_GEN) finally merges the OPF from spatial and temporal aligners
and generating the bit-masks (SPmask, TMmask) which indicates the texel
positions to be newly fetched from the texture memories. The simulation results
show the average numbers of mask bits are 5 for SPmask and 2.5 for TMmask.
Fig. 3.3-8 shows the circuit diagram of spatial aligner and temporal aligner.
Temporal aligner is basically similar to 8-entry fully-associative L1 texture cache
[33]. In this proposed architecture, however, texels are simply stored in the
pipeline latches instead of power-consuming SRAM [26]. Also, the caching
concept is extended to dual pixel processors in this work.
Although the average number of texture memories activated per cycle can be
reduced to 2.5 through the operation of spatial and temporal aligner, the
maximum number is still 8. In this implementation, a texture image is stored
across 4 texture memories as shown in fig. 3.3-6, where adjacent texels are
assigned to different texture memories. Texture memory conflicts are scheduled
in a round-robin manner by TP1_BANK_AGGREGATION. When the same
texture memory is accessed, this block sets TP1_MULTI to 1, extending the
60
Chapter 3 Low-Power Rendering Engine
operation cycles. Then TP2 and TP3 stages re-distribute the texel data from 4
texture memories to 8 corresponding positions, feeding 4 texels per PP for
bilinear texture filtering. Although the number of texture prefetch stages (TP1,
TP2, and TP3) are optimized to 3 for this implementation, where the latency of
texture DRAM is 1, it can be easily scaled up for longer-latency DRAM such as
off-chip texture memory by simply inserting more pipeline latches at TP2.
Vector PP1 Utilization LOD Change Rate
Spatial Aligner Remaining Texels
Temporal Aligner Remaining Texels
Cycle Overhead
A 56.71% 8.28% 4.82 2.30 1.09 B 56.71% 12.74% 5.27 3.64 1.21 C 97.42% 0.00% 5.35 2.78 1.03 D 78.10% No Texture No Texture No Texture No Texture
[Table 3.3-1 : Simulation Results of Texturing Unit]
PP1UV0[15:0]
PP1UV0[15:0]
PP1UV0[15:0]
PP1UV0[15:0]
SA0 SA1 SA2 SA3
4 Texel Requests from PP0
4 Te
xel R
eque
sts
from
PP1
=?
=?
=?
=?
=?
=?
=?
=?
=?
=?
=?
=?
=?
=?
=?
=?
PP0UV0[15:0]
PP0UV1[15:0]
PP0UV2[15:0]
PP0UV3[15:0]
(a) Spatial Aligner
61
Chapter 3 Low-Power Rendering Engine
TEXclk
Current Texel Requests
=?
=?
=?
=?
=?
=?
=?
=?
BitwiseAND
=?
=?
=?
=?
=?
=?
=?
=?
BitwiseAND
=?
=?
=?
=?
=?
=?
=?
=?
BitwiseAND
=?
=?
=?
=?
=?
=?
=?
=?
BitwiseAND
=?
=?
=?
=?
=?
=?
=?
=?
BitwiseAND
=?
=?
=?
=?
=?
=?
=?
=?
BitwiseAND
=?
=?
=?
=?
=?
=?
=?
=?
BitwiseAND
=?
=?
=?
=?
=?
=?
=?
=?
BitwiseAND
PP0UV0[15:0]
PP0UV1[15:0]
PP0UV2[15:0]
PP0UV3[15:0]
PP1UV0[15:0]
PP1UV1[15:0]
PP1UV2[15:0]
PP1UV3[15:0]
SPmask[7]
SPmask[6]
SPmask[5]
SPmask[4]
SPmask[3]
SPmask[2]
SPmask[1]
SPmask[0]
LOD
PreviousRequests
TMmask0 TMmask1 TMmask2 TMmask3 TMmask4 TMmask5 TMmask6 TMmask7
=?
(b) Temporal Aligner
[Fig. 3.3-8 : Spatial Aligner and Temporal Aligner]
Fig. 3.3-9, and 3.3-10 show the analysis results of AAL. Fig. 3.3-9 displays
how the number of texture requests are reduced as the frame goes on in the AAL.
Fig. 3.3-11 summarizes how the number of texture memory affects the power
consumption and the cycle time. Since the AAL reduces the average number of
texture requests and limits it to 2.5 ~ 3.5 on average, the power consumption is
saturated at certain level. Also, more number of texture memory means that less
time is necessary to fetch the same amount of data from the memory, occupying
more area. Therefore, I determine the number of texture memory to be four,
62
Chapter 3 Low-Power Rendering Engine
considering the energy consumption that is the multiplication of those two factors
to be minimized.
0 20 40 60 80 1000.00.51.01.52.02.53.03.54.04.55.05.56.06.57.07.58.08.59.0
Num
ber o
f Rem
aini
ng R
eque
sts
BA
B
A
SpatialAligner
Frame Number
TemporalAligner
OriginalRequests
[Fig. 3.3-9 : Remaining Requests : Frame by Frame]
63
0 2 4 6 80
2
4
Number of Texture Memory
0
2
4Power Time
Vector B
Vector C
Vector A
Vector B
Vector A
Vector C
3
1
1 3 5 7
3
1
[Fig. 3.3-10 : AAL Analysis Results : Power and Time]
Chapter 3 Low-Power Rendering Engine
Fig. 3.3-11(a) shows the power consumption required to activate the texture
memories, which is proportional to the number of texture memories to be
activated per cycle. Fig. 3.3-11(b) shows the number of cycles required to draw
two bilinear-filtered pixels, which is proportional to the time required to
complete the drawing of a scene. The average number of cycles in the 4 TMs
with AAL is slightly increased to 1.1. Therefore, the energy consumption
required to access the texture memory, which is the multiplication of time by
power, can be reduced by 66% on an average as illustrated in fig. 3.3-11(c).
Although a single PP architecture seems like consuming less power than AAL
architecture, it needs much more time until finishing the drawing. Therefore, this
architecture, 2PPs with AAL, is more adequate for mobile platforms driven by
limited energy source from battery.
64
Chapter 3 Low-Power Rendering Engine
1
8 TMs, 2 PPs(No AAL)
4 TMs, 1 PP (No AAL)
4 TMs, 2 PPs+ AAL
2
1.1
(b) Number of Cycles (=Time)
8
4
2.5
(c) Energy used in the TextureMemories (Normalized)
66%Reduction
Num
ber o
f Cyc
les
8 TMs, 2 PPs(No AAL)
4 TMs, 1 PP (No AAL)
4 TMs, 2 PPs+ AAL
(a) Number of Texture MemoriesActivated (=Power)
Num
ber o
f Tex
ture
Req
uest
s
1
0.34
8 TMs, 2 PPs(No AAL)
4 TMs, 1 PP (No AAL)
4 TMs, 2 PPs+ AAL
Nor
mal
ized
Ene
rgy
1
AccessTMAccessTM
AccessTM
AccessTM
AccessTM
TimePOWER
CyclesofNumberActivatedMemoriesTextureofNumberEnergy
CyclesofNumberTime
ActivatedMemoriesTextureofNumberPOWER
×=
×=
∝
∝
[Fig. 3.3-11 : Energy-Efficiency of AAL]
65
Chapter 3 Low-Power Rendering Engine
3.4 Memory Programmer : Post Processing Unit For the real-time special rendering effects, Memory Programmer (MP) post-
processes the rendered pixels transferring them to the display controller in
parallel with the SlimShader. It contains crossbar switches for front/back
selections, and a SIMD-parallel datapath which is controlled by its own 16bit
commands as shown in fig. 3.4-1. Since each memory has separate read/write bus,
total bit-width of crossbas is 160. The LCD interface reads-out the pixels from
the front-buffer through SIMD datapath and writes back to the buffer, while
SlimShader performs rendering operations with back-buffer. The post-processing
doesn’t slow down the pixel throughput because MP processes one pixel per
single LCD clock period. The special effects such as full-scene antialiasing,
motion blur and fog can be programmed by the software and downloaded to the
command registers. Full-screen antialiasing (FSAA) can be performed by 2x1
filtering, and linear fog is calculated with the help of double depth buffers.
Following equations are the examples of post-filters which can be evaluated by
SIMD datapath. Fig. 3.4-2 shows the block diagram of SIMD datapath and fig.
3.4-3 shows the examples of special effects and their assembly codes.
FSAA : OUT[x][y] = (a*FB[x][y] + b*FB[x+1][y])/c
(for example, a=3, b=1, c=4)
Fog : OUT[x][y] = a*(FB[x][y]-color) + color
( a=(ZB[x][y]+bias/SCREEN_DEPTH), 0<a<1 saturated )
66
Chapter 3 Low-Power Rendering Engine
DB A0DB A1DB B0DB B1
FB A0FB A1FB B0FB B1
SlimShader
Pipe Control
LCD Interface
24b
DisplayOutput
- SIMD-parallel Datapath - 16b Commands - Commands Registers
CommandRegisters
160b
160b
SIMDDatapath
16b16b
24b24b
64b
64b
96b
96b
16b
Pixel Data
Pipe Control
Commands
64b
64b
96b
96b
[Fig. 3.4-1 : Memory Programmer]
MASKMASK
f(X-c)+c
FB[x]
InputRegister
FB[x+1]2424
a
a*Y
RGB RGBc
f
(A+b*B)/d
A B
bd
MASKmask
OutputRegister
Pixel Out/FB Write
ZB[x]
InputRegister
ZB[x+1]1616
Z+e
e
LUTSAT
ConstantRegister
saten
saten
ba
cdef
mask
24 24
[Fig. 3.4-2 : SIMD-parallel Datapath]
67
Chapter 3 Low-Power Rendering Engine
FSAA Motion Blur
Fog Others
MOVR a 0x001;MOVR b 0x011;MOVR d 0x100;DISB LUT;MASK 0xFFFFFF;CLRZ;CLRC;SWAP;
MOVR a 0x001;MOVR b 0x010;MOVR d 0x011;DISB LUT;MASK 0xFFFFFF;CLRZ;CLRC;SWAP;
MOVR a 0x000;MOVR b 0x001;MOVR d 0x010;DISB LUT;MASK 0xFFFFFF;CLRZ;SWAP;
MOVR a 0x000;MOVR b 0x011;MOVR d 0x100;DISB LUT;MASK 0xFFFFFF;CLRZ;SWAP;
MOVR a 0x000;MOVR b 0x001;MOVR c 0xFFFFFF;MOVR d 0x001;MOVR e 0x0000;(MOVR e 0x9C40;)ENAB LUT;ENAB POSSAT;MASK 0xFFFFFF;CLRZ;CLRC;SWAP;
MOVR a 0x000;MOVR b 0x001;(MOVR b 0x011);MOVR d 0x001;DISB LUT;MASK 0xFF0000;CLRZ;SWAP;
MOVR a 0x000;MOVR b 0x001;MOVR d 0x001;DISB LUT;MASK 0xCCCCCC;CLRZ;CLRC;SWAP;
[Fig. 3.4-3 : Examples of Special Effects]
68
Chapter 3 Low-Power Rendering Engine
3.5 Memory Access To cover the 256 x 256 screen resolution which matches the screen resolution
of most of current cell-phones, 4 frame buffers and 4 depth buffers with zero-
latency are used in the chip. Also, 4 texture memories amount to 24Mb and store
MIPMAP texture image for the 3D gaming applications. Fig. 3.5-1 and fig. 3.5-2
illustrate the memory configuration and access timing, respectively. The latency,
cycle time, and bus configuration are optimized each. Frame and depth buffers
are optimized for single-cycle read-modify-write data transaction without latency.
And texture memory is optimized for continuous read operation, allowing one-
cycle latency to hold larger capacity. Also, the memories can be differently
mapped for better performance – Vertical stripe assignment for frame/depth
buffers, and 2D interleaved assignment for texture memory.
Depth Buffer A0512kb
Depth Buffer A1512kb
Depth Buffer B0512kb
Depth Buffer B1512kb
Frame Buffer A0768kb
Frame Buffer A1768kb
Frame Buffer B0768kb
FrameBuffer B1768kb
Mem
ory
Prog
ram
mer
Slim
Shad
er
Texture Memory 06Mb
Texture Memory 16Mb
Texture Memory 26Mb
Texture Memory 36Mb
[Fig. 3.5-1 : Memory Configuration]
69
Chapter 3 Low-Power Rendering Engine
Command #0
REclkMEMclk
CMD READ WRITEDRAMInternal
20ns
8ns
15ns
READ WRITE READ WRITECMD CMD
READ Bus READ_DATA #0 READ_DATA #1 READ_DATA #2
WRITE Bus WRITE_DATA #0 WRITE_DATA #1
SlimShaderPI Stage
Command #1SlimShaderHS Stage
Command #2Latched @ Falling Edge
Z-Comp
5ns
Interpolation Interpolation Z-Comp Interpolation Z-Comp
(a) Depth Buffer Timing
Command #0
REclkMEMclk
CMD READ WRITEDRAMInternal
20ns
8ns
15ns
READ WRITE READ WRITECMD CMD
READ Bus READ_DATA #0 READ_DATA #1 READ_DATA #2
WRITE Bus WRITE_DATA #0 WRITE_DATA #1
SlimShaderPB Stage
Command #1SlimShader
TF StageCommand #2
Latched @ Falling Edge
A-blend
5ns
TEX-blend TEX-blend A-blend TEX-blend A-blend
Command #3
(b) Frame Buffer Timing
READ
REclkMEMclk
CMD READDRAMInternal
20ns
CMD CMD
I/O Bus READ_DATA #1
SlimShaderTM Command
@ TP1
Data Alignment #0
CMD
5ns
READ_DATA #0
5ns
READ NOP
WRITE_DATA #0 WRITE_DATA #1
WRITE
READ NOP WRITE WRITE
SlimShaderTM Data@ TP3
Data Alignment #1
SlimShaderTM Data@ TP2
2.5ns
Reroute TM LatencyControllable Stage
(c) Texture Memory Timing
[Fig. 3.5-2 : Memory Access Timing]
70
Chapter 3 Low-Power Rendering Engine
Although it looks simple, however, satisfying the pipeline timing is a big
challenge in terms of DRAM design. The cycle time (TRC) of embedded DRAM
must be less than 20ns, while commodity SDRAMs are working at 65ns or more.
The timing budget of frame and depth buffer is even more strict because the read-
data must be written back to the same address within a cycle for efficient RMW
transaction. To support RMW operation at 50Mhz cycle time, the timing of the
core used in 256Mb SDRAM is modified and optimized to 20ns by being
reconfigured each cell-array to 256 rows x 192 bitline pairs, as shown in fig. 3.5-
3(a). Also, atomic commands defined by the standard SDRAM, ACT (Row-
Active), PCG (Precharge), READ (Column-Active and Read), and WRITE
(Column-Active and Write), are packed into a simple instruction which DRAM
decodes internally. For example, a single RMW instruction is the composition of
ACT – READ – HOLD – WRITE – PCG operations. Using the single instruction
also helps reducing the cycle time because fetching each command individually
with multiplexed row and column addresses requires extra timing margin to
preserve the setup and hold time of each command. Splitting the cell-MAT and
adding extra circuits for faster cycle time and lower power consumption,
however, bring out the area overhead of embedded DRAM. Fig. 3.5-3(b) shows
the total area of each DRAM normalized by its capacity, which implies the area
overhead or cell efficiency. As shown in this graph, the area/bit of 512kb DRAM
used in depth-buffer is 3.8 times larger than that of 256Mb SDRAM, although
they use the same 0.16µm cell structure. 6Mb DRAM, optimized for the texture
71
Chapter 3 Low-Power Rendering Engine
memory that reads data continuously without requiring the RMW (Read-Modify-
Write) operations, shows better area efficiency. Therefore, the total area of
embedded DRAM occupies 1.8 times larger area than standard that of standard
256Mb SDRAM.
768
384
192
9648
64 128 256 512 1024
10
15
20
25
30
35
40
45
50
55
60
65
TRC(ns)
Numberof Bitline Pair
Number of Row
TRC = TACT+TROW+TCOL+THOLD+TWRITE+TPCG
TRCTACTTROWTCOLTHOLDTWRITETPCG
: Random Row Cycle: Address Decoding: Wordline Activation: Read-out Path Activation: 5ns Hold for Modify: Write-Driver Activation: Bitline Pre-Charge
(a) Critical-Path Timing
0
200
400
600
800
1000
1200
1400
512kb DRAMDepth Buffer
Area/Bit(um2/kbit)
768kb DRAMFrame Buffer
6Mb DRAMTexture Memory
256MbSDRAM
3.8x
1.5x
3.2x
20.2% 23.8% 46.1% 65%
29MbeDRAM
Cell
Periphery
38.3%
1.8x
(b) Area Overhead
[Fig. 3.5-3: Characteristics of Embedded DRAM]
72
Chapter 3 Low-Power Rendering Engine
In contrast, distributed architecture of embedded DRAM saves run-time power
consumption since the only necessary memories can be selectively activated out
of 12. In this architecture, the overall power of rendering memories per two-pixel
can be written as follows:
( ) [ ] TMFBZBMemory PowerPowerPowerPower ××+×+×+= γββα1
Where, PowerFB, PowerDB, PowerTM are the power consumption of frame buffer,
depth buffer, and texture memory,
α = PP1 utilization, β = Depth-gated ratio, γ = Texture-access ratio
Here, α depends on the size and the shape of triangle, and it is tend to decrease
when the triangles gets smaller as table 3.3-1 shows. And β depends on the depth
complexity, and it can be reduced by the extensive clock gating according to the
depth-comparison results as described in section 3.3-2. In a scene with a depth
complexity of three, 7/18 of pixels fail the Z-test [46]. Finally, γ is reduced by the
Address Alignment Logic as described in section 3.3-3.
Based on the power consumption of each DRAM (PowerFB = 18mW, PowerDB
= 12.5mW, PowerTM = 18mW), the PowerMemory can be illustrated as fig. 3.5-4(a).
More power can be saved as the triangles get smaller and scenes gets more
complex, which can be happening for gaming applications on small-sized LCD
screen of mobile devices. When α = 0.5%, β = 1/3, γ = 2.5 (highly-possible
values in the gaming application), the power can be reduced by 65%, compared
73
Chapter 3 Low-Power Rendering Engine
with the unified memory architecture where all memories are activated together.
Fig. 3.5-4(b) shows the normalized energy consumption until finishing the
drawing job, which is proportional to β since the (1+α) term is cancelled out by
the cycle time as follows:
Let N = total number of pixels to be drawn,
Then, time required to finish the drawing is
α+1=
NTimeTotal
Therefore, the energy consumption to finish the drawing is
( ) [ ]( )
( ) ⎥⎦
⎤⎢⎣
⎡⎟⎠⎞
⎜⎝⎛ ×
+×
+×+×=
××+×+×+×+
=
×=
TMZBFB
TMFBZB
TotalTotalTotal
PowerPowerPowerN
PowerPowerPowerNPowerTimeEnergy
αγββ
γββαα
1
11
As shown in the fig 3.5-3(b), distributed memory system saves more energy as
3D applications get more complex in depth – 63% reduction for α = 0.5%, β =
1/3, and γ = 2.5.
74
Chapter 3 Low-Power Rendering Engine
0.0 0.2 0.4 0.6 0.8 1.00
20
40
60
80
100
120
140
Pow
er C
onsu
mpt
ion
(mW
)
PP1 Utilization (a)
b=1/1
b=1/2b=1/3b=1/4
65% PowerReduction
(a) Power Consumption
0.0 0.2 0.4 0.6 0.8 1.00.0
0.2
0.4
0.6
0.8
1.0
Ene
rgy
(Nor
mal
ized
)
PP1 Utilization (a)
b=1/1
b=1/2b=1/3b=1/4
63% EnergyReduction
(b) Energy Consumption
[Fig. 3.5-4 : Memory Access]
75
Chapter 3 Low-Power Rendering Engine
Also, the memories can be selectively refreshed for data retention in standby
modes by power-control instructions as shown in fig. 3.5-4: PLHD (Hold), PIDL
(Idle), PSLP (Sleep), and POFF (Off). PHLD can be used to hold datapath and
memory temporally for normal rendering operations, waiting for geometry
operation. All memories are refreshed in this mode. PIDL turns off the rendering
clock but refreshes all graphics memories. In PSLP mode, only texture memory
is refreshed to hold the texture images since they are possibly downloaded from
the wireless network. Finally, POFF turns off all operations.
Depth Buffer
Frame Buffer
TextureMemoryR
ende
ring
Logi
c
Depth Buffer
Frame Buffer
TextureMemoryR
ende
ring
Logi
c
Depth Buffer
Frame Buffer
TextureMemoryR
ende
ring
Logi
c
Depth Buffer
Frame Buffer
TextureMemoryR
ende
ring
Logi
c
HOLD IDLE
SLEEP OFF [Fig. 3.5-4 : Standby Power Modes]
76
Chapter 4 Chip Implementation
4.1 Process Technology
To implement the mobile 3D graphics chip, previous chips integrate DRAM
using the EML technology [6-8]. However, the fabrication process costs too
much because the logic must be designed with separate transistors from the
DRAM with more mask layers [13]. Therefore, the EML process has been
seldom used on the low-cost mobile platforms yet. In this work, I implemented
the SoC with the pure DRAM process to reduce the fabrication cost, instead of
using the EML. The logic components, SRAM and analog blocks are drawn with
the design rule of peripheral transistors of the DRAM. But the DRAM process
has some drawbacks in the aspect of logic design: 1) slower transistor speed, and
2) less number of metal layers. As summarized in table 4.1-1, the process
characteristics of 0.16um pure DRAM process are even poorer than 0.18um
merged DRAM process. Although the transistor performance does not seem to
satisfy the requirements of highend microprocessor, the high speed state-
machines and interface-circuitries of RAMBUS-DRAM and DDR-SDRAM have
been successfully implemented by the peripheral transistors of DRAM process.
Therefore, I tried to implement the chip with pure DRAM process, successfully
Chapter 4 Chip Implementation
achieving 133MHz and 50MHz speed for the RISC processor and 3D rendering
engine, respectively. negligible sub-threshold leakage current of DRAM process
also helps reducing the standby current which becomes the critical issue for the
battery-driven devices these days. Since the original 256Mb SDRAM process
was not intended to support the logic synthesis, verilog-synthesis methodology
for SDRAM process has to be setup by drawing, characterizing 73 standard cells,
and porting them to various CAD tools. Table 4.1-2 summarizes the transistors
and metal usage. Since the M0 is resistive, it is not used for the global routing as
shown in fig. 4.1-1.
- Resistive M0 - Std Cell Routing
Periphery
M0
M1
M2
M3
Bitline
Cell Cap.
- Global Routing
DRAM Core
Al
Al
Al
W
[Fig. 4.1-1 : Process Technology]
78
Chapter 4 Chip Implementation
Cell Tr. Logic Tr. Metal Ldrawn
(um)
Ldrawn
(um)
Vdd
(V)
M0 Width
(um)
Rs
(ohm/sq) Layers
0.16um
Pure-DRAM 0.16 0.28 2.5 0.35
M0: 1.8
M1~M3: ~0.05
M0: W (Bitline)
M1~M3: Al
0.18yn
Merged-DRAM 0.18 0.18 1.8 0.23
M0: 2
M1~M5: ~0.05
M0: W (Bitline)
M1~M5: Al
[Table 4.1-1 : Process Comparison of Pure-DRAM and Merged-DRAM]
Applied Blocks Transistors Metal 0 (W) M1 M2 M3
Standard Cell RoutingAll Synthesizable Logic Global Routing
Dual-Port SRAM Not Used RISC Cache
SRAM I/O
Analog Circuits
DRAM Periphery
Block Routing
Not Used All DRAM DRAM Core Bitline Wordline DBLine I/O Top Routing Not Used Not Used Horizontal Vertical Horizontal
[Table 4.1-2 : Transistor and Metal Usage]
79
Chapter 4 Chip Implementation
4.2 Chip Fabrication The graphics SoC is implemented using a typical 0.16um DRAM process with
1-W 3-Al metal layers and its die area takes 121mm2. The chip contains 1M logic
transistors, 29Mb DRAM, 72kb SRAM and PLL. Top level is verified by using
the verilog, where custom blocks such as DRAM, SRAM and PLL are
characterized and ported. The external interfaces such as boot-up ROM and
SRAM are also modeled to emulate the board-level environment. Compiled
ARM9 codes capable of controlling the full functionalities are executed, and the
results are compared with 3D-Glamor. Finally, the GDS file extracted from
Apollo P&R tool, is converted to schematics to be simulated by the transistor-
level simulator, EPIC, as illustrated in fig 4.2-1, running worst-case vectors.
Normally, the last step is skipped in the ASIC design since simulating the
transistor-level netlists takes huge amount of time. However, I did this low-level
simulation to minimize the uncertainties and mistakes in the setup process
because it is the first trial of designing the SoC with pure DRAM process. Fig.
4.2-2 show the die photograph and table 4.2-1 summarizes the physical
characteristics. The first-silicon was packaged and tested as shown in fig. 4.2-3,
where the first waveform (fig. 4.2-4) appeared after built-in self-calibrating
operations. As shown in the measured waveforms in fig. 4.2-5, the transition
from slow mode to fast mode can be completed quickly without any hazard.
80
Chapter 4 Chip Implementation
3DRE RISC BEQ
Verilog
Soft-IP
Application (C/ASM)Clock
Control MobileGLRAMP-IVLib
RenderingData
Hard-IP
PLI, RTL, GATE
Compiled ARM9 Code
DRAM SRAM PLL
SRAM ROM
Board
Code
Cell PAD
P&R
Library
Net
list
EPIC
GD
SRTL Tapeout!MEM I/F
3D-Glamor
Functional / TimingVerification
RTL GDS
GDSLIB
SPICE ModelParameters
GATE
Worst-caseVectors
[Fig. 4.2-1 : Implementation Flow]
[Fig. 4.2-2 : Die Photograph]
81
Chapter 4 Chip Implementation
Process Technology 0.16um CMOS DRAM with 1-W, 3-Al (256Mb Compatible) Power Supplies I/O : 3.3V (VDQ : 3.3V, VSQ : 0V)
Internal : 2.5V (VDD : 2.5V, VSS : 0V) Digital : VPP : 3.5V, VINT : 2.0V, VCP : 1.0V, VBLP : 1.0V, VBB : -0.8V Analog : VCCA : 2.5V, VCCVCO : 2.5V, GNDA : 0V
Clock Frequency (RISC,BEQ/3DRE,DRAM)
Fast : 132MHz / 33MHz Normal : 66MHz / 16.5MHz Slow : 33MHz / 8.25MHz
Chip Size 11mm x 11mm (including I/O Pad) Transistor Counts 1M Logic
29Mbit DRAM 72kbit SRAM (9kByte)
Analog Blocks Programmable PLL 2.4nF Decoupling Capacitor
Power Consumption 210mW Package 240pin QFP
[Table 4.2-1 : Physical Characteristics]
Fabricated Chip
[Fig. 4.2-3 : Device Under Test]
82
Chapter 4 Chip Implementation
RISCclk (25MHz)
REclk (6.25MHz)
[Fig. 4.2-4 : Measured Waveform : After Built-in Self-Calibrating Test]
Slow Mode Fast Mode33MHz / 8.25MHz 132MHz / 33MHz
V: 2V/div, H: 33ns/div
RISCClock
3DREClock
[Fig. 4.2-5 : Measured Waveform : Mode Change]
83
Chapter 4 Chip Implementation
4.3 Power Consumption Fig. 4.3-1 shows the composition of the power consumption for various
applications. The implemented graphics SoC consumes 210mW in continuous
calculation of bilinear texture-mapped and antialiased 3D graphics applications at
FAST mode (33MHz REclk and 132MHz RISCclk). The embedded DRAM
drastically reduce the power consumption since the external I/Os for 3D
rendering are eliminated, and an additional 22% reduction is obtained by AAL
(Address Alignment Logic) and DFCG (Depth-First Clock-Gating). For point-
sampled texturing, the power reduces to 185mW. Textured 3D rendering
consumes 110mW at NORMAL (16.5MHz REclk and 66MHz RISCclk), and
65mW at SLOW mode (8.25MHz REclk and 33MHz RISCclk), respectively.
Non-textured (but Gouraud-shaded) 3D applications consume 145mW at FAST
mode. The power consumption of MP is about 5mW, which is low because it is
synchronized with slow LCD clock. The power consumption of each block is
summarized in table 4.3-1.
Conventional
Texture
Bilinear
Texture
Point Sampled
Texture No Texture
3DRE (with DRAM)
SS+MP
ZB/FB/TM
200mW
77.6(68.1/9.5)
(21.9/26.25/72)
140mW
58.14(48.64/9.5)
(21.9/15.75/45)
115mW
53.2(43.7/9.5)
(21.9/15.75/20)
80mW
41.14(31.64/9.5)
(21.9/15.75/0)
RISC (with $) 54.8mW
BEQ (with SRAM) 3.5mW
PMU (with PLL) 5mW
Total 270mW 210mW 185mW 145mW
[Table 4.3-1 Block Power Consumption] 84
Chapter 4 Chip Implementation
PowerConsumption
(mW)
270mW
300
200
100
210mW
22% reduction
185mW
Implemented Graphics LSI
1000
by Embedded DRAM A : 3D Graphics with Texture Mapping (External Memory)B : 3D Graphics with Texture Mapping (No AAL, No DFCG)C : 3D Graphics with Texture Mapping (AAL, DFCG), Bilinear MIPMAPD : 3D Graphics with Texture Mapping Point-SampledE : 3D Graphics without Texture Mapping
Conventional System
ExternalI/O andDRAM
A B C ED
Depth BufferFrame BufferTexture Memory3D Rendering EngineBEQ with DP-SRAMRISC with CachePower Management UnitOthers including pad
145mW
[Fig. 4.3-1 : Power Consumption @ Fast Mode]
85
Chapter 4 Chip Implementation
4.4 Performance
4.4.1 Performance Summary
This chip can draw 24bit texture-mapped pixels at the drawing speed of
66Mpixels/s and 264Mtexels/s at 33MHz and they are summarized in table 4.4-1.
TargetApplications
3D RenderingPerformance
EmbeddedGraphics Memory
3D GeometryPerformance
with Fixed-PointGraphics Library
Realtime 2D/3D Graphics PipelineMPEG-4 SP@L1 DecodingMP3 Audio Decoding
66Mpixels/s, 264Mtexels/sHardware Triangle Setup EnginePerspective-Correct Bilinear MIPMAP TexturingGouraud Shading, Alpha Blending, Texture BlendingAntialiasing, Motion Blur, Fog, Special Effects
5Mb Double Depth / Frame Buffer (256 x 256 Resolution, 24bit Color, 16bit Depth)24Mb Texture Memory
1.04Mvertices/s : Model-View Transformation300kvertices/s : Model-View Transformation + Perspective Projection + 6-Side Clipping70kvertices/s : Model-View Transformation + Perspective Projection + 6-Side Clipping + Lighting (Single directional light source from infinite viewer, one-sided, ambient + diffuse + specular highliting)
[Table 4.4-1 : Chip Features]
86
Chapter 4 Chip Implementation
4.4.2 Performance Comparison
Fig. 4.4-1 and table 4.4-2 compare the performance of proposed graphics SoC
with recently-announced implementations [45] and previous work [6-8, 23]. The
geometry performance reaches up to 1Mvertices/s with the help of MobileGL
library. Also, the rendering engine shows the highest fill rate taking advantage of
local graphics memories and energy efficient texturing unit.
Fig. 4.4-2 shows the performance indices. Drawing at high rendering rate, the
3D graphics accelerators in the PC platforms perform many advanced rendering
functions. But they consume a great deal of power, more than sevel tens of watts.
The proposed graphics rendering engine for the mobile platform, however, shows
a slower rendering rate and performs restricted functions while consuming only
less than 140mW (rendering power). Therefore, I propose new performance
indices to compare the performance of the embedded 3D graphics rendering
engine considering the power consumption [7]. It is analogous to well known
MIPS/mW.
nConsumptio PowerSpeedRenderingDGraphicsMobileofPerforanceD 33 =
The 3D rendering speed can be illustrated in pixels/s (PxPS) or texels/s (TxPS).
Therefore, PxPS/mW stands for the pixel fill rate per 1mW power, and
TxPS/mW describes the texture fill rate per 1mW. The pixel rate of this SoC is
about 0.8-MPxPS/mW, which is 1.6 times greater than that of the previous work.
The texel rate is about 1.88-MTxPS/mW (MTxPS : Mtexels/s) which is, to my
87
Chapter 4 Chip Implementation
best knowledge, the highest ever published for the mobile handheld devices. PC
graphics system [44] shows about 40kPxPS/mW and 80kPxPS/mW. Although
SONY shows better performance indices than this work, the advantages mainly
come from the difference of the operation voltage – 1.2V for SONY and 2.5V for
this work. If the voltage is taken into account, both architectures may show
similar indices.
88
Chapter 4 Chip Implementation
4.7M
SONY0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
5
130k
S1D13721Seiko Epson
/Futrek
1M
Vert
ices
/s
185k
Z-3DMitsubishi
270k
GShark+Sanshin
RAMP-IV(This Work)
ATI2300
3.7x
1M
(a) Geometry Performance
0
20
40
60
80
100
120
140
160
S1D13721Seiko Epson
/Futrek
Z-3DMitsubishi
GShark+Sanshin
RAMP-IV(This Work)
SONY(Assuming100% $-hit)
150M
3M 5.2M 9M
100M
50MHz 100MHz 75MHz
ATI2300
100M
11x
(b) Rendering Performance
[Fig. 4.4-1 : Performance Comparison]
89
Chapter 4 Chip Implementation
MPx
PS/m
W
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
3.7x
0.07
0.5
0.8
1.3
0.10.04
RAMP-I RAMP-II THISWORK
Z3D PCGraphics
SONY(Assuming100% $-hit)
2.5V 1.2V1.8V3.5V 1.8V
(a) Pixel Rate
MTx
PS/m
W
0
1
2
3
4
5
RAMP-I RAMP-II THISWORK
Z3D PCGraphics
SONY(Assuming100% $-hit)
1.88
5.4
00 0.1 0.082.5V 1.2V
(b) Texel Rate
[Fig. 4.4-2 : Performance Indices]
90
Chapter 4 Chip Implementation
RAMP-I [6] RAMP-II [7] Z3D [23] RAMP-IV (This Work)
Maximum Shading
Performance40Mpixels/sec 70Mpixels/sec 5.2Mpixels/sec 66Mpixels/sec
Texture Fill Rate 0 0 5.2Mtexels/sec
(?) 264Mtexels/sec
Screen Resolution 256 x 32 256 x 256 132 x 176 256 x 256
Color Depth24bit True Color
Double Buffering
24bit True Color Double Buffering 18bit Color 24bit True Color
Double Buffering
Z Depth 16bit 16bit 12bit 16bit Power Supply 3.5V 1.8V 1.8V (?) 2.5V
Power Consumption 590mW 120mW 38mW 140mW (Texturing)
80mW (No Texturing) Process
Technology 0.35um EML 0.18um EML 0.18um Logic 0.16um DRAM + M3
Area 45mm2 24mm2 30mm2 44mm2
Embedded Memory 512kb DRAM
6Mb DRAM (4Mb FB, 2Mb
ZB)
2.3Mb SRAM (768Kb TM)
29Mb DRAM (3Mb FB, 2Mb ZB,
24Mb TM) Performance
Indices 68KPxPS/mW No Texturing
580KPxPS/mW No Texturing
100KPxPS/mW100KTxPS/mW
825KPxPS/mW 1.88MTxPS/mW
No. of Logic Transistors 220k Logic
Transistors 300k Logic Transistors 150k Logic Gates
Shading Features
Gouraud Shading
Z-Buffered HSR Alpha Blending
Gouraud ShadingZ-Buffered HSR Alpha Blending
Gouraud Shading
HSR (method ?)Alpha Blending
Gouraud Shading Z-Buffered HSR Alpha Blending
(No perf. degradation) Programmable Shading
Texturing Features X X Bilinear
Mapping
Bilinear MIPMAP Texture Blending
Perspective Correct LOD bias
Special Effects X X Antialiasing
2D Acceleration
Antialiasing (No perf. degradation) Programmable Shading Motion Blur by Memory
Programmer 2D Acceleration
Geometry Engine X X
185Ktris/sec SPP
(FPU x 2, INT x 1)
150Ktris/sec GPP (ARM9 + MAC + BEQ)
Software Support X X Z3D-lib MobileGL
[Table 4.4-2 : Performance Comparison]
91
Chapter 4 Chip Implementation
4.4.3 Performance of SlimShader with External SDRAM
Since the SlimShader is scalable, it can be ported to any platforms regardless
of fabrication process. Although it shows the highest performance at MDL
(Merged-DRAM and Logic) process with frame buffer, depth buffer, and texture
memory, it can be also integrated as an IP-core into application processors where
the graphics data are stored in the external SDRAM. Fig. 4.4-3 and table 4.4-3
compares the performance degradation of SlimShader with various
configurations (fig. 4.4.4), assuming the core runs at 50MHz and SDRAM is
attached to graphics-dedicated memory ports. Fig. 4.4-3(a) and (b) show the
performances when 100MHz 32bit SDR-SDRAM, and 100MHz 32bit DDR-
SDRAM are attached to the rendering engine, respectively. Although the
performance slows down to 20% of its maximum performance at worst-case
(without any graphics memories on chip – 4.4-4(d)), it is still higher than MBX.
When the rendering engine is integrated with texture cache and depth buffer (4.4-
4(b)), the performance is even comparable to that of embedding all memories
(4.4-4(a)) with 32bit mobile DDR-SDRAM.
External Memory Interface
SDR-SDRAM DDR-SDRAM
Version Frequency
PT SampleBilinear MIPMAPPT SampleBilinear MIPMAP
(a) RAMP-IV MDL
(All Embedded MEM)
50MHz
@ 0.18um
100M 100M 100M 100M
(b) RAMP-IV Logic #1
(All External MEM)
22M 19.3M 44.3M 41.3M
(c) RAMP-IV Logic #2
(T$ Embedded)
24.3M 23.8M 48.9M 47.7M
(d) RAMP-IV Logic #3
(T$ + ZB Embedded)
50MHz
@ 0.18um
47.9M 45.7M 96M 91.7M
[Table 4.4-2 : Rendering Performance with External SDRAM]
92
Chapter 4 Chip Implementation
(a) MDL (b) Logic #1 (c) Logic #2 (d) Logic #30
10
20
30
40
50
60
70
80
90
100
Pix
el F
ill R
ate
(Mpi
xels
/s)
Pt-Sample
Pt-Sample
Bilinear
Bilinear
DDR-SDRAM
SDR-SDRAM
60%PerformanceDegradation
80%PerformanceDegradation
ARM-MBX
Fig. 4.4-3 : Rendering Performance with External SDRAM
3DCG-IP
Frame Buffer
Depth Buffer
Texture Memory
3DCG-IP ExternalSDRAM
(a) MDL
(d) Logic #4
3DCG-IP ExternalSDRAM
Texture Cache
(c) Logic #2
3DCG-IP
External SDRAM
Depth Buffer
(b) Logic #1
Texture Cache
[Fig. 4.4-4 : Configuration Example of External SDRAM]
93
Chapter 4 Chip Implementation
4.5 Appendix : Design Information
4.5.1 Area Information
Major Blocks GDS Silicon (80% Shrinked) SlimShader 29.5 18.88
Texture Memory 23.6 15.104 ARM9 Core with Cache Controller 16 10.24
Frame Buffer 7.9 5.056 Instruction Cache 7.3 4.672
Data Cache 7.3 4.672 Depth Buffer 5.3 3.392 Polygon Buffer 3.7 2.368
Power Management Unit 2.7 1.728 Bandwidth Equalizer 2.3 1.472
(mm2)
SlimShader
Texture Memory
ARM9 Core with CacheController
Frame Buffer
Instruction Cache
Data Cache
Depth Buffer
Polygon Buffer
Power Management Unit
Bandwidth Equalizer
Memory Programmer
[Fig. 4.5-1 : Chip and Major Blocks]
94
Chapter 4 Chip Implementation
Pipe Combinational Noncombinational Total IF 9.3 4.7 14
ID1 242 1,059 1,301 ID2 52 3,119 3,171 TS 42,243 3,435 45,678 EP 6,187 11,460 17,647 HS 11,103 2,947 14,050 PI 4,937 5,972 10,910
TA1 9,426 2,491 11,917 TA2 8,828 2,764 11,592 TP1 1,604 6,617 8,222 TP2 1,332 2,711 4,044 TP3 4,225 5,157 9,382 TF 11,269 3,814 15,084 PB 6,352 1,193 7,545
Total 107,830 52,749 160,580
(gates)
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
IF
ID1
ID2
TS
EP
HS PI
TA1
TA2
TP1
TP2
TP3
TF
PB
Combinational
Noncombinational
Total
[Fig. 4.5-2 : SlimShader Gate Counts - Pipeline]
95
Chapter 4 Chip Implementation
Block Combinational Noncombinational Total Interface 303.3 4182.7 4486
Triangle Setup 42,243 3,435 45,678 Edge Processor 17,290 14,407 31,697 Pixel Processor 11,289 7,165 18,455
Texture Unit 36,684 23,554 60,241
(gates)
0
10000
20000
30000
40000
50000
60000
70000
Interface TriangleSetup
EdgeProcessor
PixelProcessor
Texture Unit
Combinational
Noncombinational
Total
[Fig. 4.5-3 : SlimShader Gate Counts – Functional Block]
96
Chapter 4 Chip Implementation
4.5.2 Cell Utilization
Cell Name Utilization
(%) Description
1 LND02D1 20,919 2-input NAND, 1x drive
2 LND02D2 13,418 2-inout NAND, 2x drive
3 LMX21D1 11,102 2 > 1 Mux, 1x drive
4 LMFQTNB 8,524 d-enabled f/f, active-low enable, positive-edge, Q only (no reset)
5 LXR02D1 7,504 2-input XOR, 1x drive
6 LIN01D2 4,492 inverter, 2x drive
7 LAN02D1 4,436 2-input AND, 1x drive
8 LNR02D1 4,167 2-input NOR, 1x drive
9 LND12D1 3,658 2-input NAND with /A, 1x drive
10 LIN01D1 3,260 inverter, 1x drive
11 LND03D1 3,254 3-input NAND, 1x drive
12 LND04D1 2,914 4-input NAND, 1x drive
13 LAD01D1 2,619 1-bit full adder
14 LNR03D1 1,375 3-input NOR, 1x drive
15 LLANTNQ 1,166 d-latch active-high enabled, Q only
16 LIN01D4 978 inverter, 4x drive
17 LNI01D2 828 buffer, 2x drive
18 LXR02D2 775 2-input XOR, 2x drive
19 LOR02D1 767 2-input OR, 1x drive
20 LIN01D7 704 inverter, 7x drive
21 LIN01DA 697 inverter, 10x drive
22 LND02D4 657 2-input NAND, 4x drive
23 LDFNTNB 531 d-f/f positive-edge with Q, Qb (no reset)
24 LDFQTNC 478 d-f/f positive-edge with Q only (no reset), 2x drive
25 LDFBFNC 453 d-f/f negative-edge with set and clear, Q, Qb, 2x drive
26 LDFBFNB 393 d-f/f negative-edge with set and clear, Q, Qb
27 LNR02D2 195 2-input NOR, 2x drive
28 LNT01D1 166 tri-state buffer with active high enable, 1x drive
29 LXR03D2 143 3-input XOR, 2x drive
30 LMX21D4 140 2 > 1 Mux, 4x drive
31 LNI01DC 135 buffer, 20x drive
32 LND03D2 129 3-input NAND, 2x drive
33 LNI01DD 124 buffer, 40x drive
34 LNT01D2 96 tri-state buffer with active high enable, 2x drive
35 LNR02D4 91 2-input NOR, 4x drive
36 LND03D4 86 3-input NAND, 4x drive
37 LXN02D2 84 2-input XNOR, 2x drive
38 LNI01D7 81 buffer, 7x drive
39 LNR04D1 81 4-input NOR, 1x drive
40 LMFNTNB 49 d-enabled f/f, active-low enable, positive-edge, Q and Qb (no reset)
41 LNI01DB 46 buffer, 15x drive
42 LNR03D2 45 3-input NOR, 2x drive
43 LIN03DD 41 inverter, 40x drive
44 LHA01D1 39 1-bit half adder
45 LXR02D4 31 2-input XOR, 4x drive
46 LIN02DB 30 inverter, 15x drive
47 LIN03DC 29 inverter, 20x drive
48 LDFNTNC 25 d-f/f positive-edge with Q, Qb (no reset), 2x drive
49 LNI01DA 23 buffer, 10x drive
50 LNR03D4 17 3-input NOR, 4x drive
51 LMX21DA 4 2 > 1 Mux, 10x drive
97
Chapter 4 Chip Implementation
52 LXR03D1 4 3-input XOR, 1x drive
53 LMX21D7 2 2 > 1 Mux, 7x drive
54 LNT01D4 2 tri-state buffer with active high enable, 4x drive
55 LDFBTNB 0 d-f/f positive-edge with set and clear, Q, Qb
56 LDFBTNC 0 d-f/f positive-edge with set and clear, Q, Qb, 2x drive
57 LDFCTNB 0 d-f/f positive-edge with clear, Q, Qb
58 LDFCTNC 0 d-f/f positive-edge with clear, Q, Qb, 2x drive
59 LDFPTNB 0 d-f/f positive-edge with set, Q, Qb
60 LDFPTNC 0 d-f/f positive-edge with set, Q, Qb, 2x drive
61 LDFQTNB 0 d-f/f positive-edge with Q only (no reset)
62 LIT01D1 0 tri-state inverter with active high enable, 1x drive
63 LIT01D2 0 tri-state inverter with active high enable, 2x drive
64 LIT01D4 0 tri-state inverter with active high enable, 4x drive
65 LIT01D7 0 tri-state inverter with active high enable, 7x drive
66 LIT01DA 0 tri-state inverter with active high enable, 10x drive
67 LND12D2 0 2-input NAND with /A, 2x drive
68 LND12D4 0 2-input NAND with /A, 4x drive
69 LNI01D1 0 buffer, 1x drive
70 LNI01D4 0 buffer, 4x drive
71 LNT01D7 0 tri-state buffer with active high enable, 7x drive
72 LNT01DA 0 tri-state buffer with active high enable, 10x drive
73 LXN02D1 0 2-input XNOR, 1x drive
Usuage
0
5,000
10,000
15,000
20,000
25,000
LND02
D1
LMFQ
TNB
LAN02
D1
LIN0
1D1
LAD01
D1
LIN0
1D4
LOR02
D1
LND02
D4
LDFB
FNC
LNT0
1D1
LNI01D
C
LNT0
1D2
LXN0
2D2
LMFN
TNB
LIN0
3DD
LIN0
2DB
LNI01D
A
LXR0
3D1
[Fig. 4.5-4 : SlimShader – Cell Utilization]
98
Chapter 5 System Evaluation
5.1 Target Configurations
BasebandProcessor
ApplicationProcessor
RAMP-IV
BasebandProcessor RAMP-IV
MainApplicationProcessor
RAMP-IV RAMP-IV
(a) Integration with existing A.P. (b) Replacement of existing A.P.
(c) Attachment to main A.P. (d) Standalone Processor
[Fig. 5.1-1 : Target Configurations]
There can be 4 major examples of fabricated SoC and they are summarized in
fig. 5.1-1. The upper examples are for the cell-phones and the lowers are for the
others. Fig. 5.1-1(a) is the integration with existing application processor and
baseband processor, which can be applied to high-end cell-phones. And (b)
shows the replacement of existing application processor for low-cost cell-phones,
where this chip, RAMP-IV, takes the role of application processor (AP). Then,
Chapter 5 System Evaluation
the next one (c) is the attachment to the main application processor. It is an
example for the highend PDAs and game terminals. Finally, standalone processor
for low-cost game terminals, in which the graphics SoC also controls the overall
system as in fig. 5.1-1(d)
5.2 REMY : System Evaluation Board
5.2.1 System Architecture
To demonstrate the fabricated chip, I’ve designed the evaluation board, REMY,
choosing a standalone configuration. REMY consists of a RAMP-IV, the
proposed graphics SoC, as a main processor, 32MByte main memory, bus
controller, USB interface, and LCD display as shown in fig. 5.2-1. The 3D
applications, ported with MobileGL, are downloaded to the system memory
through the USB interface. Then, RAMP-IV starts drawing pixels on the LCD
screen, interfacing with main memory.
5.2.2 REMY-I : First Evaluation Board
REMY-I is a first prototype board for the chip evaluation and debug. Fig. 5.2-2
shops its photo. The first silicon is successfully working and the images are
drawn on the 256x256 area out of 640x480 LCD screen.
5.2.3 REMY-II : PDA Prototype
100
Chapter 5 System Evaluation
Also, I revised the board to the PDA prototype by using the smaller parts and
eliminating the debug pins. The images are rendered on 240x256 area out of
240x320 LCD. The board is shown in fig. 5.2-3
GraphicsSoC
(RAMP-IV)
32
24
LCD
USB
8
SRAM
SRAM
SRAM
SRAM
SRAM
SRAM
SRAM
SRAM
16 16
32MB System Memory(Samsung Ut-SRAM)
USB Interface
USB Firmware(Atmel AT89LV92)
USB Link(Philips PDUISBD12)
18
Bus Interface(Altera EPF10K100)
8-Entry FIFO SRAM ControlSynchronizer
Init
LCD TimingGeneration
32
32
[Fig. 5.2-1 : REMY System]
101
Chapter 5 System Evaluation
RAMP-IV
B.I.
USB
SystemMemory
PowerSupply
[Fig. 5.2-2 : REMY-I]
RAMP-IV SystemMemory
BusInterface
USB
PowerSupply
[Fig. 5.2-3 : REMY-II]
102
Chapter 5 System Evaluation
5.3 Graphics Library: MobileGL To accelerate the 3D graphics applications on wireless devices, MobileGL, an
OpenGL-ES compatible graphics library, is proposed and developed. MobileGL
is optimized with hand-written assembly language to boost-up the performance
on ARMv4-based platform. As shown in fig. 5.3-1, MobileGL fill the gap
between 3D games or MMI (Man-Machine-Interface) and hardware blocks on
the application platform of wireless devices. MobileGL consists of fixed-point
math library, geometry engine, and rendering engine also with S/W renderer. In
order to downsize the size of graphics library and to enhance the library
performance for mobile 3D gaming applications, many unused functions are cut
out from OpenGL-ES v1.0 and some functions are ported from OpenGL v1.2.
Out of 106 functions in OpenGL-ES v1.0, 70 most frequently used functions are
chosen and 6 more functions, related to glBegin and glEnd, are adopted from
OpenGL v1.2 and added to MobileGL. Fig. 5.3-2 shows the code example of
MobileGL and corresponding assembly language of SlimShader.
5.4 Demonstration Fig. 5.4-1 shows the pictures captured from REMY system running real-time
3D applications.
103
Chapter 5 System Evaluation
MPEG-4Video
MP3Audio
3DGames
MMI(Menu)
Application
Application Platform (BREW)
Mobile Host S/W
Operating System
MobileGL
BasebandModem
ApplicationProcessor
3D GraphicsAccelerator
Hardware
Geometry Engine
Rendering Engine
Fixed-Point Math Library
S/WRenderer
H/WPort
ARMv4ASM
SlimShaderASM
[Fig. 5.3-1 : Application Platform of Wireless Devices]
Code Example: MobileGL>> glGenTextures(1,&texName);>> glBindTexture(GL_TEXTURE_2D, texName);>> glTexParameteri(GL_TEXTURE_2D,GL_TEXTURE_MIN_FILTER,GL_LINEAR_MIPMAP_NEAREST);>> glTexParameteri(GL_TEXTURE_2D,GL_TEXTURE_MAG_FILTER,GL_LINEAR);>> glTexImage2D(GL_TEXTURE_2D, 0, GL_RGB, 256, 256, GL_RGB, GL_UNSIGNED_BYTE, *texels256);>> glTexImage2D(GL_TEXTURE_2D, 1, GL_RGB, 128, 128, GL_RGB, GL_UNSIGNED_BYTE, *texels128);>> glTexImage2D(GL_TEXTURE_2D, 2, GL_RGB, 64, 64, GL_RGB, GL_UNSIGNED_BYTE, *texels64);>> glTexImage2D(GL_TEXTURE_2D, 3, GL_RGB, 32, 32, GL_RGB, GL_UNSIGNED_BYTE, *texels32);>> glTexImage2D(GL_TEXTURE_2D, 4, GL_RGB, 16, 16, GL_RGB, GL_UNSIGNED_BYTE, *texels16);>> glTexImage2D(GL_TEXTURE_2D, 5, GL_RGB, 8, 8, GL_RGB, GL_UNSIGNED_BYTE, *texels8);>> glTexImage2D(GL_TEXTURE_2D, 6, GL_RGB, 4, 4, GL_RGB, GL_UNSIGNED_BYTE, *texels4);>> glTexImage2D(GL_TEXTURE_2D, 7, GL_RGB, 2, 2, GL_RGB, GL_UNSIGNED_BYTE, *texels2);>> glTexImage2D(GL_TEXTURE_2D, 8, GL_RGB, 1, 1, GL_RGB, GL_UNSIGNED_BYTE, *texels1);>> glTexEnvf(GL_TEXTURE_ENV,GL_TEXTURE_ENV_MODE,GL_MODULATE);>> glEnable(GL_TEXTURE_2D);
Code Example: SlimShader ASM>> TSTR 0x00000 R G B .....>> TSTR 0x10000 R G B .....>> TSTR 0x13ED4 R G B .....>> TSTR 0x15000 R G B .....>> TSTR 0x15400 R G B .....>> TSTR 0x15500 R G B .....>> TSTR 0x15540 R G B .....>> TSTR 0x15550 R G B .....>> TMOD 0x0000 0010 0010 0x01 0000 256
[Fig. 5.3-1 : MobileGL Code Example]
104
Chapter 5 System Evaluation
[Fig. 5.4-1 : Demonstration Results]
105
Chapter 6 Conclusions and Further Work
6.1 Conclusions
A low-power graphics SoC implementing full 3D pipeline with texturing and
special rendering effects is designed, implemented and demonstrated for mobile
multimedia applications, and published first in the world.
The graphics SoC contains a 32bit RISC processor with enhanced MAC as a
geometry engine, a hard-wired 3D rendering engine, a programmable power
optimizer and 29Mb embedded DRAM. The chip can perform 1Mvertices/s
transformation with custom-designed MobileGL and it can draw pixels at
66Mpixels/s and 264Mtexels/s rate on 256x256 LCD display. Dedicated
hardware engines and embedded DRAM lower the operation frequency to
33MHz. The row-cycle and latency of Embedded DRAM are optimized for
frame buffer, depth buffer, and texture memory. The overall power can be
controlled further by three-step frequency scaling and block-level clock-gating.
A rendering engine has SlimShader and Memory programmer - a main
rendering pipeline and a post-processing unit. They’re designed mainly focusing
Chapter 6 Conclusions and Further Work
on the low-power consumption. SlimShader supports a subset of OpenGL
rendering functions with 13 128bit-encoded instructions. It is composed of 14
multi-pipelined stages to maximally save the power consumption by activating
the only necessary stages. Depth-first clock gating and latch-enabling remove the
unnecessary datapath transitions as much as possible. The SlimShader performs
horizontal-order rasterization with 3D-optimized DRAM to simplify the design.
Hard-wired triangle setup engine is implemented by simplifying the algorithm
and optimizing the datapath precision for the low-power and the small area.
Using multipliers and a shared 11bit floating-point LUT for SIMD divider saves
the area by 40%, compared with 16bit fixed-point calculation, while delivering
required precision.
The energy-efficient texturing unit performs perspective correction and
bilinear MIPMAP filtering for better image quality. In the calculation of
perspective correction, approximated division scheme, rounding off LSBs of 1/w,
is proposed to reduce the divider area by 95%, within 0.78% error boundary.
Address Alignment Logic (AAL) reduces the texture requests with spatial
aligner and temporal aligner. Spatial aligner compares the requests between two
pixel processors and eliminates the overlapped ones. Then, temporal aligner
reduces the requests further comparing the current requests with recently-used
ones without using power-consuming SRAM cache. This AAL reduces the
energy consumption by 66% since it reduces both the power consumption and the
operation cycles.
107
Chapter 6 Conclusions and Further Work
Memory Programmer post processes the rendered pixels transferring them to
the display controller in parallel with the SlimShader. It contains crossbar
switches and a SIMD-parallel datapath which is controlled by its own 16bit
commands. Special rendering effects such as Full Scene Antialiasing, Motion
Blur, and Fog can be programmed without degrading the performance of main
pipeline because the pixels are directly transferred to the LCD controller.
Also, 12 distributed DRAMs reduce the operation power in the graphics
memory by up to 75% since the only necessary memories can be selectively
activated, while providing up to 2.4GByte/s at 50MHz. The memories can be
selectively refreshed for data retention in standby modes by 4 power-control
instructions.
The chip is implemented with 0.16um 256Mb-compatible DRAM process to
reduce the fabrication cost. The logic components, SRAM and analog blocks are
drawn with the design rule of peripheral transistors of the DRAM. This DRAM-
based SoC implementation enables us to put large on-chip memory with
inherently little leakage current, which is important for mobile multimedia
applications. Full 3D graphics pipeline featuring 1Mvertices/s, 66Mpixels/s and
264Mtexels/s texture-mapped 3D graphics consumes less than 210mW and
121mm2 chip area. The embedded DRAM drastically reduces the power
consumption since the external I/Os for 3D rendering are completely eliminated,
and additional 22% reduction is obtained by Address Alignment Logic and
108
Chapter 6 Conclusions and Further Work
Depth-First Clock-Gating. This chip achieves highest performance among
previous and recently-implemented chips.
Two evaluation boards are designed to demonstrate the fabricated chip. The
3D graphics images are successfully demonstrated on each board with MobileGL,
running real-time applications.
Therefore, this work allows 3D graphics to be implemented for mobile
multimedia applications.
6.2 Further Work Since this research solves the bottleneck in 3D rendering on mobile
applications, now it’s time to accelerate geometry stage to balance and speed-up
the entire graphics pipeline. Also, as the direction of today’s PC graphics shows,
programmable shading must be implemented onto mobile devices to draw pixels
with higher fidelity.
109
Chapter 7 Summary
要約文
휴대용 멀티미디어 기기를 위한
저전력 3 차원 그래픽 SoC 의 설계및 구현
휴대폰등의 저전력 정보 단말기에 사용되는 3 차원 그래픽 가속기에 대한
연구를 수행하였다. SlimShader 라는 Low-Power 3D Rendering Engine 을 새롭게
제안하였다. 14 단의 저전력 파이프라인으로 구성되어 있으며, 메모리 억세스의
효율성을 살리기 위해 Horizontal Scan Rasterization 을, 계산 속도를 증가시키고
설계를 간단히 하기 위해 Look Up Table 과 Multiplier 를 이용한 Divider 를 채택
하였으며 Address Alignment Logic 을 채택하여 Texture Mapping 시의 메모리
Access 를 1/4 가까지 줄여 Energy 소모를 1/3 로 줄였다.
제안된 구조를 확인하고자 ARM9, 3D Rendering Engine, Cache, Texture Memory,
Depth Buffer, Frame Buffer 및 Power Management Unit 을 0.16um DRAM 공정으로
29Mbit DRAM, 72kbit SRAM, 1M logic transistor 로 구현하였다. Fast, Normal, Slow
의 3 가지 모드로 동작이 가능하며, Bilinear MIPMAP 을 사용하는 3 차원 영상
구현 시 210mW (Fast Mode, 33MHz)의 전력 소모를 보임을 확인하였다. REMY
라는 System Evaluation Board 를 제작하여 256x256 해상도의 LCD 화면에 3 차원
영상이 고속으로 표시됨을 MobileGL S/W 라이브러리와 함께 증명하였다.
Chapter 8 Bibliography
[1] Takashi Hashimoto, et al, “A 27-MHz/54-MHz 11-mW MPEG-4 Video Decoder
LSI for Mobile Applications,” IEEE J. Solid-State Circuits, vol. 37, pp. 1574-11581,
Nov. 2002
[2] Tsuyoshi Nishikawa, et al., “A 60Mhz 230mW MPEG-4 Video-Phone LSI with
16Mb Embedded DRAM,” in ISSCC Digest of Technical Papers, pp. 230-231, Feb.
2000
[3] Khronos Group, “Brining 3D Gaming to Cell Phones,” Game Developers
Conference 2003
[4] G. K. Kolli, S. Junkins, H. Barad, ”3D Graphics Optimizations for ARM
Architecture,” Proceedings of the Game Developers Conference 2002, March 2002
[5] Alan Watt, “3D Computer Graphics,” 3rd Ed, 2000, Addison-Wesley
[6] Yong-Ha Park, et al, “A 7.1-GB/s Low-Power Rendering Engine in 2-D Array-
Embedded Memory Logic CMOS for Portable Multimedia System,” IEEE J. Solid-
State Circuits, vol. 36, pp. 944-955, Jun. 2001
[7] Ramchan Woo, et al., “A 120mW 3D Rendering Engine with 6Mb Embedded
Chapter 8 Bibliography
DRAM and 3.2Gbyte/s Runtime Reconfigurable Bus for PDA-Chip,” IEEE J. Solid-
State Circuits, vol. 37, pp. 1352-1355, Oct. 2002
[8] Chi-Weon Yoon et al, “A 80/20MHz 160mW Multimedia Processor integrated
with Embedded DRAM, MPEG-4 and 3D Rendering Engine for Mobile
Applications,” IEEE J. Solid-State Circuits, vol. 36, pp. 1758-1767, Nov. 2001
[9] Se-Jeong Park, et al, “A Reconfigurable Multilevel Parallel Texture Cache
Memory With 75-GB/s Parallel Cache Replacement Bandwidth,” IEEE J. Solid-State
Circuits, pp. 612-623, May. 2002
[10] Aurangzeb K. Khan et al, “A 150MHz Graphics Rendering Processor with
256Mb Embedded DRAM,” in ISSCC Dig. Tech. Papers, pp. 150-151, Feb. 2001
[11] John S. Montry et al, “InfiniteReality: A Real-Time Graphics System,” in Proc.
SIGGRAPH, pp. 293-302, 1997
[12] W.R. Hamburgen, et al, “Itsy : stretching the bounds of mobile computing,”
IEEE Computer, vol 34, pp. 28-36, Apr. 2001
[13] Dennis D. Buss, “Technology in the Internet Age,” ISSCC Digest of Technical
Papers, pp. 18-21, Feb. 2002
[14] www.opengl.org
[15] www.microsoft.com/directx
[16] Ramchan Woo, “Design and Implementation of Low-Power Embedded 3D
Graphics Rendering Engine for Mobile Applications using the Embedded Memory
Logic Technology,” M.S. Dissertation, KAIST 2001.
[17] Ramchan Woo, et al, “A 210mW Graphics LSI Implementing Full 3D Pipeline
with 264Mtexels/s Texturing for Mobile Multimedia Applications,” in ISSCC Digest 112
Chapter 8 Bibliography
of Technical Papers, pp. 44-45, Feb. 2003
[18] Ramchan Woo, et al, “A Low-Power and high-Performance 2D/3D Graphics
Accelerator for Mobile Multimedia Applications,” Hot Chips 2003
[19] Ramchan Woo, et al, “A Low Power 3D Rendering Engine with Two Texture
Units and 29Mb Embedded DRAM for 3G Multimedia Terminals,” in Proc. Of
European Solid-State Circuits Conference, pp. 53 – 56, 2003
[20] Ramchan Woo, et al, “A 210mW Graphics LSI Implementing Full 3D Pipeline
with 264Mtexels/s Texturing for Mobile Multimedia Applications,” IEEE J. Solid-
State Circuits, Accepted for Publication
[21] Ramchan Woo, et al, “A Low-Power 3D Rendering Engine with Two Texture
Units and 29Mb Embedded DRAM for 3G Multimedia Terminals,” IEEE J. Solid-
State Circuits, Accepted for Publication
[22] Ramchan Woo, et al, “A Low-Power Graphics LSI integrating 29Mb Embedded
DRAM for Mobile Multimedia Applications,” University Design Contest, Asia-
South-Pacific Design Automatic Conference 2004, Accepted for Presentation
[23] Masatoshi Kameyama, et al, "3D Graphics LSI Core for Mobile Phone Z3D,"
ACM SIGGRAPH/Eurographics Workshop on Graphics Hardware, 2003
[24] “ARM MBX HR-S 3D Graphics Core Technical Overview,” Technical
Document, ARM DTO-0003B, 2002
[25] Xie, Feng, and Micheal Shantz, “Adaptive Hierarchical Visibility in a Tiled
Architecture,” ACM SIGGRAPH/Eurographics Workshop on Graphics Hardware, pp.
75-84, 1999
[26] Tomas Akenine-Moller, Jacob Strom, “Graphics for the Masses: A Hardware 113
Chapter 8 Bibliography
Rasterization Architecture for Mobile Phones,” Proc. of ACM SIGGRAPH, pp. 801-
808, 2003
[27] Junichi Fujita, et al, “A 109.5mW 1.2V, 600Mtexels/s 3-D Graphics Engine,” in
ISSCC Digest of Technical Papers, pp. 332-333, Feb. 2004
[28] Gregory A. Uvieghara, et al, “A Highly-Integrated 3G CDMA2000 1X Cellular
Baseband Chip with GSM/AMPS/GPS/Bluetooth/Multimedia Capabilities and ZIF
RF Support,” in ISSCC Digest of Technical Papers, pp. 422-423, Feb. 2004
[29] T. Kamei, et al, “A Resume-Standby Application Processor for 3G Cellular
Phones,” in ISSCC Digest of Technical Papers, pp. 336-337, Feb. 2004
[30] Fumio Arakawa, et al, “An Embedded Processor Core for Consumer Appliances
with 2.8GLOPS and 36M polygons/s FPU,” in ISSCC Digest of Technical Papers, pp.
334-335, Feb. 2004
[31] Khronos Group, “OpenGL ES Common/Common-Lite Profile Specification,”
version 1.0 (Annotated)
[32] JSR-184 Expert Group, “Mobile 3D Graphics API for Java 2 Micro Edition,”
Public Review Draft, Apr. 30, 2003.
[33] Z.S. Hakura, et al, “The Design and Analysis of a Cache Architecture for Texture
Mapping,” Proc. of the 24th International Symposium on Computer Architecture,
1997
[34] Young-Don Bae, et al., “A Single-Chip Programmable Platform Based on a
Multithreaded Processor and Configurable Logic Clusters,” ISSCC Digest of
Technical Papers, pp. 336-337, Feb. 2002
[35] Ju-Ho Sohn, et al, “Optimization of Portable System Architecture for Real-time 114
Chapter 8 Bibliography
3D Graphics,” in IEEE International Symposium on Circuits and Systems
Proceedings, pp. I769-I772, 2002
[36] Michael Cox, et al, “Multi-Level Texture Caching for 3D Graphics Hardware,”
ACM/IEEE International Symposium on Computer Architecture, pp. 86-97, 1998
[37] Homan Igehy, et al, “Parallel Texture Caching,” ACM SIGGRAPH/Eurographics
Workshop, pp. 95 – 106, 1999
[38] Homan Igehy, et al, “Prefetching in a Texture Cache Architecture,” ACM
SIGGRAPH/Eurographics Workshop, 1998
[39] “ARM Architecture Reference Manual,” Technical Document, ARM DUI-0100B,
1996
[40] Michael F. Deering, et al, “FBRAM : A new Form of Memory Optimized for 3D
Graphics,” SIGGRAPH, pp. 167-173, 1994
[41] L. Williams, “Pyramidal Parametrics,” SIGGRAPH, pp. 1-11, 1983
[42] Paul S. Heckbert, “Survey of Texture Mapping,” IEEE Computer Graphics and
Applications, vol. 6, no. 11, Nov. 1986, pp. 56-67
[43] Jon P. Ewins, et al, “MIP-Map Level Selection for Texture Mapping,” IEEE
Transactions on Visualization and Computer Graphics, vol. 4, no. 4, pp 317-329, Oct.-
Dec., 1998
[44] John Montrum, and Henry Moreton, “nVidia GeForce4,” HotChips 2002
[45] “3 次元グラフィックスで変するモバイル。ゲーム機,” Nikkei Electronics
pp. 77 – 86, 10-27, 2003
[46] Joel McCormack, et al, “Neon : A (Big) (Fast) Single-Chip 3D Workstation
Graphics Accelerator,” Research Report 98/1, Compaq Computer Corporation 115
Chapter 8 Bibliography
Western Research Laboratory, 1999.
[47] O. Lathrop, D. Kirk, et al, “Accurate Rendering by Subpixel Addressing,” IEEE
Computer Graphics and Applications, pp 45-52, Sep., 1990.
[48] Anders Kugler, “The Setup for Triangle Rasterization,” Eurographics, pp 49-58,
1996
[49] B. Barenbrug, et al, “Algorithms for Division Free Perspective Correct
Rendering,” ACM SIGGRAPH/Eurographics Workshop on Graphics Hardware, 2000
[50] Tomas Akenine-Moller, and Etric Haines, “Real-Time Rendering,” 2nd Ed, 2002,
AK Peters
[51] Christoforos E. Kozyarakis, David A. Patterson, “Scalable Vector Processors for
Embedded Systems,” IEEE Micro, pp. 36-45, Nov.-Dec., 2003
[52] Semiconductor Industry Association, “International Technology Roadmap for
Semiconductors,” 2002
[53] http://www.ati.com
116
Acknowledgment
감사의 글
학부를 마친 후 99 년에 반도체 시스템에 들어온 뒤, 지금의 저를 있게 도와 주
신 수 많은 분들께 진심으로 고개 숙여 감사 드립니다. 부모님을 비롯, 교수님들,
세계 최고 SSL 실험실 멤버들, 친구 및 동료들, 그리고 회사에 계신 분들.
모든 분들의 이름을 하나하나 열거하며 감사의 마음을 글로 몇 자 적어 본 들
무슨 의미가 있겠습니까. 지금까지 제가 배워왔던 방법처럼, 직접 몸으로 실천하
여 반드시 10 년 안에 최고의 결과로 보답해 드리겠습니다.
Design is not the creation,
but the process of decision.
DEPT. OF E.E, KAIST • GUSEONG-DONG, YUSEONG-KU, 305-701 • DAEJEON, KOREA • +82-42-869-8068 [email protected] • http://ssl.kaist.ac.kr/~ramchan/main.html
RAMCHAN WOO EDUCATION Korea Advanced Institute of Science and Technology
- Full Scholarship from the Korea Government 3/01 – 8/04 Ph.D. in Electrical Engineering
Thesis : Design and Implementation of Low-Power 3D Graphics SoC for Mobile Multimedia Applications
3/99 – 2/01 M.S. in Electrical Engineering Thesis : Design and Implementation of Low-Power Embedded 3D Graphics Rendering Engine for Mobile Applications using the Embedded Memory Logic Technology Course GPA : 3.78/4.3
3/95 – 2/99 B.S. in Electrical Engineering Summa Cum Laude Overall GPA : 3.95/4.3 – Major GPA : 4.15/4.3
Taejon Science High School
- Scholarship from Samsung Heavy Industry 3/93 – 2/95 Valedictorian, one-year-early graduation
INTERNATIONAL JOURNAL PAPERS (4 FIRST AUTHORED) IEEE Micro
RAMP: Brining 3D Graphics Hardware to Wireless Applications with Embedded DRAM Technology
Ramchan Woo, and Hoi-Jun Yoo IEEE Micro, Submitted
JSSC 2004
A Low-Power 3D Rendering Engine with Two Texture Units and 29Mb Embedded DRAM for 3G Multimedia Terminals
Ramchan Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, Young-Don Bae and Hoi-Jun Yoo IEEE Journal of Solid-State Circuits, Vol. 39, No. 7, July, 2004
JSSC 2004
210mW Graphics LSI Implementing Full 3D Pipeline with 264Mtexels/s Texturing for Mobile Multimedia Applications
Ramchan Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, and Hoi-Jun Yoo IEEE Journal of Solid-State Circuits, Vol. 39, No. 2, February, 2004
JSSC 2002
A 120mW 3D Graphics Rendering Engine with 6Mb Embedded DRAM and 3.2Gbyte/s Runtime Reconfigurable Bus for PDA-Chip
Ramchan Woo, Chi-Weon Yoon, Jeonghoon Kook, Se-Joong Lee, and Hoi-Jun Yoo IEEE Journal of Solid-State Circuits, Vol. 37, No. 10, October, 2002
JSSC 2002
A Reconfigurable Multilevel Parallel Texture Cache Memory With 75-GB/s Parallel Cache Replacement Bandwidth Se-Jeong Park, Jeong-Su Kim, Ramchan Woo, Se-Joong Lee, Kangmin Lee, Tae-Hum Yang, Jin-Yong Jung and Hoi-Jun Yoo
IEEE Journal of Solid-State Circuits, Vol. 37, No. 5, May, 2002 JSSC 2001
An 80/20-MHz 160-mW Multimedia Processor Integrated With Embedded DRAM, MPEG-4 Accelerator, and 3D Rendering Engine for Mobile Applications Chi-Weon Yoon, Ramchan Woo, Jeonghoon Kook, Se-Joong Lee, Kangmin Lee, and Hoi-Jun Yoo
IEEE Journal of Solid-State Circuits, Vol. 36, No. 11, November, 2001
DEPT. OF E.E, KAIST • GUSEONG-DONG, YUSEONG-KU, 305-701 • DAEJEON, KOREA • +82-42-869-8068 [email protected] • http://ssl.kaist.ac.kr/~ramchan/main.html
INTERNATIONAL CONFERENCE PAPERS (6 FIRST AUTHORED) ISSCC 2003
A 210mW Graphics LSI implementing Full 3D Pipeline with 264Mtexels/s Texturing for Mobile Multimedia Applications
Ramchan Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, Young-Don Bae, Chi-Weon Yoon, Byeong-Gyu Nam, Jeong-Ho Woo, Sung-Eun Kim, In-Cheol Park, Sungwon Shin, Kyung-Dong Yoo, Jin-Yong Chung, and Hoi-Jun Yoo IEEE International Solid-State Circuits Conference (ISSCC 2003 Proceedings)
Hot Chips 2003
A Low-Power and High-Performance 2D/3D Graphics Accelerator for Mobile Multimedia Applications
Ramchan Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, Young-Don Bae, and Hoi-Jun Yoo 15th International Hot Chips Conference
Graphics Hardware 2004
A Programmable Vertex Shader with Fixed-Point SIMD Datapath for Low Power Wireless Applications
Ju-Ho Sohn, Ramchan Woo, and Hoi-Jun Yoo Eurographics / Graphics Hardware Workshop 2004, Accepted for Presentation
ESSCIRC 2003
A Low-Power 3D Rendering Engine with Two Texture Units and 29Mb Embedded DRAM for 3G Multimedia Terminals
Ramchan Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, and Hoi-Jun Yoo IEEE European Solid-State Circuits Conference
ASP-DAC Design Contest 2004
A Low-Power Graphics LSI integrating 29Mb Embedded DRAM for Mobile Multimedia Applications
Ramchan Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, Young-Don Bae, and Hoi-Jun Yoo Asian South Pacific-Design Automation Conference 2004 University Design Contest
ISSCC 2001
A 80/20MHz 160mW Multimedia Processor integrated with Embedded DRAM, MPEG-4 Accelerator, and 3D Rendering Engine for Mobile Applications
Chi-Weon Yoon, Ramchan Woo, Jeonghoon Kook, Se-Joong Lee, Kangmin Lee, Young-Don Bae, In-Cheol Park, and Hoi-Jun Yoo IEEE International Solid-State Circuits Conference (ISSCC 2001 Proceedings)
ISSCC 2000
A 7.1GB/s Low Power 3D Rendering Engine in 2D Array Embedded Memory Logic CMOS
Yong-Ha Park, Seon-Ho Han, Jung-Su Kim, Se-Joong Lee, Jeong-Hun Kook, Jae-Won Lim, Ramchan Woo, Hoi-Jun Yoo, Jeong-Hwan Lee, and Jay-Hyun Lee IEEE International Solid-State Circuits Conference (ISSCC 2000 Proceedings)
Symp. on VLSI Circuits 2001
A 120mW Embedded 3D Graphics Rendering Engine with 6Mb Logically Local Frame-Buffer and 3.2GByte/s Run-time Reconfigurable Bus for PDA-Chip
Ramchan Woo, Chi-Weon Yoon, Jeonghoon Kook, Se-Joong Lee, Kangmin Lee, Yong-Ha Park and Hoi-Jun Yoo IEEE Symposium on VLSI Circuits (SOVC 2001 Proceedings)
Symp. on VLSI Circuits 2001
Low Power Motion Compensation Block IP with emdedded DRAM Macro for Portable Multimedia Applications
Chi-Weon Yoon, Jeonghoon Kook, Ramchan Woo, Se-Joong Lee, Kangmin Lee and Hoi-Jun Yoo IEEE Symposium on VLSI Circuits (SOVC 2001 Proceedings)
Symp. on VLSI Circuits 2001
A Reconfigurable Multimedia Parallel Graphics Cache Memory with 75GB/s Parallel Cache Replacement Bandwidth
Se-Jeong Park, Jeongsu Kim, Ramchan Woo, Se-Joong Lee, Kangmin Lee, T.H. Yang, J.Y. Jung and Hoi-Jun Yoo IEEE Symposium on VLSI Circuits (SOVC 2001 Proceedings)
Symp. on VLSI Circuits 2001
480ps 64bit Race Logic Adder Se-Joong Lee, Ramchan Woo and Hoi-Jun Yoo IEEE Symposium on VLSI Circuits (SOVC 2001 Proceedings)
ISCAS 2002
Optimization of Portable System Architecture for Real-Time 3D Graphics Juho Sohn, Ramchan Woo, and Hoi-Jun Yoo IEEE International Symposium on Circuits and Systems (ISCAS 2002 Proceedings)
DEPT. OF E.E, KAIST • GUSEONG-DONG, YUSEONG-KU, 305-701 • DAEJEON, KOREA • +82-42-869-8068 [email protected] • http://ssl.kaist.ac.kr/~ramchan/main.html
ISCAS 2001
A Comparative Analysis of a DDR-SDRAM and a D-RDRAM using a POPeye Simulator Kangmin Lee, Chi-Weon Yoon, Ramchan Woo, Jeonghoon Kook, Ja-Il Ku, Tae-Sung Jung, and Hoi-Jun Yoo IEEE International Symposium on Circuits and Systems (ISCAS 2001 Proceedings)
ISCAS 2000
A 670ps, 64bit Dynamic Low-Power Adder Design Ramchan Woo, Se-Joong Lee, and Hoi-Jun Yoo IEEE International Symposium on Circuits and Systems (ISCAS 2000 Proceedings)
Others 7.1GB/s Bandwidth 3D Rendering Engine using the EML Technology
Yong-Ha Park, Ramchan Woo, Seon-Ho Han, Jung-Su Kim, Se-Joong Lee, Jeong-Hun Kook, Jae-Won Lim, and Hoi-Jun Yoo IEEE International Conference on VLSI and CAD (ICVC 1999 Proceedings)
DOMESTIC PAPERS Magazines The Technology Trends of Embedded Processors on Portable Systems
Hoi-Jun Yoo and Ramchan WooThe Magazine of the IEEK, July, 2001
Journals POPeye : A System Analysis Simulator for DRAM Performance Evaluation Kangmin Lee, Chi-Weon Yoon, Ramchan Woo, Jeong-Hun Kook, Yon-Kyun Im, and Hoi-Jun Yoo Journal of Semiconductor Technology and Science. Vol. 1, No. 2, June, 2001
WORK EXPERIENCE Korea Advanced Institute of Science and Technology 3/99 – 8/04 Research Assistant – Perform research mainly focusing on various aspects of circuits and
systems design, chip implementation. Major research area includes mobile 3D computer graphics.
3/99 – 8/04 Teaching Assistant – Assist teaching for an Electronics Laboratory, Microelectronic Circuit Design
Sandcraft, Santa Clara, CA, USA 1/99 – 2/99 Winter Intern – Intern in the circuit division designing the high-speed adder. LG Semiconductor, Cheong-ju, Korea 1/98 – 2/98 Winter Intern – Intern in the flash-memory division.
INDUSTRY PROJECTS RAMP (RAM Processor)
Development of Application Specific Embedded Memory Logic Design Technology Sponsored by Korea Ministry of Science and Technology, Korea Ministry of Commerce, Industry and Energy.
7/02 – 6/03 Technical Advisor 7/01 – 6/02 Chief Researcher, Team Leader
Responsible for 3D-enhanced multimedia PDA-system architecture and design Responsible for full-chip architecture and design Responsible for portable 3D graphics accelerator architecture and design
8/00 – 6/01 Responsible for portable 3D graphics accelerator architecture design. 10/99 – 7/00 Responsible for “Embedded 3D Graphics Rendering Engine for PDA-Chip” design. 2/99 – 9/99 “DRAM-embedded high performance 3D rendering engine” layout. DA-1
Development of 3D Graphics Accelerator IP for Mobile Application Processor SoC Sponsored by Samsung Electronics
5/03 – 08/04 Responsible for 3D Rendering Engine Architecture
DEPT. OF E.E, KAIST • GUSEONG-DONG, YUSEONG-KU, 305-701 • DAEJEON, KOREA • +82-42-869-8068 [email protected] • http://ssl.kaist.ac.kr/~ramchan/main.html
MobileGL-C1
Development of 3D Graphics Library for Wireless Cellular Phones Sponsored by Mcres
3/04 – 5/04 Team Leader Responsible for Library Specification and 3D Rendering Code Optimization for ARM7
RAMP-C1
Development of Low-Power Graphics SoC Platform Sponsored by Korea Ministry of Information and Communication
3/04 – 8/04 Technical Advisor Responsible for Hardware Specification for 3D Graphics
POPeye
Development of Emulator to Analyze DRAM Architecture and Performance Sponsored by Samsung Electronics
2/99 – 10/00 Modeling and Performance Analysis of DDR-SDRAM.
PATENTS Method for Memory Addressing
Ramchan Woo, Chi-Weon Yoon, and Hoi-Jun Yoo U. S. Patent 6,400,640 B2 (Jun. 4, 2002), Korea Patent 368132 (Jan. 14, 2003)
A Low-Power Instruction Decoding Method for Microprocessor Ramchan Woo and Hoi-Jun Yoo Korea Patent 0324253 (Jan. 30, 2002) U. S. Application Number 09/964,387, Pending Japan Application Number 2000-363741, Pending Europe Application Number 100 54 434. 7, Pending Taiwan Application Number 89,123,526, Pending
Virtually Spanning 2D Array (ViSTA) Architecture and Memory Mapping Method for Embedded 3D Graphics Rendering Accelerator
Ramchan Woo and Hoi-Jun Yoo Korea Patent 372090
System for Calculating 3D Computer Graphics on Portable Devices Ramchan Woo, Se-Joong Lee, Jeonghoon Kook, Chi-Weon Yoon and Hoi-Jun Yoo Korea Application Number 2001-53827, Pending
Method and Apparatus for Enhancing Texture Memory Access Performance for 3D Computer Graphics Ramchan Woo, and Hoi-Jun Yoo Korea Application Number 2002-7868, Pending
Method and Apparatus for Efficient Buffer Memory Utilization with Adaptive Flow Control in the Queue System
Ju-ho Sohn, Ramchan Woo, and Hoi-Jun Yoo Korea Application Number 2002-13883, Pending
Method and Apparatus for accelerating 2D/3D multimedia processing by using the coprocessor Ju-ho Sohn, Ramchan Woo, and Hoi-Jun Yoo Korea Application Number 2003-14021, Pending
Method and apparatus for accelerating 2D/3D multimedia operations by using the streaming SIMD coprocessor in portable system
Ju-ho Sohn, Ramchan Woo, and Hoi-Jun Yoo Pending
RESEARCH INTERESTS Mobile 2D/3D Graphics Architecture and its Circuit Design, Graphics Library and Software Platform for Cell-Phones Multimedia Signal Processor for Consumer Electronics
DEPT. OF E.E, KAIST • GUSEONG-DONG, YUSEONG-KU, 305-701 • DAEJEON, KOREA • +82-42-869-8068 [email protected] • http://ssl.kaist.ac.kr/~ramchan/main.html
SKILLFUL TOOLS Graphics Library : OpenGL High-level Simulation : C/C++, SystemC Logic Design : VerilogXL, Synopsis Design Compiler, Apollo P&R Tools, Dynacell Circuit Design : Cadence Opus, Hspice, EPIC nanosim, Calibre, Hercules PCB Design : Orcad
LANGUAGES Korean as a domestic language Fluent English and Japanese Intermediate Chinese