for Mobile Multimedia Applications - KAISTssl.kaist.ac.kr/2007/data/thesis/WRC_PhD_Thesis.pdf ·...

박 사 학 위 논 문

Doctoral Thesis

휴대용 멀티미디어 기기를 위한

저전력 3 차원 그래픽 SoC 의 설계 및 구현

Design and Implementation of Low-Power 3D Graphics SoC

for Mobile Multimedia Applications

우 람 찬 (禹籃燦 Woo, Ramchan)

전자전산학과 전기및전자공학 전공

Department of Electrical Engineering and Computer Science

Division of Electrical Engineering

한 국 과 학 기 술 원

Korea Advanced Institute of Science and Technology

2004



Design and Implementation of Low-Power 3D

Graphics SoC for Mobile Multimedia Applications

Design and Implementation of Low-Power 3D

Graphics SoC for Mobile Multimedia Applications

Advisor : Professor Yoo, Hoi-Jun By

Ramchan Woo

Department of Electrical Engineering and Computer Science

Division of Electrical Engineering

Korea Advanced Institute of Science and Technology

A thesis submitted to the faculty of the Korea Advanced Institute of Science and Technology in partial fulfillment of requirements of the degree of Doctor of Philosophy in the Department of Electrical Engineering and Computer Science, Division of Electrical Engineering.

Daejeon, Korea

2004. 6. 1

Approved by

Professor Yoo, Hoi-Jun



우 람 찬

위 논문은 한국과학기술원 박사학위 논문으로

학위논문 심사위원회에서 심사 통과하였음.

2004 년 5 월 12 일

심사위원장 유 회 준 (인)

심사위원 나 종 범 (인)

심사위원 김 이 섭 (인)

심사위원 박 인 철 (인)

심사위원 원 광 연 (인)

사랑하는 부모님께 바칩니다

Dedicated To My Beloved Parents

DEE

20015171

우 람 찬, Woo, Ramchan. Design and Implementation of Low-

Power 3D Graphics SoC for Mobile Multimedia Applications.

휴대용 멀티미디어 기기를 위한 저전력 3 차원 그래픽 SoC

의 설계 및 구현. Department of Electrical Engineering and

Computer Science, Division of Electrical Engineering. 2004.

116p. Advisor Professor Yoo, Hoi-Jun. Text in English

Abstract A low-power graphics SoC implementing full-3D pipeline with texture-

mapping and special rendering effects is designed for mobile multimedia

applications such as PDAs or cell-phones. The chip contains a RISC processor

with MAC as a geometry engine, a 3D rendering engine, a programmable power

optimizer, and 29Mb embedded DRAM. Low-power consumption is achieved by

applying various techniques to the instruction set architecture, pipeline structure,

shading and texturing datapath, memory architecture, clock control, and

embedded DRAM. Programmable clocking allows the chip to operate in lower

power modes for various applications. The chip consumes less than 210mW,

delivering 1Mvertices/s, 66Mpixels/s and 264Mtexle/s texture-mapped pixels

with real-time special effects. The 121mm2 chip is fabricated with 0.16um

256Mb-compatible DRAM process to reduce the fabrication cost. The graphics

SoC is successfully demonstrated on two system evaluation boards running real-

time applications ported with custom-designed MobileGL.

Table of Contents

1. Introduction 1.1 Mobile 3D Graphics 1.2 Limitations 1.3 Design Philosophy 1.4 Previous Work

1.4.1 RAMP-I by KAIST

1.4.2 RAMP-II by KAIST

1.5 Recent Work 1.5.1 Z-3D by Mitsubishi

1.5.2 MBX by ARM

1.5.3 Others

1.6 Architecture Summary of Mobile-3D Hardware 1.7 Contribution of This Research

1.7.1 Design and Implementation of 3D Graphics SoC for Mobile Multimedia Applications

1.7.2 From Application to Demonstration

2. System Architecture 2.1 Target Specification 2.2 Simulation Environment 2.3 SoC Architecture

2.3.1 Geometry Engine with Intelligent Buffer

2.3.2 Rendering Engine

2.3.3 Graphics Memories

2.3.4 Power Management Unit

3. Low-Power Rendering Engine 3.1 3D Rendering Engine 3.2 SlimShader : Main Rendering Pipeline

3.2.1 Instruction Set Architecture

3.2.2 Low-Power Pipeline Structure

3.2.3 Triangle Setup Engine

3.3 Energy-Efficient Texturing Unit 3.3.1 Consideration of Energy Efficiency

3.3.2 Approximation of Perspective Division

3.3.3 Address Alignment Logic

3.4 Memory Programmer : Post Processing Unit 3.5 Memory Access

4. Chip Implementation 4.1 Process Technology 4.2 Chip Fabrication 4.3 Power Consumption 4.4 Performance

4.4.1 Performance Summary

4.4.2 Performance Comparison

4.4.3 Performance of SlimShader with External SDRAM

4.5 Appendix : Design Information 4.5.1 Area Information

4.5.2 Cell Utilization

5. System Evaluation 5.1 Target Configurations 5.2 REMY : System Evaluation Board

5.2.1 System Architecture

5.2.2 REMY-I : First Evaluation Board

5.2.3 REMY-II : PDA Prototype

5.3 Graphics Library : MobileGL 5.4 Demonstration

6. Conclusions and Further Work 6.1 Conclusions 6.2 Further Work

7. Summary

8. Bibliography

Chapter 1 Introduction

1.1 Mobile 3D Graphics As the mobile electronics market increases rapidly, 3G multimedia terminals

such as PDAs or smart cell-phones get popularity. The applications of PDA are

already migrating from text-based PIM (Personal Information Management) to

the real-time multimedia like MP3 audio, MPEG-4 video [1-2] and even 3D

computer graphics [3-4]. Also, today’s cell-phones are no more designated only

for the voice communication. They are already evolving to become Mobile

Multimedia Centers. Taking pictures with built-in camera, watching 2D graphics

animations and MPEG-4 videos, listening to MP3 audio, and even enjoying Java

games are not any more future stories. They are already happening at everyday

life. Therefore, it is very natural to imagine that the 3D computer graphics will be

the next step if we look back upon the PC’s evolution history. The real-time 3D

applications are especially attractive to games, advertisement, and avatars whose

data can be downloaded over the wireless network while occupying only a

limited bandwidth. Since the complex 3D scenes can be simply represented by

the list of vertices, texture images and corresponding camera movements, which

are naturally compressed, 3D graphics are adequate for the bandwidth-critical


wireless applications [5]. In order to satisfy these market demands, much

research on the realization of the 3D graphics for the handheld devices has

recently tried, including the design of hardware-accelerators for mobile platforms

[6-8] as well as the definition of software library [3-4]. However, the hardware

accelerators are far below the market requirements showing only limited shading

operations, without the texture mapping and special rendering effects which are

mandatory requirement for the 3D game applications.

1.2 Limitations Since the realization of real-time 3D computer graphics requires huge

computing power and corresponding memory bandwidth, it has been a critical

issue even in PC or console platforms in the past ten years [9-11, 46]. Although

today’s PC graphics accelerators can draw high-quality 3D images with high

performance GPU (Graphics Processing Unit), however, handheld devices cannot

tolerate those tens-of-watt power monsters. It is more challenging on the mobile

platform because the power consumption and physical dimension have much

more stringent limitations. 1) The most critical factor is limited energy supplied

by the battery. Based on the allocated budget of system power including the host

processor, system memory, input interfaces and LCD display, the power

consumption allocated to the 3D graphics system is confined to less than 300mW

~ 400mW for 2~3 hours continuous playback [12]. 2) And the limited computing

power of a mobile system which has a host processor without FPU and 400MB/s 2


memory system makes it difficult to draw 3D applications only with current

softwares. Since the users grab the wireless terminals and watch pixels on small

display devices, the average eye-to-pixel angle is wider than that of PC graphics

system, that is, 3) each pixel should be drawn with higher fidelity even if its

screen size is far smaller than that of PC. Although the recent trials related to

optimizing the 3D graphics softwares on the handheld devices achieve significant

improvement with integer-only datapath, their performance and quality are still

below the market requirements [4]. 4) Also, we can hardly find the extra space

for the graphics accelerator and corresponding graphics memories since the PCB

footprint is limited. 5) And, the low-cost aspect cannot be ignored because the

target systems will be carried by everybody’s hand. 6) Besides, standard graphics

APIs, which define the reference platforms, are not defined yet for mobile

applications. That is, we also need to define the hardware guidelines including

the supported functions, datapath precision, and other necessities for 3D graphics,

based on PC APIs such as OpenGL [14] or DirectX [15].

1.3 Design Philosophy Real-Time 3D graphics pipeline is composed of computation-intensive

geometry operations calculating the positions of vertices of triangles, and

memory-access-intensive rendering operations filling colors inside of the

triangles [35, 50]. Although the bottleneck in the geometry stage can be relieved

by using the fast and parallel datapath, the rendering performance cannot be 3


easily improved since up to tens of bytes must be accessed per every pixel. As

the energy consumption is proportional to the number of memory access, recent

research mainly focuses on reducing off-chip bandwidth to enhance the battery

lifetime for mobile 3D applications. MBX architecture reduces the memory

access with tile-based rendering, but the performance is still limited by the

system bus and the tiling overhead itself [24]. Moller’s POOMA texturing system

proposes several reduction schemes of texture requests, but the real measurement

results have not been reported through the hardware implementation yet [26].

Since 3D rendering requires various buffers to store frame, depth and texture

images, merging their requests and accessing them with limited number of off-

chip ports can make the interface circuitry more complex. Solving the bandwidth

bottleneck with the traditional approaches such as prefetching, caching, and

scheduling can be another burden for the energy.

If we move our viewpoint from off-chip bandwidth reduction to the integration

of the memory itself for the efficient 3D rendering, more effective architectures

or implementation schemes can come out in terms of the performance, the area

and the cost as well as the power consumption. It is clear that the on-chip

memory can provide more bandwidth while eliminating power-consuming off-

chip access [51]. If various buffers are integrated, each of them can be separately

and selectively activated to reduce its power consumption further. For the

wireless applications which limit screen resolution less than QVGA for a time,

the required capacity of on-chip memory is affordable, ranging from several

mega-bits to tens of mega-bits. Z3D, a 3D rendering core designed by Mitsubishi 4


[23], contains about 1Mbit SRAM assigned to the rendering. Its 53mW power

consumption allows it implemented inside of the cell-phones. However, the use

of SRAM still limits the storage capacity, thus, in turn, limits the performance

and functionalities. So the new architectures with embedded DRAM must be

explored to accelerate the realistic drawing of the 3D graphics for the wireless

applications.

To realize low-power 3D rendering at high performance, I proposed an

application specific embedded DRAM architecture, RAMP-IV architecture.

Instead of merely integrating a global DRAM and connecting it by huge number

of wires and corresponding crossbar switch, I determined the memory

configuration after analyzing the bandwidth requirements and access pattern of

the application. Various buffers and pixel-parallel characteristics of 3D rendering

operation allow me to distribute the memory access, not only providing sufficient

bandwidth, but reducing the power consumption by activating one or some of the

memory locally. After than, I specify the design of the embedded DRAM

according to its locations and access patterns. Therefore, the latency, throughput,

number of bus, and commands of the DRAM are not assumed to be determined

as given parameters. They are all treated as application-specific variables. Then, I

tune the logic pipeline to take full advantage of modified timing and functions of

DRAM. Finally, I applied various low-power techniques to the inside of the

memory and logic themselves. This design methodology is backed up by the

prediction of ITRS roadmap [52] which emphasizes the size of memory is ever

increasing as the scaling of silicon process advances, and more than half of the 5


chip area is already occupied by on-chip memory. That is, memory can be no

more treated as a passive device, nor called as a sub-system.

PixelEngine

T$F$

ExternalMemoryInterface

DDR-SDRAM

DDR-SDRAM

DDR-SDRAM

DDR-SDRAM

PixelEngine

T$F$

GPU

Prefetching,Request-Meging

PowerConsuming

SRAM Cache

High SpeedOff-Chip Interface

with Crossbar Switch

CPU

DDR-SDRAM

NorthBridgeChipset

DDR-SDRAM

System Memory

AGP

Graphics CardMain Board

[Fig. 1.3-1 : Example of PC Graphics Architecture]

Fig. 1.3-1 shows a typical example of today’s PC graphics architecture, in

which the GPU is evolved to attain huge memory bandwidth [44, 53]. The data

stored in system memories are not frequently accessed by the GPU since they

cannot satisfy enough bandwidth both for the CPU and GPU. In the GPU

architecture, several pixel engines work in parallel to boost up the performance,

fetching data from dedicated T$ (texture cache) and F$ (frame cache) memories.

Then, External Memory Interface (EMI) merges various transactions from cache

memories, and transfers them to off-chip DDR-SDRAMs dedicated only for the

graphics processing. The memories are connected to the EMI through the high-

speed crossbar switch and their data are accessed by burst-mode operations to

fully utilize their bandwidth. However, it causes its architecture power hungry.

Since the required data inside of the SRAM cache is transferred together with 6


adjacent data which may not be used at all, integrating cache memories can waste

its power. Also, merging many transactions from different cache memories can

make the circuitry of EMI more complex and more power consuming. Moreover,

prefetching data from DDR-SDRAM implies that unwanted data may be

accessed together through the high speed signal interface wasting power.

Therefore, I proposed the RAMP-IV architecture, in which the pixel engine is

directly connected to the local DRAM, instead of using the complex cache and

the memory interface. The baseband modem, CPU can access the data stored in

the SDRAM through the limited bandwidth of power-consuming system bus, as

shown in fig. 1.3-2 which describes the typical example of cell-phone

architecture. Therefore, the graphics data should be stored inside of the local

DRAM and simply accessed by local interconnection.

DRAMDRAM

PixelEngine

DRAM

PixelEngine

PixelEngine

PixelEngine

CPUBasebandModem

System Bus

RAMP-IV

Communication

Shared System Memories

Application

SRAM Flash SDRAM

Integrated DRAMsNo Cache Systems

No Bus Transactionfor 3D Graphics Rendering

DRAM

DRAMDRAM

DRAMDRAM

DRAM-optimizedLogic Pipelline

System Memory

[Fig. 1.3-2 : Example of Cell-Phone Architecture]

7


1.4 Previous Work

1.4.1. RAMP-I by KAIST

RAMP-I is a single-chip rendering engine which consists of 64 DRAM frame

buffers, 64 pixel processors (PP), 8 edge processors (EP) and a 32bit RISC core

for low-power 3D graphics as shown in fig. 1.4-1 [6]. The PPs are distributed

over the corresponding DRAMs and they work in parallel to fill the pixels inside

the polygon. Also, each PP and DRAM can be selectively activated according to

the shape of the polygon to save the overall power consumption. Although the

architectural performance of RAMP-I shows 11.1Mpolygons/s rendering speed,

however, it performs only simple shading, alpha blending, and depth-comparison

for 8x8 pre-clipped polygons. Also, it contains too many, 64, pixel processors,

some of which can be hardly utilized. In this architecture, 64 DRAMs are

independently controlled with their own controllers. Each DRAM covers only a

small portion of screen area since the small screen resolution of target PDA is

distributed. Therefore, this architecture cannot be easily implemented even with

0.18um CMOS process because the total area including the memories is too large.

Actually, fabricated chip of RAMP-I contains only 1/8 of the full architecture at

0.35um technology. Although RAMP-I were designed with 0.18um CMOS, it

would take about 100mm2 as shown in the following estimation:

µm).(withmmµm).(withmm

AreaRouting)...(FBPPEP

180100350400

7364906435864648

2

2

=

=

+×+×+×=×+×+×

Also, its distributed architecture makes it difficult to implement general 3D

graphics functionalities such as texture mapping or special rendering effects. 8


Queue

8PPs+

8DRAM

EP0

8PPs+

8DRAM

EP1

8PPs+

8DRAM

EP2

8PPs+

8DRAM

EP3

8PPs+

8DRAM

EP7

8PPs+

8DRAM

EP6

8PPs+

8DRAM

EP5

8PPs+

8DRAM

EP4

Queue160bit

DRAM64kb

SAM

PP0

DRAM64kb

SAM

PP1

DRAM64kb

SAM

PP2

DRAM64kb

SAM

PP3

DRAM64kb

SAM

PP7

DRAM64kb

SAM

PP6

DRAM64kb

SAM

PP5

DRAM64kb

SAM

PP4

64bit

EP_L

EP_R

Selector

Ctrl

24bit

24bit

Fabricated Test Chip [Fig. 1.4-1 : RAMP-I]

1.4.2 RAMP-II by KAIST

RAMP-II is a low-power 3D rendering engine which is implemented as part of

mobile PDA chip [7, 16]. 6Mb embedded DRAM macros attached to 8-pixel-

parallel rendering logic are logically localized with a 3.2GByte/s run-time

reconfigurable bus as shown in fig. 1.4-2, reducing the area by 25% compared

with conventional local frame-buffer architecture such as RAMP-I. It is the world

first 3D core integrated into the PDA-Chip, consuming 120mW and taking

24mm2 with 0.18um CMOS process. Although its maximum drawing rate

reaches up to 70Mpixels/s, however, low utility of 8 pixel processors and

unmatched load balance between PPs cut down the sustained fill rate to less than 9


20Mpixels/s. Moreover, run-time reconfigurable bus takes about 80% of power

consumption in the rendering logic because embedded DRAM has too many

data-bits (2048-bit) and their routes are changed at every 20MHz. Supported 3D

functions are exactly the same as RAMP-I – Simple shading, alpha-blending, and

depth-comparison for 8x8 pre-clipped triangles without texture mapping and

programmability for special rendering effects.

Fetch & Control

Polygon Data

L R

EdgeProcessor

PixelProcessor

512kb 512kb

512kb 512kb512kb 512kb

512kb 512kbA0 B0

A1 B1

SAM(1.5Kb SRAM)

6Mb eDRAM Frame Buffer(Z-Buffer + Double Color-Buffer)12 x 512kb independent Macros

RGB out

640bits 8 PixelProcessors1280bits

768bits

Rendering Logic1 x Edge Processor8 x Pixel Processors

ZC2C1

Run-timeReconfigurable

Bus

8-pixel-parallel renderingat every clock cycle

[Fig. 1.4-2 : RAMP-II]

1.5 Recent Work Many researches on mobile 3D graphics acceleration have been reported ever

since this work was first presented [17-22], and this section summarizes their

architectures and features. 10


1.5.1 Z-3D by Mitsubishi

93kB SRAM

Memory

TextureMemory

53kB SRAM

DisplayList Buffer

120kB SRAM

93kB SRAM93kB SRAM

MemoryDisplay

List Buffer120kB SRAM

TextureMemory

53kB SRAM

CPU

DMAC

HostIF

GeometryEngine

DisplayList Buffer

120kB SRAM

RenderingEngine

2D Engine

PixelEngine

Frame Buffer

TextureMemory

53kB SRAM

RenderingPipeline

Setup

Raster

Texture

Z Buffer

Memory

LCDInterface

LCD(176 x 132)

FPU

FPU

INT

[Fig. 1.5-1 : Z-3D]

Z-3D is the world first commercial implementation of hardware accelerated

3D graphics on cell-phones. It is targeted for 3D game, walk though, and

advertisement. Designed by Mitsubishi, the Z-3D is commercialized by NTT

DoCoMo and equipped into phones, D504i and D505i. As shown in fig. 1.5-1,

Z3D is composed of a geometry engine, rendering engine, pixel engine and on-

chip SRAM [23]. The geometry engine reads vertex data from 120kB display list

buffer and processes them to calculate coordinate transformation, lighting

calculation and clipping with one 24bit integer processing unit and two 24bit

floating processing units inside of the datapath. After the rendering engine fills

triangles performing smooth shading and texture mapping with 53kB on-chip 11


texture memory, the pixel engine performs hidden surface removal and opacity

display (alpha-blending) with 93kB on-chip frame and Z buffers at the end of the

3D pipeline. Therefore, in this architecture, the small capacity of on-chip

memories limits the contents, textures, and screen resolution. Also, its

performance, showing 185Kvertex/s transformation and 5.1Mpixels/s fill rate at

30MHz, is still below requirements of real-time 3D gaming applications although

Z-3D consumes relatively small amount of power, 38mW.

1.5.2 MBX by ARM

MBX is a 2D/3D graphics core co-developed by Imagination Technology and

ARM to accelerate the 3D graphics on ARM-based mobile platform [24]. As

shown in fig. 1.5-2, it contains a tile accelerator, a HSR (Hidden Surface

Removal) engine, a texture shading unit, a pixel blender and a 512Byte texture

cache. Containing only minimal set of rendering memories, it shares the system

memory with main processor to store frame, depth, and textures as well as

display list. Unlike the conventional graphics pipeline [11], MBX performs tile-

based rendering to save the memory bandwidth. This deferred-rendering

technique may reduce the bandwidth to access the data for frame and textures,

however, it needs extra time and bandwidth to setup parameters for the tiling

itself. Besides, the overall performance is severely degraded in the system since

the limited bandwidth from 32bit 100MHz AMBA AHB is even shared with the

CPU core. Assuming that the 400MByte/s bus can be utilized by 50% and half of 12


the acquired bandwidth is shared with the CPU, the bandwidth assigned to the

MBX is only about 100MByte/s. Therefore, the sustained pixel fill rate can be

only 9Mpixels/s at 100MHz, which is less than 10% of maximum rendering

performance, when drawing 16bpp pixels with 16bpp textures.

9Mpixels/sygon)Pixels/PolofNumber(Average 16

Vertices)of(Number20k Rate)referensh(Screen 30HzRateFillPixel

=××=

Also, this MBX is directly ported from the PC graphics, Kyro architecture [25],

with little modification, this architecture cannot be called as an optimized one for

low-power platforms. It supports too many functions such as anisotropic texture

filtering and vertex programming, some of which are useless and can be the

power and area overheads for now.

ExternalSDRAM

ExternalSDRAM

AccumulationBuffer

AccumulationBuffer

TiledZ-Buffer(16 x 16)


T&LVertexFeed

VGP

FPUFPUFPUFPU

Clipping

Viewportand

ScreenTransform

InputStreamParser

RegionGenerator

Culling

PointerCache

TileAccelerator


PE0PE1PE2

PE15

HSREngine

Iterate U

Iterate V

Iterate 1/W

Iterate R

Iterate G

Iterate B

Iterate A

TextureShading Unit

HSRFPU

Display ListParser

TextureShading

FPU

Parameter Fetch

AccumulationBuffer

BlendingUnit

PixelBlender

Texture Cache(512Byte)

TextureAddress

Generators

Cache A

Cache A

Cache A

Cache B

Cache A

Cache D

Cache A

Cache C

ArbitorDisplay List Z-buffer read/write Display List Texture

Fram

e B

uffe

r Writ

e

Eventmanager

RegisterBlock

SoCInterface

MBX HR-S Core

CPU

AHB MemoryInterface

MBX Memory InterfaceExternalSDRAM

[Fig. 1.5-2 : MBX] 13


1.5.3 Others

Recently, Akenine-Moller presented a hardware rasterization architecture for

mobile phones mainly focusing on low-cost aspect [26]. The proposed

architecture focuses on rasterizing textured triangles to save memory bandwidth

to the external memory. Only with software simulations, this work proposes an

inexpensive multisampling antialiasing scheme, a new filtering method with

texture minification and compression, and a scan-line based culling scheme that

avoids a significant amount of z-buffer access.

SONY announced a 3D graphics engine dedicated for mobile to stationary

products [27]. It contains floating-point geometry engine and rasterization engine

with color, z, and texture caches as illustrated in fig. 1.5-3. Memory interface

merges the requests from three caches, by which this core can easily interface

with the system memory. The chip can draw pixels at 600Mtexels/s rate

consuming 109.5mW.

Triangle Setup RenderingEngine

128

Pixel Generation

Texture BlendingAlpha TestingDepth Testing

Alpha BlendingPixel Operation

Texture $

Color $

Z $

GeometryEngineFMAC FDIV

128

128

MemoryInterface

BusBridge

32

32

CPUBus

3DCGIP

SystemMemory

[Fig. 1.5-3 : 3D Graphics Engine by SONY] 14


Also, a 3G baseband processor with 3D capability [28] and an application

processor [29], including a geometry-FPU [30] and a 3D rendering engine, are

trying to realize 3D graphics on 3G multimedia terminals. Concurrently, standard

graphics APIs such as OpenGL-ES [31] and JSR-184 [32] are being defined

these days.

1.6 Architecture Summary of Mobile-3D Hardware Because supplying the sufficient bandwidth to the rendering engine decides the

overall graphics performance, the mobile-3D hardwares listed in the previous

sections can be categorized by the memory access – Bus-attached system memory

and Local graphics memory [45].

Fig. 1.6-1 shows the example of bus-attached system memory. The 3DCG-IP

(3D Computer Graphics IP), integrated into the application processor with Host

CPU, accesses the system memory, where frame, depth, and texture data are

stored, to draw pixels through the system bus. The processed pixels stored inside

the system memory are also transferred through the system bus to the LCD

display to be drawn. In this architecture, 3D graphics can be easily accelerated

with least area overhead since the memory is shared with Host CPU. Therefore,

this architecture is adopted by some hardware vendors like SONY [27], ARM

[24], and Qualcomm [30]. However, the slow and narrow system bus, which

shows only about 400MByte/s at 32bit 100MHz, is far below the bandwidth

requirements of real-time 3D graphics applications – several GByte/s sustained 15


bandwidth for 100Mpixels/s with Bilinear texturing. And it is even shared with

other IPs such as Host CPU and LCD interface. Therefore, the rendering

performance is so much limited. Also, although the process technology gets

shrunk, there is not much room for the bus frequency and data-width to be

increased since they, in turn, increase the power consumption. Cache memories

can alleviate the bandwidth bottleneck, however, they are also the area and the

power overhead.

Fig. 1.6-2 shows the example of local graphics memory. The graphics IP is

integrated into the application processor with corresponding graphics memory

which supplies the required rendering data through the wide local bus between

the 3DCG-IP and the graphics memory. Then, the rendered pixels are directly

transferred to the LCD display. Low power consumption can be achieved in this

architecture since the necessary rendering data is acquired by accessing only

short-distanced local memory bus, not accessing the capacitive system bus.

Although this architecture requires additional space for the graphics memory, it

will be solved as the process goes shrunk. Also, more bandwidth is reserved to

other IPs such as Host CPU and possibly MPEG-4 in the application processor

since the system bus is almost free from the rendering operation. Solving area-

overhead will be easier than solving bus-bandwidth bottleneck in the near future,

because the process technology is heading for several nano-meters past 90nm.

Therefore, this work (RAMP-IV) and Mitsubishi [23] adopt this architecture,

showing greater rendering performance.

16


System Memory

BasebandProcessor

RX

TX

Flash /SRAM

Flash /SRAM Peripheries

Communication Application

LCDDisplay3DCG-IPHost

CPU

System Bus

[Fig. 1.6-1 : Bus-attached System Memory]

System Memory

BasebandProcessor

LCDDisplay

System Bus

RX

TX

Flash /SRAM

Flash /SRAM Peripheries

Communication Application

3DCG-IPHostCPU

GraphicsMemory

[Fig. 1.6-2 : Local Graphics Memory]

1.7 Contribution of This Research

1.7.1 Design and Implementation of 3D Graphics SoC for Mobile

Multimedia Applications

This work is the world first publication on 3D graphics SoC implementing full

3D pipeline with texturing and special effects for PDAs or 3G cell-phones.

1) The chip is highly integrated, containing a geometry engine, a rendering

engine, 29Mb embedded DRAM and power management unit. The proposed

rendering engine shows the highest performance ever announced in the world 17


with the help of energy-efficient texturing architecture and local graphics

DRAMs.

2) Low-power consumption is achieved by applying various techniques to the

instruction set architecture, pipeline structure, shading and texturing datapath,

memory architecture, and clock control.

3) It is also the world first implementation of mobile graphics processor with

pure DRAM technology to reduce the fabrication cost. The chip is fabricated with

256Mb-compatible DRAM process. DRAM process also suppresses leakage

current which is as important as run-time current in the mobile devices.

1.7.2 From Application to Demonstration

In addition to the chip implementation, complete flow from application

analysis to system demonstration is organized for SoC design.

1) Prior to the chip implementation, I analyzed the real-time applications to

propose and optimize new architecture using the simulation environment, 3D-

Glamor. Since there was no publication related to full 3D graphics pipeline for

mobile devices before, I also defined the required functions and precisions based

on 3D-Glamor.

2) A mobile graphics library, MobileGL, is designed to port the applications to

the proposed system. The MobileGL is world first trial reduction of OpenGL for

mobile platform.

3) Since the chip is implemented with the DRAM process, where only full- 18


custom method was tried, I applied the ASIC design flow to the DRAM process –

designing the standard cells and porting them to various levels of CAD tools.

4) Also, I developed two evaluation systems for the real-time demonstration.

The 3D graphics images are successfully displayed with the fabricated chip. It is

world first demonstration of mobile 3D graphics SoC with real silicon.

19

Chapter 2 System Architecture

2.1 Target Specification Fig. 2.1-1 illustrates a full 3D pipeline which includes a geometry engine, a

vertex buffer, a rendering engine, and corresponding rendering memories. For the

real-time 3D graphics on the handheld devices, the geometry engine needs fast

calculation for more than 0.5Mvertors/s and programmability for the

transformation and lighting (T&L). And the vertex buffer is necessary for the

efficient data transfer. The rendering engine requires more than 10Mpixels/s

parallel calculation and more than 1GByte/s huge memory bandwidth for shading,

depth comparison, and texturing. Also, large amount of rendering memory, more

than 10Mbits, with high bandwidth reaching to several GByte/s must be prepared

to store frame, depth, and various texture images. In this work, I implemented all

these features into a single chip and this chapter explains the architectural details.

2.2 Simulation Environment In order to find out an optimum pipeline architecture, memory size, and

bandwidth, I developed a 3D graphics simulator - 3D-Glamor (3D Graphics


Library and Memory Simulator). The simulation architecture of 3D-Glamor is

illustrated in fig. 2.2-1. Real-time 3D graphics applications running on OpenGL

are converted to vertex lists, material properties, camera movements, and texture

images. Then, the geometry and rendering codes are executed on MobileGL, a

custom-designed graphics library. Since the conventional 3D graphics libraries

for the PC platforms are optimized only to the power-consuming floating-point

datapath, they are not suitable for the low-power RISC geometry engine with

integer-only datapath. Therefore, I designed 3D graphics library with 32bit fixed-

point arithmetic to optimally use ARM9 datapath, maintaing the compatibility

with de-facto standard OpenGL. Various rendering algorithms and architectures

are simulated by various levels of rendering models as follows:

Reference Renderer : Functional C-model

Cycle-Accurate Renderer : Cycle-Accurate C-model

Verilog PLI : Verilog-model, but datapath is described by C/C++

Verilog RTL : Verilog RTL (Register-Transfer-Level) model

Verilog GATE : P&R-Ready Verilog-model after synthesis

I gathered the necessary information such as the optimum precision of each

datapath, memory bandwidth and utilization, and pipeline efficiency, running

real-time applications. In order to simulate the real-life workloads, four

distinguished vectors are selected and classified by the number of pixels per

21


polygon, texture size, and existence of texturing. Since the larger texture image

shows poorer texturing performance in general [33], 128x128 or 256x256 sized

texture images, which are relatively large for small screen resolution of cell-

phones, are used to get the worst-case results. Also, all vectors are rotated in

omni-direction and zoomed in or out to average the direction, shape, and size of

the triangles which affect many results like memory access pattern, bus

utilization, and pixel processor load-balance. The characteristics of test vectors

are summarized in table 2.2-1.

3D Pipeline

VertexBuffer

RenderingEngine

RenderingMemories

Operation Requirements

T&L

Shading Texturing

TexturesFrame/Depth

Fast Calculation (>0.5M Vec/s) Programmability

Efficient Data Transfer Scalability

Parallel Calculation (>10M Pix/s) Hugh Memory BW (>1GB/s)

Large Capacity (>10Mb) Fast Cycle Time Many Access Ports

GeometryEngine

[Fig. 2.1-1 : Integration of Full 3D Pipeline]

22


Applications CodeConversion

OpenGL

C/C++ on PC

MaterialVertexList Camera Texture

GeometryCode

RenderingCode

MobileGL

RenderingInterface

Reference Renderer

Cycle-Accurate Renderer

Hardware Interface

SlimShader Code

Model Data

VirtualFrameBuffer

VirtualDepthBuffer

VirtualTextureMemory

Renderer ModelC/C++ on UNIX

Libr

ary

Spef

icat

ion

RAMP Code

VerilogPLI

VerilogRTL

VerilogGATE

Rendering EngineVerilog

PLIVerilog

RTLVerilogGATE

ARM9

ARMulator

ARM Code

Graphics SoC

VerilogARM SDK

[Fig. 2.2-1 : 3D-Glamor Architecture]

Texture Size TriangleCount

Average PixelCount / Triangle

Number ofAnimated Frames

128 x 128 6,833 11.2 104

256 x 256 6,833 11.2 104

256 x 256 2 15,300 30

Test Vector

A

B

C

NoTexture 5,878 16.5 105D

Comments

Textured by128x128 ImageSmall Polygons

Textured by256x256 ImageSmall Polygons

Textured by256x256 ImageLarge Polygons

Non-TexturedSmall Polygons

[Table 2.2-1 : Characteristics of Test Vectors]

23


2.3 SoC Architecture Based on the simulation results of 3D-Glamor, I propose the architecture of the

graphics SoC a shown in fig. 2.3-1. It consists of a 32bit RISC processor that is

assigned to the geometry engine, a bandwidth equalizer (BEQ) for vertex buffer,

a 3D rendering engine (3DRE), 29Mb embedded DRAM and programmable

power optimizer (PPO). Dedicated hardware engines and 1.6GByte/s bandwidth

through 416bit-wide DRAM can lower the operation frequency of 3DRE and

DRAM even to 33MHz, while the RISC operates at 132MHz. Programmable

power optimizer manages the power consumption of the chip by controlling four

clock domains – gating the clocks and changing their frequencies during run-time

by the software. Each of these IP blocks will be discussed in details from the next

section.

4kB I$

RISC

ExternalInterface

PPO

BEQ 3DRE

Flow Control

ARM-9 Core

32b128b

Mem

ory

Prog

ram

mer

SlimShader

DisplayOutput

32b

24b

Triangle SetupEngine

PP0

TE0

416b (1.6GB/s @ 33MHz)

DRAM

Ctrl4 CLK

PP1

TE1

AddressAlignment

Logic

32x32 MAC

SHIFT

ALU

MEM Interface

4kB I$

32b

32b

15

31

47

63

0

16

32

48 256Byte SRAM Bank #3

256Byte SRAM Bank #2



I/O Ctrl

Entry Pointer

128b 128b

132MHz

RISCReq

Queue Entry

33MHz

3DREReq

2Mb Depth Buffer

3Mb Frame Buffer

24Mb Texture MemoryPLL Clock Control Unit

[Fig. 2.3-1 : Block Diagram of Graphics SoC]

24


2.3.1 Geometry Engine with Intelligent Buffer

The RISC processor with 4KB I/D caches is compatible with ARM-9

architecture and operates at 132MHz [34]. It has a single-cycle 32bit x 32bit

Multiply-Accumulate Unit (MAC) in its datapath to accelerate the 3D geometry

operations. It can calculate as many as 1.04Mvertices/s model-view

transformations when running a customized fixed-point graphics library, which

shows 43% improvement over the conventional ARM9 processor [35]. When the

geometry engine calculates the model-view transformation, perspective

projection and 6-side clipping together, 300Kvertices/s is obtained. If the lighting

(single directional light source of infinite viewer and one-sided lighting model

with ambient, diffuse and specular highlighting) is appended, the rendering

performance shows 70Kvertices/s. The MAC also accelerates the processing of

MPEG-4 SP@L1 video stream. It reduces more than 30% of the cycle time when

executing the IDTC routines which are basically the same operation as the

geometry vector calculation. And the memory interface is optimized for the real-

time multimedia applications so that the RISC can directly supply the 3D data to

the rendering thorough the bandwidth equalizer (BEQ), bypassing the data cache.

To compensate the difference of the processing speed and data-width between

the RISC and the 3D rendering engine, the BEQ buffers the vertex data with

1KByte Dual-Ported SRAM (DP-SRAM). The data in the vertex buffer are

128bit-encoded instructions containing vertex coordinates, texture coordinates

and colors. Revised from the previous implementation [8], the current BEQ saves

25


more than 20% power consumption in the SRAM with the help of adaptive bank

activation. It partially activates the banks of DP-SRAM according to the required

buffer size, which is decided by the entry pointer. The flow controller keeps track

of the request from the RISC and the 3DRE, and activates the only necessary

SRAM banks. Since the BEQ is also revised to be configured as 1KByte

bidirectional scratch-pad RAM, the RISC can read data from the BEQ for DSP

applications in which software-addressable on-chip memory is preferable to store

coefficients.

2.3.2 Rendering Engine

The rendering engine is the core of this graphics SoC. I designed the rendering

engine as a scalable IP core to satisfy the performance requirement on various

mobile platforms within allowed power budget, since the target applications

range from simple avatars, user interfaces, and commercials on the QCIF

(174x144) display to the real-time 3D games on the QVGA (320x240). More

details of the rendering engine will be discussed in the chapter 3.

2.3.3 Graphics Memories

Since the DRAM is integrated together with the rendering logic in this

architecture, we can optimize each memory for the corresponding operation,

26


instead of using the conventional SDRAM in the conventional PC graphics

architectures. To save the power consumption of the embedded DRAMs as well

as to optimally utilize their bandwidth, I propose three different DRAM types –

Frame Buffer, Depth Buffer, and Texture memory. As described in table 2.3-1, the

characteristics of each memory are optimized according to the operation

requirements. In order to provide the pixels for depth comparison and alpha

blending, the frame and depth buffers support read-modify-write data transaction

in a single cycle with separated read and write bus. It can drastically simplify the

memory interface of the rendering engine and the pipeline, because the data

required to process two pixels are read from the frame and depth buffers,

calculated in the pixel processor, and written back to the buffers in the pixel

processor within a single clock period without any latency. Therefore, caching

[33, 36, 37] and prefetching [38], which may cause power and area overhead, are

not necessary in this architecture. The example of operation timing of frame or

depth buffer is shown in fig. 2.3-2. The Write-Mask signal, which is generated by

the pixel processor, decides the activation of the write operation. Non-

multiplexed addressing enables the DRAM to partially activate the necessary

wordline block to save the power consumption inside the memory [6-8].

To draw pixels on the 256x256 screen which covers the resolution of most of

the current cell phones, 4 frame macros and 4 depth macros are used in the chip.

Also, 4 texture macros, or 24Mb, store MIPMAP texture images enough for the

3D game applications, and this capacity is equivalent to store 12 x 24bit 256x256

27


MIPMAP textures or 48 x 24bit 128x128 MIPMAP images. Therefore, the use of

graphics DRAMs can completely eliminate the necessity of external texture

memory. These memories are distributed, enabling the rendering engine to utilize

only necessary memories and to reduce the power consumption. The embedded

DRAM can operate at scalable clock frequencies ranging from 5Mhz to 50Mhz

to match the speed of the rendering logic, providing up to 2.4GByte/s bandwidth

with 416bit-wide bus. The configuration and activation of graphics memories

will be discussed in detail at chapter 3.5.

Frame Buffer Depth Buffer Texture Memory TRC 20ns

Macro Size 768Kbits 512Kbits 6Mbits I/O

Interface 24bit read 24bit write

16bit read 16bit write 24bit I/O

Commands Read-Modify-Write Read Write

Auto-Refresh

Read Write

Auto-Refresh

Latency 0 0 1

[Table 2.3-1 : Characteristics of Embedded DRAM]

W1

C1

0 10ns 20ns

Clock

CMD &ADDR

40ns

Read-bus

15ns

Write-bus

Write-Mask

C2

W2

R2R1

1ns

InternalOperation

Decided by PixelProcessor No Update

Hold WriteW1PCG Active & Read Modify Hold

R2PCG Active & Read

[Fig. 2.3-2 : Timing Diagram of Frame/Depth Buffer]

28


2.3.4 Power Management Unit

Programmable Power Optimizer (PPO) manages the power consumption of the

chip. Each clock can be selectively turned on or off and its frequency is scalable

by the software program or hardware buttons to adjust the frame rate during run-

time as illustrated in fig. 2.3-3. RISCclk and BEQclk run at the full speed of the

RISC core, and REclk and MEMclk operate at the quarter frequency –

132/33MHz (RISCclk/REclk) for FAST mode, 66/16.5MHz for NORMAL, and

33/8.25MHz for SLOW. The PPO provides zero-latency frequency-scaling to

allow abrupt switching of operating frequencies during the execution of software.

The transition from slow mode to fast mode can be completed quickly without

any hazard.

Frequency ScalingFast Normal Slow

RISCBEQ

3DREDRAM

132

66

33

16.58.25

Block-Level CLK Gating

1/2 x 1/4 x

Mode

ClockFrequency

(MHz) 1 x RISC BEQ 3DRE DRAM

PPO

S/WControl

H/W Control

[Fig. 2.3-3 : Operation of Programmable Power Optimizer]

29

Chapter 3 Low-Power Rendering Engine

3.1 3D Rendering Engine

Fig. 3.1-1 shows the block diagram of 3D rendering engine (3DRE). It consists

of a SlimShader, a Memory Programmer (MP), and a dozen of rendering DRAMs.

The SlimShader performs main rendering operations such as texturing, shading,

blending, and depth comparison. MP enables the special effects such as

antialiasing, motion blur and fog to be programmable by the software. The 29Mb

rendering DRAM contains frame buffers, depth buffers, and texture memories.

12 independently-controlled DRAMs reduce the power consumption since the

only necessary memories can be selectively activated. The 3DRE can accelerate

the drawing of points, lines, and rectangles for 2D graphics as well.


512k

b D

B 0

Tria

ngle

Set

up E

ngin

e (T

SE)

PP0

PP1

Text

ure

Add

r.Te

xtur

e A

ddr.

Add

ress

Alig

nmen

t Log

ic(A

AL)

Intp

l. /

Dep

th C

omp.

Intp

l. /

Dep

th C

omp.

Text

ure

Filte

r

Pixe

l Ble

ndin

gPi

xel B

lend

ing

6Mb

Text

ure

Mem

ory

0

Slim

Shad

er

6Mb

TM1

6Mb

TM3

6Mb

TM2

DB

1D

B2

DB

3

768k

b FB

0FB

1FB

2FB

3

Text

ure

Filte

rBEQ

RIS

C

3DR

E

16b

16b

24b

24b

32b

32b

24b

48b

24b

24b

24b

48b

Pipe

Con

trol

LCD

Inte

rfac

e

24b

Dis

play

Out

put

Com

man

dR

egis

ters

160b

160b

SIM

DD

atap

ath

64b

64b

96b

96b

Pixel Data

64b

64b

96b

96b

Mem

ory

Prog

ram

mer

128b

32b

[Fig. 3.1-1 : Low-Power 3D Rendering Engine]

31


3.2 SlimShader : Main Rendering Pipeline

3.2.1 Instruction Set Architecture

In order to execute the rendering programs and to control the datapath, 13

128bit-encoded instructions are defined. Since the transferring the vertices takes

most of the rendering cycle, the instructions are optimized for this operation,

RDAT. As shown in fig. 3.2-1, the length of instruction is selected to be 128bit

fixed-format to transfer whole vertex information at every single rendering cycle.

Therefore, colors (R, G, B, A), screen coordinates (X, Y), screen depth (Z),

homogeneous texture coordinates (u, v, 1/w) are transferred together with the

command information. Each color component (R, G, B, A) is represented by 8bit

integer to support true-color rendering with alpha-blending. And each screen

coordinate (X, Y) contains 8bit integer to cover 256x256 screen resolution. The

homogeneous texture coordinate (u, v, 1/w) is represented as 16bit fixed-point

format (8bit integer + 8bit fraction) to preserve necessary dynamic range and

precision for texture calculation.

MODE

Extra Command

EXTRA

CMD127 96

DATA0 DATA1 DATA295 64 63 32 31 0

31 28

TYPE27 22

OP1

21 20

OP219 16 15 0

OP-Code 2OP-Code 1Instruction TypeProcessor Mode

[Fig. 3.2-1 : Instruction Format]

32


Although using the 128bit-fixed-length instructions rather using variable-

length packets may waste bandwidth for the other operations such as RBUF,

RCLR, TSTR and ASTR, which are rarely occurred than RDAT, it can simplify the

design of decoding and controlling unit. Fetching one vertex at every cycle

enables the rendering engine to continuously calculate two pixels per cycle for

stripes and fans of even smaller triangles. This 128bit-instruction additionally

requires bandwidth equalizer, which is described in section 2.3.1, to adapt to

32bit geometry engine in this graphics SoC. However, it means this SlimShader

is attachable to any other geometry engine by changing the design of bandwidth

equalizer, without touching the SlimShader rendering core. This 128bit

instruction can be easily transferred from the ARM9 geometry engine by using

the multiple register transfer instruction [39].

The number of instruction is determined to support the subset of OpenGL

rendering operations, since OpenGL provides many high-level functions such as

trilinear texture filtering, non-linear fog, and some blending modes which can be

rarely used in the real gaming applications. Additional instructions to support

real-time special rendering effects, to control the embedded DRAMs, and to

manage the standby power are also defined. The instructions and supported

functions are listed in table 3.2-1 and 3.2-2, respectively. The Power Control

instructions control the refresh commands of the embedded DRAMs and more

details will be discussed in chapter 3.5.

33


Type Instruction OP Code Description XXXX MODE = 1111 Normal Mode PHLD MODE = 1011 Hold PIDL MODE = 0011 Idle PSLP MODE = 0001 Sleep

Power Control

POFF MODE = 0000 Off RDAT MODE = 1111

TYPE = 1000 00 OP1 = TRI OP2 = POS EXTRA = W DATA

Fetch Vertex Data W[16b] = 1/w DATA0[16b:16b] = u:v DATA1[8b:8b:16b] = X:Y:Z DATA2[8b:8b:8b:8b] = A:R:G:B

RBUF MODE = 1111 TYPE = 1000 01 OP1 = FB OP2 = ZB

Set Front Buffer Rendering

RCLR MODE = 1111 TYPE = 1000 10 OP1 = FZ

Clear Front Buffer with All-Zero

TSTR MODE = 1111 TYPE = 0100 00 OP1:OP2:EXTRA = ADDR DATA0

Store Texture Map ADDR[22b] = Texture Address DATA0[Xb:8b:8b:8b] = R:G:B

TMOD MODE = 1111 TYPE = 0100 01 OP1:OP2:EXTRA = ADDR DATA0 = BLND:FILT:ID:LOD:SIZE

Set Texture Mode ADDR[22b] = Base Address BLND[4b]: Blending Mode FILT[4b] : MIPMAP Filtering ID[8b]: Texture ID LOD[4b] : LOD Bias SIZE[12b] : Texture Size

Texture

TF2T MODE = 1111 TYPE = 0100 10 OP1:OP2:EXTRA = ADDR

Transfer FB to TM ADDR[22b] = Base Texture Address (Front Buffer Contents are Transferred)

Auxiliary

ASTR MODE = 1111 TYPE = 0010 00 OP1 = FZ EXTRA = ADDR DATA0

Store Data to Front Buffer FZ ADDR = FZB Address DATA0[Xb:8b:8b:8b] = R:G:B (only G:B) is stored into Z-Buffer

MP

MRPG MODE = 1111 TYPE = 0001 00 EXTRA = MOP

Load Memory Program MOP[16b] Defined by Memory Programmer ISA

[Table 3.2-1 : Instruction Set]

34


Screen Resolution 256 x 256 Color Depth 24bit True Color Shading Triangle Fan and Strip Support

Gouraud Shading Pixel Alpha Blending Texture Blending Programmable Shading through MemoryProgrammer™

Hidden Surface Removal 16bit Hardware Accelerated Double Z-Buffer Texture Mapping Perspective Correct Texture Address Calculation

Power-Efficient Texture Fetch through Address Alignment Logic™ LOD Bias Texture Filtering

- MIPMAP, No MIPMAP - Point sampling - Bilinear

Allowable Texture Size : 2x2 ~ 256 x 256 (Power of 2) Maximum Number of Textures : 255

Special Features 2D Graphics Acceleration - Line, Triangle, Rectangular Acceleration

Memory Programmer™ - Post Rendering Processing with FB and ZB - Linear Expression Evaluator

Special Rendering Effects - Antialiasing - Motion Blur - Artistic Trajectory - Other Special Rendering Effects

Power Management Scene-Dependent Clock Variation Control - FAST : 33MHz - NORMAL : 16.5MHz - SLOW : 8.25MHz

Instruction-Level Power Management Control - Normal : Normal Rendering Operation - Hold : Waiting for Geometry Pipeline - Idle : No Operation with FB, ZB, TM Refresh - Sleep : No Operation with TM Refresh only - Off : No Operation without eDRAM Refresh

[Table 3.2-2 : Supported Rendering Features]

35


3.2.2 Low-Power Pipeline Structure

Fig. 3.2-2 shows the main rendering pipeline attached with graphics memories

and table 3.2-3 describes its operation. It is composed of 14 multi-pipelined

stages to maximally save the power consumption by activating the only

necessary stages. The graphics memories are accessed through distributed

pipeline stages - Depth buffer at PI stage, texture memory at TP2 stage, and

frame buffer at PB stage. Since each pipeline stage is designed as a module with

its own controller, additional rendering features can be easily inserted in the next

revision without modifying the entire pipeline. After fetching the instructions, the

rendering engine shapes the triangle and varies the operation cycles in the next

stages according to the size (HOLD#1) and the shape of the triangles (HOLD#2)

by pausing the previous pipeline stages. As the example of pipeline shows in fig

3.2-3, the rendering can calculate 2 pixels at every cycle.

HOLD #1

IF ID1 ID2 TS EP HSPP#0TP1 TP2 TP3 TF PBTA1 TA2PI

PP#1TP1 TP2 TP3 TF PBTA1 TA2PI

HOLD #2

REclkPPO

DepthBuffer

TextureMemory

FrameBuffer

ClockGating

Graphics Memories

Front-Pipe Back-Pipe

[Fig. 3.2-2 : Main Rendering Pipeline]

36


Shaping the triangle is accelerated in the TS stage, performing the horizontal-

order rasterization (scanline-based rasterization) as in fig. 3.2-4. Although this

rasterization simplifies memory address and pipeline control, the rendering

performance can be degraded when the triangle falls across the DRAM pages in

the conventional DRAM architecture [40, 44, 46]. Therefore, I re-defined the

timing of graphics DRAM and assigned the frame and depth buffers as a vertical-

stripe pattern, instead of prefetching data from standard SDRAM. Since the row

of the DRAM can be changed without any latency at 50MHz random row cycle

(TRC=20ns) and each memory (A or B) has its own read/write ports as described

in table 3.2-1, the graphics DRAM can continuously provide the bandwidth

required to process two pixels together. This rasterization order also reduces the

power consumption since the memories corresponding to the only necessary

pixels can be activated. Fig. 3.2.5 shows the rasterization order of GeForce4 [44],

where the 2x2 tiles are traversed in memory page friendly order to maximally

utilize the column access of external frame buffer. However, unnecessary pixels

are also transferred together through the capacitive memory bus, wasting the

power consumption and bandwidth. Although this rasterization order is known to

improve the texture cache performance by reducing the miss-rate [33], it affects

little to the texturing performance of the proposed rendering engine where cache

is not implemented.

Since the rendering engine contains two pixel processors (PP) and each PP has

its own texture unit fetching 4 textures/cycle, the pixel fill rate and the texel rate

37


are 100Mpixels/s and 400Mtexels/s at 50MHz, respectively. The two pixel

processors are simply assigned to render horizontally-adjacent pixels. So, it is

easy to gather texture address, and this can be used to propose the energy-

efficient texture unit, which will be covered in the next section.

In order to eliminate the power consumption of the unused blocks as much as

possible, the datapath transition is controlled by clock-gating and latch-enabling.

I put the depth-compare-unit into the earlier pipeline stage (inside PI stage) and

apply a depth-first clock-gating (DFCG) scheme in order to reduce the power

consumption as shown in fig. 3.2-6. In 3D graphics, if a new pixel to be drawn is

already covered by the nearest (old) pixels from the view point, the new pixel

does not need to be processed further. DFCG can prevent the unnecessary

shading and texturing by gating off the clock in the remaining datapath according

to the results of the depth-comparison. It also eliminates the unnecessary requests

to the corresponding memories. Besides, the pipeline latches of the shading and

texturing unit can be independently enabled or disabled to maximally avoid the

unnecessary datapath transition as much as possible. Although DFCG violates the

OpenGL semantics [14], which don’t allow updating the depth buffer until

texture mapping as textured pixels may be completely transparent, this violation

can be solved by removing those triangles in the software prior to rendering

operation.

38


Pipe Description IF Instruction Fetch, Main Power Control

ID1 Instruction Decode #1 ID2 Instruction Decode #2, Triangle Shaping TS Triangle Setup EP Edge Processor HS Horizontal Setup, Span Generation PI Pixel Interpolation, Depth Comparison, Depth-Buffer Interface, Clock Gating Control

TA1 Texture Address #1, LOD Calculation, 1/w Division TA2 Texture Address #2, Address Merging TP1 Texture Prefetch #1, Bank Address Aggregation, Texture Memory Command Generation TP2 Texture Prefetch #2, Texture Memory Read TP3 Texture Prefetch #3, Texture Data Alignment, Reverse Procedure of Address Alignment TF Texture Filter PB Pixel Blending

[Table 3.2-3 : Pipeline Description]

IF

ID1

ID2

TS

EP

HS

PI

TA1

TA2

TP1

TP2

TP3

TF

PB

V1 V2 V3

V1 V2 V3

V1 V1 V1V2 V2

V3

V4 V5

P1

P1VT

V4 V5 V6

NO

1 2 3 2

V1V4V3

1 3

V5V4V3

P2

NO

V7

FrontPipe

BackPipe

P1V1

V8

V71

V6

V5V4V3

P1VT

P1VTPL

P1V1

P1V2

Pixel Interpolation@ 2PPs

P1V1PL

P1V1P1

P1V1P2

RF

DRAMRefresh

P1V1PR

P1VTPL

P1V1PL

P1V1P1

RFP1V1P2

P1V1PR

P1V1PL

P1V1P1

RFP1V1P2

P1V1PR

P1VTPL

P1V1PL

P1V1P1

RFP1V1P2

P1V1PR

P1VTPL

P1V2PL

P1V2

P1V3

P1V2P1

P1V2P2

P1V2P3

P1V2PL

P1V2PL

P1V2P1

P1V2P1

P1V2P2

P1V1PL

P1V1P1

RFP1V1P2

P1V1PR

P1VTPL

P1V2PL

P1V2PR

P1V3PL

BankConflict

P1V2P3

P1V2P3

P1V2P2

P1V2P1

P1V2P2

P1V2PL

P1V2P1

P1V2PR

P1V3

P1V3

P1V1PL

P1V1P1

RFP1V1P2

P1V1PR

P1V2PL

P1VTPL

P1V2PL

BC

P1V1PL

P1V1P1

RFP1V1P2

P1V1PR

P1V2PL

BCP1VTPL

P1V1PL

P1V1P1

RFP1V1P2

P1V1PR

P1V2PL

BCP1VTPL

Address Alignment

Address Calculation@ 2PPs

Physical AddressCalculation

Texture CommandGeneration

Data Alignment

Texture Filter@ 2PPs

Pixel Blending@ 2PPs

HOLD #1

HOLD #2 HOLD #2 HOLD #2

[Fig. 3.2-3 : Example of Pipeline Timing]

39


HO

LD #

1

HOLD #2

A B A B A B A B A

0 1 2

5

4

6 7 8 9 A

3

Local bus A Local bus B

A memory

B memory

Only necessarypixels are

transferred

1 35 7 9

5 7 9

0 2 46 8 A

6 8 A

20ns

13

57 9

02

46

8A

[Fig. 3.2-4 : Rasterization Order and Frame/Depth Buffer Assignment]

40


Power and Bandwidth Waste0 1

2 3

4 5

6 7

8 9 C D

A B E F

G H K L

I J M N

O P

0

External SDRAM

70ns 1 2 3 4 5 6 7 8 9 A B C D E F G H I J K L M N O P

Unnecessary pixelsare transferred

together

Memory bus

[Fig. 3.2-5 : Rasterization Order of GeForce4]

41


x

y

z

#1#2

UnnecessaryOperation

#1#2

Crossbar

Depth BufferDepth Buffer

DepthInterpolation

WriteMask

REclk

DepthCompare

Color/ ScreenCoordinate

Interpolation

PP0

NextPipeline

Stage

1616

Texture CoordinateInterpolation

END Q

CrossbarOldDepth

NewDepth

END Q

Pipe Controller

ShadingDatapath

TexturingDatapath

PI Stage

PP1

Latch-EnableClock-Gating

[Fig. 3.2-6 : Datapath Transition Control]

42


3.2.3 Triangle Setup Engine

To render triangles with modified Bresenham’s incremental line drawing

algorithm [47], the position of input vertices must be identified, and the

increments of colors and coordinates must be calculated in the earlier rendering

pipeline – inside the Triangle Setup Engine. Although triangle setup in 3D

graphics took more than 7,000 cycles when it was calculated by the general

purpose RISC processor, the previous implementations [6-8], didn’t contain the

hard-wired setup engine because of its logic complexity. In this work , however, I

simplify the algorithm, optimize the precision of datapath, and implement the

triangle setup engine (TSE) which contains three 9-way SIMD SUBs, three 8-

way SIMD DIVs, and a mid-point interpolation unit inside of the 3DRE to

enhance the overall 3D performance as shown in fig. 3.2-7. SORT_T2B sorts 3

vertices from top to bottom by subtracting each vertex and checking the sign of

the results. Then, VERT_DIV calculates ∆(X,Z,R,G,B,U,V,W)/∆Y. At the last stage,

MID_INTPL checks the type of triangle (P0 or P1) by comparing the mid-point

of longest edge with interpolated point. The total calculation time from the vertex

register to the final MUX is less than 20ns and it decides the maximum operation

frequency of the rendering engine – 50MHz. Therefore, the triangles do not need

to be pre-clipped anymore unlike the previous implementations [6-7].

In order to develop applications quickly in the mobile 3D graphics, the model

data may be shrunk from the PC platform, where triangles are optimized for

higher screen resolution (640x480, 1024x768 or more), to mobile platform which

43


has even lower screen resolution (176x144 or 320x240). Therefore, the average

number of pixels inside the triangle will be smaller in mobile 3D, which means

setup time may becomes bottleneck of pixel throughput. Although the exact

latency and throughput of triangle setup engine are not announced in the

conventional highend processors [44, 46], they are more than one and varied

from triangle to triangle. However, the proposed setup engine is designed to

ensure the triangle-setup cycle to be always smaller than pixels-filling cycle even

for a small triangle – one cycle triangle setup without latency.

VertexRegister #0

VertexRegister #1

VertexRegister #2

SORT_T2B(3 x 9-way SIMD SUBs)

VERT_DIV(3 x 8-way SIMD DIVs)

Vertex Delta MID_INTPL

SortedVertices

UnsortedVertices

X, Y, dX, dYVertex

Parameters

dY8

MUL9

MUL17

MUL9

MUL9

MUL9

MUL17

9 17 9 9 9

MUL17

MUL17

17 17 17dX dZ dR dG dB dU dV d(1/W)

SHFT SHFT SHFT SHFT SHFT SHFT SHFT SHFT3

17 17 17 1725 25 25 25

17 17 17 1725 25 25 25

dX/dY dZ/dY dR/dY dG/dY dB/dY dU/dY dV/dY d(1/W)/dY

256

entr

y

LUT(=1/dY)

Mantissa

Expo

nent

8

SIMDDivider

[Fig. 3.2-7 : Triangle Setup Engine]

44


During the setup calculation, insufficient precision can cause significant

degradation of image quality, since the errors are accumulated in the following

stages. Even though inaccurate colors are tolerable to eyes, inaccurate

coordinates lead to distortion in shape. Therefore, highend graphics platforms use

floating-point Datapath for setup operation. However, using the conventional

floating-point divider inside this SIMD Datapath of mobile graphics SoC can be

an overhead in terms of area and power consumption, since the screen resolution

is limited. Therefore, proposed engine uses fixed-point arithmetic instead. Once

colors and coordinates are fed into the rendering engine, they are calculated and

stored as fixed-point numbers. However, when division operations are performed

in the TSE, the data are temporally treated by floating-point divider. Because this

TSE requires three 8-way SIMD dividers, the divider can take significant amount

of silicon area. For the SIMD divider, using multipliers with LUT is inevitable

choice considering the power and the area, since each divider shares the divisor –

∆Y. The 8-way SIMD divider is designed by using 8 integer multipliers, 8 shifters,

and one floating-point LUT (Look-Up Table).

Here, optimizing the datapath-width is important to implement the TSE with

small number of transistor gates, while preserving the necessary precision. The

derivatives of setup operation can be written as follows:

)/1(/ yxyxP ∆×∆=∆∆=∆

(∆P = derivative, ∆x = dividend, ∆y = divisor)

Here, x and 1/∆y can be implemented by multiplier and LUT, respectively.

45


Since the error of ∆P is accumulated through the incremental shading datapath,

the fractional point of 1/∆y must keep the required precision – m-bit fraction is

required for m-bit screen resolution [48]. Insufficient number of fractional point

results in noticeable distortion as shown in fig. 3.2-8. Also, because ∆y varies

from 2 (21) to 255 (28-1), 1/∆y changes from 0.5 to 0.003921568..., requiring 8-

dynamic range to hold the MSB position. Therefore, 16bit-width is necessary to

store the dynamic range and fractional point of 1/∆y. Cutting out the LSBs of

1/∆y can distort the images as shown in fig. 3.2-9. However, storing 16bit fixed-

point in LUT and calculating corresponding data with MUL can lead to area

burden for the mobile applications. Estimating the gate count of three 8-way

SIMD dividers, 16bit fixed-point division will take about 54,780 gates, which is

even slightly larger than that of ARM9 processor (about 50k gates) in the

geometry engine. The total area of divider is estimated with the following

calculation and the results are summarized in fig. 3.2-10.

)44(3

)44(3

)44(3

)44(3

169161716

119111711

898178

8981711

××

××

××

××

×+×+×=

×+×+×=

×+×+×=

+×+×+×=

MULMULLUTD)LUT16(FIXE

MULMULLUTD)LUT11(FIXE

MULMULLUT)LUT8(FIXED

SHIFTERSMULMULLUTT)LUT11(FLOA

AREAAREAAREAAREA

AREAAREAAREAAREA

AREAAREAAREAAREA

AREAAREAAREAAREAAREA

Therefore, 8bit dynamic range and 8bit fraction are separately stored in the

LUT as floating-point numbers. All leading zeros of 1/∆y are removed and only

meaningful 8bit integer after the leading zeros and 3bit corresponding fractional

46


point location are stored in the LUT as a mantissa and an exponent, respectively.

Although the shifters at the last stage in the floating-point LUT division,

LUT11(FLOAT), increases the gate counts by 14%, the total area of three SIMD

dividers is smaller than that of 16bit fixed-point LUT divider, LUT16(FIXED),

by 40%. The area is even smaller by 15% than LUT11(FIXED), while

suppressing unwanted image distortion. FLOAT(SINGLE) shows the image

directly calculated by standard floating-point datapath supporting IEEE-754

single precision, without using shared LUT and multipliers. The image of

LUT11(FLOAT) is even compared to that of FLOAT(SINGLE), while reducing

the power and the area by 95% and 85%, respectively.

47


0-bit fraction 4-bit fraction 8-bit fraction

Proposed

[Fig. 3.2-8 : Fractional Point]

FLOAT (Single)

LUT16 (FIXED)LUT8 (FIXED) LUT11 (FIXED)

LUT11 (FLOAT)

FLOATLUT11 (FLOAT)

LUT8 (FIXED)LUT11 (FIXED)LUT16 (FIXED)

IEEE-754 Single Precision (1sign+23+8exp)11bit float (8bit mantissa + 3bit exponent)

8bit fixed (8bit integer)11bit fixed (8bit integer + 3bit fraction)16bit fixed (8bit integer + 8bit fraction)

Proposed

[Fig. 3.2-9 : Precision of LUT]

48


1,374

1,890

2,750

8 11 16 IntegerBit-widthMUL17

8 11 16MUL9

7971,100

1,600

8 11 16LUT

590860

430

3,000

2,000

1,000

Numberof Gates

(a) Area of Each Block

32,343

27,342

37,650

54,780

10k

20k

30k

40k

50k

Numberof Gates

DividerPrecision

LUT11(FLOAT)

LUT8(FIXED)

LUT11(FIXED)

LUT16(FIXED)

40%Area

Reduction

SHIFTER

MUL

LUT

LUT MUL SHIFT SIMD Div TOTAL LUT11(FLOAT) 590 8,864 1,507 10,781 32,343 LUT8(FIXED) 430 8,864 0 9,114 27,342 LUT11(FIXED) 590 11,960 0 12,550 37,650 LUT11(FIXED) 860 17,400 0 18,260 54,780

(b) Total Area of Three 8-way SIMD Dividers.

[Fig. 3.2-10 : Divider Area in Triangle Setup Engine]

49


3.3 Energy-Efficient Texturing Unit

3.3.1 Consideration of Energy Efficiency

Frame#1

Pow

er C

onsu

mpt

ion

Frame#2 Frame#3

Low-Power(Run-time)

EnergyHigh

Performance

Low-Power(Standby Power)

[Fig. 3.3-1 : Power and Energy Consumption]

Reducing the power consumption is sometimes believed to be an ultimate goad

of designing circuits for mobile applications. However, it is not always true. The

amount of energy consumption will be the same when the power consumption is

cut in half and the calculation time is doubled in contrast, since the energy

consumption is the multiplication of the power and the energy. It is the battery to

drive the mobile devices so that reducing the energy consumption is the key to

enhance the operation lifetime. Fig. 3.3-1 shows the power and energy

consumption when the texturing unit renders consecutive frames. Once the frame

rate is fixed, the rendering engine will wait, after drawing in the frame slot, until

starting the next frame since the job assigned to each frame is finite. Therefore,

reducing the operation time by achieving the high performance, as well as

50


reducing the operation power, must be taken into account for long-lasting

operation. Also, suppressing the standby current is necessary to minimize the

over energy consumption. Therefore, I proposed two schemes to achieve the high

performance while keeping the power consumption low: 1) Approximation of

perspective division, and 2) Address alignment logic.

Even though the screen resolution of target PDA is limited, the rendering

quality itself cannot be sacrificed much. The rendering engine must calculate the

pixels correctly within the boundary of the required power budget at high pixel

fill rate. Therefore, the SlimShader contains two texture units, each of which

supports perspective-correct address calculation [42] and bilinear MIPMAP

texture filtering [41].

3.3.2 Approximation of Perspective Division

In the calculation of perspective-correct texture address, per-pixel division is

required and this operation can be described like the following equations [42].

U = u/(1/w) and V = v/(1/w), ……….… [Eq. 1]

1V)U,(0 ≤≤ , …………………… [Eq. 2]

v/w)(,u)w/( ≥≥ 11 , ………… [Eq. 3]

(Where, (u, v, 1/w) and (U,V) are homogeneous texture addresses, and texture

addresses, respectively)

51


Direct calculation of upper equation is difficult in a single-cycle even in the

highend 3D graphics system [46], because of the gate count overhead of divider.

Although division-free algorithm was introduces [49], the cycle times are varying

depending on the inputs, which means slower pixel throughput and more

complex pipeline control. Therefore, this architecture uses direct division method

for sustained pixel throughput, keeping pipeline control simple. Since each

operand (u, v, and 1/w) has 16-bit precision in the datapath, 16-bit /16-bit divider

is required to calculate the perspective-correct texture address (U and V).

However, by the definition of the texture address as written in Eq. 2, the range of

1/w can be limited as in Eq. 3. These facts can be used to reduce the power

consumption and the area of the address calculation circuit.

LeadingZeros LSB

LeadingZeros 8-bit Data LSB Zero

Padding

1/wMeaningless Approximation Errorto 8-bit LUT

u, vBefore reformatting

After reformattingShift

8-bit Data

[Fig. 3.3-2 : u, v, 1/w formatting]

The following approximation method enables us to use 16/8 divider, instead of

using 16/16 divider. The 1/w can be represented in a binary form as the

composition of leading zeros, m-bit data and LSBs as shown in fig. 3.3-2. Since u

and v are always equal to or smaller than 1/w, removing the same number of

52


leading zeros in u and v still preserves data. Therefore, we have a chance to use

the smaller bit-width divider. In LUT divider, bit-width of divisor (m-bit)

decides the LUT area, which in turn may occupy most of the divider area.

However, using only 8bit data, discarding LSBs, lead to approximation error as

described in the following equations:

Let

wedApproximatwvuwOriginalwuw

a /1)/1(,,/1,,)/1( 000

==

Then, (1/w)0 and (1/w)a can be represented as follows:

DatabitmZerosLeadingaw

LSBsDatabitmZerosLeadingaaw

mLLa

emLL

−+=⋅+⋅=

+−+=+⋅+⋅=

−−−

−−−

160

16

160

160

220)/1(

220)/1(

161 ≤+≤ mL

120 16 −≤≤ −− mLea

Where,

L = Number of Leading Zeros

m = Number of bit-width of DATA to search LUT

Since a0 is the number after leading zeros, the MSB of a0 must not be zero and

it can be written as follows:

)120,(2 111

10 −≤≤+= −− mm awhereaa

53


The texture coordinates U0 (original) and Ua (Approximated) can be written as

follows:

a0a w

uUw

uU)/1(

,)/1(

000 ==

Thus, the approximation error can be written as follows:

( )

mLe

a

a

aa

aa

aa

www

uwwuwu

ww

wu

wu

wu

UUUwE

−−⋅=

−=

⎟⎟⎠

⎞⎜⎜⎝

⎛−⎟⎟

⎠

⎞⎜⎜⎝

⎛×=

−=

−=

160

0

0

0000

0

0

0

0

0

0

0

0

2

)/1()/1()/1(

)/1()/1()/1()/1(

1)/1(

1

)/1(

)/1()/1()/1(

Therefore, the maximum error is

mLm

mL

wEMAX −−−

−−

⋅−

= 161

16

2212))/1((

Fig. 3.3-4 shows the estimated gate counts perspective division unit and

approximation errors as the bit-width. The total area of division unit per pixel

processor can be calculated as follows:

MULLUT

MULMULLUTTOTAL

AREAAREAVforAREAUforAREAAREAAREA

×+=++=

2

Since each PP contains the division unit, overall gate count will be doubled in

54


the chip which has two PPs. The maximum error occurs when leading zero

doesn’t exist. I can choose m to be 8 to make the maximum error less then 1%.

Also, selecting 8 can also shorten the designing time since 8bit LUT divider is

already designed for TSE and available when designing the TA1 stage. Then, it

reduces the divisor bit-width from 16 to 8, resulting in more than more than 95%

area reduction in the divider if we can sacrifice the image quality within the

0.78% error boundary as shown in fig. 3.3-3. Before fed into the LUT divider, u

and v are also reformatted to match 1/w, which is done by left-shifting them by

the same number of leading zeros as 1/w and padding zeros after LSBs.

16bit / 8bit 16bit / 16bit

Proposed

[Fig. 3.3-3 : Error on Perspective Division]

55


0 2 4 6 8 10 12 14 16248

163264

128256512

1024204840968192

163843276865536

Gat

e C

ount

s

m (bit-width of Data)

Total

MUL

LUT

Selected

Log scale

95% AreaReduction

(a) Area

0 2 4 6 8 10 12 14 160

20

40

60

80

100

Max

imum

Err

or (%

)

Number of Leading Zeros

m=1

m=2

m=5

m=3

m=4

0 2 4 6 8 100

1

2

3

Max

imum

Err

or (%

)

Number of Leading Zeros

m=8

m=7

m=6

m=9

(b) Maximum Calculation Error

[Fig. 3.3-4 : Area and Error Estimation of Division Approximation]

56


3.3.3 Address Alignment Logic

8 texel requests are generated at every cycle because two texture units perform

the bilinear MIPMAP texture filtering to draw more realistic images [41].

Although the on-chip DRAM is capable of supplying the bandwidth for 8 texels

per every cycle, fetching 8 texels directly from 8 texture memories (TMs) may

consume large amount of power due to the concurrent data transitions in many

capacitive I/Os and the activation power of TMs themselves. Therefore, I propose

Address Alignment Logic (AAL) to reduce the number of memory request as

illustrated in fig. 3.3-4. Because four texel requests are generated by each pixel

processor in the Bilinear MIPMAP filtering, the total number of request is 8.

However, there are several requests that can be overlapped because their

footprints are separated by approximately 1-texel distance as shown in fig. 3.3-6,

based on the definition of MIPMAP filtering.

Fig 3.3-5 shows the block diagram of Address Alignment Logic. After texture

addresses (U and V) are calculated at TA1 stage, four bilinear addresses are

generated from each pixel processor. In this stage, LOD (Level of Detail) is also

calculated. Fig. 3.3-6 shows the variation of integer part of LOD according to the

calculation method [44]. Although LODMAXx shows some difference from widely-

used LODMAXall or LODSQRT [46], I chose the LODMAXx since it can reduce the

hardware cost by eliminating the Square-Root logic and Y-value registers. Also,

the PP0 and PP1 shares the LOD unit to further minimize the hardware because

the LODs of PP1 differs from those of PP0 by about 10% on an average as

57


summarized in table 3.3-7. 82% reduction in gate count is achieved compared

with LODSQRT.

TA1

PP0 PP1

uIN[15:0] vIN[15:0] wIN[15:0]

UVW Division UVW Division

uOUT[11:0] vOUT[11:0]

uIN[15:0] vIN[15:0] wIN[15:0]

uOUT[11:0] vOUT[11:0]

TA2TA2_ADDR_LOD TA2_ADDR_LOD

PP0uOUT[8:0]

LOD

PP0vOUT[8:0] PP0uOUT[8:0] PP0vOUT[8:0]

TA2_ADDR_BILINEAR

PP0UV0[15:0] PP0UV3[15:0]

TA2_ADDR_BILINEAR

PP1UV0[15:0] PP1UV3[15:0]

TA2_SPATIAL_ALIGN

TA2_MASK_GEN

TA2_TEMPORAL_ALIGN

SA0[3:0]

SA3[3:0]SA2[3:0]SA1[3:0]

TA0[7:0]TA1[7:0]TA2[7:0]TA3[7:0]TA4[7:0]TA5[7:0]TA6[7:0]TA7[7:0]

SPmask[7:0] TMmask[7:0]TA2_ADDR_TRANSLATION

TMaddr0[19:2]

TMaddr3[19:0] TMaddr4[19:0] TMaddr7[19:0]

TP1

TP1_BANK_AGGREGATION

TMaddr3[1:0]TMaddr0[1:0] TMaddr7[1:0]TMaddr4[1:0]

TMaddr7[19:2]

TP1_MULTI

BAsel0[7:0] BAsel3[7:0]

TMaddr0[19:0]

TP1_ADDR_SELECTTM0_ADDRESS[17:0]TM1_ADDRESS[17:0]TM2_ADDRESS[17:0]TM3_ADDRESS[17:0]

toTexture

Memories

TP1_BAON[3:0]

TP2

TP2_DATA_DISTRIBUTETM0_DATA[23:0]TM1_DATA[23:0]TM2_DATA[23:0]TM3_DATA[23:0]

fromTexture

Memories

TP3

TP3_DATA_DISTRIBUTETMmask[7:0]

SPmask[7:0]

TF

TEXTURE_FILTER

PP0 PP1

TEXEL0[23:0]

TEXTURE_FILTER

TEXEL1[23:0]

[Fig. 3.3-5 : Block Diagram of Address Alignment Logic]

58


TextureImage

|AddrPP1 - AddrPP1| ~= 1(Definition of LOD)

0 1 0 12 3 2 3 2 30 1

0 1 0 1 0 12 3

2 3 2 3 2 3

2 32 30 1 0 1 0 1

MIPMAPTextureLOD0

LOD1

LOD2

LOD3

TwoTextureAddress

PP0PP1

Texture Address - 4 Req. / PP

Spatial Aligner - Reduced to ~5

Current

PreviousTemporal Aligner - Reduced to ~2.5

Remaining Req.and TM Assignment

PP0

PP1

SpatiallyOverlappedRequests

TemporallyOverlappedRequests

LODSelection

[Fig. 3.3-6 : AAL Operation]

Original Image

SQRT MAXall

MAXx MAXy

Proposed

),max(),max(

),,,max(

),max( 2222

yyMAXy

xxMAXx

yyxxMAXall

yyxxSQRT

tsLODtsLOD

tstsLOD

tstsLOD

==

=

++=

[Fig. 3.3-7 : LOD Calculation Method]

59


In fig. 3.3-5, spatial aligner (TA2_SPATIAL_ALIGN) compares the texture

addresses of PP0 (PP0UV0 ~ PP0UV3) with those of PP1 (PP1UV0 ~ PP1UV3),

setting the overlapped position flag (OPF) on SA0 ~ SA3. Then, temporal aligner

(TA2_TEMPORAL_ALIGN) compares the current texture requests (PP0UV0 ~

PP0UV3, PP1UV0~PP1UV3) with the previous ones which are stored inside the

registers, setting the OPF on TA0 ~ TA7. Mask generation block

(TA2_MASK_GEN) finally merges the OPF from spatial and temporal aligners

and generating the bit-masks (SPmask, TMmask) which indicates the texel

positions to be newly fetched from the texture memories. The simulation results

show the average numbers of mask bits are 5 for SPmask and 2.5 for TMmask.

Fig. 3.3-8 shows the circuit diagram of spatial aligner and temporal aligner.

Temporal aligner is basically similar to 8-entry fully-associative L1 texture cache

[33]. In this proposed architecture, however, texels are simply stored in the

pipeline latches instead of power-consuming SRAM [26]. Also, the caching

concept is extended to dual pixel processors in this work.

Although the average number of texture memories activated per cycle can be

reduced to 2.5 through the operation of spatial and temporal aligner, the

maximum number is still 8. In this implementation, a texture image is stored

across 4 texture memories as shown in fig. 3.3-6, where adjacent texels are

assigned to different texture memories. Texture memory conflicts are scheduled

in a round-robin manner by TP1_BANK_AGGREGATION. When the same

texture memory is accessed, this block sets TP1_MULTI to 1, extending the

60


operation cycles. Then TP2 and TP3 stages re-distribute the texel data from 4

texture memories to 8 corresponding positions, feeding 4 texels per PP for

bilinear texture filtering. Although the number of texture prefetch stages (TP1,

TP2, and TP3) are optimized to 3 for this implementation, where the latency of

texture DRAM is 1, it can be easily scaled up for longer-latency DRAM such as

off-chip texture memory by simply inserting more pipeline latches at TP2.

Vector PP1 Utilization LOD Change Rate

Spatial Aligner Remaining Texels

Temporal Aligner Remaining Texels

Cycle Overhead

A 56.71% 8.28% 4.82 2.30 1.09 B 56.71% 12.74% 5.27 3.64 1.21 C 97.42% 0.00% 5.35 2.78 1.03 D 78.10% No Texture No Texture No Texture No Texture

[Table 3.3-1 : Simulation Results of Texturing Unit]

PP1UV0[15:0]

PP1UV0[15:0]

PP1UV0[15:0]

PP1UV0[15:0]

SA0 SA1 SA2 SA3

4 Texel Requests from PP0

4 Te

xel R

eque

sts

from

PP1

=?

=?

=?

=?

=?

=?

=?

=?

=?

=?

=?

=?

=?

=?

=?

=?

PP0UV0[15:0]

PP0UV1[15:0]

PP0UV2[15:0]

PP0UV3[15:0]

(a) Spatial Aligner

61


TEXclk

Current Texel Requests

=?

=?

=?

=?

=?

=?

=?

=?

BitwiseAND

=?

=?

=?

=?

=?

=?

=?

=?

BitwiseAND

=?

=?

=?

=?

=?

=?

=?

=?

BitwiseAND

=?

=?

=?

=?

=?

=?

=?

=?

BitwiseAND

=?

=?

=?

=?

=?

=?

=?

=?

BitwiseAND

=?

=?

=?

=?

=?

=?

=?

=?

BitwiseAND

=?

=?

=?

=?

=?

=?

=?

=?

BitwiseAND

=?

=?

=?

=?

=?

=?

=?

=?

BitwiseAND

PP0UV0[15:0]

PP0UV1[15:0]

PP0UV2[15:0]

PP0UV3[15:0]

PP1UV0[15:0]

PP1UV1[15:0]

PP1UV2[15:0]

PP1UV3[15:0]

SPmask[7]

SPmask[6]

SPmask[5]

SPmask[4]

SPmask[3]

SPmask[2]

SPmask[1]

SPmask[0]

LOD

PreviousRequests

TMmask0 TMmask1 TMmask2 TMmask3 TMmask4 TMmask5 TMmask6 TMmask7

=?

(b) Temporal Aligner

[Fig. 3.3-8 : Spatial Aligner and Temporal Aligner]

Fig. 3.3-9, and 3.3-10 show the analysis results of AAL. Fig. 3.3-9 displays

how the number of texture requests are reduced as the frame goes on in the AAL.

Fig. 3.3-11 summarizes how the number of texture memory affects the power

consumption and the cycle time. Since the AAL reduces the average number of

texture requests and limits it to 2.5 ~ 3.5 on average, the power consumption is

saturated at certain level. Also, more number of texture memory means that less

time is necessary to fetch the same amount of data from the memory, occupying

more area. Therefore, I determine the number of texture memory to be four,

62


considering the energy consumption that is the multiplication of those two factors

to be minimized.

0 20 40 60 80 1000.00.51.01.52.02.53.03.54.04.55.05.56.06.57.07.58.08.59.0

Num

ber o

f Rem

aini

ng R

eque

sts

BA

B

A

SpatialAligner

Frame Number

TemporalAligner

OriginalRequests

[Fig. 3.3-9 : Remaining Requests : Frame by Frame]

63

0 2 4 6 80

2

4

Number of Texture Memory

0

2

4Power Time

Vector B

Vector C

Vector A

Vector B

Vector A

Vector C

3

1

1 3 5 7

3

1

[Fig. 3.3-10 : AAL Analysis Results : Power and Time]


Fig. 3.3-11(a) shows the power consumption required to activate the texture

memories, which is proportional to the number of texture memories to be

activated per cycle. Fig. 3.3-11(b) shows the number of cycles required to draw

two bilinear-filtered pixels, which is proportional to the time required to

complete the drawing of a scene. The average number of cycles in the 4 TMs

with AAL is slightly increased to 1.1. Therefore, the energy consumption

required to access the texture memory, which is the multiplication of time by

power, can be reduced by 66% on an average as illustrated in fig. 3.3-11(c).

Although a single PP architecture seems like consuming less power than AAL

architecture, it needs much more time until finishing the drawing. Therefore, this

architecture, 2PPs with AAL, is more adequate for mobile platforms driven by

limited energy source from battery.

64


1

8 TMs, 2 PPs(No AAL)

4 TMs, 1 PP (No AAL)

4 TMs, 2 PPs+ AAL

2

1.1

(b) Number of Cycles (=Time)

8

4

2.5

(c) Energy used in the TextureMemories (Normalized)

66%Reduction

Num

ber o

f Cyc

les



4 TMs, 2 PPs+ AAL

(a) Number of Texture MemoriesActivated (=Power)

Num

ber o

f Tex

ture

Req

uest

s

1

0.34



4 TMs, 2 PPs+ AAL

Nor

mal

ized

Ene

rgy

1

AccessTMAccessTM

AccessTM

AccessTM

AccessTM

TimePOWER

CyclesofNumberActivatedMemoriesTextureofNumberEnergy

CyclesofNumberTime

ActivatedMemoriesTextureofNumberPOWER

×=

×=

∝

∝

[Fig. 3.3-11 : Energy-Efficiency of AAL]

65


3.4 Memory Programmer : Post Processing Unit For the real-time special rendering effects, Memory Programmer (MP) post-

processes the rendered pixels transferring them to the display controller in

parallel with the SlimShader. It contains crossbar switches for front/back

selections, and a SIMD-parallel datapath which is controlled by its own 16bit

commands as shown in fig. 3.4-1. Since each memory has separate read/write bus,

total bit-width of crossbas is 160. The LCD interface reads-out the pixels from

the front-buffer through SIMD datapath and writes back to the buffer, while

SlimShader performs rendering operations with back-buffer. The post-processing

doesn’t slow down the pixel throughput because MP processes one pixel per

single LCD clock period. The special effects such as full-scene antialiasing,

motion blur and fog can be programmed by the software and downloaded to the

command registers. Full-screen antialiasing (FSAA) can be performed by 2x1

filtering, and linear fog is calculated with the help of double depth buffers.

Following equations are the examples of post-filters which can be evaluated by

SIMD datapath. Fig. 3.4-2 shows the block diagram of SIMD datapath and fig.

3.4-3 shows the examples of special effects and their assembly codes.

FSAA : OUT[x][y] = (a*FB[x][y] + b*FB[x+1][y])/c

(for example, a=3, b=1, c=4)

Fog : OUT[x][y] = a*(FB[x][y]-color) + color

( a=(ZB[x][y]+bias/SCREEN_DEPTH), 0<a<1 saturated )

66


DB A0DB A1DB B0DB B1

FB A0FB A1FB B0FB B1

SlimShader

Pipe Control

LCD Interface

24b

DisplayOutput

- SIMD-parallel Datapath - 16b Commands - Commands Registers

CommandRegisters

160b

160b

SIMDDatapath

16b16b

24b24b

64b

64b

96b

96b

16b

Pixel Data

Pipe Control

Commands

64b

64b

96b

96b

[Fig. 3.4-1 : Memory Programmer]

MASKMASK

f(X-c)+c

FB[x]

InputRegister

FB[x+1]2424

a

a*Y

RGB RGBc

f

(A+b*B)/d

A B

bd

MASKmask

OutputRegister

Pixel Out/FB Write

ZB[x]

InputRegister

ZB[x+1]1616

Z+e

e

LUTSAT

ConstantRegister

saten

saten

ba

cdef

mask

24 24

[Fig. 3.4-2 : SIMD-parallel Datapath]

67


FSAA Motion Blur

Fog Others

MOVR a 0x001;MOVR b 0x011;MOVR d 0x100;DISB LUT;MASK 0xFFFFFF;CLRZ;CLRC;SWAP;

MOVR a 0x001;MOVR b 0x010;MOVR d 0x011;DISB LUT;MASK 0xFFFFFF;CLRZ;CLRC;SWAP;

MOVR a 0x000;MOVR b 0x001;MOVR d 0x010;DISB LUT;MASK 0xFFFFFF;CLRZ;SWAP;

MOVR a 0x000;MOVR b 0x011;MOVR d 0x100;DISB LUT;MASK 0xFFFFFF;CLRZ;SWAP;

MOVR a 0x000;MOVR b 0x001;MOVR c 0xFFFFFF;MOVR d 0x001;MOVR e 0x0000;(MOVR e 0x9C40;)ENAB LUT;ENAB POSSAT;MASK 0xFFFFFF;CLRZ;CLRC;SWAP;

MOVR a 0x000;MOVR b 0x001;(MOVR b 0x011);MOVR d 0x001;DISB LUT;MASK 0xFF0000;CLRZ;SWAP;

MOVR a 0x000;MOVR b 0x001;MOVR d 0x001;DISB LUT;MASK 0xCCCCCC;CLRZ;CLRC;SWAP;

[Fig. 3.4-3 : Examples of Special Effects]

68


3.5 Memory Access To cover the 256 x 256 screen resolution which matches the screen resolution

of most of current cell-phones, 4 frame buffers and 4 depth buffers with zero-

latency are used in the chip. Also, 4 texture memories amount to 24Mb and store

MIPMAP texture image for the 3D gaming applications. Fig. 3.5-1 and fig. 3.5-2

illustrate the memory configuration and access timing, respectively. The latency,

cycle time, and bus configuration are optimized each. Frame and depth buffers

are optimized for single-cycle read-modify-write data transaction without latency.

And texture memory is optimized for continuous read operation, allowing one-

cycle latency to hold larger capacity. Also, the memories can be differently

mapped for better performance – Vertical stripe assignment for frame/depth

buffers, and 2D interleaved assignment for texture memory.

Depth Buffer A0512kb

Depth Buffer A1512kb

Depth Buffer B0512kb

Depth Buffer B1512kb

Frame Buffer A0768kb

Frame Buffer A1768kb

Frame Buffer B0768kb

FrameBuffer B1768kb

Mem

ory

Prog

ram

mer

Slim

Shad

er

Texture Memory 06Mb

Texture Memory 16Mb

Texture Memory 26Mb

Texture Memory 36Mb

[Fig. 3.5-1 : Memory Configuration]

69


Command #0

REclkMEMclk

CMD READ WRITEDRAMInternal

20ns

8ns

15ns

READ WRITE READ WRITECMD CMD

READ Bus READ_DATA #0 READ_DATA #1 READ_DATA #2

WRITE Bus WRITE_DATA #0 WRITE_DATA #1

SlimShaderPI Stage

Command #1SlimShaderHS Stage

Command #2Latched @ Falling Edge

Z-Comp

5ns

Interpolation Interpolation Z-Comp Interpolation Z-Comp

(a) Depth Buffer Timing

Command #0

REclkMEMclk

CMD READ WRITEDRAMInternal

20ns

8ns

15ns

READ WRITE READ WRITECMD CMD

READ Bus READ_DATA #0 READ_DATA #1 READ_DATA #2

WRITE Bus WRITE_DATA #0 WRITE_DATA #1

SlimShaderPB Stage

Command #1SlimShader

TF StageCommand #2

Latched @ Falling Edge

A-blend

5ns

TEX-blend TEX-blend A-blend TEX-blend A-blend

Command #3

(b) Frame Buffer Timing

READ

REclkMEMclk

CMD READDRAMInternal

20ns

CMD CMD

I/O Bus READ_DATA #1

SlimShaderTM Command

@ TP1

Data Alignment #0

CMD

5ns

READ_DATA #0

5ns

READ NOP

WRITE_DATA #0 WRITE_DATA #1

WRITE

READ NOP WRITE WRITE

SlimShaderTM Data@ TP3

Data Alignment #1

SlimShaderTM Data@ TP2

2.5ns

Reroute TM LatencyControllable Stage

(c) Texture Memory Timing

[Fig. 3.5-2 : Memory Access Timing]

70


Although it looks simple, however, satisfying the pipeline timing is a big

challenge in terms of DRAM design. The cycle time (TRC) of embedded DRAM

must be less than 20ns, while commodity SDRAMs are working at 65ns or more.

The timing budget of frame and depth buffer is even more strict because the read-

data must be written back to the same address within a cycle for efficient RMW

transaction. To support RMW operation at 50Mhz cycle time, the timing of the

core used in 256Mb SDRAM is modified and optimized to 20ns by being

reconfigured each cell-array to 256 rows x 192 bitline pairs, as shown in fig. 3.5-

3(a). Also, atomic commands defined by the standard SDRAM, ACT (Row-

Active), PCG (Precharge), READ (Column-Active and Read), and WRITE

(Column-Active and Write), are packed into a simple instruction which DRAM

decodes internally. For example, a single RMW instruction is the composition of

ACT – READ – HOLD – WRITE – PCG operations. Using the single instruction

also helps reducing the cycle time because fetching each command individually

with multiplexed row and column addresses requires extra timing margin to

preserve the setup and hold time of each command. Splitting the cell-MAT and

adding extra circuits for faster cycle time and lower power consumption,

however, bring out the area overhead of embedded DRAM. Fig. 3.5-3(b) shows

the total area of each DRAM normalized by its capacity, which implies the area

overhead or cell efficiency. As shown in this graph, the area/bit of 512kb DRAM

used in depth-buffer is 3.8 times larger than that of 256Mb SDRAM, although

they use the same 0.16µm cell structure. 6Mb DRAM, optimized for the texture

71


memory that reads data continuously without requiring the RMW (Read-Modify-

Write) operations, shows better area efficiency. Therefore, the total area of

embedded DRAM occupies 1.8 times larger area than standard that of standard

256Mb SDRAM.

768

384

192

9648

64 128 256 512 1024

10

15

20

25

30

35

40

45

50

55

60

65

TRC(ns)

Numberof Bitline Pair

Number of Row

TRC = TACT+TROW+TCOL+THOLD+TWRITE+TPCG

TRCTACTTROWTCOLTHOLDTWRITETPCG

: Random Row Cycle: Address Decoding: Wordline Activation: Read-out Path Activation: 5ns Hold for Modify: Write-Driver Activation: Bitline Pre-Charge

(a) Critical-Path Timing

0

200

400

600

800

1000

1200

1400

512kb DRAMDepth Buffer

Area/Bit(um2/kbit)

768kb DRAMFrame Buffer

6Mb DRAMTexture Memory

256MbSDRAM

3.8x

1.5x

3.2x

20.2% 23.8% 46.1% 65%

29MbeDRAM

Cell

Periphery

38.3%

1.8x

(b) Area Overhead

[Fig. 3.5-3: Characteristics of Embedded DRAM]

72


In contrast, distributed architecture of embedded DRAM saves run-time power

consumption since the only necessary memories can be selectively activated out

of 12. In this architecture, the overall power of rendering memories per two-pixel

can be written as follows:

( ) [ ] TMFBZBMemory PowerPowerPowerPower ××+×+×+= γββα1

Where, PowerFB, PowerDB, PowerTM are the power consumption of frame buffer,

depth buffer, and texture memory,

α = PP1 utilization, β = Depth-gated ratio, γ = Texture-access ratio

Here, α depends on the size and the shape of triangle, and it is tend to decrease

when the triangles gets smaller as table 3.3-1 shows. And β depends on the depth

complexity, and it can be reduced by the extensive clock gating according to the

depth-comparison results as described in section 3.3-2. In a scene with a depth

complexity of three, 7/18 of pixels fail the Z-test [46]. Finally, γ is reduced by the

Address Alignment Logic as described in section 3.3-3.

Based on the power consumption of each DRAM (PowerFB = 18mW, PowerDB

= 12.5mW, PowerTM = 18mW), the PowerMemory can be illustrated as fig. 3.5-4(a).

More power can be saved as the triangles get smaller and scenes gets more

complex, which can be happening for gaming applications on small-sized LCD

screen of mobile devices. When α = 0.5%, β = 1/3, γ = 2.5 (highly-possible

values in the gaming application), the power can be reduced by 65%, compared

73


with the unified memory architecture where all memories are activated together.

Fig. 3.5-4(b) shows the normalized energy consumption until finishing the

drawing job, which is proportional to β since the (1+α) term is cancelled out by

the cycle time as follows:

Let N = total number of pixels to be drawn,

Then, time required to finish the drawing is

α+1=

NTimeTotal

Therefore, the energy consumption to finish the drawing is

( ) [ ]( )

( ) ⎥⎦

⎤⎢⎣

⎡⎟⎠⎞

⎜⎝⎛ ×

+×

+×+×=

××+×+×+×+

=

×=

TMZBFB

TMFBZB

TotalTotalTotal

PowerPowerPowerN

PowerPowerPowerNPowerTimeEnergy

αγββ

γββαα

1

11

As shown in the fig 3.5-3(b), distributed memory system saves more energy as

3D applications get more complex in depth – 63% reduction for α = 0.5%, β =

1/3, and γ = 2.5.

74


0.0 0.2 0.4 0.6 0.8 1.00

20

40

60

80

100

120

140

Pow

er C

onsu

mpt

ion

(mW

)

PP1 Utilization (a)

b=1/1

b=1/2b=1/3b=1/4

65% PowerReduction

(a) Power Consumption

0.0 0.2 0.4 0.6 0.8 1.00.0

0.2

0.4

0.6

0.8

1.0

Ene

rgy

(Nor

mal

ized

)

PP1 Utilization (a)

b=1/1

b=1/2b=1/3b=1/4

63% EnergyReduction

(b) Energy Consumption

[Fig. 3.5-4 : Memory Access]

75


Also, the memories can be selectively refreshed for data retention in standby

modes by power-control instructions as shown in fig. 3.5-4: PLHD (Hold), PIDL

(Idle), PSLP (Sleep), and POFF (Off). PHLD can be used to hold datapath and

memory temporally for normal rendering operations, waiting for geometry

operation. All memories are refreshed in this mode. PIDL turns off the rendering

clock but refreshes all graphics memories. In PSLP mode, only texture memory

is refreshed to hold the texture images since they are possibly downloaded from

the wireless network. Finally, POFF turns off all operations.

Depth Buffer

Frame Buffer

TextureMemoryR

ende

ring

Logi

c

Depth Buffer

Frame Buffer

TextureMemoryR

ende

ring

Logi

c

Depth Buffer

Frame Buffer

TextureMemoryR

ende

ring

Logi

c

Depth Buffer

Frame Buffer

TextureMemoryR

ende

ring

Logi

c

HOLD IDLE

SLEEP OFF [Fig. 3.5-4 : Standby Power Modes]

76

Chapter 4 Chip Implementation

4.1 Process Technology

To implement the mobile 3D graphics chip, previous chips integrate DRAM

using the EML technology [6-8]. However, the fabrication process costs too

much because the logic must be designed with separate transistors from the

DRAM with more mask layers [13]. Therefore, the EML process has been

seldom used on the low-cost mobile platforms yet. In this work, I implemented

the SoC with the pure DRAM process to reduce the fabrication cost, instead of

using the EML. The logic components, SRAM and analog blocks are drawn with

the design rule of peripheral transistors of the DRAM. But the DRAM process

has some drawbacks in the aspect of logic design: 1) slower transistor speed, and

2) less number of metal layers. As summarized in table 4.1-1, the process

characteristics of 0.16um pure DRAM process are even poorer than 0.18um

merged DRAM process. Although the transistor performance does not seem to

satisfy the requirements of highend microprocessor, the high speed state-

machines and interface-circuitries of RAMBUS-DRAM and DDR-SDRAM have

been successfully implemented by the peripheral transistors of DRAM process.

Therefore, I tried to implement the chip with pure DRAM process, successfully


achieving 133MHz and 50MHz speed for the RISC processor and 3D rendering

engine, respectively. negligible sub-threshold leakage current of DRAM process

also helps reducing the standby current which becomes the critical issue for the

battery-driven devices these days. Since the original 256Mb SDRAM process

was not intended to support the logic synthesis, verilog-synthesis methodology

for SDRAM process has to be setup by drawing, characterizing 73 standard cells,

and porting them to various CAD tools. Table 4.1-2 summarizes the transistors

and metal usage. Since the M0 is resistive, it is not used for the global routing as

shown in fig. 4.1-1.

- Resistive M0 - Std Cell Routing

Periphery

M0

M1

M2

M3

Bitline

Cell Cap.

- Global Routing

DRAM Core

Al

Al

Al

W

[Fig. 4.1-1 : Process Technology]

78


Cell Tr. Logic Tr. Metal Ldrawn

(um)

Ldrawn

(um)

Vdd

(V)

M0 Width

(um)

Rs

(ohm/sq) Layers

0.16um

Pure-DRAM 0.16 0.28 2.5 0.35

M0: 1.8

M1~M3: ~0.05

M0: W (Bitline)

M1~M3: Al

0.18yn

Merged-DRAM 0.18 0.18 1.8 0.23

M0: 2

M1~M5: ~0.05

M0: W (Bitline)

M1~M5: Al

[Table 4.1-1 : Process Comparison of Pure-DRAM and Merged-DRAM]

Applied Blocks Transistors Metal 0 (W) M1 M2 M3

Standard Cell RoutingAll Synthesizable Logic Global Routing

Dual-Port SRAM Not Used RISC Cache

SRAM I/O

Analog Circuits

DRAM Periphery

Block Routing

Not Used All DRAM DRAM Core Bitline Wordline DBLine I/O Top Routing Not Used Not Used Horizontal Vertical Horizontal

[Table 4.1-2 : Transistor and Metal Usage]

79


4.2 Chip Fabrication The graphics SoC is implemented using a typical 0.16um DRAM process with

1-W 3-Al metal layers and its die area takes 121mm2. The chip contains 1M logic

transistors, 29Mb DRAM, 72kb SRAM and PLL. Top level is verified by using

the verilog, where custom blocks such as DRAM, SRAM and PLL are

characterized and ported. The external interfaces such as boot-up ROM and

SRAM are also modeled to emulate the board-level environment. Compiled

ARM9 codes capable of controlling the full functionalities are executed, and the

results are compared with 3D-Glamor. Finally, the GDS file extracted from

Apollo P&R tool, is converted to schematics to be simulated by the transistor-

level simulator, EPIC, as illustrated in fig 4.2-1, running worst-case vectors.

Normally, the last step is skipped in the ASIC design since simulating the

transistor-level netlists takes huge amount of time. However, I did this low-level

simulation to minimize the uncertainties and mistakes in the setup process

because it is the first trial of designing the SoC with pure DRAM process. Fig.

4.2-2 show the die photograph and table 4.2-1 summarizes the physical

characteristics. The first-silicon was packaged and tested as shown in fig. 4.2-3,

where the first waveform (fig. 4.2-4) appeared after built-in self-calibrating

operations. As shown in the measured waveforms in fig. 4.2-5, the transition

from slow mode to fast mode can be completed quickly without any hazard.

80


3DRE RISC BEQ

Verilog

Soft-IP

Application (C/ASM)Clock

Control MobileGLRAMP-IVLib

RenderingData

Hard-IP

PLI, RTL, GATE

Compiled ARM9 Code

DRAM SRAM PLL

SRAM ROM

Board

Code

Cell PAD

P&R

Library

Net

list

EPIC

GD

SRTL Tapeout!MEM I/F

3D-Glamor

Functional / TimingVerification

RTL GDS

GDSLIB

SPICE ModelParameters

GATE

Worst-caseVectors

[Fig. 4.2-1 : Implementation Flow]

[Fig. 4.2-2 : Die Photograph]

81


Process Technology 0.16um CMOS DRAM with 1-W, 3-Al (256Mb Compatible) Power Supplies I/O : 3.3V (VDQ : 3.3V, VSQ : 0V)

Internal : 2.5V (VDD : 2.5V, VSS : 0V) Digital : VPP : 3.5V, VINT : 2.0V, VCP : 1.0V, VBLP : 1.0V, VBB : -0.8V Analog : VCCA : 2.5V, VCCVCO : 2.5V, GNDA : 0V

Clock Frequency (RISC,BEQ/3DRE,DRAM)

Fast : 132MHz / 33MHz Normal : 66MHz / 16.5MHz Slow : 33MHz / 8.25MHz

Chip Size 11mm x 11mm (including I/O Pad) Transistor Counts 1M Logic

29Mbit DRAM 72kbit SRAM (9kByte)

Analog Blocks Programmable PLL 2.4nF Decoupling Capacitor

Power Consumption 210mW Package 240pin QFP

[Table 4.2-1 : Physical Characteristics]

Fabricated Chip

[Fig. 4.2-3 : Device Under Test]

82


RISCclk (25MHz)

REclk (6.25MHz)

[Fig. 4.2-4 : Measured Waveform : After Built-in Self-Calibrating Test]

Slow Mode Fast Mode33MHz / 8.25MHz 132MHz / 33MHz

V: 2V/div, H: 33ns/div

RISCClock

3DREClock

[Fig. 4.2-5 : Measured Waveform : Mode Change]

83


4.3 Power Consumption Fig. 4.3-1 shows the composition of the power consumption for various

applications. The implemented graphics SoC consumes 210mW in continuous

calculation of bilinear texture-mapped and antialiased 3D graphics applications at

FAST mode (33MHz REclk and 132MHz RISCclk). The embedded DRAM

drastically reduce the power consumption since the external I/Os for 3D

rendering are eliminated, and an additional 22% reduction is obtained by AAL

(Address Alignment Logic) and DFCG (Depth-First Clock-Gating). For point-

sampled texturing, the power reduces to 185mW. Textured 3D rendering

consumes 110mW at NORMAL (16.5MHz REclk and 66MHz RISCclk), and

65mW at SLOW mode (8.25MHz REclk and 33MHz RISCclk), respectively.

Non-textured (but Gouraud-shaded) 3D applications consume 145mW at FAST

mode. The power consumption of MP is about 5mW, which is low because it is

synchronized with slow LCD clock. The power consumption of each block is

summarized in table 4.3-1.

Conventional

Texture

Bilinear

Texture

Point Sampled

Texture No Texture

3DRE (with DRAM)

SS+MP

ZB/FB/TM

200mW

77.6(68.1/9.5)

(21.9/26.25/72)

140mW

58.14(48.64/9.5)

(21.9/15.75/45)

115mW

53.2(43.7/9.5)

(21.9/15.75/20)

80mW

41.14(31.64/9.5)

(21.9/15.75/0)

RISC (with $) 54.8mW

BEQ (with SRAM) 3.5mW

PMU (with PLL) 5mW

Total 270mW 210mW 185mW 145mW

[Table 4.3-1 Block Power Consumption] 84


PowerConsumption

(mW)

270mW

300

200

100

210mW

22% reduction

185mW

Implemented Graphics LSI

1000

by Embedded DRAM A : 3D Graphics with Texture Mapping (External Memory)B : 3D Graphics with Texture Mapping (No AAL, No DFCG)C : 3D Graphics with Texture Mapping (AAL, DFCG), Bilinear MIPMAPD : 3D Graphics with Texture Mapping Point-SampledE : 3D Graphics without Texture Mapping

Conventional System

ExternalI/O andDRAM

A B C ED

Depth BufferFrame BufferTexture Memory3D Rendering EngineBEQ with DP-SRAMRISC with CachePower Management UnitOthers including pad

145mW

[Fig. 4.3-1 : Power Consumption @ Fast Mode]

85


4.4 Performance

4.4.1 Performance Summary

This chip can draw 24bit texture-mapped pixels at the drawing speed of

66Mpixels/s and 264Mtexels/s at 33MHz and they are summarized in table 4.4-1.

TargetApplications

3D RenderingPerformance

EmbeddedGraphics Memory

3D GeometryPerformance

with Fixed-PointGraphics Library

Realtime 2D/3D Graphics PipelineMPEG-4 SP@L1 DecodingMP3 Audio Decoding

66Mpixels/s, 264Mtexels/sHardware Triangle Setup EnginePerspective-Correct Bilinear MIPMAP TexturingGouraud Shading, Alpha Blending, Texture BlendingAntialiasing, Motion Blur, Fog, Special Effects

5Mb Double Depth / Frame Buffer (256 x 256 Resolution, 24bit Color, 16bit Depth)24Mb Texture Memory

1.04Mvertices/s : Model-View Transformation300kvertices/s : Model-View Transformation + Perspective Projection + 6-Side Clipping70kvertices/s : Model-View Transformation + Perspective Projection + 6-Side Clipping + Lighting (Single directional light source from infinite viewer, one-sided, ambient + diffuse + specular highliting)

[Table 4.4-1 : Chip Features]

86


4.4.2 Performance Comparison

Fig. 4.4-1 and table 4.4-2 compare the performance of proposed graphics SoC

with recently-announced implementations [45] and previous work [6-8, 23]. The

geometry performance reaches up to 1Mvertices/s with the help of MobileGL

library. Also, the rendering engine shows the highest fill rate taking advantage of

local graphics memories and energy efficient texturing unit.

Fig. 4.4-2 shows the performance indices. Drawing at high rendering rate, the

3D graphics accelerators in the PC platforms perform many advanced rendering

functions. But they consume a great deal of power, more than sevel tens of watts.

The proposed graphics rendering engine for the mobile platform, however, shows

a slower rendering rate and performs restricted functions while consuming only

less than 140mW (rendering power). Therefore, I propose new performance

indices to compare the performance of the embedded 3D graphics rendering

engine considering the power consumption [7]. It is analogous to well known

MIPS/mW.

nConsumptio PowerSpeedRenderingDGraphicsMobileofPerforanceD 33 =

The 3D rendering speed can be illustrated in pixels/s (PxPS) or texels/s (TxPS).

Therefore, PxPS/mW stands for the pixel fill rate per 1mW power, and

TxPS/mW describes the texture fill rate per 1mW. The pixel rate of this SoC is

about 0.8-MPxPS/mW, which is 1.6 times greater than that of the previous work.

The texel rate is about 1.88-MTxPS/mW (MTxPS : Mtexels/s) which is, to my

87


best knowledge, the highest ever published for the mobile handheld devices. PC

graphics system [44] shows about 40kPxPS/mW and 80kPxPS/mW. Although

SONY shows better performance indices than this work, the advantages mainly

come from the difference of the operation voltage – 1.2V for SONY and 2.5V for

this work. If the voltage is taken into account, both architectures may show

similar indices.

88


4.7M

SONY0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1.0

5

130k

S1D13721Seiko Epson

/Futrek

1M

Vert

ices

/s

185k

Z-3DMitsubishi

270k

GShark+Sanshin

RAMP-IV(This Work)

ATI2300

3.7x

1M

(a) Geometry Performance

0

20

40

60

80

100

120

140

160

S1D13721Seiko Epson

/Futrek

Z-3DMitsubishi

GShark+Sanshin

RAMP-IV(This Work)

SONY(Assuming100% $-hit)

150M

3M 5.2M 9M

100M

50MHz 100MHz 75MHz

ATI2300

100M

11x

(b) Rendering Performance

[Fig. 4.4-1 : Performance Comparison]

89


MPx

PS/m

W

0.0

0.2

0.4

0.6

0.8

1.0

1.2

1.4

3.7x

0.07

0.5

0.8

1.3

0.10.04

RAMP-I RAMP-II THISWORK

Z3D PCGraphics


2.5V 1.2V1.8V3.5V 1.8V

(a) Pixel Rate

MTx

PS/m

W

0

1

2

3

4

5

RAMP-I RAMP-II THISWORK

Z3D PCGraphics


1.88

5.4

00 0.1 0.082.5V 1.2V

(b) Texel Rate

[Fig. 4.4-2 : Performance Indices]

90


RAMP-I [6] RAMP-II [7] Z3D [23] RAMP-IV (This Work)

Maximum Shading

Performance40Mpixels/sec 70Mpixels/sec 5.2Mpixels/sec 66Mpixels/sec

Texture Fill Rate 0 0 5.2Mtexels/sec

(?) 264Mtexels/sec

Screen Resolution 256 x 32 256 x 256 132 x 176 256 x 256

Color Depth24bit True Color

Double Buffering

24bit True Color Double Buffering 18bit Color 24bit True Color

Double Buffering

Z Depth 16bit 16bit 12bit 16bit Power Supply 3.5V 1.8V 1.8V (?) 2.5V

Power Consumption 590mW 120mW 38mW 140mW (Texturing)

80mW (No Texturing) Process

Technology 0.35um EML 0.18um EML 0.18um Logic 0.16um DRAM + M3

Area 45mm2 24mm2 30mm2 44mm2

Embedded Memory 512kb DRAM

6Mb DRAM (4Mb FB, 2Mb

ZB)

2.3Mb SRAM (768Kb TM)

29Mb DRAM (3Mb FB, 2Mb ZB,

24Mb TM) Performance

Indices 68KPxPS/mW No Texturing

580KPxPS/mW No Texturing

100KPxPS/mW100KTxPS/mW

825KPxPS/mW 1.88MTxPS/mW

No. of Logic Transistors 220k Logic

Transistors 300k Logic Transistors 150k Logic Gates

Shading Features

Gouraud Shading

Z-Buffered HSR Alpha Blending

Gouraud ShadingZ-Buffered HSR Alpha Blending

Gouraud Shading

HSR (method ?)Alpha Blending

Gouraud Shading Z-Buffered HSR Alpha Blending

(No perf. degradation) Programmable Shading

Texturing Features X X Bilinear

Mapping

Bilinear MIPMAP Texture Blending

Perspective Correct LOD bias

Special Effects X X Antialiasing

2D Acceleration

Antialiasing (No perf. degradation) Programmable Shading Motion Blur by Memory

Programmer 2D Acceleration

Geometry Engine X X

185Ktris/sec SPP

(FPU x 2, INT x 1)

150Ktris/sec GPP (ARM9 + MAC + BEQ)

Software Support X X Z3D-lib MobileGL

[Table 4.4-2 : Performance Comparison]

91


4.4.3 Performance of SlimShader with External SDRAM

Since the SlimShader is scalable, it can be ported to any platforms regardless

of fabrication process. Although it shows the highest performance at MDL

(Merged-DRAM and Logic) process with frame buffer, depth buffer, and texture

memory, it can be also integrated as an IP-core into application processors where

the graphics data are stored in the external SDRAM. Fig. 4.4-3 and table 4.4-3

compares the performance degradation of SlimShader with various

configurations (fig. 4.4.4), assuming the core runs at 50MHz and SDRAM is

attached to graphics-dedicated memory ports. Fig. 4.4-3(a) and (b) show the

performances when 100MHz 32bit SDR-SDRAM, and 100MHz 32bit DDR-

SDRAM are attached to the rendering engine, respectively. Although the

performance slows down to 20% of its maximum performance at worst-case

(without any graphics memories on chip – 4.4-4(d)), it is still higher than MBX.

When the rendering engine is integrated with texture cache and depth buffer (4.4-

4(b)), the performance is even comparable to that of embedding all memories

(4.4-4(a)) with 32bit mobile DDR-SDRAM.

External Memory Interface

SDR-SDRAM DDR-SDRAM

Version Frequency

PT SampleBilinear MIPMAPPT SampleBilinear MIPMAP

(a) RAMP-IV MDL

(All Embedded MEM)

50MHz

@ 0.18um

100M 100M 100M 100M

(b) RAMP-IV Logic #1

(All External MEM)

22M 19.3M 44.3M 41.3M

(c) RAMP-IV Logic #2

(T$ Embedded)

24.3M 23.8M 48.9M 47.7M

(d) RAMP-IV Logic #3

(T$ + ZB Embedded)

50MHz

@ 0.18um

47.9M 45.7M 96M 91.7M

[Table 4.4-2 : Rendering Performance with External SDRAM]

92


(a) MDL (b) Logic #1 (c) Logic #2 (d) Logic #30

10

20

30

40

50

60

70

80

90

100

Pix

el F

ill R

ate

(Mpi

xels

/s)

Pt-Sample

Pt-Sample

Bilinear

Bilinear

DDR-SDRAM

SDR-SDRAM

60%PerformanceDegradation

80%PerformanceDegradation

ARM-MBX

Fig. 4.4-3 : Rendering Performance with External SDRAM

3DCG-IP

Frame Buffer

Depth Buffer

Texture Memory

3DCG-IP ExternalSDRAM

(a) MDL

(d) Logic #4

3DCG-IP ExternalSDRAM

Texture Cache

(c) Logic #2

3DCG-IP

External SDRAM

Depth Buffer

(b) Logic #1

Texture Cache

[Fig. 4.4-4 : Configuration Example of External SDRAM]

93


4.5 Appendix : Design Information

4.5.1 Area Information

Major Blocks GDS Silicon (80% Shrinked) SlimShader 29.5 18.88

Texture Memory 23.6 15.104 ARM9 Core with Cache Controller 16 10.24

Frame Buffer 7.9 5.056 Instruction Cache 7.3 4.672

Data Cache 7.3 4.672 Depth Buffer 5.3 3.392 Polygon Buffer 3.7 2.368

Power Management Unit 2.7 1.728 Bandwidth Equalizer 2.3 1.472

(mm2)

SlimShader

Texture Memory

ARM9 Core with CacheController

Frame Buffer

Instruction Cache

Data Cache

Depth Buffer

Polygon Buffer

Power Management Unit

Bandwidth Equalizer

Memory Programmer

[Fig. 4.5-1 : Chip and Major Blocks]

94


Pipe Combinational Noncombinational Total IF 9.3 4.7 14

ID1 242 1,059 1,301 ID2 52 3,119 3,171 TS 42,243 3,435 45,678 EP 6,187 11,460 17,647 HS 11,103 2,947 14,050 PI 4,937 5,972 10,910

TA1 9,426 2,491 11,917 TA2 8,828 2,764 11,592 TP1 1,604 6,617 8,222 TP2 1,332 2,711 4,044 TP3 4,225 5,157 9,382 TF 11,269 3,814 15,084 PB 6,352 1,193 7,545

Total 107,830 52,749 160,580

(gates)

0

5000

10000

15000

20000

25000

30000

35000

40000

45000

50000

IF

ID1

ID2

TS

EP

HS PI

TA1

TA2

TP1

TP2

TP3

TF

PB

Combinational

Noncombinational

Total

[Fig. 4.5-2 : SlimShader Gate Counts - Pipeline]

95


Block Combinational Noncombinational Total Interface 303.3 4182.7 4486

Triangle Setup 42,243 3,435 45,678 Edge Processor 17,290 14,407 31,697 Pixel Processor 11,289 7,165 18,455

Texture Unit 36,684 23,554 60,241

(gates)

0

10000

20000

30000

40000

50000

60000

70000

Interface TriangleSetup

EdgeProcessor

PixelProcessor

Texture Unit

Combinational

Noncombinational

Total

[Fig. 4.5-3 : SlimShader Gate Counts – Functional Block]

96


4.5.2 Cell Utilization

Cell Name Utilization

(%) Description

1 LND02D1 20,919 2-input NAND, 1x drive

2 LND02D2 13,418 2-inout NAND, 2x drive

3 LMX21D1 11,102 2 > 1 Mux, 1x drive

4 LMFQTNB 8,524 d-enabled f/f, active-low enable, positive-edge, Q only (no reset)

5 LXR02D1 7,504 2-input XOR, 1x drive

6 LIN01D2 4,492 inverter, 2x drive

7 LAN02D1 4,436 2-input AND, 1x drive

8 LNR02D1 4,167 2-input NOR, 1x drive

9 LND12D1 3,658 2-input NAND with /A, 1x drive

10 LIN01D1 3,260 inverter, 1x drive



13 LAD01D1 2,619 1-bit full adder

14 LNR03D1 1,375 3-input NOR, 1x drive

15 LLANTNQ 1,166 d-latch active-high enabled, Q only

16 LIN01D4 978 inverter, 4x drive

17 LNI01D2 828 buffer, 2x drive

18 LXR02D2 775 2-input XOR, 2x drive

19 LOR02D1 767 2-input OR, 1x drive

20 LIN01D7 704 inverter, 7x drive

21 LIN01DA 697 inverter, 10x drive

22 LND02D4 657 2-input NAND, 4x drive

23 LDFNTNB 531 d-f/f positive-edge with Q, Qb (no reset)

24 LDFQTNC 478 d-f/f positive-edge with Q only (no reset), 2x drive

25 LDFBFNC 453 d-f/f negative-edge with set and clear, Q, Qb, 2x drive

26 LDFBFNB 393 d-f/f negative-edge with set and clear, Q, Qb

27 LNR02D2 195 2-input NOR, 2x drive

28 LNT01D1 166 tri-state buffer with active high enable, 1x drive


30 LMX21D4 140 2 > 1 Mux, 4x drive

31 LNI01DC 135 buffer, 20x drive


33 LNI01DD 124 buffer, 40x drive




37 LXN02D2 84 2-input XNOR, 2x drive



40 LMFNTNB 49 d-enabled f/f, active-low enable, positive-edge, Q and Qb (no reset)

41 LNI01DB 46 buffer, 15x drive


43 LIN03DD 41 inverter, 40x drive

44 LHA01D1 39 1-bit half adder


46 LIN02DB 30 inverter, 15x drive

47 LIN03DC 29 inverter, 20x drive

48 LDFNTNC 25 d-f/f positive-edge with Q, Qb (no reset), 2x drive

49 LNI01DA 23 buffer, 10x drive


51 LMX21DA 4 2 > 1 Mux, 10x drive

97



53 LMX21D7 2 2 > 1 Mux, 7x drive


55 LDFBTNB 0 d-f/f positive-edge with set and clear, Q, Qb

56 LDFBTNC 0 d-f/f positive-edge with set and clear, Q, Qb, 2x drive

57 LDFCTNB 0 d-f/f positive-edge with clear, Q, Qb

58 LDFCTNC 0 d-f/f positive-edge with clear, Q, Qb, 2x drive

59 LDFPTNB 0 d-f/f positive-edge with set, Q, Qb

60 LDFPTNC 0 d-f/f positive-edge with set, Q, Qb, 2x drive

61 LDFQTNB 0 d-f/f positive-edge with Q only (no reset)

62 LIT01D1 0 tri-state inverter with active high enable, 1x drive




66 LIT01DA 0 tri-state inverter with active high enable, 10x drive

67 LND12D2 0 2-input NAND with /A, 2x drive

68 LND12D4 0 2-input NAND with /A, 4x drive




72 LNT01DA 0 tri-state buffer with active high enable, 10x drive

73 LXN02D1 0 2-input XNOR, 1x drive

Usuage

0

5,000

10,000

15,000

20,000

25,000

LND02

D1

LMFQ

TNB

LAN02

D1

LIN0

1D1

LAD01

D1

LIN0

1D4

LOR02

D1

LND02

D4

LDFB

FNC

LNT0

1D1

LNI01D

C

LNT0

1D2

LXN0

2D2

LMFN

TNB

LIN0

3DD

LIN0

2DB

LNI01D

A

LXR0

3D1

[Fig. 4.5-4 : SlimShader – Cell Utilization]

98

Chapter 5 System Evaluation

5.1 Target Configurations

BasebandProcessor

ApplicationProcessor

RAMP-IV

BasebandProcessor RAMP-IV

MainApplicationProcessor

RAMP-IV RAMP-IV

(a) Integration with existing A.P. (b) Replacement of existing A.P.

(c) Attachment to main A.P. (d) Standalone Processor

[Fig. 5.1-1 : Target Configurations]

There can be 4 major examples of fabricated SoC and they are summarized in

fig. 5.1-1. The upper examples are for the cell-phones and the lowers are for the

others. Fig. 5.1-1(a) is the integration with existing application processor and

baseband processor, which can be applied to high-end cell-phones. And (b)

shows the replacement of existing application processor for low-cost cell-phones,

where this chip, RAMP-IV, takes the role of application processor (AP). Then,


the next one (c) is the attachment to the main application processor. It is an

example for the highend PDAs and game terminals. Finally, standalone processor

for low-cost game terminals, in which the graphics SoC also controls the overall

system as in fig. 5.1-1(d)

5.2 REMY : System Evaluation Board

5.2.1 System Architecture

To demonstrate the fabricated chip, I’ve designed the evaluation board, REMY,

choosing a standalone configuration. REMY consists of a RAMP-IV, the

proposed graphics SoC, as a main processor, 32MByte main memory, bus

controller, USB interface, and LCD display as shown in fig. 5.2-1. The 3D

applications, ported with MobileGL, are downloaded to the system memory

through the USB interface. Then, RAMP-IV starts drawing pixels on the LCD

screen, interfacing with main memory.

5.2.2 REMY-I : First Evaluation Board

REMY-I is a first prototype board for the chip evaluation and debug. Fig. 5.2-2

shops its photo. The first silicon is successfully working and the images are

drawn on the 256x256 area out of 640x480 LCD screen.

5.2.3 REMY-II : PDA Prototype

100


Also, I revised the board to the PDA prototype by using the smaller parts and

eliminating the debug pins. The images are rendered on 240x256 area out of

240x320 LCD. The board is shown in fig. 5.2-3

GraphicsSoC

(RAMP-IV)

32

24

LCD

USB

8

SRAM

SRAM

SRAM

SRAM

SRAM

SRAM

SRAM

SRAM

16 16

32MB System Memory(Samsung Ut-SRAM)

USB Interface

USB Firmware(Atmel AT89LV92)

USB Link(Philips PDUISBD12)

18

Bus Interface(Altera EPF10K100)

8-Entry FIFO SRAM ControlSynchronizer

Init

LCD TimingGeneration

32

32

[Fig. 5.2-1 : REMY System]

101


RAMP-IV

B.I.

USB

SystemMemory

PowerSupply

[Fig. 5.2-2 : REMY-I]

RAMP-IV SystemMemory

BusInterface

USB

PowerSupply

[Fig. 5.2-3 : REMY-II]

102


5.3 Graphics Library: MobileGL To accelerate the 3D graphics applications on wireless devices, MobileGL, an

OpenGL-ES compatible graphics library, is proposed and developed. MobileGL

is optimized with hand-written assembly language to boost-up the performance

on ARMv4-based platform. As shown in fig. 5.3-1, MobileGL fill the gap

between 3D games or MMI (Man-Machine-Interface) and hardware blocks on

the application platform of wireless devices. MobileGL consists of fixed-point

math library, geometry engine, and rendering engine also with S/W renderer. In

order to downsize the size of graphics library and to enhance the library

performance for mobile 3D gaming applications, many unused functions are cut

out from OpenGL-ES v1.0 and some functions are ported from OpenGL v1.2.

Out of 106 functions in OpenGL-ES v1.0, 70 most frequently used functions are

chosen and 6 more functions, related to glBegin and glEnd, are adopted from

OpenGL v1.2 and added to MobileGL. Fig. 5.3-2 shows the code example of

MobileGL and corresponding assembly language of SlimShader.

5.4 Demonstration Fig. 5.4-1 shows the pictures captured from REMY system running real-time

3D applications.

103


MPEG-4Video

MP3Audio

3DGames

MMI(Menu)

Application

Application Platform (BREW)

Mobile Host S/W

Operating System

MobileGL

BasebandModem

ApplicationProcessor

3D GraphicsAccelerator

Hardware

Geometry Engine

Rendering Engine

Fixed-Point Math Library

S/WRenderer

H/WPort

ARMv4ASM

SlimShaderASM

[Fig. 5.3-1 : Application Platform of Wireless Devices]

Code Example: MobileGL>> glGenTextures(1,&texName);>> glBindTexture(GL_TEXTURE_2D, texName);>> glTexParameteri(GL_TEXTURE_2D,GL_TEXTURE_MIN_FILTER,GL_LINEAR_MIPMAP_NEAREST);>> glTexParameteri(GL_TEXTURE_2D,GL_TEXTURE_MAG_FILTER,GL_LINEAR);>> glTexImage2D(GL_TEXTURE_2D, 0, GL_RGB, 256, 256, GL_RGB, GL_UNSIGNED_BYTE, *texels256);>> glTexImage2D(GL_TEXTURE_2D, 1, GL_RGB, 128, 128, GL_RGB, GL_UNSIGNED_BYTE, *texels128);>> glTexImage2D(GL_TEXTURE_2D, 2, GL_RGB, 64, 64, GL_RGB, GL_UNSIGNED_BYTE, *texels64);>> glTexImage2D(GL_TEXTURE_2D, 3, GL_RGB, 32, 32, GL_RGB, GL_UNSIGNED_BYTE, *texels32);>> glTexImage2D(GL_TEXTURE_2D, 4, GL_RGB, 16, 16, GL_RGB, GL_UNSIGNED_BYTE, *texels16);>> glTexImage2D(GL_TEXTURE_2D, 5, GL_RGB, 8, 8, GL_RGB, GL_UNSIGNED_BYTE, *texels8);>> glTexImage2D(GL_TEXTURE_2D, 6, GL_RGB, 4, 4, GL_RGB, GL_UNSIGNED_BYTE, *texels4);>> glTexImage2D(GL_TEXTURE_2D, 7, GL_RGB, 2, 2, GL_RGB, GL_UNSIGNED_BYTE, *texels2);>> glTexImage2D(GL_TEXTURE_2D, 8, GL_RGB, 1, 1, GL_RGB, GL_UNSIGNED_BYTE, *texels1);>> glTexEnvf(GL_TEXTURE_ENV,GL_TEXTURE_ENV_MODE,GL_MODULATE);>> glEnable(GL_TEXTURE_2D);

Code Example: SlimShader ASM>> TSTR 0x00000 R G B .....>> TSTR 0x10000 R G B .....>> TSTR 0x13ED4 R G B .....>> TSTR 0x15000 R G B .....>> TSTR 0x15400 R G B .....>> TSTR 0x15500 R G B .....>> TSTR 0x15540 R G B .....>> TSTR 0x15550 R G B .....>> TMOD 0x0000 0010 0010 0x01 0000 256

[Fig. 5.3-1 : MobileGL Code Example]

104


[Fig. 5.4-1 : Demonstration Results]

105

Chapter 6 Conclusions and Further Work

6.1 Conclusions

A low-power graphics SoC implementing full 3D pipeline with texturing and

special rendering effects is designed, implemented and demonstrated for mobile

multimedia applications, and published first in the world.

The graphics SoC contains a 32bit RISC processor with enhanced MAC as a

geometry engine, a hard-wired 3D rendering engine, a programmable power

optimizer and 29Mb embedded DRAM. The chip can perform 1Mvertices/s

transformation with custom-designed MobileGL and it can draw pixels at

66Mpixels/s and 264Mtexels/s rate on 256x256 LCD display. Dedicated

hardware engines and embedded DRAM lower the operation frequency to

33MHz. The row-cycle and latency of Embedded DRAM are optimized for

frame buffer, depth buffer, and texture memory. The overall power can be

controlled further by three-step frequency scaling and block-level clock-gating.

A rendering engine has SlimShader and Memory programmer - a main

rendering pipeline and a post-processing unit. They’re designed mainly focusing


on the low-power consumption. SlimShader supports a subset of OpenGL

rendering functions with 13 128bit-encoded instructions. It is composed of 14

multi-pipelined stages to maximally save the power consumption by activating

the only necessary stages. Depth-first clock gating and latch-enabling remove the

unnecessary datapath transitions as much as possible. The SlimShader performs

horizontal-order rasterization with 3D-optimized DRAM to simplify the design.

Hard-wired triangle setup engine is implemented by simplifying the algorithm

and optimizing the datapath precision for the low-power and the small area.

Using multipliers and a shared 11bit floating-point LUT for SIMD divider saves

the area by 40%, compared with 16bit fixed-point calculation, while delivering

required precision.

The energy-efficient texturing unit performs perspective correction and

bilinear MIPMAP filtering for better image quality. In the calculation of

perspective correction, approximated division scheme, rounding off LSBs of 1/w,

is proposed to reduce the divider area by 95%, within 0.78% error boundary.

Address Alignment Logic (AAL) reduces the texture requests with spatial

aligner and temporal aligner. Spatial aligner compares the requests between two

pixel processors and eliminates the overlapped ones. Then, temporal aligner

reduces the requests further comparing the current requests with recently-used

ones without using power-consuming SRAM cache. This AAL reduces the

energy consumption by 66% since it reduces both the power consumption and the

operation cycles.

107


Memory Programmer post processes the rendered pixels transferring them to

the display controller in parallel with the SlimShader. It contains crossbar

switches and a SIMD-parallel datapath which is controlled by its own 16bit

commands. Special rendering effects such as Full Scene Antialiasing, Motion

Blur, and Fog can be programmed without degrading the performance of main

pipeline because the pixels are directly transferred to the LCD controller.

Also, 12 distributed DRAMs reduce the operation power in the graphics

memory by up to 75% since the only necessary memories can be selectively

activated, while providing up to 2.4GByte/s at 50MHz. The memories can be

selectively refreshed for data retention in standby modes by 4 power-control

instructions.

The chip is implemented with 0.16um 256Mb-compatible DRAM process to

reduce the fabrication cost. The logic components, SRAM and analog blocks are

drawn with the design rule of peripheral transistors of the DRAM. This DRAM-

based SoC implementation enables us to put large on-chip memory with

inherently little leakage current, which is important for mobile multimedia

applications. Full 3D graphics pipeline featuring 1Mvertices/s, 66Mpixels/s and

264Mtexels/s texture-mapped 3D graphics consumes less than 210mW and

121mm2 chip area. The embedded DRAM drastically reduces the power

consumption since the external I/Os for 3D rendering are completely eliminated,

and additional 22% reduction is obtained by Address Alignment Logic and

108


Depth-First Clock-Gating. This chip achieves highest performance among

previous and recently-implemented chips.

Two evaluation boards are designed to demonstrate the fabricated chip. The

3D graphics images are successfully demonstrated on each board with MobileGL,

running real-time applications.

Therefore, this work allows 3D graphics to be implemented for mobile

multimedia applications.

6.2 Further Work Since this research solves the bottleneck in 3D rendering on mobile

applications, now it’s time to accelerate geometry stage to balance and speed-up

the entire graphics pipeline. Also, as the direction of today’s PC graphics shows,

programmable shading must be implemented onto mobile devices to draw pixels

with higher fidelity.

109

Chapter 7 Summary

要約文


저전력 3 차원 그래픽 SoC 의 설계및 구현

휴대폰등의 저전력 정보 단말기에 사용되는 3 차원 그래픽 가속기에 대한

연구를 수행하였다. SlimShader 라는 Low-Power 3D Rendering Engine 을 새롭게

제안하였다. 14 단의 저전력 파이프라인으로 구성되어 있으며, 메모리 억세스의

효율성을 살리기 위해 Horizontal Scan Rasterization 을, 계산 속도를 증가시키고

설계를 간단히 하기 위해 Look Up Table 과 Multiplier 를 이용한 Divider 를 채택

하였으며 Address Alignment Logic 을 채택하여 Texture Mapping 시의 메모리

Access 를 1/4 가까지 줄여 Energy 소모를 1/3 로 줄였다.

제안된 구조를 확인하고자 ARM9, 3D Rendering Engine, Cache, Texture Memory,

Depth Buffer, Frame Buffer 및 Power Management Unit 을 0.16um DRAM 공정으로

29Mbit DRAM, 72kbit SRAM, 1M logic transistor 로 구현하였다. Fast, Normal, Slow

의 3 가지 모드로 동작이 가능하며, Bilinear MIPMAP 을 사용하는 3 차원 영상

구현 시 210mW (Fast Mode, 33MHz)의 전력 소모를 보임을 확인하였다. REMY

라는 System Evaluation Board 를 제작하여 256x256 해상도의 LCD 화면에 3 차원

영상이 고속으로 표시됨을 MobileGL S/W 라이브러리와 함께 증명하였다.

Chapter 8 Bibliography

[1] Takashi Hashimoto, et al, “A 27-MHz/54-MHz 11-mW MPEG-4 Video Decoder

LSI for Mobile Applications,” IEEE J. Solid-State Circuits, vol. 37, pp. 1574-11581,

Nov. 2002

[2] Tsuyoshi Nishikawa, et al., “A 60Mhz 230mW MPEG-4 Video-Phone LSI with

16Mb Embedded DRAM,” in ISSCC Digest of Technical Papers, pp. 230-231, Feb.

2000

[3] Khronos Group, “Brining 3D Gaming to Cell Phones,” Game Developers

Conference 2003

[4] G. K. Kolli, S. Junkins, H. Barad, ”3D Graphics Optimizations for ARM

Architecture,” Proceedings of the Game Developers Conference 2002, March 2002

[5] Alan Watt, “3D Computer Graphics,” 3rd Ed, 2000, Addison-Wesley

[6] Yong-Ha Park, et al, “A 7.1-GB/s Low-Power Rendering Engine in 2-D Array-

Embedded Memory Logic CMOS for Portable Multimedia System,” IEEE J. Solid-

State Circuits, vol. 36, pp. 944-955, Jun. 2001

[7] Ramchan Woo, et al., “A 120mW 3D Rendering Engine with 6Mb Embedded


DRAM and 3.2Gbyte/s Runtime Reconfigurable Bus for PDA-Chip,” IEEE J. Solid-

State Circuits, vol. 37, pp. 1352-1355, Oct. 2002

[8] Chi-Weon Yoon et al, “A 80/20MHz 160mW Multimedia Processor integrated

with Embedded DRAM, MPEG-4 and 3D Rendering Engine for Mobile

Applications,” IEEE J. Solid-State Circuits, vol. 36, pp. 1758-1767, Nov. 2001

[9] Se-Jeong Park, et al, “A Reconfigurable Multilevel Parallel Texture Cache

Memory With 75-GB/s Parallel Cache Replacement Bandwidth,” IEEE J. Solid-State

Circuits, pp. 612-623, May. 2002

[10] Aurangzeb K. Khan et al, “A 150MHz Graphics Rendering Processor with

256Mb Embedded DRAM,” in ISSCC Dig. Tech. Papers, pp. 150-151, Feb. 2001

[11] John S. Montry et al, “InfiniteReality: A Real-Time Graphics System,” in Proc.

SIGGRAPH, pp. 293-302, 1997

[12] W.R. Hamburgen, et al, “Itsy : stretching the bounds of mobile computing,”

IEEE Computer, vol 34, pp. 28-36, Apr. 2001

[13] Dennis D. Buss, “Technology in the Internet Age,” ISSCC Digest of Technical

Papers, pp. 18-21, Feb. 2002

[14] www.opengl.org

[15] www.microsoft.com/directx

[16] Ramchan Woo, “Design and Implementation of Low-Power Embedded 3D

Graphics Rendering Engine for Mobile Applications using the Embedded Memory

Logic Technology,” M.S. Dissertation, KAIST 2001.

[17] Ramchan Woo, et al, “A 210mW Graphics LSI Implementing Full 3D Pipeline

with 264Mtexels/s Texturing for Mobile Multimedia Applications,” in ISSCC Digest 112


of Technical Papers, pp. 44-45, Feb. 2003

[18] Ramchan Woo, et al, “A Low-Power and high-Performance 2D/3D Graphics

Accelerator for Mobile Multimedia Applications,” Hot Chips 2003

[19] Ramchan Woo, et al, “A Low Power 3D Rendering Engine with Two Texture

Units and 29Mb Embedded DRAM for 3G Multimedia Terminals,” in Proc. Of

European Solid-State Circuits Conference, pp. 53 – 56, 2003

[20] Ramchan Woo, et al, “A 210mW Graphics LSI Implementing Full 3D Pipeline

with 264Mtexels/s Texturing for Mobile Multimedia Applications,” IEEE J. Solid-

State Circuits, Accepted for Publication

[21] Ramchan Woo, et al, “A Low-Power 3D Rendering Engine with Two Texture

Units and 29Mb Embedded DRAM for 3G Multimedia Terminals,” IEEE J. Solid-

State Circuits, Accepted for Publication

[22] Ramchan Woo, et al, “A Low-Power Graphics LSI integrating 29Mb Embedded

DRAM for Mobile Multimedia Applications,” University Design Contest, Asia-

South-Pacific Design Automatic Conference 2004, Accepted for Presentation

[23] Masatoshi Kameyama, et al, "3D Graphics LSI Core for Mobile Phone Z3D,"

ACM SIGGRAPH/Eurographics Workshop on Graphics Hardware, 2003

[24] “ARM MBX HR-S 3D Graphics Core Technical Overview,” Technical

Document, ARM DTO-0003B, 2002

[25] Xie, Feng, and Micheal Shantz, “Adaptive Hierarchical Visibility in a Tiled

Architecture,” ACM SIGGRAPH/Eurographics Workshop on Graphics Hardware, pp.

75-84, 1999

[26] Tomas Akenine-Moller, Jacob Strom, “Graphics for the Masses: A Hardware 113


Rasterization Architecture for Mobile Phones,” Proc. of ACM SIGGRAPH, pp. 801-

808, 2003

[27] Junichi Fujita, et al, “A 109.5mW 1.2V, 600Mtexels/s 3-D Graphics Engine,” in

ISSCC Digest of Technical Papers, pp. 332-333, Feb. 2004

[28] Gregory A. Uvieghara, et al, “A Highly-Integrated 3G CDMA2000 1X Cellular

Baseband Chip with GSM/AMPS/GPS/Bluetooth/Multimedia Capabilities and ZIF

RF Support,” in ISSCC Digest of Technical Papers, pp. 422-423, Feb. 2004

[29] T. Kamei, et al, “A Resume-Standby Application Processor for 3G Cellular

Phones,” in ISSCC Digest of Technical Papers, pp. 336-337, Feb. 2004

[30] Fumio Arakawa, et al, “An Embedded Processor Core for Consumer Appliances

with 2.8GLOPS and 36M polygons/s FPU,” in ISSCC Digest of Technical Papers, pp.

334-335, Feb. 2004

[31] Khronos Group, “OpenGL ES Common/Common-Lite Profile Specification,”

version 1.0 (Annotated)

[32] JSR-184 Expert Group, “Mobile 3D Graphics API for Java 2 Micro Edition,”

Public Review Draft, Apr. 30, 2003.

[33] Z.S. Hakura, et al, “The Design and Analysis of a Cache Architecture for Texture

Mapping,” Proc. of the 24th International Symposium on Computer Architecture,

1997

[34] Young-Don Bae, et al., “A Single-Chip Programmable Platform Based on a

Multithreaded Processor and Configurable Logic Clusters,” ISSCC Digest of

Technical Papers, pp. 336-337, Feb. 2002

[35] Ju-Ho Sohn, et al, “Optimization of Portable System Architecture for Real-time 114


3D Graphics,” in IEEE International Symposium on Circuits and Systems

Proceedings, pp. I769-I772, 2002

[36] Michael Cox, et al, “Multi-Level Texture Caching for 3D Graphics Hardware,”

ACM/IEEE International Symposium on Computer Architecture, pp. 86-97, 1998

[37] Homan Igehy, et al, “Parallel Texture Caching,” ACM SIGGRAPH/Eurographics

Workshop, pp. 95 – 106, 1999

[38] Homan Igehy, et al, “Prefetching in a Texture Cache Architecture,” ACM

SIGGRAPH/Eurographics Workshop, 1998

[39] “ARM Architecture Reference Manual,” Technical Document, ARM DUI-0100B,

1996

[40] Michael F. Deering, et al, “FBRAM : A new Form of Memory Optimized for 3D

Graphics,” SIGGRAPH, pp. 167-173, 1994

[41] L. Williams, “Pyramidal Parametrics,” SIGGRAPH, pp. 1-11, 1983

[42] Paul S. Heckbert, “Survey of Texture Mapping,” IEEE Computer Graphics and

Applications, vol. 6, no. 11, Nov. 1986, pp. 56-67

[43] Jon P. Ewins, et al, “MIP-Map Level Selection for Texture Mapping,” IEEE

Transactions on Visualization and Computer Graphics, vol. 4, no. 4, pp 317-329, Oct.-

Dec., 1998

[44] John Montrum, and Henry Moreton, “nVidia GeForce4,” HotChips 2002

[45] “3 次元グラフィックスで変するモバイル。ゲーム機,” Nikkei Electronics

pp. 77 – 86, 10-27, 2003

[46] Joel McCormack, et al, “Neon : A (Big) (Fast) Single-Chip 3D Workstation

Graphics Accelerator,” Research Report 98/1, Compaq Computer Corporation 115


Western Research Laboratory, 1999.

[47] O. Lathrop, D. Kirk, et al, “Accurate Rendering by Subpixel Addressing,” IEEE

Computer Graphics and Applications, pp 45-52, Sep., 1990.

[48] Anders Kugler, “The Setup for Triangle Rasterization,” Eurographics, pp 49-58,

1996

[49] B. Barenbrug, et al, “Algorithms for Division Free Perspective Correct

Rendering,” ACM SIGGRAPH/Eurographics Workshop on Graphics Hardware, 2000

[50] Tomas Akenine-Moller, and Etric Haines, “Real-Time Rendering,” 2nd Ed, 2002,

AK Peters

[51] Christoforos E. Kozyarakis, David A. Patterson, “Scalable Vector Processors for

Embedded Systems,” IEEE Micro, pp. 36-45, Nov.-Dec., 2003

[52] Semiconductor Industry Association, “International Technology Roadmap for

Semiconductors,” 2002

[53] http://www.ati.com

116

Acknowledgment

감사의 글

학부를 마친 후 99 년에 반도체 시스템에 들어온 뒤, 지금의 저를 있게 도와 주

신 수 많은 분들께 진심으로 고개 숙여 감사 드립니다. 부모님을 비롯, 교수님들,

세계 최고 SSL 실험실 멤버들, 친구 및 동료들, 그리고 회사에 계신 분들.

모든 분들의 이름을 하나하나 열거하며 감사의 마음을 글로 몇 자 적어 본 들

무슨 의미가 있겠습니까. 지금까지 제가 배워왔던 방법처럼, 직접 몸으로 실천하

여 반드시 10 년 안에 최고의 결과로 보답해 드리겠습니다.

Design is not the creation,

but the process of decision.

DEPT. OF E.E, KAIST • GUSEONG-DONG, YUSEONG-KU, 305-701 • DAEJEON, KOREA • +82-42-869-8068 [email protected] • http://ssl.kaist.ac.kr/~ramchan/main.html

RAMCHAN WOO EDUCATION Korea Advanced Institute of Science and Technology

- Full Scholarship from the Korea Government 3/01 – 8/04 Ph.D. in Electrical Engineering

Thesis : Design and Implementation of Low-Power 3D Graphics SoC for Mobile Multimedia Applications

3/99 – 2/01 M.S. in Electrical Engineering Thesis : Design and Implementation of Low-Power Embedded 3D Graphics Rendering Engine for Mobile Applications using the Embedded Memory Logic Technology Course GPA : 3.78/4.3

3/95 – 2/99 B.S. in Electrical Engineering Summa Cum Laude Overall GPA : 3.95/4.3 – Major GPA : 4.15/4.3

Taejon Science High School

- Scholarship from Samsung Heavy Industry 3/93 – 2/95 Valedictorian, one-year-early graduation

INTERNATIONAL JOURNAL PAPERS (4 FIRST AUTHORED) IEEE Micro

RAMP: Brining 3D Graphics Hardware to Wireless Applications with Embedded DRAM Technology

Ramchan Woo, and Hoi-Jun Yoo IEEE Micro, Submitted

JSSC 2004

A Low-Power 3D Rendering Engine with Two Texture Units and 29Mb Embedded DRAM for 3G Multimedia Terminals

Ramchan Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, Young-Don Bae and Hoi-Jun Yoo IEEE Journal of Solid-State Circuits, Vol. 39, No. 7, July, 2004

JSSC 2004

210mW Graphics LSI Implementing Full 3D Pipeline with 264Mtexels/s Texturing for Mobile Multimedia Applications

Ramchan Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, and Hoi-Jun Yoo IEEE Journal of Solid-State Circuits, Vol. 39, No. 2, February, 2004

JSSC 2002

A 120mW 3D Graphics Rendering Engine with 6Mb Embedded DRAM and 3.2Gbyte/s Runtime Reconfigurable Bus for PDA-Chip

Ramchan Woo, Chi-Weon Yoon, Jeonghoon Kook, Se-Joong Lee, and Hoi-Jun Yoo IEEE Journal of Solid-State Circuits, Vol. 37, No. 10, October, 2002

JSSC 2002

A Reconfigurable Multilevel Parallel Texture Cache Memory With 75-GB/s Parallel Cache Replacement Bandwidth Se-Jeong Park, Jeong-Su Kim, Ramchan Woo, Se-Joong Lee, Kangmin Lee, Tae-Hum Yang, Jin-Yong Jung and Hoi-Jun Yoo

IEEE Journal of Solid-State Circuits, Vol. 37, No. 5, May, 2002 JSSC 2001

An 80/20-MHz 160-mW Multimedia Processor Integrated With Embedded DRAM, MPEG-4 Accelerator, and 3D Rendering Engine for Mobile Applications Chi-Weon Yoon, Ramchan Woo, Jeonghoon Kook, Se-Joong Lee, Kangmin Lee, and Hoi-Jun Yoo

IEEE Journal of Solid-State Circuits, Vol. 36, No. 11, November, 2001


INTERNATIONAL CONFERENCE PAPERS (6 FIRST AUTHORED) ISSCC 2003

A 210mW Graphics LSI implementing Full 3D Pipeline with 264Mtexels/s Texturing for Mobile Multimedia Applications

Ramchan Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, Young-Don Bae, Chi-Weon Yoon, Byeong-Gyu Nam, Jeong-Ho Woo, Sung-Eun Kim, In-Cheol Park, Sungwon Shin, Kyung-Dong Yoo, Jin-Yong Chung, and Hoi-Jun Yoo IEEE International Solid-State Circuits Conference (ISSCC 2003 Proceedings)

Hot Chips 2003

A Low-Power and High-Performance 2D/3D Graphics Accelerator for Mobile Multimedia Applications

Ramchan Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, Young-Don Bae, and Hoi-Jun Yoo 15th International Hot Chips Conference

Graphics Hardware 2004

A Programmable Vertex Shader with Fixed-Point SIMD Datapath for Low Power Wireless Applications

Ju-Ho Sohn, Ramchan Woo, and Hoi-Jun Yoo Eurographics / Graphics Hardware Workshop 2004, Accepted for Presentation

ESSCIRC 2003

A Low-Power 3D Rendering Engine with Two Texture Units and 29Mb Embedded DRAM for 3G Multimedia Terminals

Ramchan Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, and Hoi-Jun Yoo IEEE European Solid-State Circuits Conference

ASP-DAC Design Contest 2004

A Low-Power Graphics LSI integrating 29Mb Embedded DRAM for Mobile Multimedia Applications

Ramchan Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, Young-Don Bae, and Hoi-Jun Yoo Asian South Pacific-Design Automation Conference 2004 University Design Contest

ISSCC 2001

A 80/20MHz 160mW Multimedia Processor integrated with Embedded DRAM, MPEG-4 Accelerator, and 3D Rendering Engine for Mobile Applications

Chi-Weon Yoon, Ramchan Woo, Jeonghoon Kook, Se-Joong Lee, Kangmin Lee, Young-Don Bae, In-Cheol Park, and Hoi-Jun Yoo IEEE International Solid-State Circuits Conference (ISSCC 2001 Proceedings)

ISSCC 2000

A 7.1GB/s Low Power 3D Rendering Engine in 2D Array Embedded Memory Logic CMOS

Yong-Ha Park, Seon-Ho Han, Jung-Su Kim, Se-Joong Lee, Jeong-Hun Kook, Jae-Won Lim, Ramchan Woo, Hoi-Jun Yoo, Jeong-Hwan Lee, and Jay-Hyun Lee IEEE International Solid-State Circuits Conference (ISSCC 2000 Proceedings)

Symp. on VLSI Circuits 2001

A 120mW Embedded 3D Graphics Rendering Engine with 6Mb Logically Local Frame-Buffer and 3.2GByte/s Run-time Reconfigurable Bus for PDA-Chip

Ramchan Woo, Chi-Weon Yoon, Jeonghoon Kook, Se-Joong Lee, Kangmin Lee, Yong-Ha Park and Hoi-Jun Yoo IEEE Symposium on VLSI Circuits (SOVC 2001 Proceedings)


Low Power Motion Compensation Block IP with emdedded DRAM Macro for Portable Multimedia Applications

Chi-Weon Yoon, Jeonghoon Kook, Ramchan Woo, Se-Joong Lee, Kangmin Lee and Hoi-Jun Yoo IEEE Symposium on VLSI Circuits (SOVC 2001 Proceedings)


A Reconfigurable Multimedia Parallel Graphics Cache Memory with 75GB/s Parallel Cache Replacement Bandwidth

Se-Jeong Park, Jeongsu Kim, Ramchan Woo, Se-Joong Lee, Kangmin Lee, T.H. Yang, J.Y. Jung and Hoi-Jun Yoo IEEE Symposium on VLSI Circuits (SOVC 2001 Proceedings)


480ps 64bit Race Logic Adder Se-Joong Lee, Ramchan Woo and Hoi-Jun Yoo IEEE Symposium on VLSI Circuits (SOVC 2001 Proceedings)

ISCAS 2002

Optimization of Portable System Architecture for Real-Time 3D Graphics Juho Sohn, Ramchan Woo, and Hoi-Jun Yoo IEEE International Symposium on Circuits and Systems (ISCAS 2002 Proceedings)


ISCAS 2001

A Comparative Analysis of a DDR-SDRAM and a D-RDRAM using a POPeye Simulator Kangmin Lee, Chi-Weon Yoon, Ramchan Woo, Jeonghoon Kook, Ja-Il Ku, Tae-Sung Jung, and Hoi-Jun Yoo IEEE International Symposium on Circuits and Systems (ISCAS 2001 Proceedings)

ISCAS 2000

A 670ps, 64bit Dynamic Low-Power Adder Design Ramchan Woo, Se-Joong Lee, and Hoi-Jun Yoo IEEE International Symposium on Circuits and Systems (ISCAS 2000 Proceedings)

Others 7.1GB/s Bandwidth 3D Rendering Engine using the EML Technology

Yong-Ha Park, Ramchan Woo, Seon-Ho Han, Jung-Su Kim, Se-Joong Lee, Jeong-Hun Kook, Jae-Won Lim, and Hoi-Jun Yoo IEEE International Conference on VLSI and CAD (ICVC 1999 Proceedings)

DOMESTIC PAPERS Magazines The Technology Trends of Embedded Processors on Portable Systems

Hoi-Jun Yoo and Ramchan WooThe Magazine of the IEEK, July, 2001

Journals POPeye : A System Analysis Simulator for DRAM Performance Evaluation Kangmin Lee, Chi-Weon Yoon, Ramchan Woo, Jeong-Hun Kook, Yon-Kyun Im, and Hoi-Jun Yoo Journal of Semiconductor Technology and Science. Vol. 1, No. 2, June, 2001

WORK EXPERIENCE Korea Advanced Institute of Science and Technology 3/99 – 8/04 Research Assistant – Perform research mainly focusing on various aspects of circuits and

systems design, chip implementation. Major research area includes mobile 3D computer graphics.

3/99 – 8/04 Teaching Assistant – Assist teaching for an Electronics Laboratory, Microelectronic Circuit Design

Sandcraft, Santa Clara, CA, USA 1/99 – 2/99 Winter Intern – Intern in the circuit division designing the high-speed adder. LG Semiconductor, Cheong-ju, Korea 1/98 – 2/98 Winter Intern – Intern in the flash-memory division.

INDUSTRY PROJECTS RAMP (RAM Processor)

Development of Application Specific Embedded Memory Logic Design Technology Sponsored by Korea Ministry of Science and Technology, Korea Ministry of Commerce, Industry and Energy.

7/02 – 6/03 Technical Advisor 7/01 – 6/02 Chief Researcher, Team Leader

Responsible for 3D-enhanced multimedia PDA-system architecture and design Responsible for full-chip architecture and design Responsible for portable 3D graphics accelerator architecture and design

8/00 – 6/01 Responsible for portable 3D graphics accelerator architecture design. 10/99 – 7/00 Responsible for “Embedded 3D Graphics Rendering Engine for PDA-Chip” design. 2/99 – 9/99 “DRAM-embedded high performance 3D rendering engine” layout. DA-1

Development of 3D Graphics Accelerator IP for Mobile Application Processor SoC Sponsored by Samsung Electronics

5/03 – 08/04 Responsible for 3D Rendering Engine Architecture


MobileGL-C1

Development of 3D Graphics Library for Wireless Cellular Phones Sponsored by Mcres

3/04 – 5/04 Team Leader Responsible for Library Specification and 3D Rendering Code Optimization for ARM7

RAMP-C1

Development of Low-Power Graphics SoC Platform Sponsored by Korea Ministry of Information and Communication

3/04 – 8/04 Technical Advisor Responsible for Hardware Specification for 3D Graphics

POPeye

Development of Emulator to Analyze DRAM Architecture and Performance Sponsored by Samsung Electronics

2/99 – 10/00 Modeling and Performance Analysis of DDR-SDRAM.

PATENTS Method for Memory Addressing

Ramchan Woo, Chi-Weon Yoon, and Hoi-Jun Yoo U. S. Patent 6,400,640 B2 (Jun. 4, 2002), Korea Patent 368132 (Jan. 14, 2003)

A Low-Power Instruction Decoding Method for Microprocessor Ramchan Woo and Hoi-Jun Yoo Korea Patent 0324253 (Jan. 30, 2002) U. S. Application Number 09/964,387, Pending Japan Application Number 2000-363741, Pending Europe Application Number 100 54 434. 7, Pending Taiwan Application Number 89,123,526, Pending

Virtually Spanning 2D Array (ViSTA) Architecture and Memory Mapping Method for Embedded 3D Graphics Rendering Accelerator

Ramchan Woo and Hoi-Jun Yoo Korea Patent 372090

System for Calculating 3D Computer Graphics on Portable Devices Ramchan Woo, Se-Joong Lee, Jeonghoon Kook, Chi-Weon Yoon and Hoi-Jun Yoo Korea Application Number 2001-53827, Pending

Method and Apparatus for Enhancing Texture Memory Access Performance for 3D Computer Graphics Ramchan Woo, and Hoi-Jun Yoo Korea Application Number 2002-7868, Pending

Method and Apparatus for Efficient Buffer Memory Utilization with Adaptive Flow Control in the Queue System

Ju-ho Sohn, Ramchan Woo, and Hoi-Jun Yoo Korea Application Number 2002-13883, Pending

Method and Apparatus for accelerating 2D/3D multimedia processing by using the coprocessor Ju-ho Sohn, Ramchan Woo, and Hoi-Jun Yoo Korea Application Number 2003-14021, Pending

Method and apparatus for accelerating 2D/3D multimedia operations by using the streaming SIMD coprocessor in portable system

Ju-ho Sohn, Ramchan Woo, and Hoi-Jun Yoo Pending

RESEARCH INTERESTS Mobile 2D/3D Graphics Architecture and its Circuit Design, Graphics Library and Software Platform for Cell-Phones Multimedia Signal Processor for Consumer Electronics


SKILLFUL TOOLS Graphics Library : OpenGL High-level Simulation : C/C++, SystemC Logic Design : VerilogXL, Synopsis Design Compiler, Apollo P&R Tools, Dynacell Circuit Design : Cadence Opus, Hspice, EPIC nanosim, Calibre, Hercules PCB Design : Orcad

LANGUAGES Korean as a domestic language Fluent English and Japanese Intermediate Chinese

for Mobile Multimedia Applications - KAISTssl.kaist.ac.kr/2007/data/thesis/WRC_PhD_Thesis.pdf ·...

Documents

Transcript of for Mobile Multimedia Applications - KAISTssl.kaist.ac.kr/2007/data/thesis/WRC_PhD_Thesis.pdf ·...