SoC 저전력 설계 기법

1

SoC SoC 저전력 설계 기법저전력 설계 기법

조 준 동 조 준 동 SungKyunKwan UniversitySungKyunKwan University

VADA Lab.VADA Lab.

2

·· ContentContent

Introduction SOC Design Trends System Level Low Power Design Architecture Level Low Power Design Conclusion

3

·· SOC Design TrendsSOC Design Trends

Expected to integrate more and more complex• Web-browsing, real-time video processing, speech

recognition and synthesis

Average operating power at or below 100mW and standby power levels at or below 2mW

Performance levels must increase from 300 million operations per second (MOPS) today to 2500 MOPS in 2016

4

Achieving functionality while maximizing battery life and minimizing size

Medical

watch

Cellular phone

Digital still camera

Hearing

aid

Cochlear implant GPS

Portable

audio Digital radio

Noise cancellationheadphones

5

QoS vs. PowerQoS vs. Power

• How accurate should I make my FDCT?

6

The new version of ITRS predicts that Moore’s law will continue on a two to three year cycle throughout this period (2001-2016)

One of the key design challenges is to effectively use the dramatically increasing transistor counts, given certain power and productivity constraints

“Bottom-up” - based on system constraints “Top-down” - based on design resource constraints

SOC Design CharacteristicsSOC Design Characteristics

7

임베디드 프로세서 (ARM) 0.5 MOPS/mW

신호처리 프로세서ASIPs, DSPs

3 MOPS/mW

신호처리ASIC

가용성

에너

지 효

율(M

OP

S/m

W)

0.1

1

10

100

1000

200 MOPS/mW

10-80 MOPS/mW

6

재구성 구조

Energy-Flexibility GapEnergy-Flexibility Gap

8

Radio systemsRadio systems

• WiFi – 10-100Mbits/sec unlicensed band– OFDM, M-ary coding

• 3G – .1-2 Mbits/sec wide area cellular– CDMA, GMSK

• Bluetooth – .8 Mbit/sec cable replacement– Frequency hop

• ZigBee – .02-.2 Kbits/sec low power, low cost– QPSK

• UWB – Recently allowed by FCC – Short pulses (no carrier), bi-phase or PPM

9

Data rateData rate

10 kbits/sec

100 kbits/sec

1 Mbit/sec

10 Mbit/sec

100 Mbit/sec

0 GHz 2 GHz1GHz 3 GHz 5 GHz4 GHz 6 GHz

802.11a

UWBZigBee

Bluetooth

ZigBee

802.11b

802.11g

3G

UWB

10

Cost (projections)Cost (projections)

$ .10

$1

$10

$100

$1000


802.11a

UWB

UWBZigBee

BluetoothZigBee

802.11b,g

3G

11

Power DissipationPower Dissipation

1 mW

10 mW

100 mW

1 W

10 W


802.11a

UWB

UWBZigBee

Bluetooth

ZigBee

802.11bg3G

12

Why Low-Power Devices?Why Low-Power Devices?

• Practical reasons (Reducing power requirements of high throughput portable applications)

• Financial reasons (Reducing packaging costs and achieving memory savings)

• Technological reasons(Excessive heat prevents the realization of high density chips and limits their functionalities)

13

Different Constraints for Different Constraints for Different Application FieldsDifferent Application Fields

• Portable devices: Battery life-time• Telecom and military: Reliability (reduced p

ower decreases electromigration, hence increases reliability)

• High volume products: Unit cost(reduced power decreases packaging cost)

14

Driving Forces for Low-Power: Driving Forces for Low-Power:

Deep-Submicron TechnologyDeep-Submicron Technology

ADVANTAGES Smaller geometries Higher clock

frequencies

DISADVANTAGES Higher power

consumption Lower reliability

15

Dynamic Power ConsumptionDynamic Power Consumption

• Average power consumption by a node cycling at each period T: (each period has a 01 or a 1 0 transition)

CLKDDcycle

switching fVCT

EP

battery

20

CLKDDswitching fVCPbattery

20

Average power consumed by a node with partial activity(only a fraction of the periods has a transition)

16

·· Power ModelPower Model

• Power dissipation in logic blocks, consists of both dynamic (switching) and static (standby)

17

·· Power ModelPower Model

• Memory power is due primarily to row/column decoders and bit and word line switching activity

• Consider the power dissipated when the bitlines are switched by approximately VDD during write cycles

18

·· Chip Composition (Future)Chip Composition (Future)

Low-power digital SOC designs of the future will be 90-95% memory and 5-10% logic, including overhead

Future chips may be dominated by memory due to power and resource constraints

19

Three Factors affecting EnergyThree Factors affecting Energy

– Reducing waste by Hardware Simplification: redundant h/w extraction, Locality of reference,Demand-driven / Data-driven computation,Application-specific processing,Preservation of data correlations

– All in one Approach(SOC): I/O pin and buffer reduction

– Voltage Reducible Hardwares– 2-D pipelining (systolic arrays)– Parallel processing

20

저전력 설계 기법들…저전력 설계 기법들…

• Voltage and process scaling• Design methodologies

– Power-aware design flows and tools, trade area for lower power

• Architecture Design• Power down techniques

– Clock gating, dynamic power management• Dynamic voltage scaling based on workload• Power conscious RT/ logic synthesis• Better cell library design and resizing methods

– Cap. reduction, threshold control, transistor layout

21

SoC Design FlowSoC Design Flow

22

Power AnalysisPower Analysis

• Fast and accurate analysis in the design process– Power budgeting– Knowledge-based architectural and implementation de

cisions– Package selection– Power hungry module identification

• Detailed and comprehesive analysis at the later stages– Satisfaction of power budget and constraints– Hot spots

23

Power SavingsPower Savings

24

Estimation ExpectationsEstimation Expectations

25

System Level Power OptimizationSystem Level Power Optimization

• Algorithm selection / algorithm transformation

• Identification of hot spots• Low Power data encoding• Quality of Service vs. Power• Low Power Memory mapping• Resource Sharing / Allocation

26

Flow Flow

• C/C++ Compilation • Program Execution• Building design representation• Loading profiling data• Setting constraints• Power estimation• Identification of Hot Spots

27

IBM’s PowerPC IBM’s PowerPC

• Optimum Supply Voltage through Hardware Parallel, Pipelining ,Parallel instruction execution– five instruction in parallel (IU, FPU, BPU, LSU,

SRU) , RISC – FPU is pipelined so a multiply-add instruction

can be issued every clock cycle – Low power 3.3-volt design– 603e provides four software controllable

power-saving modes. • Copper Processor with SOI• IBM’s Blue Logic ASIC :New design reduces of

power by a factor of 10 times

28

Silicon-on-InsulatorSilicon-on-Insulator

• How Does SOI Reduce Capacitance ?

Eliminated junction capacitance by using SOI (similar to glass) is placed between the impuritis and the silicon substrate high performance, low power, low soft error

29

Why Copper Processor?Why Copper Processor?

• Motivation: Aluminum resists the flow of electricity as wires are made thinner and narrower.

• Performance: 40% speed-up

• Cost: 30% less expensive

• Power: Less power from batteries

• Chip Size: 60% smaller than Aluminum chip

30

Factors Influencing Factors Influencing CCeffeff

• Circuit function• Circuit technology• Input probabilities• Circuit topology

31

Some Basic DefinitionsSome Basic Definitions

• Signal probability of a signal g(t) is given by

2

2

1lim

T

TTdttg

TgP

T

TngA g

T lim

Signal activity of a logic signal g(t) is given by

where ng(t) is the number of transitions of g(t) in the time interval between –T/2 and T/2.

32

Circuit FunctionCircuit Function

• Assume that there are M mutually independent signals g1, g2,...gM each having a signal probability Pi and a signal activity Ai, for i n.

• For static CMOS, the signal probability at the output of a gate is determined according

to the probability of 1s (or 0s) in the logic description of the gate

P1 1-P1

P1

P2

P1P2P1

P2

1-(1-P1)(1- P2)

Factors Influencing Factors Influencing CeffCeff::

33

Circuit Function (Static CMOS)Circuit Function (Static CMOS)• Transistors connected to

the same input are turning on and off simultaneously when the input changes

• CL of a static CMOS gate is charged to VDD any time a 01 transition at the output node is required.

• CL of a static CMOS gate is discharged to ground any time a 1 0 transition at the output node is required.

NOR Gate

Factors Influencing Factors Influencing CeffCeff::

34

Factors Influencing Factors Influencing CCeffeff::Circuit Function (Static CMOS)Circuit Function (Static CMOS)

• State transition diagram of the NOR gate

8311

''

YY p

YYY

p

Y pppp

35

Factors Influencing Factors Influencing CCeffeff::Circuit Function (Static CMOS)Circuit Function (Static CMOS)

• State transition diagram of the NOR gate

21'' YYYY pppp

36

Factors Influencing Factors Influencing CCeffeff::Input Probabilities Input Probabilities (Static CMOS)(Static CMOS)

• Signal activity calculation: Boolean Difference

01 ii xxi ffxf

It signifies the condition under which output f is sensitized to input xi

If the primary inputs to function f are not spatially correlated, the signal activity at f is

Ni

xif iAxfPA

1

37

Power Reduction Methods:Power Reduction Methods:Architecture Driven Supply Architecture Driven Supply Voltage ScalingVoltage Scaling

• Strategy:1. Modify the architecture of the system so as to

make it faster.2. Reduce VDD so as to restore the original speed.

Power consumption has decreased.• The most common architectural changes rely on

the exploitation of parallelization and pipelining.• Drawback:

The additional circuitry required to compensate the speed degradation may dominate, and the power consumption may increase.

• Consequence:Parallelism and pipelining do not always pay-off.

38

Parallel ArchitecturesParallel Architectures

Ppar=0.36Pref

39

Parallel-Pipelined ArchitecturesParallel-Pipelined Architectures

Ppar=0.2Pref

40

Loop unrollingLoop unrolling

• The technique of loop unrolling replicates the body of a loop some number of times (unrolling factor u) and then iterates by step u instead of step 1. This transformation reduces the loop overhead, increases the instruction parallelism and improves register, data cache or TLB locality.

Loop overhead is cut in half because two iterations are performed in each iteration. If array elements are assigned to registers, register locality is improved because A(i) and A(i +1) are used twice in the loop body. Instruction parallelism is increased because the second assignment can be performed while the results of the first are being stored and the loop variables are being updated.

for i to N

A i A i A i A i

= -

( ) = ( ) + ( - ) ( + )

2 1

1 1

for i to N

A i A i A i A i

A i A i A i A i

= - 2 step 2

( ) = ( ) + ( - ) ( + )

( ) = ( ) + ( ) ( + )

2

1 1

1 1 2

41

Loop Unrolling (IIR filter example)Loop Unrolling (IIR filter example)

Two output samples are computed in parallel based on two input samples.

Neither the capacitance switched nor the voltage is altered. However, loop unrolling enables several other transformations (distributivity, constant propagation, and pipelining). After distributivity and constant propagation,

The transformation yields critical path of 3, thus voltage can be dropped.

)( 211

211

nnnnnn

nnn

YAXAXYAXY

YAXY

22

1

211

nnnn

nnn

YAYAXY

YAXY

42

Loop Unrolling for Low PowerLoop Unrolling for Low Power

43


44


45

EncodingEncoding

• Bus-invert (BI) code– Appropriate for random data patterns– Redundant code (1 extra bus line)– Reduce avg. transitions up to 25%

R. J. Fletcher, “Integrated circuit having outputs configured for reduced state changes,” May 1987, U.S. Patent 4667337.M. R. Stan and W. P. Burleson, “Bus-invert coding for low-power I/O,” IEEE Tr. on VLSI Systems, Mar. 1995, pp. 49-58.

0 0 0 01 0 1 00 1 0 01 1 1 11 0 1 00 1 0 01 1 0 10 0 1 1

0 0 0 0 01 0 1 0 01 0 1 1 11 1 1 1 01 0 1 0 01 0 1 1 10 0 1 0 10 0 1 1 0

X D Z

Z X

Majorityvoter

D inv

inv

46

Different Supply VoltagesDifferent Supply Voltages for Different Units for Different Units

• Partition the chip into multiple sub-units each of which is designed to operate at a specific supply voltage

FAST

SLO

W

SLOW

SLO

W

SLO

W

5V

5V3V

3V

3V

3V

5V 3V

47

Eureka 147/KDMBEureka 147/KDMB 을 위한 을 위한 COFDM COFDM 모뎀 블록도 모뎀 블록도

Reed S

olomon

Encoder

Reed S

olomon

Decoder

Convoluional Interleaver

Convoluional

Deinterleaver

Scram

blerS

crambler

Convolutional E

ncoderV

iterbiD

ecoder

Tim

e Interleaver

Tim

e D

einterleaver

CO

FD

M

Modulator(F

FT

)

CO

FD

M

Modulator(IF

FT

, P

hase/Tim

ing Lock, F

rame

Sync

Channel

(Gaussian, R

icean, Rayleigh)

Serial Data

BE

RT

(Bit-E

rror-Ratio-T

ester) Serial Data

48

DMB DMB 변복조부 국내 외 현황 ․변복조부 국내 외 현황 ․

업체명 생산품목과 주요 특징

TI( 미국 )

DRE200 : 범 용 DSP 사 용 하 여 COFDM/Audio FEC/Decoder 수 행 , 160mW

ATMEL( 독일 )

U2739M : Oak DSP 사 용 하 여 COFDM 복 조 , HW Audio / FEC Decoding, 860mW

Panasonic( 일본 )

MN66720UC : SDSP for COFDM, MDSP for Audio,

Frontier Silicon(

영국 )

Chorus FS1010 : Special DSP for COFDM/Audio, 100mW

49

저전력 소모 기술 개발 현황 저전력 소모 기술 개발 현황 개발자 응용 제품 특징

IBM, AustinLow power Computing Research

DPM (PowerPC 405LP) 휴대용 프로세서

Linux power management(90% 전력 감소 )

DoD DARPAPower Aware Communication

전력관리 , 스케줄링 , OS 시스템

Philips STMicroelectronic

sAtmel

PCF50606:Single Chip power management unit (for smart phone and wireless PDA)

Programmed power management(70% 전력 감소 )

Atrenta 사 GlassSpy CAD tool

RTL 구조의 HDL 및 SystemC 로 gate 된 클록 구조를 생성

50

DSPDSP

ASICASIC

GIRemoval

GIRemoval FFTFFT Phase

RotatorPhase

Rotator

CRCR

FineSTRFineSTR

ChannelEstimator/Equalizer


ViterbiFEC

ViterbiFEC

Coarse STR

Coarse STR

GI/FFTDetectorGI/FFTDetector

ADCADC

CPE CSI

TimingProcessorTiming

Processor

IF

RFSERSER

DemodDemod

NCONCO

DPAGCDP

AGC

GIRemoval

GIRemoval FFTFFT Phase

RotatorPhase

Rotator

CRCR

FineSTRFineSTR



ViterbiFEC

ViterbiFEC

Coarse STR

Coarse STR

GI/FFTDetectorGI/FFTDetector

ADCADC

CPE CSI

TimingProcessorTiming

Processor

IF

RFSERSER

DemodDemod

NCONCO

DPAGCDP

AGC

Key_add

Mux_1

Mux_2

Mix_Column

Byte_Sub

Shift_Low

Key_add

DIN_Reg

DOUT_Reg

Control

KeyGeneration

clksel_1

enb

sel_2

clk

enb

rst

Key

subKey

clk

rst

start sel_2

enb

sel_1

HOSTCPU

ADDRESS BUS(8BIT)

RESET

CS

RD

WR

CLK

DW

CryptoProcessor

DATA BUS(32BIT)

DATA BUS(32BIT)

C o e ff ic ie n tU p d a te

C o n ju g a to r

E rro rC o n tro l

L e a rn in gC o n s ta n tC o n tro l

x

x *

y z

c

-5

0

5

10

15

20

25

30

35

40

Conventional FEQ Low-Power FEQ

Conventional FEQ

Low-Power FEQ

buffer

PE PE PE PE

comparator comparator comparator comparator

Control Generator

MemoryPDF

( )

Transition( )

1( )j tb w

ija

1( )i tw

( )i tw

search data buffer reference data buffer

addressgenerator

externalmemorysearch

data

clock generator

contorl signalgenerator

comparator

Motion Vector

comparator

c3_sum

c4_sum

comparator

comparator

modified PE

modified PE

modified PE

modified PE

modified PE

modified PE

modified PE

modified PE

modified PE

modified PE

modified PE

modified PE

shift registors

c2_sum

c1_sum

shift register

externalmemorycurrent

data

modified PE

modified PE

modified PE

modified PE

Low-Power Equalizer for xDSL21% 전력 감소 , SNR=40dBLow-Power Equalizer for xDSL21% 전력 감소 , SNR=40dB

Fast and Low Power Viterbi Search Engine using Inverse Hidden Markov Model68% 전력 감소 , 71% 속도개선 , 1.9 배면적증가삼성 휴먼 테크 우수논문상 , ‘02

Fast and Low Power Viterbi Search Engine using Inverse Hidden Markov Model68% 전력 감소 , 71% 속도개선 , 1.9 배면적증가삼성 휴먼 테크 우수논문상 , ‘02

Maximizing Memory Data Reuse for Lower Power Motion Estimation33% 전력 감소 , 52Mhz 2.1 배 면적증가(SCI 논문 )

Maximizing Memory Data Reuse for Lower Power Motion Estimation33% 전력 감소 , 52Mhz 2.1 배 면적증가(SCI 논문 )

IS-95 기반 CDMA 의 Double Dwell Searcher 저전력 및 co-design 설계 67% 전력 감소 , 41% 면적감소

IS-95 기반 CDMA 의 Double Dwell Searcher 저전력 및 co-design 설계 67% 전력 감소 , 41% 면적감소

OFDM-based high-speed wireless LAN platform20.7Mhz, 237000 gates

OFDM-based high-speed wireless LAN platform20.7Mhz, 237000 gates

스마트 카드용 차세대 저전력 보안 프로세서 칩 설계ECC, Rijndael, DES, SHA

스마트 카드용 차세대 저전력 보안 프로세서 칩 설계ECC, Rijndael, DES, SHA

High-Flexible Design of OFDM Tranceiver for DVB-T ( 개발 중 )

High-Flexible Design of OFDM Tranceiver for DVB-T ( 개발 중 )

VADA Lab’s VADA Lab’s 저전력 저전력 IP’sIP’s

51

기타 저전력 설계 기법 사례기타 저전력 설계 기법 사례

• 변화된 수 체계의 사용• Scheduling/ordering• 알고리즘 치환• 신호 및 통계적 분석

52

수체계 변환에 의한 저전력 기법 – 수체계 변환에 의한 저전력 기법 – I.1I.1• Logarithmic Number System 의 사용

• Log 수 체계– 연산 모듈 중 크기가 가장 큰 FFT 에

적용– look-up table 이 크기에 변수– 어떤 수를 부호와 크기 영역으로

분리한다 . 크기 영역에 대해서 2 의 log 를 취한 값을 산출한다 .

– 변환된 log 값을 어떤 n 비트로 제한된 표현 범위의 값을 갖는 2 진수로 표현 .

• LNS 연산– 곱셈 : 가산– 가감산 : 가산고 감산 및 look-up

table• 연산의 정확도

– 소수부가 2 비트 이상의 경우 BER 성능 감소 없음

• 전력 소모– 실험 결과 일반 butterfly FFT 에

비하여 약 60% 정도 까지 전력 소모가 감소함

– 7.8mW -> 3.1mW

ALA

A

A

AA

SA

A

A

if

ifAL

A

A

if

ifS

LSA

221

,log

,log

0

0

,1

,0

2

2

0

0

,2/5.02

,2/5.02ˆ

ˆ

1

011

A

A

bA

b

bA

b

A

bbnA

L

L

if

if

L

LL

llllL

bIn

53

수체계 변환에 의한 저전력 기법 – 수체계 변환에 의한 저전력 기법 – I.2I.2

54

연산 순차 변환에 의한 저전력 기법 – 연산 순차 변환에 의한 저전력 기법 – I.1I.1

• coefficient ordering– radix-4 pipeline 저전력 FFT 프로세서의 전력

소모를 줄이기 위해 연산 순서를 변형• Coefficient ordering

– 복소 곱셈기의 고정된 계수 입력에 대한 스위칭 동작 감축

• 새로운 commutator 구조– 추가적인 dual-port RAM 사용

– 16 과 64 포인트 FFT 에 대하여 각각 23% 및 9% 의 전력 감소 효과 .• 보다 큰 FFT 에서 효과가 감소

55

연산 순차 변환에 의한 저전력 기법 – 연산 순차 변환에 의한 저전력 기법 – I.2I.2

56

알고리즘 치환에 의한 저전력 – 알고리즘 치환에 의한 저전력 – I.1I.1

• 64-point FFT 에 적용– 64 포인트 FFT 를 알고리즘 변환에 의해 수식을

치환– 2 개의 2 차원 구조의 8 포인트 FFT 로

분할한다 . • 복소 곱셈은 shift-and-add 방식으로 구현한다 .

• 전력 소모– in-house 0.25µ/m BiCMOS technology

공정의 20 MHz 1.8v 공급 전압 하에서 평균 동적 전력 소모 41mW

57

알고리즘 치환에 의한 저전력 – 알고리즘 치환에 의한 저전력 – I.2I.2

1

0

N

k

rkNWkBrA

7

08

7

0864 88

l

lt

m

smsl WWmlBWtsA

58

신호 및 통계적 분석에 의한 저전력 – 신호 및 통계적 분석에 의한 저전력 – I.1I.1

• 전력 소모의 비율– 전체 전력 소모의 절반 가량은 복소 곱셈기에서 이루어 진다 .

• Butterfly 곱셈의 내용 분석– 계수 곱셈의 경우

• generic stage 에서 M 개의 계수 중에서 총 0.25*M+3 은 1– (1, 0) 의 cosine 과 sine 에 대해서 clock gating 사용 가능

• Frequency division duplex 모뎀의 경우– ETSI 표준의 4.3125KHz tone spacing 을 갖는 , 4096 DMT

• upstram carrier 중 41%, donwstream 중 26%, 그외 30% 는 사용되지 않는다 .

– ETSI 표준의 4.3125KHz tone spacing 을 갖는 , 1024 DMT• 각각 13%, 68%, 18% 이다 .

– 59~87% 의 IFFT(up) 입력은 0 이고 31~74%dml FFT(down)입력은 0 이다 .

– clock gating 가능 . – 초기 입력 단에서 적용 가능

59

Clock Network Power Managements Clock Network Power Managements

• 50% of the total power• FIR (massively pipelined circuit): video processing: edge detection voice-processing (data transmission like xDSL) Telephony: 50% (70%/30%) idle, 동시에 이야기하지 않음 .with every clock cycle, data are loaded into the wor

king register banks, even if there are no data changes.

60

Wireless Interface Power-Saving Wireless Interface Power-Saving Ronny Krashinsky and Hari BalakrishnanRonny Krashinsky and Hari BalakrishnanMIT Laboratory for Computer ScienceMIT Laboratory for Computer Science

• Sleep to save energy, periodically wake to check for pending data – PSM protocol: when to sleep and when to wake?

• A PSM-static protocol has a regular sleep/wake cycle

pow

er

pow

er

time time

PSM off PSM on

750mW 50mW 100ms

Measurements of Enterasys Networks RoamAbout 802.11 NIC

61

SYN

ACKDATA SLEEP

PSM onMobile Device

Access Point

Server

100ms

200ms

0msAWAKE

tim

eMobile Device

Access Point

Server

PSM off

Ronny Krashinsky and Hari Balakrishnan, MIT Ronny Krashinsky and Hari Balakrishnan, MIT

62

The PSM-static DilemmaThe PSM-static Dilemma

Compromise between performance and energyIf PSM-static is too coarse-grained, it harms performance by delaying network data

If PSM-static is too fine-grained, it wastes energy by waking unnecessarily

Solution: dynamically adapt to network activity to maintain performance while minimizing energy

– Stay awake to avoid delaying very fast RTTs– Back off (listen to fewer beacons) while idle

63

Why Hardware for Motion Estimation?Why Hardware for Motion Estimation?

• Most Computationally demanding part of Video Encoding

• Example: CCIR 601 format• 720 by 576 pixel• 16 by 16 macro block (n = 16)• 32 by 32 search area (p = 8)• 25 Hz Frame rate (f frame = 25)• 9 Giga Operations/Sec is needed for

Full Search Block Matching Algorithm.

64

Why Reconguration in Motion Estimation?Why Reconguration in Motion Estimation?

• Adjusting the search area at frame-rate according to the changing characteristics of video sequences

• Reducing Power Consumption by avoiding unnecessary computation

Motion Vector Distributions

65

Architecture for Motion EstimationArchitecture for Motion Estimation

From P. Pirsch et al, VLSI Architectures for Video Compression, Proc. Of IEEE, 1995

66

DIGLOG multiplierDIGLOG multiplier

C n n C n n

A A B B

A B A B B A A B

mult add

jR

kR

jR

kR

jR

kR R R

( ) , ( ) ,

,

( )( )

253 214

2 2

2 2 2 2

2 where n world length in bits

1st Iter 2nd Iter 3rd Iter

Worst-case error -25% -6% -1.6%

Prob. of Error<1% 10% 70% 99.8%

With an 8 by 8 multiplier, the exact result can be obtained at a maximum of seven iteration steps (worst case)

67

Low Power CDMA Searcher Low Power CDMA Searcher

CDMA 단말기에 사용하기위한 MSM (Mobile Station Modem) 칩의 Searcher Engine 에 대한 RTL

수준 저전력 설계 구현 . 동작 주파수 : 12.5MHz Data flow graph 를 사용하여 rescheduling, pre-computation

및 strength reduction, Synchronous Accumulator 를 이용한 저전력 설 , area 와 power 를 각각 최대 67.68%, 41.35% 감소 시킴 . San Kim and Jun-Dong Cho, “Low Power CDMA Searcher”, CAD and VLSI Workshop, May. 1999.

• Inki Hwang, San Kim and Jun-Dong Cho, “CDMA Searcher Co-Design”,

• ASIC Workshop, Sep. 1999.

68

CDMA SearcherCDMA Searcher

그림 1). 상세 블록도

69

탐색자 탐색자 (Searcher)(Searcher)

• IS-95 기반의 DS/CDMA 시스템에서 기지국에서 전송하는 파일롯 채널을 입력으로 하여 , 초기 동기를 획득하는 장치

• 탐색자 (Searcher) 의 종류– 상관기를 사용하는 방식 , 정합필터를 응용한 방식– 상관기를 사용한 직렬 탐색 및 Double Dwell 방식을 사용함 .

• 국부 ( 단말기 ) PN 코드 발생기– 15 개의 register 를 사용하여 생성 .– 생성 다항식

70

Operation FlowOperation Flow

1 기지국에서 전송하는 파일럿 채널을 단말기에서 발생된 PN부호열과 역확산 과정 수행 .

2 역확산된 결과를 동기 누적 횟수 Nc 만큼 누적한 후 에너지 계산 과정을 거침 ( 제곱 연산 ).

3 에너지 계산 결과값들은 첫번째 임계치와 비교하여 초과할 경우 뒷 단에서 비동기 누적 (Nn) 수행 .

4 그렇지 못할 경우 PN부호열을 한 칩 빨리 발생시키고 입력되는 신호에 대하여 앞의 과정을 반복 .

5 비동기 누적을 거친 결과값을 두번째 임계치와 비교 .

6 초과하면 탐색 과정을 종료하고 , 그렇지 않을 경우 PN부호열을 한 칩 빨리 발생시키고 앞의 과정을 반복 .

71

Pre-computation Pre-computation

◈ A comparator example : Shrinivas Devadas, 1994

◈ Precomputation for external idleness : M. Alidina, 1994

72

Low Power ComparatorLow Power Comparator

73

Three Input ALUThree Input ALU ( ( Ovadia Bat-Sheva, 1998 )Ovadia Bat-Sheva, 1998 )

The three input ALU consumes much less power than an ALU and an ASU

A drawback of using a 3I-ALU is the added complexity in calculating the carry and overflow.

MUL0 MUL1

ALU ALU/ ASU

ac c 0 ac c 1

P0 P1

Two ALUs Struc ture

MUL0 MUL1

P0 P1

3IALU

ac c 1

Three Input ALU Struc ture

74

Carry Save Adder Carry Save Adder 및 및 Pre-computation Pre-computation 적용 적용

XOR XOR XOR XOR

+ +

+ +

() 2 () 2

>

>

+

>

RX I TXI RXQ TXQ RX I TXQ RXQ -TX I

max 값 선택

θ 1 와 비교

θ 2 와 비교

동기 누적단

비동기 누적단

에너지 계산단

XOR XOR XOR XOR

() 2 () 2

>

>

+

>

RX I TXI RXQ TXQ RX I TXQ RXQ -TX I

max 값 선택

θ 1 와 비교

θ 2 와 비교

동기 누적단

비동기 누적단

에너지 계산단

CSA CSA

75

Rescheduled Data Flow Graph Rescheduled Data Flow Graph

동기 누적단– Carry Save Adder (or

3 Iinput ALU) 사용

임계치 비교– Pre-computation

적용

에너지 계산단– Data Flow 순서를

변화하여 곱셈 과정을 줄임

XOR XOR XOR XOR

()2

>

>

+

>

RXI TXI RXQ TXQRXI TXQRXQ -TX I

max 값 선택

θ1 와 비교

θ2 와 비교

동기 누적단

비동기 누적단

에너지 계산단

| | | |

CSA CSA

76

Image Image 압축압축

77

Link Adaptation TechniqueLink Adaptation TechniqueAdaptive Modulation and CodingAdaptive Modulation and Coding

ThroughputThroughput

C/IC/I

QPSK, R=1/4QPSK, R=1/4

8PSK, R=1/48PSK, R=1/4

16QAM, R=1/416QAM, R=1/4

16QAM, R=1/216QAM, R=1/2

Hull of AMCHull of AMC

Modulation/CodingModulation/Codingtransition, 8PSK->16QAMtransition, 8PSK->16QAM

SoC 저전력 설계 기법

Documents

Transcript of SoC 저전력 설계 기법