SoC 저전력 설계 기법
description
Transcript of SoC 저전력 설계 기법
1
SoC SoC 저전력 설계 기법저전력 설계 기법
조 준 동 조 준 동 SungKyunKwan UniversitySungKyunKwan University
VADA Lab.VADA Lab.
2
·· ContentContent
Introduction SOC Design Trends System Level Low Power Design Architecture Level Low Power Design Conclusion
3
·· SOC Design TrendsSOC Design Trends
Expected to integrate more and more complex• Web-browsing, real-time video processing, speech
recognition and synthesis
Average operating power at or below 100mW and standby power levels at or below 2mW
Performance levels must increase from 300 million operations per second (MOPS) today to 2500 MOPS in 2016
4
Achieving functionality while maximizing battery life and minimizing size
Medical
watch
Cellular phone
Digital still camera
Hearing
aid
Cochlear implant GPS
Portable
audio Digital radio
Noise cancellationheadphones
5
QoS vs. PowerQoS vs. Power
• How accurate should I make my FDCT?
6
The new version of ITRS predicts that Moore’s law will continue on a two to three year cycle throughout this period (2001-2016)
One of the key design challenges is to effectively use the dramatically increasing transistor counts, given certain power and productivity constraints
“Bottom-up” - based on system constraints “Top-down” - based on design resource constraints
SOC Design CharacteristicsSOC Design Characteristics
7
임베디드 프로세서 (ARM) 0.5 MOPS/mW
신호처리 프로세서ASIPs, DSPs
3 MOPS/mW
신호처리ASIC
가용성
에너
지 효
율(M
OP
S/m
W)
0.1
1
10
100
1000
200 MOPS/mW
10-80 MOPS/mW
6
재구성 구조
Energy-Flexibility GapEnergy-Flexibility Gap
8
Radio systemsRadio systems
• WiFi – 10-100Mbits/sec unlicensed band– OFDM, M-ary coding
• 3G – .1-2 Mbits/sec wide area cellular– CDMA, GMSK
• Bluetooth – .8 Mbit/sec cable replacement– Frequency hop
• ZigBee – .02-.2 Kbits/sec low power, low cost– QPSK
• UWB – Recently allowed by FCC – Short pulses (no carrier), bi-phase or PPM
9
Data rateData rate
10 kbits/sec
100 kbits/sec
1 Mbit/sec
10 Mbit/sec
100 Mbit/sec
0 GHz 2 GHz1GHz 3 GHz 5 GHz4 GHz 6 GHz
802.11a
UWBZigBee
Bluetooth
ZigBee
802.11b
802.11g
3G
UWB
10
Cost (projections)Cost (projections)
$ .10
$1
$10
$100
$1000
0 GHz 2 GHz1GHz 3 GHz 5 GHz4 GHz 6 GHz
802.11a
UWB
UWBZigBee
BluetoothZigBee
802.11b,g
3G
11
Power DissipationPower Dissipation
1 mW
10 mW
100 mW
1 W
10 W
0 GHz 2 GHz1GHz 3 GHz 5 GHz4 GHz 6 GHz
802.11a
UWB
UWBZigBee
Bluetooth
ZigBee
802.11bg3G
12
Why Low-Power Devices?Why Low-Power Devices?
• Practical reasons (Reducing power requirements of high throughput portable applications)
• Financial reasons (Reducing packaging costs and achieving memory savings)
• Technological reasons(Excessive heat prevents the realization of high density chips and limits their functionalities)
13
Different Constraints for Different Constraints for Different Application FieldsDifferent Application Fields
• Portable devices: Battery life-time• Telecom and military: Reliability (reduced p
ower decreases electromigration, hence increases reliability)
• High volume products: Unit cost(reduced power decreases packaging cost)
14
Driving Forces for Low-Power: Driving Forces for Low-Power:
Deep-Submicron TechnologyDeep-Submicron Technology
ADVANTAGES Smaller geometries Higher clock
frequencies
DISADVANTAGES Higher power
consumption Lower reliability
15
Dynamic Power ConsumptionDynamic Power Consumption
• Average power consumption by a node cycling at each period T: (each period has a 01 or a 1 0 transition)
CLKDDcycle
switching fVCT
EP
battery
20
CLKDDswitching fVCPbattery
20
Average power consumed by a node with partial activity(only a fraction of the periods has a transition)
16
·· Power ModelPower Model
• Power dissipation in logic blocks, consists of both dynamic (switching) and static (standby)
17
·· Power ModelPower Model
• Memory power is due primarily to row/column decoders and bit and word line switching activity
• Consider the power dissipated when the bitlines are switched by approximately VDD during write cycles
18
·· Chip Composition (Future)Chip Composition (Future)
Low-power digital SOC designs of the future will be 90-95% memory and 5-10% logic, including overhead
Future chips may be dominated by memory due to power and resource constraints
19
Three Factors affecting EnergyThree Factors affecting Energy
– Reducing waste by Hardware Simplification: redundant h/w extraction, Locality of reference,Demand-driven / Data-driven computation,Application-specific processing,Preservation of data correlations
– All in one Approach(SOC): I/O pin and buffer reduction
– Voltage Reducible Hardwares– 2-D pipelining (systolic arrays)– Parallel processing
20
저전력 설계 기법들…저전력 설계 기법들…
• Voltage and process scaling• Design methodologies
– Power-aware design flows and tools, trade area for lower power
• Architecture Design• Power down techniques
– Clock gating, dynamic power management• Dynamic voltage scaling based on workload• Power conscious RT/ logic synthesis• Better cell library design and resizing methods
– Cap. reduction, threshold control, transistor layout
21
SoC Design FlowSoC Design Flow
22
Power AnalysisPower Analysis
• Fast and accurate analysis in the design process– Power budgeting– Knowledge-based architectural and implementation de
cisions– Package selection– Power hungry module identification
• Detailed and comprehesive analysis at the later stages– Satisfaction of power budget and constraints– Hot spots
23
Power SavingsPower Savings
24
Estimation ExpectationsEstimation Expectations
25
System Level Power OptimizationSystem Level Power Optimization
• Algorithm selection / algorithm transformation
• Identification of hot spots• Low Power data encoding• Quality of Service vs. Power• Low Power Memory mapping• Resource Sharing / Allocation
26
Flow Flow
• C/C++ Compilation • Program Execution• Building design representation• Loading profiling data• Setting constraints• Power estimation• Identification of Hot Spots
27
IBM’s PowerPC IBM’s PowerPC
• Optimum Supply Voltage through Hardware Parallel, Pipelining ,Parallel instruction execution– five instruction in parallel (IU, FPU, BPU, LSU,
SRU) , RISC – FPU is pipelined so a multiply-add instruction
can be issued every clock cycle – Low power 3.3-volt design– 603e provides four software controllable
power-saving modes. • Copper Processor with SOI• IBM’s Blue Logic ASIC :New design reduces of
power by a factor of 10 times
28
Silicon-on-InsulatorSilicon-on-Insulator
• How Does SOI Reduce Capacitance ?
Eliminated junction capacitance by using SOI (similar to glass) is placed between the impuritis and the silicon substrate high performance, low power, low soft error
29
Why Copper Processor?Why Copper Processor?
• Motivation: Aluminum resists the flow of electricity as wires are made thinner and narrower.
• Performance: 40% speed-up
• Cost: 30% less expensive
• Power: Less power from batteries
• Chip Size: 60% smaller than Aluminum chip
30
Factors Influencing Factors Influencing CCeffeff
• Circuit function• Circuit technology• Input probabilities• Circuit topology
31
Some Basic DefinitionsSome Basic Definitions
• Signal probability of a signal g(t) is given by
2
2
1lim
T
TTdttg
TgP
T
TngA g
T lim
Signal activity of a logic signal g(t) is given by
where ng(t) is the number of transitions of g(t) in the time interval between –T/2 and T/2.
32
Circuit FunctionCircuit Function
• Assume that there are M mutually independent signals g1, g2,...gM each having a signal probability Pi and a signal activity Ai, for i n.
• For static CMOS, the signal probability at the output of a gate is determined according
to the probability of 1s (or 0s) in the logic description of the gate
P1 1-P1
P1
P2
P1P2P1
P2
1-(1-P1)(1- P2)
Factors Influencing Factors Influencing CeffCeff::
33
Circuit Function (Static CMOS)Circuit Function (Static CMOS)• Transistors connected to
the same input are turning on and off simultaneously when the input changes
• CL of a static CMOS gate is charged to VDD any time a 01 transition at the output node is required.
• CL of a static CMOS gate is discharged to ground any time a 1 0 transition at the output node is required.
NOR Gate
Factors Influencing Factors Influencing CeffCeff::
34
Factors Influencing Factors Influencing CCeffeff::Circuit Function (Static CMOS)Circuit Function (Static CMOS)
• State transition diagram of the NOR gate
8311
''
YY p
YYY
p
Y pppp
35
Factors Influencing Factors Influencing CCeffeff::Circuit Function (Static CMOS)Circuit Function (Static CMOS)
• State transition diagram of the NOR gate
21'' YYYY pppp
36
Factors Influencing Factors Influencing CCeffeff::Input Probabilities Input Probabilities (Static CMOS)(Static CMOS)
• Signal activity calculation: Boolean Difference
01 ii xxi ffxf
It signifies the condition under which output f is sensitized to input xi
If the primary inputs to function f are not spatially correlated, the signal activity at f is
Ni
xif iAxfPA
1
37
Power Reduction Methods:Power Reduction Methods:Architecture Driven Supply Architecture Driven Supply Voltage ScalingVoltage Scaling
• Strategy:1. Modify the architecture of the system so as to
make it faster.2. Reduce VDD so as to restore the original speed.
Power consumption has decreased.• The most common architectural changes rely on
the exploitation of parallelization and pipelining.• Drawback:
The additional circuitry required to compensate the speed degradation may dominate, and the power consumption may increase.
• Consequence:Parallelism and pipelining do not always pay-off.
38
Parallel ArchitecturesParallel Architectures
Ppar=0.36Pref
39
Parallel-Pipelined ArchitecturesParallel-Pipelined Architectures
Ppar=0.2Pref
40
Loop unrollingLoop unrolling
• The technique of loop unrolling replicates the body of a loop some number of times (unrolling factor u) and then iterates by step u instead of step 1. This transformation reduces the loop overhead, increases the instruction parallelism and improves register, data cache or TLB locality.
Loop overhead is cut in half because two iterations are performed in each iteration. If array elements are assigned to registers, register locality is improved because A(i) and A(i +1) are used twice in the loop body. Instruction parallelism is increased because the second assignment can be performed while the results of the first are being stored and the loop variables are being updated.
for i to N
A i A i A i A i
= -
( ) = ( ) + ( - ) ( + )
2 1
1 1
for i to N
A i A i A i A i
A i A i A i A i
= - 2 step 2
( ) = ( ) + ( - ) ( + )
( ) = ( ) + ( ) ( + )
2
1 1
1 1 2
41
Loop Unrolling (IIR filter example)Loop Unrolling (IIR filter example)
Two output samples are computed in parallel based on two input samples.
Neither the capacitance switched nor the voltage is altered. However, loop unrolling enables several other transformations (distributivity, constant propagation, and pipelining). After distributivity and constant propagation,
The transformation yields critical path of 3, thus voltage can be dropped.
)( 211
211
nnnnnn
nnn
YAXAXYAXY
YAXY
22
1
211
nnnn
nnn
YAYAXY
YAXY
42
Loop Unrolling for Low PowerLoop Unrolling for Low Power
43
Loop Unrolling for Low PowerLoop Unrolling for Low Power
44
Loop Unrolling for Low PowerLoop Unrolling for Low Power
45
EncodingEncoding
• Bus-invert (BI) code– Appropriate for random data patterns– Redundant code (1 extra bus line)– Reduce avg. transitions up to 25%
R. J. Fletcher, “Integrated circuit having outputs configured for reduced state changes,” May 1987, U.S. Patent 4667337.M. R. Stan and W. P. Burleson, “Bus-invert coding for low-power I/O,” IEEE Tr. on VLSI Systems, Mar. 1995, pp. 49-58.
0 0 0 01 0 1 00 1 0 01 1 1 11 0 1 00 1 0 01 1 0 10 0 1 1
0 0 0 0 01 0 1 0 01 0 1 1 11 1 1 1 01 0 1 0 01 0 1 1 10 0 1 0 10 0 1 1 0
X D Z
Z X
Majorityvoter
D inv
inv
46
Different Supply VoltagesDifferent Supply Voltages for Different Units for Different Units
• Partition the chip into multiple sub-units each of which is designed to operate at a specific supply voltage
FAST
SLO
W
SLOW
SLO
W
SLO
W
5V
5V3V
3V
3V
3V
5V 3V
47
Eureka 147/KDMBEureka 147/KDMB 을 위한 을 위한 COFDM COFDM 모뎀 블록도 모뎀 블록도
Reed S
olomon
Encoder
Reed S
olomon
Decoder
Convoluional Interleaver
Convoluional
Deinterleaver
Scram
blerS
crambler
Convolutional E
ncoderV
iterbiD
ecoder
Tim
e Interleaver
Tim
e D
einterleaver
CO
FD
M
Modulator(F
FT
)
CO
FD
M
Modulator(IF
FT
, P
hase/Tim
ing Lock, F
rame
Sync
Channel
(Gaussian, R
icean, Rayleigh)
Serial Data
BE
RT
(Bit-E
rror-Ratio-T
ester) Serial Data
48
DMB DMB 변복조부 국내 외 현황 ․변복조부 국내 외 현황 ․
업체명 생산품목과 주요 특징
TI( 미국 )
DRE200 : 범 용 DSP 사 용 하 여 COFDM/Audio FEC/Decoder 수 행 , 160mW
ATMEL( 독일 )
U2739M : Oak DSP 사 용 하 여 COFDM 복 조 , HW Audio / FEC Decoding, 860mW
Panasonic( 일본 )
MN66720UC : SDSP for COFDM, MDSP for Audio,
Frontier Silicon(
영국 )
Chorus FS1010 : Special DSP for COFDM/Audio, 100mW
49
저전력 소모 기술 개발 현황 저전력 소모 기술 개발 현황 개발자 응용 제품 특징
IBM, AustinLow power Computing Research
DPM (PowerPC 405LP) 휴대용 프로세서
Linux power management(90% 전력 감소 )
DoD DARPAPower Aware Communication
전력관리 , 스케줄링 , OS 시스템
Philips STMicroelectronic
sAtmel
PCF50606:Single Chip power management unit (for smart phone and wireless PDA)
Programmed power management(70% 전력 감소 )
Atrenta 사 GlassSpy CAD tool
RTL 구조의 HDL 및 SystemC 로 gate 된 클록 구조를 생성
50
DSPDSP
ASICASIC
GIRemoval
GIRemoval FFTFFT Phase
RotatorPhase
Rotator
CRCR
FineSTRFineSTR
ChannelEstimator/Equalizer
ChannelEstimator/Equalizer
ViterbiFEC
ViterbiFEC
Coarse STR
Coarse STR
GI/FFTDetectorGI/FFTDetector
ADCADC
CPE CSI
TimingProcessorTiming
Processor
IF
RFSERSER
DemodDemod
NCONCO
DPAGCDP
AGC
GIRemoval
GIRemoval FFTFFT Phase
RotatorPhase
Rotator
CRCR
FineSTRFineSTR
ChannelEstimator/Equalizer
ChannelEstimator/Equalizer
ViterbiFEC
ViterbiFEC
Coarse STR
Coarse STR
GI/FFTDetectorGI/FFTDetector
ADCADC
CPE CSI
TimingProcessorTiming
Processor
IF
RFSERSER
DemodDemod
NCONCO
DPAGCDP
AGC
Key_add
Mux_1
Mux_2
Mix_Column
Byte_Sub
Shift_Low
Key_add
DIN_Reg
DOUT_Reg
Control
KeyGeneration
clksel_1
enb
sel_2
clk
enb
rst
Key
subKey
clk
rst
start sel_2
enb
sel_1
HOSTCPU
ADDRESS BUS(8BIT)
RESET
CS
RD
WR
CLK
DW
CryptoProcessor
DATA BUS(32BIT)
DATA BUS(32BIT)
C o e ff ic ie n tU p d a te
C o n ju g a to r
E rro rC o n tro l
L e a rn in gC o n s ta n tC o n tro l
x
x *
y z
c
-5
0
5
10
15
20
25
30
35
40
Conventional FEQ Low-Power FEQ
Conventional FEQ
Low-Power FEQ
buffer
PE PE PE PE
comparator comparator comparator comparator
Control Generator
MemoryPDF
( )
Transition( )
1( )j tb w
ija
1( )i tw
( )i tw
search data buffer reference data buffer
addressgenerator
externalmemorysearch
data
clock generator
contorl signalgenerator
comparator
Motion Vector
comparator
c3_sum
c4_sum
comparator
comparator
modified PE
modified PE
modified PE
modified PE
modified PE
modified PE
modified PE
modified PE
modified PE
modified PE
modified PE
modified PE
shift registors
c2_sum
c1_sum
shift register
externalmemorycurrent
data
modified PE
modified PE
modified PE
modified PE
Low-Power Equalizer for xDSL21% 전력 감소 , SNR=40dBLow-Power Equalizer for xDSL21% 전력 감소 , SNR=40dB
Fast and Low Power Viterbi Search Engine using Inverse Hidden Markov Model68% 전력 감소 , 71% 속도개선 , 1.9 배면적증가삼성 휴먼 테크 우수논문상 , ‘02
Fast and Low Power Viterbi Search Engine using Inverse Hidden Markov Model68% 전력 감소 , 71% 속도개선 , 1.9 배면적증가삼성 휴먼 테크 우수논문상 , ‘02
Maximizing Memory Data Reuse for Lower Power Motion Estimation33% 전력 감소 , 52Mhz 2.1 배 면적증가(SCI 논문 )
Maximizing Memory Data Reuse for Lower Power Motion Estimation33% 전력 감소 , 52Mhz 2.1 배 면적증가(SCI 논문 )
IS-95 기반 CDMA 의 Double Dwell Searcher 저전력 및 co-design 설계 67% 전력 감소 , 41% 면적감소
IS-95 기반 CDMA 의 Double Dwell Searcher 저전력 및 co-design 설계 67% 전력 감소 , 41% 면적감소
OFDM-based high-speed wireless LAN platform20.7Mhz, 237000 gates
OFDM-based high-speed wireless LAN platform20.7Mhz, 237000 gates
스마트 카드용 차세대 저전력 보안 프로세서 칩 설계ECC, Rijndael, DES, SHA
스마트 카드용 차세대 저전력 보안 프로세서 칩 설계ECC, Rijndael, DES, SHA
High-Flexible Design of OFDM Tranceiver for DVB-T ( 개발 중 )
High-Flexible Design of OFDM Tranceiver for DVB-T ( 개발 중 )
VADA Lab’s VADA Lab’s 저전력 저전력 IP’sIP’s
51
기타 저전력 설계 기법 사례기타 저전력 설계 기법 사례
• 변화된 수 체계의 사용• Scheduling/ordering• 알고리즘 치환• 신호 및 통계적 분석
52
수체계 변환에 의한 저전력 기법 – 수체계 변환에 의한 저전력 기법 – I.1I.1• Logarithmic Number System 의 사용
• Log 수 체계– 연산 모듈 중 크기가 가장 큰 FFT 에
적용– look-up table 이 크기에 변수– 어떤 수를 부호와 크기 영역으로
분리한다 . 크기 영역에 대해서 2 의 log 를 취한 값을 산출한다 .
– 변환된 log 값을 어떤 n 비트로 제한된 표현 범위의 값을 갖는 2 진수로 표현 .
• LNS 연산– 곱셈 : 가산– 가감산 : 가산고 감산 및 look-up
table• 연산의 정확도
– 소수부가 2 비트 이상의 경우 BER 성능 감소 없음
• 전력 소모– 실험 결과 일반 butterfly FFT 에
비하여 약 60% 정도 까지 전력 소모가 감소함
– 7.8mW -> 3.1mW
ALA
A
A
AA
SA
A
A
if
ifAL
A
A
if
ifS
LSA
221
,log
,log
0
0
,1
,0
2
2
0
0
,2/5.02
,2/5.02ˆ
ˆ
1
011
A
A
bA
b
bA
b
A
bbnA
L
L
if
if
L
LL
llllL
bIn
53
수체계 변환에 의한 저전력 기법 – 수체계 변환에 의한 저전력 기법 – I.2I.2
54
연산 순차 변환에 의한 저전력 기법 – 연산 순차 변환에 의한 저전력 기법 – I.1I.1
• coefficient ordering– radix-4 pipeline 저전력 FFT 프로세서의 전력
소모를 줄이기 위해 연산 순서를 변형• Coefficient ordering
– 복소 곱셈기의 고정된 계수 입력에 대한 스위칭 동작 감축
• 새로운 commutator 구조– 추가적인 dual-port RAM 사용
– 16 과 64 포인트 FFT 에 대하여 각각 23% 및 9% 의 전력 감소 효과 .• 보다 큰 FFT 에서 효과가 감소
55
연산 순차 변환에 의한 저전력 기법 – 연산 순차 변환에 의한 저전력 기법 – I.2I.2
56
알고리즘 치환에 의한 저전력 – 알고리즘 치환에 의한 저전력 – I.1I.1
• 64-point FFT 에 적용– 64 포인트 FFT 를 알고리즘 변환에 의해 수식을
치환– 2 개의 2 차원 구조의 8 포인트 FFT 로
분할한다 . • 복소 곱셈은 shift-and-add 방식으로 구현한다 .
• 전력 소모– in-house 0.25µ/m BiCMOS technology
공정의 20 MHz 1.8v 공급 전압 하에서 평균 동적 전력 소모 41mW
57
알고리즘 치환에 의한 저전력 – 알고리즘 치환에 의한 저전력 – I.2I.2
1
0
N
k
rkNWkBrA
7
08
7
0864 88
l
lt
m
smsl WWmlBWtsA
58
신호 및 통계적 분석에 의한 저전력 – 신호 및 통계적 분석에 의한 저전력 – I.1I.1
• 전력 소모의 비율– 전체 전력 소모의 절반 가량은 복소 곱셈기에서 이루어 진다 .
• Butterfly 곱셈의 내용 분석– 계수 곱셈의 경우
• generic stage 에서 M 개의 계수 중에서 총 0.25*M+3 은 1– (1, 0) 의 cosine 과 sine 에 대해서 clock gating 사용 가능
• Frequency division duplex 모뎀의 경우– ETSI 표준의 4.3125KHz tone spacing 을 갖는 , 4096 DMT
• upstram carrier 중 41%, donwstream 중 26%, 그외 30% 는 사용되지 않는다 .
– ETSI 표준의 4.3125KHz tone spacing 을 갖는 , 1024 DMT• 각각 13%, 68%, 18% 이다 .
– 59~87% 의 IFFT(up) 입력은 0 이고 31~74%dml FFT(down)입력은 0 이다 .
– clock gating 가능 . – 초기 입력 단에서 적용 가능
59
Clock Network Power Managements Clock Network Power Managements
• 50% of the total power• FIR (massively pipelined circuit): video processing: edge detection voice-processing (data transmission like xDSL) Telephony: 50% (70%/30%) idle, 동시에 이야기하지 않음 .with every clock cycle, data are loaded into the wor
king register banks, even if there are no data changes.
60
Wireless Interface Power-Saving Wireless Interface Power-Saving Ronny Krashinsky and Hari BalakrishnanRonny Krashinsky and Hari BalakrishnanMIT Laboratory for Computer ScienceMIT Laboratory for Computer Science
• Sleep to save energy, periodically wake to check for pending data – PSM protocol: when to sleep and when to wake?
• A PSM-static protocol has a regular sleep/wake cycle
pow
er
pow
er
time time
PSM off PSM on
750mW 50mW 100ms
Measurements of Enterasys Networks RoamAbout 802.11 NIC
61
SYN
ACKDATA SLEEP
PSM onMobile Device
Access Point
Server
100ms
200ms
0msAWAKE
tim
eMobile Device
Access Point
Server
PSM off
Ronny Krashinsky and Hari Balakrishnan, MIT Ronny Krashinsky and Hari Balakrishnan, MIT
62
The PSM-static DilemmaThe PSM-static Dilemma
Compromise between performance and energyIf PSM-static is too coarse-grained, it harms performance by delaying network data
If PSM-static is too fine-grained, it wastes energy by waking unnecessarily
Solution: dynamically adapt to network activity to maintain performance while minimizing energy
– Stay awake to avoid delaying very fast RTTs– Back off (listen to fewer beacons) while idle
63
Why Hardware for Motion Estimation?Why Hardware for Motion Estimation?
• Most Computationally demanding part of Video Encoding
• Example: CCIR 601 format• 720 by 576 pixel• 16 by 16 macro block (n = 16)• 32 by 32 search area (p = 8)• 25 Hz Frame rate (f frame = 25)• 9 Giga Operations/Sec is needed for
Full Search Block Matching Algorithm.
64
Why Reconguration in Motion Estimation?Why Reconguration in Motion Estimation?
• Adjusting the search area at frame-rate according to the changing characteristics of video sequences
• Reducing Power Consumption by avoiding unnecessary computation
Motion Vector Distributions
65
Architecture for Motion EstimationArchitecture for Motion Estimation
From P. Pirsch et al, VLSI Architectures for Video Compression, Proc. Of IEEE, 1995
66
DIGLOG multiplierDIGLOG multiplier
C n n C n n
A A B B
A B A B B A A B
mult add
jR
kR
jR
kR
jR
kR R R
( ) , ( ) ,
,
( )( )
253 214
2 2
2 2 2 2
2 where n world length in bits
1st Iter 2nd Iter 3rd Iter
Worst-case error -25% -6% -1.6%
Prob. of Error<1% 10% 70% 99.8%
With an 8 by 8 multiplier, the exact result can be obtained at a maximum of seven iteration steps (worst case)
67
Low Power CDMA Searcher Low Power CDMA Searcher
CDMA 단말기에 사용하기위한 MSM (Mobile Station Modem) 칩의 Searcher Engine 에 대한 RTL
수준 저전력 설계 구현 . 동작 주파수 : 12.5MHz Data flow graph 를 사용하여 rescheduling, pre-computation
및 strength reduction, Synchronous Accumulator 를 이용한 저전력 설 , area 와 power 를 각각 최대 67.68%, 41.35% 감소 시킴 . San Kim and Jun-Dong Cho, “Low Power CDMA Searcher”, CAD and VLSI Workshop, May. 1999.
• Inki Hwang, San Kim and Jun-Dong Cho, “CDMA Searcher Co-Design”,
• ASIC Workshop, Sep. 1999.
68
CDMA SearcherCDMA Searcher
그림 1). 상세 블록도
69
탐색자 탐색자 (Searcher)(Searcher)
• IS-95 기반의 DS/CDMA 시스템에서 기지국에서 전송하는 파일롯 채널을 입력으로 하여 , 초기 동기를 획득하는 장치
• 탐색자 (Searcher) 의 종류– 상관기를 사용하는 방식 , 정합필터를 응용한 방식– 상관기를 사용한 직렬 탐색 및 Double Dwell 방식을 사용함 .
• 국부 ( 단말기 ) PN 코드 발생기– 15 개의 register 를 사용하여 생성 .– 생성 다항식
70
Operation FlowOperation Flow
1 기지국에서 전송하는 파일럿 채널을 단말기에서 발생된 PN부호열과 역확산 과정 수행 .
2 역확산된 결과를 동기 누적 횟수 Nc 만큼 누적한 후 에너지 계산 과정을 거침 ( 제곱 연산 ).
3 에너지 계산 결과값들은 첫번째 임계치와 비교하여 초과할 경우 뒷 단에서 비동기 누적 (Nn) 수행 .
4 그렇지 못할 경우 PN부호열을 한 칩 빨리 발생시키고 입력되는 신호에 대하여 앞의 과정을 반복 .
5 비동기 누적을 거친 결과값을 두번째 임계치와 비교 .
6 초과하면 탐색 과정을 종료하고 , 그렇지 않을 경우 PN부호열을 한 칩 빨리 발생시키고 앞의 과정을 반복 .
71
Pre-computation Pre-computation
◈ A comparator example : Shrinivas Devadas, 1994
◈ Precomputation for external idleness : M. Alidina, 1994
72
Low Power ComparatorLow Power Comparator
73
Three Input ALUThree Input ALU ( ( Ovadia Bat-Sheva, 1998 )Ovadia Bat-Sheva, 1998 )
The three input ALU consumes much less power than an ALU and an ASU
A drawback of using a 3I-ALU is the added complexity in calculating the carry and overflow.
MUL0 MUL1
ALU ALU/ ASU
ac c 0 ac c 1
P0 P1
Two ALUs Struc ture
MUL0 MUL1
P0 P1
3IALU
ac c 1
Three Input ALU Struc ture
74
Carry Save Adder Carry Save Adder 및 및 Pre-computation Pre-computation 적용 적용
XOR XOR XOR XOR
+ +
+ +
() 2 () 2
>
>
+
>
RX I TXI RXQ TXQ RX I TXQ RXQ -TX I
max 값 선택
θ 1 와 비교
θ 2 와 비교
동기 누적단
비동기 누적단
에너지 계산단
XOR XOR XOR XOR
() 2 () 2
>
>
+
>
RX I TXI RXQ TXQ RX I TXQ RXQ -TX I
max 값 선택
θ 1 와 비교
θ 2 와 비교
동기 누적단
비동기 누적단
에너지 계산단
CSA CSA
75
Rescheduled Data Flow Graph Rescheduled Data Flow Graph
동기 누적단– Carry Save Adder (or
3 Iinput ALU) 사용
임계치 비교– Pre-computation
적용
에너지 계산단– Data Flow 순서를
변화하여 곱셈 과정을 줄임
XOR XOR XOR XOR
()2
>
>
+
>
RXI TXI RXQ TXQRXI TXQRXQ -TX I
max 값 선택
θ1 와 비교
θ2 와 비교
동기 누적단
비동기 누적단
에너지 계산단
| | | |
CSA CSA
76
Image Image 압축압축
77
Link Adaptation TechniqueLink Adaptation TechniqueAdaptive Modulation and CodingAdaptive Modulation and Coding
ThroughputThroughput
C/IC/I
QPSK, R=1/4QPSK, R=1/4
8PSK, R=1/48PSK, R=1/4
16QAM, R=1/416QAM, R=1/4
16QAM, R=1/216QAM, R=1/2
Hull of AMCHull of AMC
Modulation/CodingModulation/Codingtransition, 8PSK->16QAMtransition, 8PSK->16QAM