05523919

8/3/2019 05523919

1/10

1200 IEEE TRANSACTION S ON CIRCUI TS A ND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 9, SEPTEMBER 2010

Parallel and Pipeline Architectures forHigh-Throughput Computation of

Multilevel 3-D DWTBasant K. Mohanty, Member, IEEE, and Pramod K. Meher, Senior Member, IEEE

AbstractIn this paper, we present a throughput-scalable par-allel and pipeline architecture for high-throughput computationof multilevel 3-D discrete wavelet transform (3-D DWT). Thecomputation of 3-D DWT for each level of decomposition issplit into three distinct stages, and all the three stages areimplemented in parallel by a processing unit consisting of anarray of processing modules. The processing unit for the first leveldecomposition of a video stream of frame-size (MN) consists ofQ/2 processing modules, where Q is the number of input samplesavailable to the structure in each clock cycle. The processing unitfor a higher level of decomposition requires 1/8 times the numberof processing modules required by the processing unit for itspreceding level. For J level 3-D DWT of a video stream, each ofthe proposed structures involves J processing units in a cascadedpipeline. The proposed structures have a small output latency,and can perform multilevel 3-D DWT computation with 100%hardware utilization efficiency. The throughput rate of proposedstructures are Q/7 time higher than the best of the correspondingexisting structures. Interestingly, the proposed structures involvea frame-buffer of O(MN) while the frame-buffer size of theexisting structures is O(MNR). Besides, the on-chip storage andthe frame-buffer size of the proposed structure is independent ofthe input-block size and this favors to derive highly concurrentparallel architecture for high-throughput implementation. Theoverall area-delay products of proposed structure are significantly

lower than the existing structures, although they involve slightlymore multiplier-delay product and more adder-delay product,since it involves significantly less frame-buffer and storage-word-delay product. The throughput rate of the proposed structurecan easily be scaled without increasing the on-chip storage andframe-memory by using more number of processing modules,and it provides greater advantage over the existing designs forhigher frame-rates and higher input block-size. The full-parallelimplementation of proposed scalable structure provides the bestof its performance. When very high throughput generated bysuch parallel structure is not required, the structure could beoperated by a slower clock, where speed could be traded forpower by scaling down the operating voltage and/or the pro-cessing modules could be implemented by slower but hardware-efficient arithmetic circuits.

Manuscript received January 17, 2009; revised September 2, 2009 andDecember 24, 2009. Date of publication July 26, 2010; date of current versionSeptember 9, 2010. This paper was recommended by Associate Editor S.-Y.Chien.

B. K. Mohanty is with the Dept. of Electronics and CommunicationEngineering, Jaypee Institute of Engineering and Technology, Guna 473226,Madhya Pradesh, India (e-mail: [email protected]).

P. K. Meher is with the Department of Embedded Systems, Institute forInfocomm Research, Singapore 138632 (e-mail: [email protected]).

Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

Digital Object Identifier 10.1109/TCSVT.2010.2056950

Index Terms3-D DWT, discrete wavelet transform, paralleland pipeline architecture, very large scale integration (VLSI).

I. Introduction

THE DISCRETE wavelet transform (DWT) is widely useddue to its remarkable advantage over the unitary trans-forms like discrete Fourier transform (DFT), discrete cosine

transform (DCT) and discrete sine transform (DST) for variousapplications due to its multiple time-frequency resolution.

DWT of different dimensions has emerged as a powerful tool

for speech and image coding in recent years. The 3-D DWT is

found to provide superior performance in video compression

by eliminating the temporal redundancies within the video

sequences for motion compensation [1]. Apart from that, 3-D

DWT has been used popularly for the compression of 3-D and

4-D medical images, volumetric image compression, and video

watermarking etc [2][6]. The multidimensional DWTs are

particularly more computation intensive and, therefore, require

to be implemented in very large scale integrated (VLSI)

systems for real-time applications.

The 2-D and 3-D DWTs can be realized either by separableapproach or by non-separable approach. The separable ap-

proach is more popular over the other, because it involves less

computation for the same throughput performance. The main

building blocks of separable multidimensional DWT cores are

constituted by lower dimensional transform modules and one

or more transposition units. Although the speed performance

and hardware-complexity of a multidimensional DWT archi-

tecture substantially depends on the implementation of its

lower dimensional modules, the area and time overhead of

transposition unit significantly affects the overall performance

of the computing structure [7][9]. In order to avoid the

transposition unit of 2-D DWT computation, some pipeline

architectures have been proposed for the implementation of

non-separable 2-D DWT [10][12]. Attempts have also been

made to reduce the complexities of transposition unit. Wu et al.

[13] have proposed a line-based folded architecture to reduce

the size of the transposition-buffer and overall complexity of

separable 2-D DWT device. In this paper, we aim at presenting

another possible approach for separable implementation of

multidimensional DWT by parallel processing with appropri-

ate scheduling of computation to achieve significant reduction

in storage requirement and area-delay product.

1051-8215/$26.00 c 2010 IEEE

8/3/2019 05523919

2/10

MOHANTY AND MEHER: PARALLEL AND PIPELINE ARCHITECTURES FOR HIGH-THROUGHPUT COMPUTATION OF MULTILEVEL 3-D DWT 1201

As of now, only a few designs for VLSI implementation

of 3-D DWT are found in the literature [16][20]. Weeks

et al. have proposed two separate designs (3DW-I and 3DW-

II) for computing 3-D DWT [16][18]. The 3DW-I design

involves on-chip memory of the O(MNR) which is imprac-

tical to implement in a chip, where (M N) is the im-

age size, and R is the frame-rate of the given video. The

3DW-II structure performs block-by-block processing of 3-

D data, and involves very small on-chip memory O(K). It,however, involves complex control circuitry and frame-buffer

of size MNR to feed the blocks of input data. The aver-

age computation time (ACT) of 3DW-II is also significantly

higher than that of 3DW-I. Das et al. [19] have proposed

a separable architecture for 3-D DWT, which involves less

memory space and lower output latency compared with the

earlier designs. Dai et al. [20] have applied the polyphase

decomposition technique, and mapped the computation of

3-D DWT into a systolic architecture. They have used the

conventional separable method efficiently in their structure

for reducing the on-chip storage space. The systolic design

of [20] requires four times the resources compared to that of

[19], and calculates the 3-D DWT nearly four times faster,and involves significantly less storage space compared to the

other. One major problem with the designs proposed in [19]

and [20] is that they compute the multilevel 3-D DWT by

level-by-level approach (similar to the folded scheme proposed

by Wu et al. [13]) using an external frame-buffer of size

(MNR)/8.

The external frame-buffer is a major hardware component in

the existing structures since the frame-size for practical video

transmission applications may vary from 176144 (screen size

of mobile phone) to 19201080 [screen size of high-definition

television (HDTV)] and the frame-rate can vary from 15 (f/s)

in mobile phones to 60 f/s for HDTV application. The on-

chip storage and the frame-buffer contribute more than 90% to

the total area of the existing structures. Significant amount of

memory-bandwidth and computation-time are also wasted for

accessing the external frame-buffer. It is also observed that the

on-chip storage and frame-buffer size is remain independent

of the throughput rate. This motivates us to apply concurrent

design method and this has two-fold advantage: using con-

current design method the frame-buffer size could be reduced

and the on-chip memory of the 3-D structure can be used more

efficiently to calculate multiple outputs per cycle to improve

the overall performance of the chip. Since the silicon devices

are progressively scaling according to Moores Law [21][23],

more components are accommodated in the integrated circuits,and at the same time silicon cost has been declining fast,

over the years. Hence, it may be considered as an appropriate

strategy to design parallel architectures where area can be

traded either for time, or for power if faster computation is

not required by the application. If high throughput is not

required for a given application, then the clock frequency

could be reduced and lower operating voltage could be used

for reducing the power consumption [24]. Keeping this in

mind, we have proposed a parallel architecture for multilevel

3-D DWT. The key ideas used in our proposed approach

are:

1) to process each decomposition level of 3-D DWT in sep-

arate computing blocks in cascaded pipeline structure for

concurrent computation of multilevel DWT computation

in order to reduce the size of the frame-buffer used for

buffering of the subband components and to maximize

hardware utilization efficiency;

2) the input rows for each level are appropriately folded to

meet the desired throughput rate and to achieve 100%

hardware utilization efficiency (HUE) of the processingunit.

Using the above approach, we have reduced the frame-

buffer size, and have obtained Q/7 times higher throughput

compared with the best of the existing structures, using the on-

chip memory of the same order. It is shown that the proposed

structure can calculate DWT coefficient of an input video

signal of size (MNR) in MNR/Q cycles. The proposed

parallel implementation of 3-D DWT structures is of additional

advantage, since the size of the frame-buffer could be reduced,

and it does not demand for higher on-chip memory and frame-

buffer for higher input block sizes, which contribute the most

of the hardware in the existing designs.

The remainder of this paper is organized as follows. Mathe-

matical formulation of the 3-D DWT computation is presented

in Section II. The proposed architecture for 1-level 3-D

DWT is presented in Section III. A pipeline architecture for

multilevel 3-D DWT is presented in Section IV. The hardware

complexity and performance of proposed structure is discussed

in Section V. Our conclusion is presented in Section VI.

II. Mathematical Formulation

The 3-D DWT coefficients of any decomposition level can

be obtained from the scaling coefficients of its previous level

according to the pyramid algorithm, given by

lllj (n1, n2, n3) =

Kh1

i1=0

Kh1

i2=0

Kh1

i3=0

h1(i1) h2(i2) h3(i3)

lllj1(2n1 i1, 2n2 i2, 2n3 i3) (1)

llhj (n1, n2, n3) =

Kh1

i1=0

Kh1

i2=0

Kg1

i3=0

h1(i1) h2(i2) g3(i3)

lllj1(2n1 i1, 2n2 i2, 2n3 i3) (2)

::

hhhj (n1, n2, n3) =

Kg1

i1=0

Kg1

i2=0

Kg1

i3=0

g1(i1) g2(i2) g3(i3).

lllj1(2n1 i1, 2n2 i2, 2n3 i3). (3)

n1 = 0, 1, . . . , (R/2) 1, n2 = 0, 1, . . . , (M/2) 1, and

n3 = 0, 1, . . . , (N/2) 1, where, Kh and Kg are, respectively,

the lengths of the low-pass and high-pass filters, M and N,

respectively, the height and width the image and R is the

8/3/2019 05523919

3/10


Fig. 1. Proposed structure for computation of 1-level convolution-based 3-D DWT of frame-size (M N). z1l

(n1, n2, q)p and z1h

(n1, n2, q)p represent four

subband components {z1lll

(n1, n2, q)p, z1hll

(n1, n2, q)p, z1lhl

(n1, n2, q)p, z1hhl

(n1, n2, q)p} and {z1llh

(n1, n2, q)p, z1hlh

(n1, n2, q)p, z1lhh

(n1, n2, q)p, z1hhh

(n1, n2, q)p},respectively. z(n1, n2, q)p is defined as the qth data sample of pth block of n2th row of n1th frame of the subband matrix, where 0 q (Q/2) 1,0 p P 1, 0 n2 (M/2) 1, 0 n1 (R/2) 1, N = PQ.

frame-rate of the video stream. Assuming K = Kh = Kg,

(1)(3) can be represented in a generalized form

z(n1, n2, n3) =

K1i1=0

K1i2=0

K1i3=0

w1(i1) w2(i2) w3(i3).

x(2n1 i1, 2n2 i2, 2n3 i3) (4)

where z() corresponds to lllj (), llhj (), . . . , hhhj (), and x()

corresponds to lllj1() in (1)(3), while w1(i), w2(i) and w3(i),

respectively correspond to the filter coefficients (h1(i) or g1(i)),

(h2(i) or g2(i)) and (h3(i) or g3(i)).

The computations of (2) can be decomposed into three

distinct stages as

z(n1, n2, n3) =

K1

i=0

w3(i).v(2n1 i, n2, n3) (5)

v(n1, n2, n3) =

K1

i=0

w2(i).u(n1, 2n2 i, n3) (6)

u(n1, n2, n3) =

K1

i=0

w1(i).x(n1, n2, 2n3 i). (7)

[u(n1, n2, n3)] represents the low-pass and high-pass output

matrices [ul(n1, n2, n3)] and [uh(n1, n2, n3)], respectively,

corresponding to the 3-D input [x(n1, n2, n3)], while

[v(n1, n2, n3)] of (6) represents the four subband output

matrices [vll(n1, n2, n3)], [vlh(n1, n2, n3)], [vhl(n1, n2, n3)],

[vhh(n1, n2, n3)], where (vll(n1, n2, n3), vlh(n1, n2, n3))constitute the low-pass and high-pass outputs resulting from

the intermediate output ul(n1, n2, n3), and (vhl(n1, n2, n3),

vhh(n1, n2, n3)) constitute the low and high-pass outputs

resulting from the intermediate output uh(n1, n2, n3).

Similarly, z(n1, n2, n3) of (4) represents the eight oriented

selective subband outputs zlll(n1, n2, n3), zllh(n1, n2, n3),

zlhl(n1, n2, n3), zlhh(n1, n2, n3), zhll(n1, n2, n3), zhlh(n1, n2, n3),

zhhl(n1, n2, n3), zhhh(n1, n2, n3) corresponding to the

low and high-pass outputs of the four intermediate

outputs {vll(n1, n2, n3), vlh(n1, n2, n3), vhl(n1, n2, n3),

vhh(n1, n2, n3)}.

Using the decomposition scheme of (5)(7), the computa-

tion of 3-D DWT can be performed in three distinct stages as

follows.

1) In stage-1, low and high-pass filtering is performed

row-wise on each input frame (intra frame) to produceintermediate matrix [Ul] and [Uh] according to (7).

2) In stage-2, low and high-pass filtering is performed

column-wise on each of the intermediate coefficient

matrices [Ul] and [Uh] to generate four subband matrices

[Vll], [Vlh], [Vhl] and [Vhh] according to (6).

3) Finally, in stage-3 low-pass and high-pass filtering is

performed on inter-frame subbands to obtain eight ori-

ented selective subband matrices [Zlll], [Zllh], [Zlhl],

[Zlhh], [Vhll], [Vhlh], [Vhhl] and [Vhhh] of size [M/2

N/2 R/2].

III. Proposed Architecture for 1-Level 3-D DWT

Based on the mathematical formulation of the previous

section, we derive here a throughput-scalable structure for the

implementation of 1-level 3-D DWT, as shown in Fig. 1. It

consists ofQ/2 regularly arranged processing modules, where

Q samples are available to the structure per clock cycle, and

N = PQ, such that P clock cycles are involved to feed each

of the rows of a frame of the 3-D input. Input data of each

frame of size (MN) is cyclically extended by (K2) rows

and (K 2) columns; and rows of 3-D input data-matrix are

then folded by a factor P, where P is assumed to be power of

2, and fed block-by-block serially in P successive clocks to

the input data distribution unit (IDU), such that one completeframe is fed in (M + K 2)P successive clocks. Note that

for P = 1, one complete row of data is fed to the IDU in

each cycle which corresponds to the fully parallel structure.

Each input data-block (Xn1,n2,p) is comprised of (Q + K 2)

consecutive samples of a given row and the successive data-

blocks, for 0 p P 1 are overlapped by (K 2)

samples, where 0 n1 R 1, 0 n2 M 1. In every

cycle, the IDU derives Q/2 input-vectors (I(n1, n2, q)p) such

that, each input-vector (I(n1, n2, q)p) for 0 q Q/2 1,

consists of K consecutive values of particular input block,

where I(n1, n2, q)p = x(n1, n2, pQ+2q+K1), x(n1, n2, pQ+

8/3/2019 05523919

4/10


Fig. 2. Structure of a processing module. Output-1 and output-2, re-spectively represent the subband components Z1

lll, Z1

hll, Z1

lhl, Z1

hhland

Z1llh

, Z1hlh

, Z1lhh

, Z1hhh

.

Fig. 3. Internal structure of subcell-1. Output-1 and output-2, respec-tively, represent the low-pass and the high-pass components ul and uh.I(n1, n2, q)p : x(n1, n2, pQ + 2q + 3), x(n1, n2, pQ + 2q + 2), x(n1, n2, pQ +2q + 1), x(n1, n2, pQ + 2q).

2q+K2), . . . , x(n1, n2, pQ+2q+1), x(n1, n2, pQ+2q). Note

that, for K = 4, two adjacent input-vectors of a particular input

block are overlapped by two samples. The IDU feeds the Q/2

input-vectors in parallel to the Q/2 processing modules such

that, (Q/2q)th module receives the input-vector I(n1, n2, q)pduring the pth cycle of a period of P successive cycles.

The internal structure of the processing module is shown in

Fig. 2. It consists of three subcells working in separate pipeline

stages. Subcell-1, subcell-2, and subcell-3, respectively, per-

form the computations pertaining to stage-1, stage-2, and

stage-3. The arithmetic operations pertaining to the low-pass

and high-pass filtering of stage-1 [as given by (7)] are mapped

into a subcell-1. The internal structure of subcell-1 for K = 4

is shown in Fig. 3. It consists ofK multiplication units (MUs)

and (2K 2) adders. Each of the MUs stores a pair of filter

coefficients of the low-pass and high-pass filters, such that the

(k +1)th MU stores the coefficients h1(k) and g1(k). Four input

samples (for K = 4) are fed in parallel to the MUs through

the input-vector I(n1, n2, q)p during each clock cycle. The four

pairs of multiplications required for computing a pair of filter

outputs are implemented concurrently by four MUs; the out-

puts of the MUs are added concurrently by an adder-tree. Theduration of a clock cycle of the structure is equal to (TM+2TA),

where TM and TA are, respectively, the multiplication-time

and addition-time. During each cycle, subcell-1 produces a

low-pass component ul(n1, n2, q)p and a high-pass compo-

nent uh(n1, n2, q)p of an intermediate matrix. Note that, P

successive pairs of such outputs of subcell-1 correspond to

a particular input row, and form a row of the intermediate

matrix, which can be used directly as the input for the stage-2

to perform the wavelet filtering along the column direction.

The computation of stage-2 [given by (6)] is mapped

into a subcell-2, where the filtering of intermediate outputs

Fig. 4. Structure of subcell-2. Output-1 and Output-2 of subcell-2, respec-tively, represent the subband components {vll, vhl} and {vlh, vhh}.

Fig. 5. (a) Function of LC. (b) Function of AC.

ul(n1, n2, q)p and uh(n1, n2, q)p for (0 p P1) are time-

multiplexed to take the advantage of down-sampling of DWT

components. The internal structure of subcell-2 for K = 4

is depicted in Fig. 4. It consists of four MUs, three adder

cells (AC), three serial-in serial-out shift-registers (SR) of size

P words, and one line-changer (LC). During every cycle, a

pair of outputs from subcell-1 is fed to the MUs in subcell-

2 through the LC. The structure and the function of the LC

are shown in Fig. 5(a). In each set of P successive clock

cycles, the output lines of a pair of output of subcell-1 are

interchanged by the LC, such that if the low-pass intermediate

output is on a particular output line during a particular set of

P successive cycles, then the high-pass intermediate output

appears on that output line in the next set P cycles. The sample

values on line-1 are fed to the odd-numbered MUs (MU-1

and MU-3), while the sample values on the line-2 are fed

to the even-numbered MUs (MU-2 and MU-4). This simple

technique of sample loading introduces embedded decimation

in the filter computation of ul(n1, n2, q)p and uh(n1, n2, q)palong the column direction. Each of the MUs stores a pair

of coefficients (h2(k) and g2(k)) of the low-pass and high-

pass filters of the stage-2. The four MUs of subcell-2 perform

the multiplications concurrently. The addition operations are

implemented by the ACs operated in a systolic pipeline. The

function of each AC is depicted in Fig. 5(b). After a latencyof (K + 3P 1) cycles (where 3P cycles are required to fill

the SR) subcell-2 produces two subband components in each

cycle; and in P successive cycles, it computes the subband

components of a particular row. During two consecutive sets

of P cycles, it computes four subband components of a given

column. All the (Q/2) modules of the linear array structure

thus produce a block of Q/2 pairs of 2-D DWT subbands,

and one complete row of each of the four subbands of a given

frame in every sets of 2P cycles. The first level decomposition

of the n1th frame of size (MN), therefore, can be obtained in

MP cycles after an initial latency of (K + 3P) cycles. Each of

8/3/2019 05523919

5/10


Fig. 6. Internal structure of subcell-3 and intermediate buffer. Input-1and Input-2, respectively, represent [vll(n1, n2, q)p] or [vhl(n1, n2, q)p] and[vlh(n1, n2, q)p] or [vhh(n1, n2, q)p]. The output-1 and the output-2, respec-tively, represent the 3-D DWT coefficients [zlll(n1, n2, q)p], [zhll(n1, n2, q)p],[zlhl(n1, n2, q)p], [zhhl(n1, n2, q)p] and [zllh(n1, n2, q)p], [zhlh(n1, n2, q)p],[zlhh(n1, n2, q)p], [zhhh(n1, n2, q)p] in time-multiplexed form.

the subcell-2 gives a pair of outputs in parallel in each cycle,

such that in MP successive cycles it gives four columns of

subbands [Vll], [Vlh], [Vhl], and [Vhh] of a particular frame,

where the subband components of ([Vll] and [Vhl]) or ([Vlh]

and [Vhh]) are obtained in time-multiplexed form.

The output of subcell-2 gets queued in the intermediate

buffer (shown in Fig. 6) to be processed for stage-3 of the com-

putation. The intermediate buffer consists of seven SRs. Each

of the SRs holds MP words of either the columns of ([Vll] and

[Vhl]) or of ([Vlh] and [Vhh]), such that the even-numbered SRs

hold the elements of [Vll] and [Vhl] alternately, while the odd-

numbered SRs hold the elements of [Vlh] and [Vhh] alternately.

Note that each successive SRs hold the subband components

corresponding to the successive frames. The buffering of

subband components of successive frames in the SRs is done

to perform the inter-frame wavelet filtering of stage-3.

In stage-3 of the computation, low-pass and highpass filtering are performed across the frames on the

four subband components of stage-2 to obtain the eight-

oriented selective subband components [Zlll], [Zllh], [Zhll],

[Zhlh], [Zlhl], [Zlhh], [Zhhl], [Zhhh] of 3-D DWT. The compu-

tation of each subcell-3 is thus performed by a pair of low-

pass and high-pass filters, where the filtering of [Vll], [Vhl],

[Vlh], and [Vhh] are time-multiplexed. Structure of subcell-

3 (shown in Fig. 6) is similar to the structure of subcell-

1, except that it contains four MUXes to select the required

input samples from the intermediate buffer. The intermediate

buffer provides the required subband components [Vll], [Vhl],

[Vlh], and [Vhh] of a given frame to subcell-3. The extra

SR-1 provides one complete column-delay to the input [Vlh]/

[Vhh], so that for the first MP cycles, the inter-frame wavelet

filtering is performed on the components of [Vll]/[Vhl] and

in the second MP cycles such filtering is performed on the

components of [Vlh]/[Vhh]. This process is repeated such that

the components of [Vll]/[Vhl] and [Vlh]/[Vhh] are processed in

alternate MP cycles.

The MUXes of each subcell-3 select the inputs of ([Vll]or [Vhl]) and ([Vlh] or [Vhh]) of consecutive frames from the

intermediate buffer in alternate MP cycles, and compute the 3-

D DWT coefficients such that, during every odd set ofMP suc-

cessive cycles, one complete column of DWT components of

four subbands [Zlll], [Zllh], [Zhll], and [Zhlh] are obtained from

subcell-3, where the subband components of ([Zlll], [Zllh]) and

([Zhll], [Zhlh]) are time-multiplexed. Similarly during every

even-numbered set of MP successive cycles, one column of

other four subband coefficients [Zlhl], [Zlhh], [Zhhl], [Zhhh] are

obtained from subcell-3. The linear array of Q/2 processing

modules thus produces Q/2 coefficients of each of the eight

subbands in every computational cycle, such that the 3-D DWT

computation of a given frame is completed in MP cycles. Thecomplete 3-D DWT of size (M N R) can be obtained in

MPR cycles, where R is the frame-rate of the video stream and

M, N are, respectively, the height and width of each frame.

The entire linear array of Q/2 processing module can be

implemented in a processing unit (PU-1) for first level DWT

computation. Similar processing units can also be designed

for computation of higher level decomposition, and all those

processing units can be integrated into a pipeline structure for

concurrent implementation of multilevel 3-D DWT.

IV. Proposed Pipeline Architecture for

Multilevel 3-D DWT

In multilevel 3-D DWT, the [Zlll] subband of current decom-

position level is further processed to calculate the 3-D DWT

of the next higher level of decomposition. Since the 3-D DWT

structure for each level of decomposition performs decimated

filtering, the number of arithmetic operations to calculate 3-D

DWT of each higher level decreases consistently by a factor

of eight. The amount of hardware resources for calculating

the DWT coefficients of every higher level of decomposition

should, therefore, be reduced by a factor of 8, in order to

achieve 100% HUE. Based on this point of view, we have

derived a fully pipelined structure for the implementation of J

level 3-D DWT as shown in Fig. 7, where J = log8(Q/2)+1.

It is comprised of J PUs, where PU-j, for (1 j J)performs the computation of the jth decomposition level. The

PUs are connected in a linear structure, and work in separate

pipeline stages. The structure of each PU is similar to the

1-level 3-D DWT structure shown in Fig. 1.

PU-1 consists of Q/2 processing modules to calculate the

3-D DWT of the first level decomposition. It gives one row

(of size N/2) of four out of eight subbands in 2P successive

cycles, such that all the M/2 rows of those subbands are

obtained in MP successive cycles. Similarly, all the M/2 rows

of other four subbands of 3-D DWT are obtained during the

next MP cycles. Note that the low-low subband components

8/3/2019 05523919

6/10


Fig. 7. Proposed pipeline structure multilevel 3-D DWT, where J = log8(Q/2) + 1 and (Q/23J3) 2.

of the first level (Z1lll) of a particular frame are obtained from

PU-1 during alternate sets of MP cycles. Each output block

(consisting of Q/2 samples) corresponding to Z1lll are folded

by a factor of 4, and fed to PU-2 in four successive cycles, such

that one row of Z1lll is fed in 4P cycles, and a complete frame

in 2MP cycles. PU-2, therefore, performs the processing of

half of the components of the matrix Z1lll (of size [M/2N/2]

of a particular frame) in MP cycles, while PU-1 generates

the entire DWT components of Z1lll of a particular frame in

alternate sets of MP cycles. PU-2 uses an input-buffer (IB2)

of size MN/8 words to store half of the output values of

Z1lll corresponding to a frame. PU-2 is comprised of (Q/16)

processing modules arranged in a linear array structure similar

to that of PU-1. It receives a block of Q/8 intermediate outputs

through IB2 in every cycle, and calculates a block of Q/16

components of a pair of subband matrices in every cycle. The

subcell-1 and intermediate-buffer of each processing module

of PU-2 are the same as those of PU-1. But, each SR of

subcell-2 (Fig. 4) and subcell-3 (Fig. 6) of PU-1 are replaced

by SRs of size 4P and 2MP, respectively, to obtain thecorresponding subcells of PU-2. It calculates one complete

row of N/4 coefficients of all the eight subband matrices in

16P cycles; and completes the 2-level decomposition in MPR

cycles. Similarly, the low-low subband components of ZJ1lllare buffered in IBJ of size MN/2

2J1 words. The Jth PU is

comprised ofQ/23J2 identical modules, and receives a block

of Q/23J3 intermediate outputs from IBJ in every cycle,

where Q/23J3 2. The SR size of subcell-2 and subcell-

3 of each module of Jth PU are, respectively, (4J1P) and

(2J1MP) words. It calculates one row of N/4 coefficients of

a pair of subbands in (4J1P) cycles, and one row of all the

eight subband matrices in (4JP) cycles. It takes MPR cycles to

complete the calculation of all the eight subbands of [each ofsize (MR/4JN/2J)]. The PUs of the proposed structure work

in separate pipeline stages, and compute the J level 3-D DWT

computations of an input video stream (MNR) in MPR

cycles. The latency of the proposed structure is estimated to be

Latency =

J

j=1

(PUj )latency

=

J

j=1

K + 4j1(K 1)P + 2j1(K 1)MP

= KJ + (4J)(K 1)P/3 + (2J)(K 1)MP. (8)

Note that (4J)(K 1)P/3 + (2J)(K 1)MP cycle delay is

introduced to fill the SRs of subcell-2 and intermediate-buffer

of each the module where K is the filter order.

V. Hardware and Time Complexity

The proposed 3-D DWT structure for J levels of decompo-

sition involves J processing units, where J = log8(Q/2) + 1.

Each of the processing units (PU) operates in separate pipeline

stage. The jth PU involves (Q/23j2) identical modules to

perform 3-D DWT ofjth level with 100 % hardware utilization

efficiency. Each module has three subcells (subcell-1, subcell-

2, and subcell-3).

Each subcell of the proposed structure requires 2K mul-

tipliers and (2K 2) adders. In addition to this, subcell-2

and intermediate-buffer, respectively, involve 22j1(K 1)Pand 2j1(2K1)MP delay registers. Each module, therefore,

involves 6K multipliers, 6(K1) adders and 22j1(K1)P+

2j1(2K 1)MP registers. Along with (Q/23j2) processing

modules, the jth PU (except the first PU) involves an input-buffer (IBj ) of size MN/2

2j1, for 2 j J. The size of

each subband matrix of the jth level 3-D DWT is [(MR/4j )

(N/2j )] for input size (M N R). Eight such subband

matrices are computed by the jth PU in MPR cycles. The

PUs of the proposed pipeline structure concurrently calculate

the 3-D DWT of an input video stream of size (M N) and

frame-rate R in approximately in MPR cycles.The hardware

complexity of the proposed structure for J-level 3-D DWT is

estimated as follows.

Number of multipliers:

3KQ + 3KQ/8 + + 6KQ/23J2 =24

7KQ(1 23J)

Number of adders:

3(K 1)Q + 3(K 1)Q/8 + + 6(K 1)Q/23J2)

=24

7(K 1)Q(1 23J)

Number of pipeline/data registers:

[(K 1)N + (2K 1)MN/2] + [(K 1)N/2

+(2K 1)MN/8] + + [(K 1)N/2J1

+(2K 1)MN/22J1]

= 2(K 1)N(1 2J) +2

3(2K 1)(1 22J)MN

8/3/2019 05523919

7/10


Input-buffer size in words

MN/8 + MN/32 + + MN/22J1

= MN(1 22J+2)/6

A. Performance Comparison

The hardware and time complexity of the proposed structure

and the existing structures of [16], [19], and [20] are listed in

Table I in terms of cycle period, ACT1 in clock cycles, storage

space and frame-buffer size in words, along with the number ofmultipliers and adders for comparison. The overall hardware-

complexity of the structures has two major components:

1) complexity of arithmetic units, and 2) on-chip storage and

external frame-buffer. When the frame-size (MN) is small,

the complexity of arithmetic units is more significant than

that of the storage component. But, the frame-size for video

processing varies from as low as 176144 for the 3G phones

to 640480 in standard definition format and 1280720 in the

commonly used high definition format, while the frame-rate

can vary from 15f/s in mobile phones to 60f/s for HDTV

application Therefore, in practice, the on-chip storage and

frame-buffer have dominant contribution to the hardware of

the overall structure.

As shown in Table I, the proposed structure requires

4Q/7 more multipliers and adders compared to those of [16]

and [19]. Compared with [20], the proposed one requires

Q/7 times more multipliers and adders. The on-chip storage

of the proposed structure is 4/3 times and 2M/3R times

of the structures of [19] and [20], respectively. Compared

with the 3DW-I and 3DW-II structures of [16], it involves

nearly 2K/3R times less and 4MN/3K2 times more on-chip

storage. But, it is interesting to note that, unlike the existing

structures, the frame-buffer of the proposed structure is in-

dependent of the frame-rate R; it involves (3R/4) times less

frame-buffer than the others; and offers higher throughput rate.Moreover, the proposed structure provides 4Q/7 and Q/7

times higher throughput per cycle compared with ([16], [19])

and [20], respectively, with a slightly higher clock period.

Compared with block-based 3DW-II structure of [16], the

proposed one can offer K2Q/7 times higher throughput rate.

Since the frame buffer as well as the on-chip storage of the

proposed structure is independent of input block size (Q), the

throughput per clock cycle of the proposed structure can be

increased proportionately, by using more number of processing

modules, without increasing its memory component. When

higher processing speed is not required, the structure could

be operated by a slower clock, where speed could be traded

for power by scaling down the operating voltage [24]. On theother way, the multiplications and additions of the proposed

structure either could be time-multiplexed in side the process-

ing modules, or the could be implemented by slower hardware-

efficient multipliers and adders. Moreover, the proposed struc-

1ACT is the number of cycles required for the computation of all the J-levels of 3-D DWT after the initial latency. In case of proposed structures itis calculated by dividing the total number of 3-D DWT coefficients by thethroughput per cycle. In case of the structure of [19] and [20], the ACTis calculated by the sum of the ACTs of each individual levels, becausethey compute the 3-D DWT of different levels sequentially. For the proposedstructures the ACT is MNR/Q cycles, since all the levels of multilevel DWTcomputation are performed in separate pipeline stages.

ture involves a small output latency of the O(MN/Q); and

calculates the multilevel 3-D DWT computation with 100%

utilization efficiency.

Since the throughput rates of the existing structures and the

proposed structures are different, we have compared the area-

delay-products pertaining to different hardware components,

such as the multiplier-delay-product (MDP), Adder-delay-

product (ADP), storage-word-delay-product (SWDP), and

frame-buffer-delay-product (FBDP), estimated by the productof the respective hardware component with the computation-

time, where the computation time is estimated as the product

of cycle time with ACT in cycles (since the structures are

fully pipelined and there is no inter-frame latency). We have

estimated the MDP, ADP, SWDP and FBDP of the proposed

structure and those of existing structures; and listed those in

Table III. The proposed structure is found to involve nearly

1.33 times more MDP and ADP than the structures of [19],

[20] and 3DW-I of [16] for the first level DWT computation,

if we assume the multiplication-time to be twice that of

addition-time.2 For each of the subsequent levels, the MDP

and ADP of the proposed structure fall by a factor 8. SWDPs

of 3DW-I structure of [16], and the structures of [19] and[20] are, respectively, QR times, Q times and QR/M

times higher than those of the proposed structure. However, the

block-based (3DW-II) structure of [16] is (MN/K4Q) times

less than the proposed structure. It is found that FBDP of the

proposed structure is QR times less than the existing structure.

The hardware and time-complexities pertaining to frame-

size 176 144 and frame-rates of 15, 30, and 60f/s are

estimated for the proposed structure for input block size (16,

48, 72, 144) and for J = 2; and compared them with those

of the existing structures in Tables III. As shown in Tables

III, the structure of [20] involves four times the multipliers

and adders as those of [19], but offers nearly four times more

throughput of the other. The structure of [20] requires nearly

11.6 times less on-chip storage, respectively for Daub-4 filter

compared to that of [19] for 15 f/s. The block-based structure

of [16] requires nearly 78.2 times less on-chip storage for

Daub-4 filters and offers nearly 28 times less throughput rate

those of [20] for 15 f/s. As shown in Table III, the proposed

structure for Daub-4 filter requires nearly (2.25, 6.75, 10.125,

20.25) times more multipliers and adders of [20] for input

block size (Q = 16, 48, 72, 144), respectively, and provides

nearly (2.25, 6.75, 10.125, 20.25) times more throughput rate

over the latter. For 15 f/s, it requires 12.73 times more on-chip

storage than the structure of [20], but for 60 f/s, it involves

3.27 times more on-chip storage of [20] for input block size(Q = 16, 48, 72, 144). The storage-complexity of proposed of

structure remains the same for different frame-rates, but in

case of [20] the on-chip storage as well as the frame-buffer

size increase linearly with the frame-rate.

To estimate the transistor counts, we have assumed ripple-

carry adders (RCA) and RCA-based multipliers of 8-bit input

2We have synthesized the multipliers and adders for 8-bit and 12-bit signedas well as unsigned numbers by Synopsys Design Compiler using DesignWarebuilding block library and estimated the multiplication-time and addition-timepertaining to TSMC 90 nm process technology. The average multiplication-time is found to be nearly 2.1 times the average addition-time.

8/3/2019 05523919

8/10


TABLE I

General Comparison of Hardware and Time-Complexities of the Proposed Structure and the Existing Structures

Structures Multipliers Adders On-Chip Storage Frame-Buffer Cycle Period ACT

Weeks et al. [16] 6K 6(K 1) 2MNR + 2MN + 6K 18 MNR T M + TA 47 MNRx1

(3DW-I)

Weeks et al. [16] 2K 2(K 1) K3 + 2K2 + 4K MNR T M + TA 17 (K

2 + 2K + 4)(3DW-II) MNRx1

Das et al. [19] 6K 6(K 1) KMN+ (K 2)N + 2K 18 MNR T M + TA 47 MNRx1

Dai et al. [20] 24K 24(K 1) 2R(K 2)(N + 2) + 8K 18 MNR T M + TA 17 MNRx1

Proposed 247 KQx1

247 (K 1)Qx1

2(K 1)Nx2+ MNx4/6 TM + 2TA MNR/Q

2(2K 1)MNx3/3

J: maximum of levels of 3-D DWT decomposition, M: image height, N: image width, R: frame-rate, K: order of low-pass/high-pass filter.x1 = (1 2

3J), x2 = (1 2J), x3 = (1 2

2J), x4 = (1 22J+2), K is the filter length, N is power of 2.

TABLE II

Comparison of Hardware-Component-Delay Product of Proposed Structure and the Existing Structures

Structures MDP ADP SWDP FBDP

Weeks et al. [16] 10.28KMNRx1TA 10.28(K 1)MNRx1TA 3.42(MNR + MN + 3K)MNRx1TA3

14 M2N2R2x1TA

(3DW-I)

Weeks et al. [16] 0.85KMNRzx1TA 0.85(K 1)MNRzx1TA 0.42(K3 + 2K2 + 4K)MNRzx1TA

37 M

2N2R2zx1TA(3DW-II)

Das et al. [19] 10.28KMNRx1TA 10.28(K 1)MNRx1TA 1.71(KM+ (K 2))N + 2K)MNRx1TA3

14 M2N2R2x1TA

Dai et al. [20] 10.28KMNRx1TA 10.28(K 1)MNRx1TA 0.85(R(K 2)(N + 2) + 4K)MNRx1TA3

14 M2N2R2x1TA

Proposed 13.71KMNRx1TA 13.71(K 1)MNRx1TA 8[(K 1)x2 + (2K 1)Mx3/3]MN2RTA/Q

23Q

M2N2RTA

K: filter length, J: 3-D DWT levels, M: image height, N: image width, R: frame-rate.MDP: multiplier-delay-product, ADP: adder-delay-product, SWDP: storage-word-delay product, FBDP: frame-buffer-delay product,x1 = (1 2

3J) y = (1 2J), x2 = (1 23J+3), z = (K2 + 2K + 4). We have assumed TM = 2TA.

TABLE III

Comparison of Hardware and Time-Complexities of the Proposed and the Existing Structures for Different Video Applications

Using Four-Tap Daub-4 Wavelet Filters, J = 2

Structures Frame-Size and Rate Multipliers Adders On-Chip Storage Frame-Buffer ACT(M N R) (Register Words) (RAM Words) in Cycles

Das et al. [19] (176 144 15) 24 18 101 672 47 520 213840

Dai et al. [20] (176 144 15) 96 72 8760 47 520 53460

Weeks et al. [16] 3DW-II (176 144 15) 8 6 112 380 160 1496880

Proposed (Q = 16) (176 144 15) 216 162 111 528 3168 23760

Proposed (Q = 48) (176 144 15) 648 486 111 528 3168 7920

Proposed (Q = 72) (176 144 15) 972 729 111 528 3168 5280

Proposed (Q = 144) (176 144 15) 1944 1458 111 528 3168 2640

Das et al. [19] (176 144 30) 24 18 101 672 95 040 427680

Dai et al. [20] (176 144 30) 96 72 17 520 95 040 106920

Weeks et al. [16] 3DW-II (176 144 30) 8 6 112 760 320 2993760

Proposed (Q = 16) (176 144 30) 216 162 111 528 3168 47520Proposed (Q = 48) (176 144 30) 648 486 111 528 3168 15840

Proposed (Q = 72) (176 144 30) 972 729 111 528 3168 10560

Proposed (Q = 144) (176 144 30) 1944 1458 111 528 3168 5280

Das et al. [19] (176 144 60) 24 18 101 672 190 080 855360

Dai et al. [20] (176 144 60) 96 72 34 042 190 080 213840

Weeks et al. [16] 3DW-II (176 144 60) 8 6 112 1 520 640 5987520

Proposed (Q = 16) (176 144 60) 216 162 111 528 3168 95040

Proposed (Q = 48) (176 144 60) 648 486 111 528 3168 31680

Proposed (Q = 72) (176 144 60) 972 729 111 528 3168 21120

Proposed (Q = 144) (176 144 60) 1944 1458 111 528 3168 10560

M: image height, N: image width, R: frame-rate.

8/3/2019 05523919

9/10


Fig. 8. Transistor count of proposed structure and existing structures usingDaub-4 filters of frame-size (176 144) and frame-rate 15 f/s.

Fig. 9. Transistor count of proposed structure and existing structures usingDaub-4 filters of frame-size (176 144) and frame-rate 60 f/s.

width for all the structures. The on-chip storage and the frame-

buffer are assumed to be of D flip-flops, and SRAM, respec-

tively. The multiplier, adder, 8-bit register, and 8-bit SRAM

are taken to be 1085, 248, 128, and 48 transistors, respectively

[25]. The transistor counts of proposed structure using Daub-

4 filters are estimated accordingly for 2-level decompositions

for input frame-size (176144) and frame-rates 15 and 60 f/s

(shown in Figs. 8 and 9). It can be observed from these figures

that, for 15 f/s the structure of [20] involves considerably less

transistors than all the structures. It is shown that, the proposed

structures for Q = 16, 48, 72, and 144, respectively, involve

4.17, 4.32, and 4.44, and 4.79 times more transistors than that

of [20] and offer 2.25, 6.75, 10.125, and 20.25 times more

throughput rate. For 60 f/s (Fig. 9), the proposed structure

involves 1.08, 1.12, 1.15, and 1.24 times more transistors

than the corresponding structures of [20], but offer nearly

2.25, 6.75, 10.125, and 20.25 times more throughput than the

other. An interesting feature of the proposed structures is that

its throughput rate can easily be scaled by simply increasing

the number of processing modules without increasing the on-

chip storage and frame buffer size. The transistor counts of

the proposed structures therefore do not increase remarkably

with the throughput rate.

We have coded the proposed structure (based on Daub-

4 filters) for input block size Q = 16, 48, 72, and 144 forthe first-level decompositions of input-size (176 144 15),

in very high speed integrated circuit hardware descriptive

language and synthesized using Xilinx ISE 11i. The proposed

designs are then implemented in Virtex-5 field programmable

gate array (FPGA) platform using 5VLX330 device. The

results obtained from the synthesis report are listed in Table

IV in terms of number of slices and maximum clock frequency

in MHz. The designs for different input block size have the

same cycle period, as expected from the theoretical estimation

shown in the comparison Table I. It can be observed from

Table IV that the proposed design for Q = 48, 72, and 144,

TABLE IV

Synthesis Results of Proposed Structures for FPGA Device

5VLX330

Proposed Designs Number of Slices Max. Clock Frequency

Q = 16 29 812 582.74MHz

Q = 48 43 898 582.74MHz

Q = 72 53 532 582.74MHz

Q = 144 82 008 582.74MHz

respectively, involves 1.47, 1.79, and 2.75 times more slices

than those required for (Q = 16) and offers, respectively,

3, 4.5, and 9 times more throughput rate. The increase in the

number of slices could be due to difference in the way of

implementation of adders and multipliers in FPGA although

we find that the transistor count does not increase remarkably

with Q.

VI. Conclusion

A throughput-scalable parallel and pipeline architecture was

proposed for high-throughput computation of multilevel 3-DDWT with 100% hardware utilization efficiency. Each level

of 3-D DWT computation was split into three distinct stages

and the computations of all the three stages are implemented

concurrently in a parallel array of pipelined processing mod-

ules. For multilevel DWT computation, we have proposed a

cascaded structure where each level of decomposition was

performed by a processing unit in a separate pipeline stage. An

interesting feature of the proposed structure is that it involves

relatively small frame-buffer than the existing structures; and

size of the frame-buffer is independent of frame-rate. Besides,

the size of the on-chip storage and frame-buffer is independent

of the input block size. The latency of the proposed structure

for multilevel 3-D DWT is O(MN/Q) while that of the

existing structures is O(MNR), where (M N) is the frame-

size. Compared with best of the existing structures, we find

that for frame size (176144) and frame-rate (15 and 60) f/s,

the proposed structure for input block size (16, 48, 72, 144),

respectively, involves (4.17, 4.32, 4.44, 4.79) and (1.08, 1.12,

1.15, 1.24) times more transistors of [20] and offers (2.25,

6.75, 10.125, 20.25) times more throughput rate. The overall

area-delay products of proposed structure is significantly lower

than the existing structures, although they involve slightly

more multiplier-delay product and more adder-delay product,

since it involves significantly less frame-buffer and storage-

word-delay product. It was found that the throughput rate ofthe proposed structure can easily be scaled without increas-

ing the on-chip storage and frame-memory by using more

number of processing modules; and for higher-frame-rates

and higher input block-size it provides greater advantage over

the existing designs. It was also found that the full-parallel

implementation of proposed scalable structure provides the

best of its performance. But, the computational speed of the

full-parallel structure on one hand might be too high for the

required application, and on the other hand it may involve too a

large hardware unit. Therefore, when higher processing speed

is not required, the structure could be operated by a slower

8/3/2019 05523919

10/10


clock, where speed could be traded for power by scaling down

the operating voltage [24]. To see the other possibilities of

using the proposed high-throughout structure, we have shown

that it could be implemented in FPGA for slow and low-

cost implementation. On the other way, the multiplications

and additions of the proposed structure either could be time-

multiplexed inside the processing modules, and/or could be

implemented by slower (possibly bit-serial or digit-serial kind

of) but hardware-efficient multipliers and adders, such that theclock period will increase and hardware requirement could be

reduced. Future work still needs to be carried out to determine

how the full-parallel structure could be used by slow and low

cost implementation by appropriate algorithm architecture co-

design to match the real-time processing requirement.

References

[1] A. Wang, Z. Xiong, P. A. Chou, and S. Mehrotra, 3-D wavelet codingof video with global motion compensation, in Proc. IEEE CS DataCompression Conf., May 1999, p. 404.

[2] G. Minami, Z. Xiong, A. Wang, and S. Mehrotra, 3-D wavelet codingof video with arbitrary regions of support, IEEE Trans. Circuit Syst.

Video Technol., vol. 11, no. 9, pp. 10631068, Sep. 2001.[3] A. M. Baskurt, H. Benoit-Cattin, and C. Odet, 3-D medical image

coding method using a separable 3-D wavelet transform, in Proc. SPIE Med. Imaging: Image Display, vol. 2431. Apr. 1995, pp. 173183.

[4] V. Sanchez, P. Nasiopoulos, and R. Abugharbieh, Lossless compressionof 4D medical images using H.264/AVC, in Proc. IEEE ICASSP, vol. II.May 2006, pp. 11161119.

[5] J. Wei, P. Saipetch, R. K. Panwar, D. Chen, and B. K. Ho, Volumetricimage compression by 3-D DWT, in Proc. SPIE Med. Imaging: Image

Display, vol. 2431. Apr. 1995, pp. 184194.[6] L. Anqiang and L. Jing, A novel scheme for robust video watermark

in the 3-D DWT domain, in Proc. ISDPE, Nov. 2007, pp. 514516.[7] A. Grzeszczak, M. K. Manal, S. Panchanathan, and T. H. Yeap, VLSI

implementation of discrete wavelet transform, IEEE Trans. VLSI Syst.,vol. 4, no. 4, pp. 421433, Dec. 1996.

[8] J. C. Limpueco and M. A. Bayoumi, A VLSI architecture for separable2-D discrete wavelet transform, J. VLSI Signal Process. Syst., vol. 18,

pp. 125140, Feb. 1998.[9] C. Chrysafis and A. Ortega, Line-based reduced memory, wavelet

image compression, IEEE Trans. Image Process., vol. 9, no. 3,pp. 378389, Mar. 2000.

[10] F. Marino, Efficient high-speed/low-power pipelined architectures fordirect 2-D wavelet transform, IEEE Trans. Circuit. Syst. II: Analog

Digital Signal Process., vol. 47, no. 12, pp. 14761491, Dec. 2000.[11] F. Marino, Two fast architectures for direct 2-D wavelet transform,

IEEE Trans. Signal Process., vol. 49, no. 6, pp. 12481259, Jun. 2001.[12] C. Yu and S. J. Chen, VLSI Implementation of 2-D wavelet transform

for real-time video signal processing, IEEE Trans. Consumer Electron.,vol. 43, no. 4, pp. 12701279, Nov. 1997.

[13] P.-C. Wu and L.-G. Chen, An efficient architecture for 2-D discretewavelet transform, IEEE Trans., Circuit Syst. Video Technol., vol. 11,no. 4, pp. 536545, Apr. 2001.

[14] S. B. Pan and R. H. Park, Systolic array architectures for computation ofthe discrete wavelet transform, J. Visual Commun. Image Representat.,

vol. 14, no. 3, pp. 217231, Sep. 2003.[15] C.-T. Huang, P.-C. Tseng, and L.-G. Chen, Generic RAM-based ar-

chitectures for 2-D discrete wavelet transform with line-based method,IEEE Trans., Circuit Syst. Video Technol., vol. 15, no. 7, pp. 910920,Jul. 2005.

[16] M. Weeks and M. A. Bayoumi, 3-D discrete wavelet transformarchitecture, IEEE Trans. Signal Process., vol. 50, no. 8, pp. 20502063, Aug. 2002.

[17] M. Weeks and M. A. Bayoumi, Wavelet transform: Architecture,design and performance issues, J. VLSI Signal Process., vol. 35,no. 2, pp. 155178, Sep. 2003.

[18] W. Badawy, M. Talley, G. Zhang, M. Weeks, and M. A. Bayoumi, Lowpower very large scale integration prototype for 3-D discrete wavelettransform processor with medical application, J. Electron. Imaging,vol. 12, no. 2, pp. 270277, Apr. 2003.

[19] B. Das and S. Banerjee, A memory efficient 3-D DWT architecture,in Proc. 16th Int. Conf. VLSI Design, Aug. 2003, p. 208.

[20] Q. Dai, X. Chen, and C. Lin, A novel VLSI architecture formultidimensional discrete wavelet transform, IEEE Trans. Circuitand Syst., Video Technol., vol. 14, no. 8, pp. 11051110, Aug.2004.

[21] ITRS. (2005). International Technology Roadmap for Semiconductors[Online]. Available: http://public.itrs.net/

[22] G. E. Moore, Cramming more components onto integrated circuits,Proc. IEEE, vol. 86, no. 1, pp. 8285, Jan. 1998.

[23] G. E. Moore, Lithography and the future of Moores law, inProc. 12th SPIE Adv. Resist Technol. Process., vol. 2438. Jun. 1995,pp. 217.

[24] K. K. Parhi, VLSI Digital Signal Processing Systems: Design andImplementation. New York: Wiley, 1999.

[25] N. H. E. Weste and D. Harris, CMOS VLSI Design: A Circuitsand Systems Perspective. Boston, MA: Pearson/Addison-Wesley,2005.

[26] A. Vincent, D. Wang, and L. Zhang. (2006, Dec.). Codec CRC-WVC Outperforms H.264 Video with Wavelets [Online]. Available:http://www.byte.com/documents

Basant K. Mohanty (M06) received the B.Sc. andM.Sc. degrees in physics from Sambalpur Univer-sity, Orissa, India, in 1987 and 1989, respectively,and the Ph.D. degree in the field of VLSI for

digital signal processing from Berhampur University,Orissa, in 2000.In 1992, he was selected by the Orissa Public

Service Commission to become a Faculty Memberwith the Department of Physics, SKCG CollegeParalakhemundi, Orissa. In 2001, he was a Lecturerwith the Department of Electrical and Electronic En-

gineering, BITS Pilani, Rajasthan, India. Then, he was an Assistant Professorwith the Department of Electrical and Computer Engineering, Mody Instituteof Science and Technology, Deemed University, Rajasthan. In 2003, he joinedthe Jaypee Institute of Engineering and Technology, Guna, Madhya Pradesh,India, as an Assistant Professor. He was promoted to Associate Professor in2005 and Professor in 2007. He has published nearly 25 technical papers. Hisname has appeared in Marquis Whos Who in the World in 1999. His cur-rent research interests include design and implementation of re-configurableVLSI architectures for resource-constrained digital signal processingapplications.

Dr. Mohanty is a lifetime member of the Institution of Electronics andTelecommunication Engineering, New Delhi, India.

Pramod Kumar Meher (SM03) received the B.Sc.and M.Sc. degrees in physics and the Ph.D. degreein science from Sambalpur University, Sambalpur,India, in 1976, 1978, and 1996, respectively.

He has a wide scientific and technical backgroundcovering physics, electronics, and computer engi-neering. Currently, he is a Senior Scientist with theInstitute for Infocomm Research, Singapore. Prior tothis assignment he was a Visiting Faculty Memberwith the School of Computer Engineering, NanyangTechnological University, Singapore. Previously, he

was a Professor of computer applications with Utkal University, Bhubaneswar,India, from 1997 to 2002, a Reader of Electronics with Berhampur University,

Berhampur, India, from 1993 to 1997, and a Lecturer of physics withvarious government colleges in India from 1981 to 1993. He has publishednearly 150 technical papers in various reputed journals and conferenceproceedings. His current research interests include design of dedicated andreconfigurable architectures for computation-intensive algorithms pertain-ing to signal processing, image processing, communication, and intelligentcomputing.

Dr. Meher is a Fellow of the Institution of Electronics and Telecommunica-tion Engineers, India, and the Institution of Engineering and Technology, U.K.He is currently serving as an Associate Editor for the IEEE Transactionson Circuits and Systems II: Express Briefs, IEEE Transactionson Very Large Scale Integration Systems, and Journal of Circuits,Systems, and Signal Processing. He was the recipient of the Samanta Chan-drasekhar Award for Excellence in Research in Engineering and Technologyfor 1999.

05523919

Documents

Transcript of 05523919