05523919

download 05523919

of 10

Transcript of 05523919

  • 8/3/2019 05523919

    1/10

    1200 IEEE TRANSACTION S ON CIRCUI TS A ND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 9, SEPTEMBER 2010

    Parallel and Pipeline Architectures forHigh-Throughput Computation of

    Multilevel 3-D DWTBasant K. Mohanty, Member, IEEE, and Pramod K. Meher, Senior Member, IEEE

    AbstractIn this paper, we present a throughput-scalable par-allel and pipeline architecture for high-throughput computationof multilevel 3-D discrete wavelet transform (3-D DWT). Thecomputation of 3-D DWT for each level of decomposition issplit into three distinct stages, and all the three stages areimplemented in parallel by a processing unit consisting of anarray of processing modules. The processing unit for the first leveldecomposition of a video stream of frame-size (MN) consists ofQ/2 processing modules, where Q is the number of input samplesavailable to the structure in each clock cycle. The processing unitfor a higher level of decomposition requires 1/8 times the numberof processing modules required by the processing unit for itspreceding level. For J level 3-D DWT of a video stream, each ofthe proposed structures involves J processing units in a cascadedpipeline. The proposed structures have a small output latency,and can perform multilevel 3-D DWT computation with 100%hardware utilization efficiency. The throughput rate of proposedstructures are Q/7 time higher than the best of the correspondingexisting structures. Interestingly, the proposed structures involvea frame-buffer of O(MN) while the frame-buffer size of theexisting structures is O(MNR). Besides, the on-chip storage andthe frame-buffer size of the proposed structure is independent ofthe input-block size and this favors to derive highly concurrentparallel architecture for high-throughput implementation. Theoverall area-delay products of proposed structure are significantly

    lower than the existing structures, although they involve slightlymore multiplier-delay product and more adder-delay product,since it involves significantly less frame-buffer and storage-word-delay product. The throughput rate of the proposed structurecan easily be scaled without increasing the on-chip storage andframe-memory by using more number of processing modules,and it provides greater advantage over the existing designs forhigher frame-rates and higher input block-size. The full-parallelimplementation of proposed scalable structure provides the bestof its performance. When very high throughput generated bysuch parallel structure is not required, the structure could beoperated by a slower clock, where speed could be traded forpower by scaling down the operating voltage and/or the pro-cessing modules could be implemented by slower but hardware-efficient arithmetic circuits.

    Manuscript received January 17, 2009; revised September 2, 2009 andDecember 24, 2009. Date of publication July 26, 2010; date of current versionSeptember 9, 2010. This paper was recommended by Associate Editor S.-Y.Chien.

    B. K. Mohanty is with the Dept. of Electronics and CommunicationEngineering, Jaypee Institute of Engineering and Technology, Guna 473226,Madhya Pradesh, India (e-mail: [email protected]).

    P. K. Meher is with the Department of Embedded Systems, Institute forInfocomm Research, Singapore 138632 (e-mail: [email protected]).

    Color versions of one or more of the figures in this paper are availableonline at http://ieeexplore.ieee.org.

    Digital Object Identifier 10.1109/TCSVT.2010.2056950

    Index Terms3-D DWT, discrete wavelet transform, paralleland pipeline architecture, very large scale integration (VLSI).

    I. Introduction

    THE DISCRETE wavelet transform (DWT) is widely useddue to its remarkable advantage over the unitary trans-forms like discrete Fourier transform (DFT), discrete cosine

    transform (DCT) and discrete sine transform (DST) for variousapplications due to its multiple time-frequency resolution.

    DWT of different dimensions has emerged as a powerful tool

    for speech and image coding in recent years. The 3-D DWT is

    found to provide superior performance in video compression

    by eliminating the temporal redundancies within the video

    sequences for motion compensation [1]. Apart from that, 3-D

    DWT has been used popularly for the compression of 3-D and

    4-D medical images, volumetric image compression, and video

    watermarking etc [2][6]. The multidimensional DWTs are

    particularly more computation intensive and, therefore, require

    to be implemented in very large scale integrated (VLSI)

    systems for real-time applications.

    The 2-D and 3-D DWTs can be realized either by separableapproach or by non-separable approach. The separable ap-

    proach is more popular over the other, because it involves less

    computation for the same throughput performance. The main

    building blocks of separable multidimensional DWT cores are

    constituted by lower dimensional transform modules and one

    or more transposition units. Although the speed performance

    and hardware-complexity of a multidimensional DWT archi-

    tecture substantially depends on the implementation of its

    lower dimensional modules, the area and time overhead of

    transposition unit significantly affects the overall performance

    of the computing structure [7][9]. In order to avoid the

    transposition unit of 2-D DWT computation, some pipeline

    architectures have been proposed for the implementation of

    non-separable 2-D DWT [10][12]. Attempts have also been

    made to reduce the complexities of transposition unit. Wu et al.

    [13] have proposed a line-based folded architecture to reduce

    the size of the transposition-buffer and overall complexity of

    separable 2-D DWT device. In this paper, we aim at presenting

    another possible approach for separable implementation of

    multidimensional DWT by parallel processing with appropri-

    ate scheduling of computation to achieve significant reduction

    in storage requirement and area-delay product.

    1051-8215/$26.00 c 2010 IEEE

  • 8/3/2019 05523919

    2/10

    MOHANTY AND MEHER: PARALLEL AND PIPELINE ARCHITECTURES FOR HIGH-THROUGHPUT COMPUTATION OF MULTILEVEL 3-D DWT 1201

    As of now, only a few designs for VLSI implementation

    of 3-D DWT are found in the literature [16][20]. Weeks

    et al. have proposed two separate designs (3DW-I and 3DW-

    II) for computing 3-D DWT [16][18]. The 3DW-I design

    involves on-chip memory of the O(MNR) which is imprac-

    tical to implement in a chip, where (M N) is the im-

    age size, and R is the frame-rate of the given video. The

    3DW-II structure performs block-by-block processing of 3-

    D data, and involves very small on-chip memory O(K). It,however, involves complex control circuitry and frame-buffer

    of size MNR to feed the blocks of input data. The aver-

    age computation time (ACT) of 3DW-II is also significantly

    higher than that of 3DW-I. Das et al. [19] have proposed

    a separable architecture for 3-D DWT, which involves less

    memory space and lower output latency compared with the

    earlier designs. Dai et al. [20] have applied the polyphase

    decomposition technique, and mapped the computation of

    3-D DWT into a systolic architecture. They have used the

    conventional separable method efficiently in their structure

    for reducing the on-chip storage space. The systolic design

    of [20] requires four times the resources compared to that of

    [19], and calculates the 3-D DWT nearly four times faster,and involves significantly less storage space compared to the

    other. One major problem with the designs proposed in [19]

    and [20] is that they compute the multilevel 3-D DWT by

    level-by-level approach (similar to the folded scheme proposed

    by Wu et al. [13]) using an external frame-buffer of size

    (MNR)/8.

    The external frame-buffer is a major hardware component in

    the existing structures since the frame-size for practical video

    transmission applications may vary from 176144 (screen size

    of mobile phone) to 19201080 [screen size of high-definition

    television (HDTV)] and the frame-rate can vary from 15 (f/s)

    in mobile phones to 60 f/s for HDTV application. The on-

    chip storage and the frame-buffer contribute more than 90% to

    the total area of the existing structures. Significant amount of

    memory-bandwidth and computation-time are also wasted for

    accessing the external frame-buffer. It is also observed that the

    on-chip storage and frame-buffer size is remain independent

    of the throughput rate. This motivates us to apply concurrent

    design method and this has two-fold advantage: using con-

    current design method the frame-buffer size could be reduced

    and the on-chip memory of the 3-D structure can be used more

    efficiently to calculate multiple outputs per cycle to improve

    the overall performance of the chip. Since the silicon devices

    are progressively scaling according to Moores Law [21][23],

    more components are accommodated in the integrated circuits,and at the same time silicon cost has been declining fast,

    over the years. Hence, it may be considered as an appropriate

    strategy to design parallel architectures where area can be

    traded either for time, or for power if faster computation is

    not required by the application. If high throughput is not

    required for a given application, then the clock frequency

    could be reduced and lower operating voltage could be used

    for reducing the power consumption [24]. Keeping this in

    mind, we have proposed a parallel architecture for multilevel

    3-D DWT. The key ideas used in our proposed approach

    are:

    1) to process each decomposition level of 3-D DWT in sep-

    arate computing blocks in cascaded pipeline structure for

    concurrent computation of multilevel DWT computation

    in order to reduce the size of the frame-buffer used for

    buffering of the subband components and to maximize

    hardware utilization efficiency;

    2) the input rows for each level are appropriately folded to

    meet the desired throughput rate and to achieve 100%

    hardware utilization efficiency (HUE) of the processingunit.

    Using the above approach, we have reduced the frame-

    buffer size, and have obtained Q/7 times higher throughput

    compared with the best of the existing structures, using the on-

    chip memory of the same order. It is shown that the proposed

    structure can calculate DWT coefficient of an input video

    signal of size (MNR) in MNR/Q cycles. The proposed

    parallel implementation of 3-D DWT structures is of additional

    advantage, since the size of the frame-buffer could be reduced,

    and it does not demand for higher on-chip memory and frame-

    buffer for higher input block sizes, which contribute the most

    of the hardware in the existing designs.

    The remainder of this paper is organized as follows. Mathe-

    matical formulation of the 3-D DWT computation is presented

    in Section II. The proposed architecture for 1-level 3-D

    DWT is presented in Section III. A pipeline architecture for

    multilevel 3-D DWT is presented in Section IV. The hardware

    complexity and performance of proposed structure is discussed

    in Section V. Our conclusion is presented in Section VI.

    II. Mathematical Formulation

    The 3-D DWT coefficients of any decomposition level can

    be obtained from the scaling coefficients of its previous level

    according to the pyramid algorithm, given by

    lllj (n1, n2, n3) =

    Kh1

    i1=0

    Kh1

    i2=0

    Kh1

    i3=0

    h1(i1) h2(i2) h3(i3)

    lllj1(2n1 i1, 2n2 i2, 2n3 i3) (1)

    llhj (n1, n2, n3) =

    Kh1

    i1=0

    Kh1

    i2=0

    Kg1

    i3=0

    h1(i1) h2(i2) g3(i3)

    lllj1(2n1 i1, 2n2 i2, 2n3 i3) (2)

    ::

    hhhj (n1, n2, n3) =

    Kg1

    i1=0

    Kg1

    i2=0

    Kg1

    i3=0

    g1(i1) g2(i2) g3(i3).

    lllj1(2n1 i1, 2n2 i2, 2n3 i3). (3)

    n1 = 0, 1, . . . , (R/2) 1, n2 = 0, 1, . . . , (M/2) 1, and

    n3 = 0, 1, . . . , (N/2) 1, where, Kh and Kg are, respectively,

    the lengths of the low-pass and high-pass filters, M and N,

    respectively, the height and width the image and R is the

  • 8/3/2019 05523919

    3/10

    1202 IEEE TRANSACTION S ON CIRCUI TS A ND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 9, SEPTEMBER 2010

    Fig. 1. Proposed structure for computation of 1-level convolution-based 3-D DWT of frame-size (M N). z1l

    (n1, n2, q)p and z1h

    (n1, n2, q)p represent four

    subband components {z1lll

    (n1, n2, q)p, z1hll

    (n1, n2, q)p, z1lhl

    (n1, n2, q)p, z1hhl

    (n1, n2, q)p} and {z1llh

    (n1, n2, q)p, z1hlh

    (n1, n2, q)p, z1lhh

    (n1, n2, q)p, z1hhh

    (n1, n2, q)p},respectively. z(n1, n2, q)p is defined as the qth data sample of pth block of n2th row of n1th frame of the subband matrix, where 0 q (Q/2) 1,0 p P 1, 0 n2 (M/2) 1, 0 n1 (R/2) 1, N = PQ.

    frame-rate of the video stream. Assuming K = Kh = Kg,

    (1)(3) can be represented in a generalized form

    z(n1, n2, n3) =

    K1i1=0

    K1i2=0

    K1i3=0

    w1(i1) w2(i2) w3(i3).

    x(2n1 i1, 2n2 i2, 2n3 i3) (4)

    where z() corresponds to lllj (), llhj (), . . . , hhhj (), and x()

    corresponds to lllj1() in (1)(3), while w1(i), w2(i) and w3(i),

    respectively correspond to the filter coefficients (h1(i) or g1(i)),

    (h2(i) or g2(i)) and (h3(i) or g3(i)).

    The computations of (2) can be decomposed into three

    distinct stages as

    z(n1, n2, n3) =

    K1

    i=0

    w3(i).v(2n1 i, n2, n3) (5)

    v(n1, n2, n3) =

    K1

    i=0

    w2(i).u(n1, 2n2 i, n3) (6)

    u(n1, n2, n3) =

    K1

    i=0

    w1(i).x(n1, n2, 2n3 i). (7)

    [u(n1, n2, n3)] represents the low-pass and high-pass output

    matrices [ul(n1, n2, n3)] and [uh(n1, n2, n3)], respectively,

    corresponding to the 3-D input [x(n1, n2, n3)], while

    [v(n1, n2, n3)] of (6) represents the four subband output

    matrices [vll(n1, n2, n3)], [vlh(n1, n2, n3)], [vhl(n1, n2, n3)],

    [vhh(n1, n2, n3)], where (vll(n1, n2, n3), vlh(n1, n2, n3))constitute the low-pass and high-pass outputs resulting from

    the intermediate output ul(n1, n2, n3), and (vhl(n1, n2, n3),

    vhh(n1, n2, n3)) constitute the low and high-pass outputs

    resulting from the intermediate output uh(n1, n2, n3).

    Similarly, z(n1, n2, n3) of (4) represents the eight oriented

    selective subband outputs zlll(n1, n2, n3), zllh(n1, n2, n3),

    zlhl(n1, n2, n3), zlhh(n1, n2, n3), zhll(n1, n2, n3), zhlh(n1, n2, n3),

    zhhl(n1, n2, n3), zhhh(n1, n2, n3) corresponding to the

    low and high-pass outputs of the four intermediate

    outputs {vll(n1, n2, n3), vlh(n1, n2, n3), vhl(n1, n2, n3),

    vhh(n1, n2, n3)}.

    Using the decomposition scheme of (5)(7), the computa-

    tion of 3-D DWT can be performed in three distinct stages as

    follows.

    1) In stage-1, low and high-pass filtering is performed

    row-wise on each input frame (intra frame) to produceintermediate matrix [Ul] and [Uh] according to (7).

    2) In stage-2, low and high-pass filtering is performed

    column-wise on each of the intermediate coefficient

    matrices [Ul] and [Uh] to generate four subband matrices

    [Vll], [Vlh], [Vhl] and [Vhh] according to (6).

    3) Finally, in stage-3 low-pass and high-pass filtering is

    performed on inter-frame subbands to obtain eight ori-

    ented selective subband matrices [Zlll], [Zllh], [Zlhl],

    [Zlhh], [Vhll], [Vhlh], [Vhhl] and [Vhhh] of size [M/2

    N/2 R/2].

    III. Proposed Architecture for 1-Level 3-D DWT

    Based on the mathematical formulation of the previous

    section, we derive here a throughput-scalable structure for the

    implementation of 1-level 3-D DWT, as shown in Fig. 1. It

    consists ofQ/2 regularly arranged processing modules, where

    Q samples are available to the structure per clock cycle, and

    N = PQ, such that P clock cycles are involved to feed each

    of the rows of a frame of the 3-D input. Input data of each

    frame of size (MN) is cyclically extended by (K2) rows

    and (K 2) columns; and rows of 3-D input data-matrix are

    then folded by a factor P, where P is assumed to be power of

    2, and fed block-by-block serially in P successive clocks to

    the input data distribution unit (IDU), such that one completeframe is fed in (M + K 2)P successive clocks. Note that

    for P = 1, one complete row of data is fed to the IDU in

    each cycle which corresponds to the fully parallel structure.

    Each input data-block (Xn1,n2,p) is comprised of (Q + K 2)

    consecutive samples of a given row and the successive data-

    blocks, for 0 p P 1 are overlapped by (K 2)

    samples, where 0 n1 R 1, 0 n2 M 1. In every

    cycle, the IDU derives Q/2 input-vectors (I(n1, n2, q)p) such

    that, each input-vector (I(n1, n2, q)p) for 0 q Q/2 1,

    consists of K consecutive values of particular input block,

    where I(n1, n2, q)p = x(n1, n2, pQ+2q+K1), x(n1, n2, pQ+

  • 8/3/2019 05523919

    4/10

    MOHANTY AND MEHER: PARALLEL AND PIPELINE ARCHITECTURES FOR HIGH-THROUGHPUT COMPUTATION OF MULTILEVEL 3-D DWT 1203

    Fig. 2. Structure of a processing module. Output-1 and output-2, re-spectively represent the subband components Z1

    lll, Z1

    hll, Z1

    lhl, Z1

    hhland

    Z1llh

    , Z1hlh

    , Z1lhh

    , Z1hhh

    .

    Fig. 3. Internal structure of subcell-1. Output-1 and output-2, respec-tively, represent the low-pass and the high-pass components ul and uh.I(n1, n2, q)p : x(n1, n2, pQ + 2q + 3), x(n1, n2, pQ + 2q + 2), x(n1, n2, pQ +2q + 1), x(n1, n2, pQ + 2q).

    2q+K2), . . . , x(n1, n2, pQ+2q+1), x(n1, n2, pQ+2q). Note

    that, for K = 4, two adjacent input-vectors of a particular input

    block are overlapped by two samples. The IDU feeds the Q/2

    input-vectors in parallel to the Q/2 processing modules such

    that, (Q/2q)th module receives the input-vector I(n1, n2, q)pduring the pth cycle of a period of P successive cycles.

    The internal structure of the processing module is shown in

    Fig. 2. It consists of three subcells working in separate pipeline

    stages. Subcell-1, subcell-2, and subcell-3, respectively, per-

    form the computations pertaining to stage-1, stage-2, and

    stage-3. The arithmetic operations pertaining to the low-pass

    and high-pass filtering of stage-1 [as given by (7)] are mapped

    into a subcell-1. The internal structure of subcell-1 for K = 4

    is shown in Fig. 3. It consists ofK multiplication units (MUs)

    and (2K 2) adders. Each of the MUs stores a pair of filter

    coefficients of the low-pass and high-pass filters, such that the

    (k +1)th MU stores the coefficients h1(k) and g1(k). Four input

    samples (for K = 4) are fed in parallel to the MUs through

    the input-vector I(n1, n2, q)p during each clock cycle. The four

    pairs of multiplications required for computing a pair of filter

    outputs are implemented concurrently by four MUs; the out-

    puts of the MUs are added concurrently by an adder-tree. Theduration of a clock cycle of the structure is equal to (TM+2TA),

    where TM and TA are, respectively, the multiplication-time

    and addition-time. During each cycle, subcell-1 produces a

    low-pass component ul(n1, n2, q)p and a high-pass compo-

    nent uh(n1, n2, q)p of an intermediate matrix. Note that, P

    successive pairs of such outputs of subcell-1 correspond to

    a particular input row, and form a row of the intermediate

    matrix, which can be used directly as the input for the stage-2

    to perform the wavelet filtering along the column direction.

    The computation of stage-2 [given by (6)] is mapped

    into a subcell-2, where the filtering of intermediate outputs

    Fig. 4. Structure of subcell-2. Output-1 and Output-2 of subcell-2, respec-tively, represent the subband components {vll, vhl} and {vlh, vhh}.

    Fig. 5. (a) Function of LC. (b) Function of AC.

    ul(n1, n2, q)p and uh(n1, n2, q)p for (0 p P1) are time-

    multiplexed to take the advantage of down-sampling of DWT

    components. The internal structure of subcell-2 for K = 4

    is depicted in Fig. 4. It consists of four MUs, three adder

    cells (AC), three serial-in serial-out shift-registers (SR) of size

    P words, and one line-changer (LC). During every cycle, a

    pair of outputs from subcell-1 is fed to the MUs in subcell-

    2 through the LC. The structure and the function of the LC

    are shown in Fig. 5(a). In each set of P successive clock

    cycles, the output lines of a pair of output of subcell-1 are

    interchanged by the LC, such that if the low-pass intermediate

    output is on a particular output line during a particular set of

    P successive cycles, then the high-pass intermediate output

    appears on that output line in the next set P cycles. The sample

    values on line-1 are fed to the odd-numbered MUs (MU-1

    and MU-3), while the sample values on the line-2 are fed

    to the even-numbered MUs (MU-2 and MU-4). This simple

    technique of sample loading introduces embedded decimation

    in the filter computation of ul(n1, n2, q)p and uh(n1, n2, q)palong the column direction. Each of the MUs stores a pair

    of coefficients (h2(k) and g2(k)) of the low-pass and high-

    pass filters of the stage-2. The four MUs of subcell-2 perform

    the multiplications concurrently. The addition operations are

    implemented by the ACs operated in a systolic pipeline. The

    function of each AC is depicted in Fig. 5(b). After a latencyof (K + 3P 1) cycles (where 3P cycles are required to fill

    the SR) subcell-2 produces two subband components in each

    cycle; and in P successive cycles, it computes the subband

    components of a particular row. During two consecutive sets

    of P cycles, it computes four subband components of a given

    column. All the (Q/2) modules of the linear array structure

    thus produce a block of Q/2 pairs of 2-D DWT subbands,

    and one complete row of each of the four subbands of a given

    frame in every sets of 2P cycles. The first level decomposition

    of the n1th frame of size (MN), therefore, can be obtained in

    MP cycles after an initial latency of (K + 3P) cycles. Each of

  • 8/3/2019 05523919

    5/10

    1204 IEEE TRANSACTION S ON CIRCUI TS A ND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 9, SEPTEMBER 2010

    Fig. 6. Internal structure of subcell-3 and intermediate buffer. Input-1and Input-2, respectively, represent [vll(n1, n2, q)p] or [vhl(n1, n2, q)p] and[vlh(n1, n2, q)p] or [vhh(n1, n2, q)p]. The output-1 and the output-2, respec-tively, represent the 3-D DWT coefficients [zlll(n1, n2, q)p], [zhll(n1, n2, q)p],[zlhl(n1, n2, q)p], [zhhl(n1, n2, q)p] and [zllh(n1, n2, q)p], [zhlh(n1, n2, q)p],[zlhh(n1, n2, q)p], [zhhh(n1, n2, q)p] in time-multiplexed form.

    the subcell-2 gives a pair of outputs in parallel in each cycle,

    such that in MP successive cycles it gives four columns of

    subbands [Vll], [Vlh], [Vhl], and [Vhh] of a particular frame,

    where the subband components of ([Vll] and [Vhl]) or ([Vlh]

    and [Vhh]) are obtained in time-multiplexed form.

    The output of subcell-2 gets queued in the intermediate

    buffer (shown in Fig. 6) to be processed for stage-3 of the com-

    putation. The intermediate buffer consists of seven SRs. Each

    of the SRs holds MP words of either the columns of ([Vll] and

    [Vhl]) or of ([Vlh] and [Vhh]), such that the even-numbered SRs

    hold the elements of [Vll] and [Vhl] alternately, while the odd-

    numbered SRs hold the elements of [Vlh] and [Vhh] alternately.

    Note that each successive SRs hold the subband components

    corresponding to the successive frames. The buffering of

    subband components of successive frames in the SRs is done

    to perform the inter-frame wavelet filtering of stage-3.

    In stage-3 of the computation, low-pass and highpass filtering are performed across the frames on the

    four subband components of stage-2 to obtain the eight-

    oriented selective subband components [Zlll], [Zllh], [Zhll],

    [Zhlh], [Zlhl], [Zlhh], [Zhhl], [Zhhh] of 3-D DWT. The compu-

    tation of each subcell-3 is thus performed by a pair of low-

    pass and high-pass filters, where the filtering of [Vll], [Vhl],

    [Vlh], and [Vhh] are time-multiplexed. Structure of subcell-

    3 (shown in Fig. 6) is similar to the structure of subcell-

    1, except that it contains four MUXes to select the required

    input samples from the intermediate buffer. The intermediate

    buffer provides the required subband components [Vll], [Vhl],

    [Vlh], and [Vhh] of a given frame to subcell-3. The extra

    SR-1 provides one complete column-delay to the input [Vlh]/

    [Vhh], so that for the first MP cycles, the inter-frame wavelet

    filtering is performed on the components of [Vll]/[Vhl] and

    in the second MP cycles such filtering is performed on the

    components of [Vlh]/[Vhh]. This process is repeated such that

    the components of [Vll]/[Vhl] and [Vlh]/[Vhh] are processed in

    alternate MP cycles.

    The MUXes of each subcell-3 select the inputs of ([Vll]or [Vhl]) and ([Vlh] or [Vhh]) of consecutive frames from the

    intermediate buffer in alternate MP cycles, and compute the 3-

    D DWT coefficients such that, during every odd set ofMP suc-

    cessive cycles, one complete column of DWT components of

    four subbands [Zlll], [Zllh], [Zhll], and [Zhlh] are obtained from

    subcell-3, where the subband components of ([Zlll], [Zllh]) and

    ([Zhll], [Zhlh]) are time-multiplexed. Similarly during every

    even-numbered set of MP successive cycles, one column of

    other four subband coefficients [Zlhl], [Zlhh], [Zhhl], [Zhhh] are

    obtained from subcell-3. The linear array of Q/2 processing

    modules thus produces Q/2 coefficients of each of the eight

    subbands in every computational cycle, such that the 3-D DWT

    computation of a given frame is completed in MP cycles. Thecomplete 3-D DWT of size (M N R) can be obtained in

    MPR cycles, where R is the frame-rate of the video stream and

    M, N are, respectively, the height and width of each frame.

    The entire linear array of Q/2 processing module can be

    implemented in a processing unit (PU-1) for first level DWT

    computation. Similar processing units can also be designed

    for computation of higher level decomposition, and all those

    processing units can be integrated into a pipeline structure for

    concurrent implementation of multilevel 3-D DWT.

    IV. Proposed Pipeline Architecture for

    Multilevel 3-D DWT

    In multilevel 3-D DWT, the [Zlll] subband of current decom-

    position level is further processed to calculate the 3-D DWT

    of the next higher level of decomposition. Since the 3-D DWT

    structure for each level of decomposition performs decimated

    filtering, the number of arithmetic operations to calculate 3-D

    DWT of each higher level decreases consistently by a factor

    of eight. The amount of hardware resources for calculating

    the DWT coefficients of every higher level of decomposition

    should, therefore, be reduced by a factor of 8, in order to

    achieve 100% HUE. Based on this point of view, we have

    derived a fully pipelined structure for the implementation of J

    level 3-D DWT as shown in Fig. 7, where J = log8(Q/2)+1.

    It is comprised of J PUs, where PU-j, for (1 j J)performs the computation of the jth decomposition level. The

    PUs are connected in a linear structure, and work in separate

    pipeline stages. The structure of each PU is similar to the

    1-level 3-D DWT structure shown in Fig. 1.

    PU-1 consists of Q/2 processing modules to calculate the

    3-D DWT of the first level decomposition. It gives one row

    (of size N/2) of four out of eight subbands in 2P successive

    cycles, such that all the M/2 rows of those subbands are

    obtained in MP successive cycles. Similarly, all the M/2 rows

    of other four subbands of 3-D DWT are obtained during the

    next MP cycles. Note that the low-low subband components

  • 8/3/2019 05523919

    6/10

    MOHANTY AND MEHER: PARALLEL AND PIPELINE ARCHITECTURES FOR HIGH-THROUGHPUT COMPUTATION OF MULTILEVEL 3-D DWT 1205

    Fig. 7. Proposed pipeline structure multilevel 3-D DWT, where J = log8(Q/2) + 1 and (Q/23J3) 2.

    of the first level (Z1lll) of a particular frame are obtained from

    PU-1 during alternate sets of MP cycles. Each output block

    (consisting of Q/2 samples) corresponding to Z1lll are folded

    by a factor of 4, and fed to PU-2 in four successive cycles, such

    that one row of Z1lll is fed in 4P cycles, and a complete frame

    in 2MP cycles. PU-2, therefore, performs the processing of

    half of the components of the matrix Z1lll (of size [M/2N/2]

    of a particular frame) in MP cycles, while PU-1 generates

    the entire DWT components of Z1lll of a particular frame in

    alternate sets of MP cycles. PU-2 uses an input-buffer (IB2)

    of size MN/8 words to store half of the output values of

    Z1lll corresponding to a frame. PU-2 is comprised of (Q/16)

    processing modules arranged in a linear array structure similar

    to that of PU-1. It receives a block of Q/8 intermediate outputs

    through IB2 in every cycle, and calculates a block of Q/16

    components of a pair of subband matrices in every cycle. The

    subcell-1 and intermediate-buffer of each processing module

    of PU-2 are the same as those of PU-1. But, each SR of

    subcell-2 (Fig. 4) and subcell-3 (Fig. 6) of PU-1 are replaced

    by SRs of size 4P and 2MP, respectively, to obtain thecorresponding subcells of PU-2. It calculates one complete

    row of N/4 coefficients of all the eight subband matrices in

    16P cycles; and completes the 2-level decomposition in MPR

    cycles. Similarly, the low-low subband components of ZJ1lllare buffered in IBJ of size MN/2

    2J1 words. The Jth PU is

    comprised ofQ/23J2 identical modules, and receives a block

    of Q/23J3 intermediate outputs from IBJ in every cycle,

    where Q/23J3 2. The SR size of subcell-2 and subcell-

    3 of each module of Jth PU are, respectively, (4J1P) and

    (2J1MP) words. It calculates one row of N/4 coefficients of

    a pair of subbands in (4J1P) cycles, and one row of all the

    eight subband matrices in (4JP) cycles. It takes MPR cycles to

    complete the calculation of all the eight subbands of [each ofsize (MR/4JN/2J)]. The PUs of the proposed structure work

    in separate pipeline stages, and compute the J level 3-D DWT

    computations of an input video stream (MNR) in MPR

    cycles. The latency of the proposed structure is estimated to be

    Latency =

    J

    j=1

    (PUj )latency

    =

    J

    j=1

    K + 4j1(K 1)P + 2j1(K 1)MP

    = KJ + (4J)(K 1)P/3 + (2J)(K 1)MP. (8)

    Note that (4J)(K 1)P/3 + (2J)(K 1)MP cycle delay is

    introduced to fill the SRs of subcell-2 and intermediate-buffer

    of each the module where K is the filter order.

    V. Hardware and Time Complexity

    The proposed 3-D DWT structure for J levels of decompo-

    sition involves J processing units, where J = log8(Q/2) + 1.

    Each of the processing units (PU) operates in separate pipeline

    stage. The jth PU involves (Q/23j2) identical modules to

    perform 3-D DWT ofjth level with 100 % hardware utilization

    efficiency. Each module has three subcells (subcell-1, subcell-

    2, and subcell-3).

    Each subcell of the proposed structure requires 2K mul-

    tipliers and (2K 2) adders. In addition to this, subcell-2

    and intermediate-buffer, respectively, involve 22j1(K 1)Pand 2j1(2K1)MP delay registers. Each module, therefore,

    involves 6K multipliers, 6(K1) adders and 22j1(K1)P+

    2j1(2K 1)MP registers. Along with (Q/23j2) processing

    modules, the jth PU (except the first PU) involves an input-buffer (IBj ) of size MN/2

    2j1, for 2 j J. The size of

    each subband matrix of the jth level 3-D DWT is [(MR/4j )

    (N/2j )] for input size (M N R). Eight such subband

    matrices are computed by the jth PU in MPR cycles. The

    PUs of the proposed pipeline structure concurrently calculate

    the 3-D DWT of an input video stream of size (M N) and

    frame-rate R in approximately in MPR cycles.The hardware

    complexity of the proposed structure for J-level 3-D DWT is

    estimated as follows.

    Number of multipliers:

    3KQ + 3KQ/8 + + 6KQ/23J2 =24

    7KQ(1 23J)

    Number of adders:

    3(K 1)Q + 3(K 1)Q/8 + + 6(K 1)Q/23J2)

    =24

    7(K 1)Q(1 23J)

    Number of pipeline/data registers:

    [(K 1)N + (2K 1)MN/2] + [(K 1)N/2

    +(2K 1)MN/8] + + [(K 1)N/2J1

    +(2K 1)MN/22J1]

    = 2(K 1)N(1 2J) +2

    3(2K 1)(1 22J)MN

  • 8/3/2019 05523919

    7/10

    1206 IEEE TRANSACTION S ON CIRCUI TS A ND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 9, SEPTEMBER 2010

    Input-buffer size in words

    MN/8 + MN/32 + + MN/22J1

    = MN(1 22J+2)/6

    A. Performance Comparison

    The hardware and time complexity of the proposed structure

    and the existing structures of [16], [19], and [20] are listed in

    Table I in terms of cycle period, ACT1 in clock cycles, storage

    space and frame-buffer size in words, along with the number ofmultipliers and adders for comparison. The overall hardware-

    complexity of the structures has two major components:

    1) complexity of arithmetic units, and 2) on-chip storage and

    external frame-buffer. When the frame-size (MN) is small,

    the complexity of arithmetic units is more significant than

    that of the storage component. But, the frame-size for video

    processing varies from as low as 176144 for the 3G phones

    to 640480 in standard definition format and 1280720 in the

    commonly used high definition format, while the frame-rate

    can vary from 15f/s in mobile phones to 60f/s for HDTV

    application Therefore, in practice, the on-chip storage and

    frame-buffer have dominant contribution to the hardware of

    the overall structure.

    As shown in Table I, the proposed structure requires

    4Q/7 more multipliers and adders compared to those of [16]

    and [19]. Compared with [20], the proposed one requires

    Q/7 times more multipliers and adders. The on-chip storage

    of the proposed structure is 4/3 times and 2M/3R times

    of the structures of [19] and [20], respectively. Compared

    with the 3DW-I and 3DW-II structures of [16], it involves

    nearly 2K/3R times less and 4MN/3K2 times more on-chip

    storage. But, it is interesting to note that, unlike the existing

    structures, the frame-buffer of the proposed structure is in-

    dependent of the frame-rate R; it involves (3R/4) times less

    frame-buffer than the others; and offers higher throughput rate.Moreover, the proposed structure provides 4Q/7 and Q/7

    times higher throughput per cycle compared with ([16], [19])

    and [20], respectively, with a slightly higher clock period.

    Compared with block-based 3DW-II structure of [16], the

    proposed one can offer K2Q/7 times higher throughput rate.

    Since the frame buffer as well as the on-chip storage of the

    proposed structure is independent of input block size (Q), the

    throughput per clock cycle of the proposed structure can be

    increased proportionately, by using more number of processing

    modules, without increasing its memory component. When

    higher processing speed is not required, the structure could

    be operated by a slower clock, where speed could be traded

    for power by scaling down the operating voltage [24]. On theother way, the multiplications and additions of the proposed

    structure either could be time-multiplexed in side the process-

    ing modules, or the could be implemented by slower hardware-

    efficient multipliers and adders. Moreover, the proposed struc-

    1ACT is the number of cycles required for the computation of all the J-levels of 3-D DWT after the initial latency. In case of proposed structures itis calculated by dividing the total number of 3-D DWT coefficients by thethroughput per cycle. In case of the structure of [19] and [20], the ACTis calculated by the sum of the ACTs of each individual levels, becausethey compute the 3-D DWT of different levels sequentially. For the proposedstructures the ACT is MNR/Q cycles, since all the levels of multilevel DWTcomputation are performed in separate pipeline stages.

    ture involves a small output latency of the O(MN/Q); and

    calculates the multilevel 3-D DWT computation with 100%

    utilization efficiency.

    Since the throughput rates of the existing structures and the

    proposed structures are different, we have compared the area-

    delay-products pertaining to different hardware components,

    such as the multiplier-delay-product (MDP), Adder-delay-

    product (ADP), storage-word-delay-product (SWDP), and

    frame-buffer-delay-product (FBDP), estimated by the productof the respective hardware component with the computation-

    time, where the computation time is estimated as the product

    of cycle time with ACT in cycles (since the structures are

    fully pipelined and there is no inter-frame latency). We have

    estimated the MDP, ADP, SWDP and FBDP of the proposed

    structure and those of existing structures; and listed those in

    Table III. The proposed structure is found to involve nearly

    1.33 times more MDP and ADP than the structures of [19],

    [20] and 3DW-I of [16] for the first level DWT computation,

    if we assume the multiplication-time to be twice that of

    addition-time.2 For each of the subsequent levels, the MDP

    and ADP of the proposed structure fall by a factor 8. SWDPs

    of 3DW-I structure of [16], and the structures of [19] and[20] are, respectively, QR times, Q times and QR/M

    times higher than those of the proposed structure. However, the

    block-based (3DW-II) structure of [16] is (MN/K4Q) times

    less than the proposed structure. It is found that FBDP of the

    proposed structure is QR times less than the existing structure.

    The hardware and time-complexities pertaining to frame-

    size 176 144 and frame-rates of 15, 30, and 60f/s are

    estimated for the proposed structure for input block size (16,

    48, 72, 144) and for J = 2; and compared them with those

    of the existing structures in Tables III. As shown in Tables

    III, the structure of [20] involves four times the multipliers

    and adders as those of [19], but offers nearly four times more

    throughput of the other. The structure of [20] requires nearly

    11.6 times less on-chip storage, respectively for Daub-4 filter

    compared to that of [19] for 15 f/s. The block-based structure

    of [16] requires nearly 78.2 times less on-chip storage for

    Daub-4 filters and offers nearly 28 times less throughput rate

    those of [20] for 15 f/s. As shown in Table III, the proposed

    structure for Daub-4 filter requires nearly (2.25, 6.75, 10.125,

    20.25) times more multipliers and adders of [20] for input

    block size (Q = 16, 48, 72, 144), respectively, and provides

    nearly (2.25, 6.75, 10.125, 20.25) times more throughput rate

    over the latter. For 15 f/s, it requires 12.73 times more on-chip

    storage than the structure of [20], but for 60 f/s, it involves

    3.27 times more on-chip storage of [20] for input block size(Q = 16, 48, 72, 144). The storage-complexity of proposed of

    structure remains the same for different frame-rates, but in

    case of [20] the on-chip storage as well as the frame-buffer

    size increase linearly with the frame-rate.

    To estimate the transistor counts, we have assumed ripple-

    carry adders (RCA) and RCA-based multipliers of 8-bit input

    2We have synthesized the multipliers and adders for 8-bit and 12-bit signedas well as unsigned numbers by Synopsys Design Compiler using DesignWarebuilding block library and estimated the multiplication-time and addition-timepertaining to TSMC 90 nm process technology. The average multiplication-time is found to be nearly 2.1 times the average addition-time.

  • 8/3/2019 05523919

    8/10

    MOHANTY AND MEHER: PARALLEL AND PIPELINE ARCHITECTURES FOR HIGH-THROUGHPUT COMPUTATION OF MULTILEVEL 3-D DWT 1207

    TABLE I

    General Comparison of Hardware and Time-Complexities of the Proposed Structure and the Existing Structures

    Structures Multipliers Adders On-Chip Storage Frame-Buffer Cycle Period ACT

    Weeks et al. [16] 6K 6(K 1) 2MNR + 2MN + 6K 18 MNR T M + TA 47 MNRx1

    (3DW-I)

    Weeks et al. [16] 2K 2(K 1) K3 + 2K2 + 4K MNR T M + TA 17 (K

    2 + 2K + 4)(3DW-II) MNRx1

    Das et al. [19] 6K 6(K 1) KMN+ (K 2)N + 2K 18 MNR T M + TA 47 MNRx1

    Dai et al. [20] 24K 24(K 1) 2R(K 2)(N + 2) + 8K 18 MNR T M + TA 17 MNRx1

    Proposed 247 KQx1

    247 (K 1)Qx1

    2(K 1)Nx2+ MNx4/6 TM + 2TA MNR/Q

    2(2K 1)MNx3/3

    J: maximum of levels of 3-D DWT decomposition, M: image height, N: image width, R: frame-rate, K: order of low-pass/high-pass filter.x1 = (1 2

    3J), x2 = (1 2J), x3 = (1 2

    2J), x4 = (1 22J+2), K is the filter length, N is power of 2.

    TABLE II

    Comparison of Hardware-Component-Delay Product of Proposed Structure and the Existing Structures

    Structures MDP ADP SWDP FBDP

    Weeks et al. [16] 10.28KMNRx1TA 10.28(K 1)MNRx1TA 3.42(MNR + MN + 3K)MNRx1TA3

    14 M2N2R2x1TA

    (3DW-I)

    Weeks et al. [16] 0.85KMNRzx1TA 0.85(K 1)MNRzx1TA 0.42(K3 + 2K2 + 4K)MNRzx1TA

    37 M

    2N2R2zx1TA(3DW-II)

    Das et al. [19] 10.28KMNRx1TA 10.28(K 1)MNRx1TA 1.71(KM+ (K 2))N + 2K)MNRx1TA3

    14 M2N2R2x1TA

    Dai et al. [20] 10.28KMNRx1TA 10.28(K 1)MNRx1TA 0.85(R(K 2)(N + 2) + 4K)MNRx1TA3

    14 M2N2R2x1TA

    Proposed 13.71KMNRx1TA 13.71(K 1)MNRx1TA 8[(K 1)x2 + (2K 1)Mx3/3]MN2RTA/Q

    23Q

    M2N2RTA

    K: filter length, J: 3-D DWT levels, M: image height, N: image width, R: frame-rate.MDP: multiplier-delay-product, ADP: adder-delay-product, SWDP: storage-word-delay product, FBDP: frame-buffer-delay product,x1 = (1 2

    3J) y = (1 2J), x2 = (1 23J+3), z = (K2 + 2K + 4). We have assumed TM = 2TA.

    TABLE III

    Comparison of Hardware and Time-Complexities of the Proposed and the Existing Structures for Different Video Applications

    Using Four-Tap Daub-4 Wavelet Filters, J = 2

    Structures Frame-Size and Rate Multipliers Adders On-Chip Storage Frame-Buffer ACT(M N R) (Register Words) (RAM Words) in Cycles

    Das et al. [19] (176 144 15) 24 18 101 672 47 520 213840

    Dai et al. [20] (176 144 15) 96 72 8760 47 520 53460

    Weeks et al. [16] 3DW-II (176 144 15) 8 6 112 380 160 1496880

    Proposed (Q = 16) (176 144 15) 216 162 111 528 3168 23760

    Proposed (Q = 48) (176 144 15) 648 486 111 528 3168 7920

    Proposed (Q = 72) (176 144 15) 972 729 111 528 3168 5280

    Proposed (Q = 144) (176 144 15) 1944 1458 111 528 3168 2640

    Das et al. [19] (176 144 30) 24 18 101 672 95 040 427680

    Dai et al. [20] (176 144 30) 96 72 17 520 95 040 106920

    Weeks et al. [16] 3DW-II (176 144 30) 8 6 112 760 320 2993760

    Proposed (Q = 16) (176 144 30) 216 162 111 528 3168 47520Proposed (Q = 48) (176 144 30) 648 486 111 528 3168 15840

    Proposed (Q = 72) (176 144 30) 972 729 111 528 3168 10560

    Proposed (Q = 144) (176 144 30) 1944 1458 111 528 3168 5280

    Das et al. [19] (176 144 60) 24 18 101 672 190 080 855360

    Dai et al. [20] (176 144 60) 96 72 34 042 190 080 213840

    Weeks et al. [16] 3DW-II (176 144 60) 8 6 112 1 520 640 5987520

    Proposed (Q = 16) (176 144 60) 216 162 111 528 3168 95040

    Proposed (Q = 48) (176 144 60) 648 486 111 528 3168 31680

    Proposed (Q = 72) (176 144 60) 972 729 111 528 3168 21120

    Proposed (Q = 144) (176 144 60) 1944 1458 111 528 3168 10560

    M: image height, N: image width, R: frame-rate.

  • 8/3/2019 05523919

    9/10

    1208 IEEE TRANSACTION S ON CIRCUI TS A ND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 20, NO. 9, SEPTEMBER 2010

    Fig. 8. Transistor count of proposed structure and existing structures usingDaub-4 filters of frame-size (176 144) and frame-rate 15 f/s.

    Fig. 9. Transistor count of proposed structure and existing structures usingDaub-4 filters of frame-size (176 144) and frame-rate 60 f/s.

    width for all the structures. The on-chip storage and the frame-

    buffer are assumed to be of D flip-flops, and SRAM, respec-

    tively. The multiplier, adder, 8-bit register, and 8-bit SRAM

    are taken to be 1085, 248, 128, and 48 transistors, respectively

    [25]. The transistor counts of proposed structure using Daub-

    4 filters are estimated accordingly for 2-level decompositions

    for input frame-size (176144) and frame-rates 15 and 60 f/s

    (shown in Figs. 8 and 9). It can be observed from these figures

    that, for 15 f/s the structure of [20] involves considerably less

    transistors than all the structures. It is shown that, the proposed

    structures for Q = 16, 48, 72, and 144, respectively, involve

    4.17, 4.32, and 4.44, and 4.79 times more transistors than that

    of [20] and offer 2.25, 6.75, 10.125, and 20.25 times more

    throughput rate. For 60 f/s (Fig. 9), the proposed structure

    involves 1.08, 1.12, 1.15, and 1.24 times more transistors

    than the corresponding structures of [20], but offer nearly

    2.25, 6.75, 10.125, and 20.25 times more throughput than the

    other. An interesting feature of the proposed structures is that

    its throughput rate can easily be scaled by simply increasing

    the number of processing modules without increasing the on-

    chip storage and frame buffer size. The transistor counts of

    the proposed structures therefore do not increase remarkably

    with the throughput rate.

    We have coded the proposed structure (based on Daub-

    4 filters) for input block size Q = 16, 48, 72, and 144 forthe first-level decompositions of input-size (176 144 15),

    in very high speed integrated circuit hardware descriptive

    language and synthesized using Xilinx ISE 11i. The proposed

    designs are then implemented in Virtex-5 field programmable

    gate array (FPGA) platform using 5VLX330 device. The

    results obtained from the synthesis report are listed in Table

    IV in terms of number of slices and maximum clock frequency

    in MHz. The designs for different input block size have the

    same cycle period, as expected from the theoretical estimation

    shown in the comparison Table I. It can be observed from

    Table IV that the proposed design for Q = 48, 72, and 144,

    TABLE IV

    Synthesis Results of Proposed Structures for FPGA Device

    5VLX330

    Proposed Designs Number of Slices Max. Clock Frequency

    Q = 16 29 812 582.74MHz

    Q = 48 43 898 582.74MHz

    Q = 72 53 532 582.74MHz

    Q = 144 82 008 582.74MHz

    respectively, involves 1.47, 1.79, and 2.75 times more slices

    than those required for (Q = 16) and offers, respectively,

    3, 4.5, and 9 times more throughput rate. The increase in the

    number of slices could be due to difference in the way of

    implementation of adders and multipliers in FPGA although

    we find that the transistor count does not increase remarkably

    with Q.

    VI. Conclusion

    A throughput-scalable parallel and pipeline architecture was

    proposed for high-throughput computation of multilevel 3-DDWT with 100% hardware utilization efficiency. Each level

    of 3-D DWT computation was split into three distinct stages

    and the computations of all the three stages are implemented

    concurrently in a parallel array of pipelined processing mod-

    ules. For multilevel DWT computation, we have proposed a

    cascaded structure where each level of decomposition was

    performed by a processing unit in a separate pipeline stage. An

    interesting feature of the proposed structure is that it involves

    relatively small frame-buffer than the existing structures; and

    size of the frame-buffer is independent of frame-rate. Besides,

    the size of the on-chip storage and frame-buffer is independent

    of the input block size. The latency of the proposed structure

    for multilevel 3-D DWT is O(MN/Q) while that of the

    existing structures is O(MNR), where (M N) is the frame-

    size. Compared with best of the existing structures, we find

    that for frame size (176144) and frame-rate (15 and 60) f/s,

    the proposed structure for input block size (16, 48, 72, 144),

    respectively, involves (4.17, 4.32, 4.44, 4.79) and (1.08, 1.12,

    1.15, 1.24) times more transistors of [20] and offers (2.25,

    6.75, 10.125, 20.25) times more throughput rate. The overall

    area-delay products of proposed structure is significantly lower

    than the existing structures, although they involve slightly

    more multiplier-delay product and more adder-delay product,

    since it involves significantly less frame-buffer and storage-

    word-delay product. It was found that the throughput rate ofthe proposed structure can easily be scaled without increas-

    ing the on-chip storage and frame-memory by using more

    number of processing modules; and for higher-frame-rates

    and higher input block-size it provides greater advantage over

    the existing designs. It was also found that the full-parallel

    implementation of proposed scalable structure provides the

    best of its performance. But, the computational speed of the

    full-parallel structure on one hand might be too high for the

    required application, and on the other hand it may involve too a

    large hardware unit. Therefore, when higher processing speed

    is not required, the structure could be operated by a slower

  • 8/3/2019 05523919

    10/10

    MOHANTY AND MEHER: PARALLEL AND PIPELINE ARCHITECTURES FOR HIGH-THROUGHPUT COMPUTATION OF MULTILEVEL 3-D DWT 1209

    clock, where speed could be traded for power by scaling down

    the operating voltage [24]. To see the other possibilities of

    using the proposed high-throughout structure, we have shown

    that it could be implemented in FPGA for slow and low-

    cost implementation. On the other way, the multiplications

    and additions of the proposed structure either could be time-

    multiplexed inside the processing modules, and/or could be

    implemented by slower (possibly bit-serial or digit-serial kind

    of) but hardware-efficient multipliers and adders, such that theclock period will increase and hardware requirement could be

    reduced. Future work still needs to be carried out to determine

    how the full-parallel structure could be used by slow and low

    cost implementation by appropriate algorithm architecture co-

    design to match the real-time processing requirement.

    References

    [1] A. Wang, Z. Xiong, P. A. Chou, and S. Mehrotra, 3-D wavelet codingof video with global motion compensation, in Proc. IEEE CS DataCompression Conf., May 1999, p. 404.

    [2] G. Minami, Z. Xiong, A. Wang, and S. Mehrotra, 3-D wavelet codingof video with arbitrary regions of support, IEEE Trans. Circuit Syst.

    Video Technol., vol. 11, no. 9, pp. 10631068, Sep. 2001.[3] A. M. Baskurt, H. Benoit-Cattin, and C. Odet, 3-D medical image

    coding method using a separable 3-D wavelet transform, in Proc. SPIE Med. Imaging: Image Display, vol. 2431. Apr. 1995, pp. 173183.

    [4] V. Sanchez, P. Nasiopoulos, and R. Abugharbieh, Lossless compressionof 4D medical images using H.264/AVC, in Proc. IEEE ICASSP, vol. II.May 2006, pp. 11161119.

    [5] J. Wei, P. Saipetch, R. K. Panwar, D. Chen, and B. K. Ho, Volumetricimage compression by 3-D DWT, in Proc. SPIE Med. Imaging: Image

    Display, vol. 2431. Apr. 1995, pp. 184194.[6] L. Anqiang and L. Jing, A novel scheme for robust video watermark

    in the 3-D DWT domain, in Proc. ISDPE, Nov. 2007, pp. 514516.[7] A. Grzeszczak, M. K. Manal, S. Panchanathan, and T. H. Yeap, VLSI

    implementation of discrete wavelet transform, IEEE Trans. VLSI Syst.,vol. 4, no. 4, pp. 421433, Dec. 1996.

    [8] J. C. Limpueco and M. A. Bayoumi, A VLSI architecture for separable2-D discrete wavelet transform, J. VLSI Signal Process. Syst., vol. 18,

    pp. 125140, Feb. 1998.[9] C. Chrysafis and A. Ortega, Line-based reduced memory, wavelet

    image compression, IEEE Trans. Image Process., vol. 9, no. 3,pp. 378389, Mar. 2000.

    [10] F. Marino, Efficient high-speed/low-power pipelined architectures fordirect 2-D wavelet transform, IEEE Trans. Circuit. Syst. II: Analog

    Digital Signal Process., vol. 47, no. 12, pp. 14761491, Dec. 2000.[11] F. Marino, Two fast architectures for direct 2-D wavelet transform,

    IEEE Trans. Signal Process., vol. 49, no. 6, pp. 12481259, Jun. 2001.[12] C. Yu and S. J. Chen, VLSI Implementation of 2-D wavelet transform

    for real-time video signal processing, IEEE Trans. Consumer Electron.,vol. 43, no. 4, pp. 12701279, Nov. 1997.

    [13] P.-C. Wu and L.-G. Chen, An efficient architecture for 2-D discretewavelet transform, IEEE Trans., Circuit Syst. Video Technol., vol. 11,no. 4, pp. 536545, Apr. 2001.

    [14] S. B. Pan and R. H. Park, Systolic array architectures for computation ofthe discrete wavelet transform, J. Visual Commun. Image Representat.,

    vol. 14, no. 3, pp. 217231, Sep. 2003.[15] C.-T. Huang, P.-C. Tseng, and L.-G. Chen, Generic RAM-based ar-

    chitectures for 2-D discrete wavelet transform with line-based method,IEEE Trans., Circuit Syst. Video Technol., vol. 15, no. 7, pp. 910920,Jul. 2005.

    [16] M. Weeks and M. A. Bayoumi, 3-D discrete wavelet transformarchitecture, IEEE Trans. Signal Process., vol. 50, no. 8, pp. 20502063, Aug. 2002.

    [17] M. Weeks and M. A. Bayoumi, Wavelet transform: Architecture,design and performance issues, J. VLSI Signal Process., vol. 35,no. 2, pp. 155178, Sep. 2003.

    [18] W. Badawy, M. Talley, G. Zhang, M. Weeks, and M. A. Bayoumi, Lowpower very large scale integration prototype for 3-D discrete wavelettransform processor with medical application, J. Electron. Imaging,vol. 12, no. 2, pp. 270277, Apr. 2003.

    [19] B. Das and S. Banerjee, A memory efficient 3-D DWT architecture,in Proc. 16th Int. Conf. VLSI Design, Aug. 2003, p. 208.

    [20] Q. Dai, X. Chen, and C. Lin, A novel VLSI architecture formultidimensional discrete wavelet transform, IEEE Trans. Circuitand Syst., Video Technol., vol. 14, no. 8, pp. 11051110, Aug.2004.

    [21] ITRS. (2005). International Technology Roadmap for Semiconductors[Online]. Available: http://public.itrs.net/

    [22] G. E. Moore, Cramming more components onto integrated circuits,Proc. IEEE, vol. 86, no. 1, pp. 8285, Jan. 1998.

    [23] G. E. Moore, Lithography and the future of Moores law, inProc. 12th SPIE Adv. Resist Technol. Process., vol. 2438. Jun. 1995,pp. 217.

    [24] K. K. Parhi, VLSI Digital Signal Processing Systems: Design andImplementation. New York: Wiley, 1999.

    [25] N. H. E. Weste and D. Harris, CMOS VLSI Design: A Circuitsand Systems Perspective. Boston, MA: Pearson/Addison-Wesley,2005.

    [26] A. Vincent, D. Wang, and L. Zhang. (2006, Dec.). Codec CRC-WVC Outperforms H.264 Video with Wavelets [Online]. Available:http://www.byte.com/documents

    Basant K. Mohanty (M06) received the B.Sc. andM.Sc. degrees in physics from Sambalpur Univer-sity, Orissa, India, in 1987 and 1989, respectively,and the Ph.D. degree in the field of VLSI for

    digital signal processing from Berhampur University,Orissa, in 2000.In 1992, he was selected by the Orissa Public

    Service Commission to become a Faculty Memberwith the Department of Physics, SKCG CollegeParalakhemundi, Orissa. In 2001, he was a Lecturerwith the Department of Electrical and Electronic En-

    gineering, BITS Pilani, Rajasthan, India. Then, he was an Assistant Professorwith the Department of Electrical and Computer Engineering, Mody Instituteof Science and Technology, Deemed University, Rajasthan. In 2003, he joinedthe Jaypee Institute of Engineering and Technology, Guna, Madhya Pradesh,India, as an Assistant Professor. He was promoted to Associate Professor in2005 and Professor in 2007. He has published nearly 25 technical papers. Hisname has appeared in Marquis Whos Who in the World in 1999. His cur-rent research interests include design and implementation of re-configurableVLSI architectures for resource-constrained digital signal processingapplications.

    Dr. Mohanty is a lifetime member of the Institution of Electronics andTelecommunication Engineering, New Delhi, India.

    Pramod Kumar Meher (SM03) received the B.Sc.and M.Sc. degrees in physics and the Ph.D. degreein science from Sambalpur University, Sambalpur,India, in 1976, 1978, and 1996, respectively.

    He has a wide scientific and technical backgroundcovering physics, electronics, and computer engi-neering. Currently, he is a Senior Scientist with theInstitute for Infocomm Research, Singapore. Prior tothis assignment he was a Visiting Faculty Memberwith the School of Computer Engineering, NanyangTechnological University, Singapore. Previously, he

    was a Professor of computer applications with Utkal University, Bhubaneswar,India, from 1997 to 2002, a Reader of Electronics with Berhampur University,

    Berhampur, India, from 1993 to 1997, and a Lecturer of physics withvarious government colleges in India from 1981 to 1993. He has publishednearly 150 technical papers in various reputed journals and conferenceproceedings. His current research interests include design of dedicated andreconfigurable architectures for computation-intensive algorithms pertain-ing to signal processing, image processing, communication, and intelligentcomputing.

    Dr. Meher is a Fellow of the Institution of Electronics and Telecommunica-tion Engineers, India, and the Institution of Engineering and Technology, U.K.He is currently serving as an Associate Editor for the IEEE Transactionson Circuits and Systems II: Express Briefs, IEEE Transactionson Very Large Scale Integration Systems, and Journal of Circuits,Systems, and Signal Processing. He was the recipient of the Samanta Chan-drasekhar Award for Excellence in Research in Engineering and Technologyfor 1999.