243 Akella Paper

18
The DARPA Data Transposition Benchmark on a Reconfigurable Computer S. Akella, D. A. Buell, L. E. Cordova Department Computer Science and Engineering, University of South Carolina, Columbia, South Carolina, 29208. {akella | buell | cordoval}@cse.sc.edu J. Hammes SRC Computers, Inc. 4240 North Nevada Avenue Colorado Springs, Colorado 80907 [email protected] MAPLD, September 7-9, 2005

description

zxsds

Transcript of 243 Akella Paper

Page 1: 243 Akella Paper

The DARPA Data Transposition Benchmark on a Reconfigurable Computer

S. Akella, D. A. Buell, L. E. CordovaDepartment Computer Science and Engineering,

University of South Carolina,Columbia, South Carolina, 29208.

{akella | buell | cordoval}@cse.sc.eduJ. Hammes

SRC Computers, Inc.4240 North Nevada Avenue

Colorado Springs, Colorado [email protected]

MAPLD, September 7-9, 2005

Page 2: 243 Akella Paper

Abstract

The Defense Advanced Research Projects Agency has recently released a set of six discrete mathematics benchmarks that could be used to measure the performance of high productivity computing systems. Benchmark 3 requires transposition, bit by bit in blocks, of a long bit-stream. In this paper we present the design and implementation of this benchmark on a high performance reconfigurable computer – the SRC Computers SRC-6. We evaluate the performance of this machine by benchmarking our implementation against a standard C based implementation of the same algorithm on a single processor Pentium PC.

Index Terms— Reconfigurable Architectures, FPGA, Verilog, C, Matrix Transposition.

1. Introduction

The six DARPA Discrete Mathematics Benchmarks can be used to measure the performance of high productivity computing systems [1]. The benchmarks have been written for 64-bit machines and coded in Fortran 77 or C. In addition some of the benchmarks are available in MPI, Shmem, and UPC. DARPA is interested in experimenting with all of these six algorithms and desires performance improvement using novel methods for implementations of these algorithms. The six benchmarks are described briefly on the University of South Carolina reconfigurable computing research group’s website [2]

In this paper we look at DARPA Benchmark 3, the Data Transposition algorithm. The problem definition for this benchmark is given below. Let {Ai} be a stream of n-bit integers of length L.

Consider each successive block of n integers as an n x n matrix of bits. For each such matrix, transpose the bits such that bit bij is interchanged with bit bji. The resulting stream {A’i} will have the property that:

A’1, A’n+1, A’2n+1, etc will hold sequentially high order bits of {Ai};

A’2, A’n+2, A’2n+2, etc with hold sequentially next-to-high order bits of {Ai}

Output the stream {A’i} Parameters for this benchmark are:

Length of Stream Bit-width (n) No. of Iterations

107 32 400107 64 230107 1024 12

We see that the Data Transposition benchmark provides three different set of sub-benchmarks which need to be implemented. They vary based on the bit-width (n) of the integers and the number of iterations we need to run the input stream of length L set to ten million bits for all the three sub-benchmarks.

2. Previous Research

There has been relatively little previous research on this data transposition problem in which the transposition operation is done at the bit level. Most previous work has assumed that each element of the matrix was a word by itself. Several methodologies such as parallel matrix transpose algorithms [3, 4], mesh implementations [5], and hypercube algorithms [6] have been investigated, but none of them is relevant to what we do here. We therefore look at implementing this algorithm in our methodology that suits the underlying architecture on which we implement it. First we will look at the software implementation, and then we will discuss the SRC-6 implementation.

3. Software implementation

We first look at a software implementation of this benchmark. The DARPA benchmark “rules of the game” are that at first we should make and time the original code, making only changes necessary for correct execution. The original code was written in FORTRAN and has a recursive transposition algorithm and is written for a 64-bit architecture which is difficult to run on the available 32-bit Pentium Processor PC. It also assumes that all input words could be accessed at the same time, which is not suitable for direct porting to the SRC-6 architecture. We therefore implemented the benchmark in C using a nested loop construct for the purposes of simplicity and for easier portability to the SRC-6 computer. It works on one word of data at a time which is adherent to the SRC-6 hardware architecture. The loop construct is similar for 32-bit and 64-bit benchmarks. The code for the 1024-bit benchmark has slight modifications to work on an array of 64-bit words instead of the entire 1024-bit word. The methodology used for the software implementation is as given below:

Read the bit-stream in words of n bits (32/64/1024 bits based on the benchmark)

There are n n-bit inputs, and n-bit outputs. The outputs represent the transposed values of the inputs.

Each of the n-bits of the ith input would form ith

bit of all the n output words. The inner loop runs n times, once for each output

generated. In each iteration of the inner loop, the ith bit of the

n-bit input read is picked and placed in the ith

MAPLD, September 7-9, 2005

Page 3: 243 Akella Paper

position of the corresponding output. The outer loop runs the inner loop n times, once

for each output to be generated.This setup is slightly different for the 1024-bit

benchmark since the natural word length of the PC is restricted to 64-bits and the natural data transfer unit to the reconfigurable resources on the SRC-6 is a 64-bit word. We use sixteen 64-bit words to represent one 1024-bit word. Thus while working on one 1024-bit word input we actually work on sixteen 64-bit words each time in the inner loop. Similarly, when generating each of the 1024-bit word outputs, we generate sixteen 64-bit words which when concatenated together would form the 1024-bit word.

The code for each benchmark is run on a computer which has dual Intel Pentium 4 processors, and the timing results collected are presented in Table 1.

4. The SRC-6 reconfigurable computer

The SRC-6, by SRC Computers, [7] is one of the limited numbers of reconfigurable computers commercially available. This computer allows the programmer to be able to overlook the details of underlying the hardware architecture and focus on the implemented function. This approach helps in decreasing the time to solution by facilitating software development by programmers and mathematicians.

4.1. The Hardware Architecture [8]

The SRC-6 system architecture includes Intel®

microprocessors running the Linux operating system. The reconfigurable logic resource is referred to by SRC as a MAP®; the MAP® boards normally come in pairs attached to these microprocessors. Each MAP® consists of two Xilinx® FPGA chips and a control processor. Code for the Intel processors is written in standard C or Fortran. Code for the MAP® hardware can be written in MAP C or MAP Fortran and compiled by an SRC-proprietary compiler that targets the MAP® components. Calls to execute on the MAP® are function/subroutine calls from the standard C or Fortran modules.

The MAP® consists of both fixed and reconfigurable hardware. The fixed part includes general control logic, the DMA engine in the control FPGA processor and six 4MB banks of on board memory referred to as OBMs. The reconfigurable hardware consists of two Xilinx®

XC2V6000 FPGA chips referred to as User logic. The architectural block diagram is as given in Figure 1.

Figure 1. MAP® interface block diagram [8]Most computations will first require a transfer of input

data to the OBMs from the microprocessor memory through the control processor interface. The FPGA user logic then performs computations on the input data by reading it from the OBMs, perhaps writing intermediate or final results to the OBMs, and then transferring back from the OBMs to the microprocessor memory through the control processor interface.

An important point to notice, especially when comparing 32-bit and 64-bit computations like this benchmark, is that the reconfigurable hardware is strongly oriented toward 64-bit words. The OBM is organized as six banks of memory organized as 64-bit words to be read or written. The DMA data transfers to and from the microprocessor in 64-bit quantities. We would expect, therefore, that a 32-bit benchmark would actually be hampered by the 64-bit orientation of the machine, and we would have as a goal that the 1024-bit benchmark would function very nearly like an array of sixteen 64-bit-word transpositions. This is something of an oversimplification but might serve to place the implementations in perspective.

4.2. The Programming Environment

The SRC-6 programming environment involves traditional software based compilation process along with a specialized MAP® compilation process that is used to compile the code that is to run on the MAP® hardware. Code that is to be run on the Intel® processors is compiled

MAPLD, September 7-9, 2005

Page 4: 243 Akella Paper

separately. The application code has two main source files, one that is supposed to run on the Intel® processors and one that is supposed to run on the MAP® hardware. The two source files can be written in either C or Fortran. The MAP® source file has functions that can call user hardware macros to be executed on the MAP® hardware. The hardware macros can be built-in or user-defined in either the VHDL or Verilog hardware description languages.

The microprocessor C or Fortran compiler and the MAP® compiler object files are linked together to form the final application executable.

5. The SRC-6 Implementation

The SRC-6 environment provides us with the options of programming in C or Fortran. We have chosen to implement all the DARPA benchmarks in C. We have implemented the benchmarks in two ways:

The transposition operation is implemented using pure C code which we refer to as a C Map implementation.

The transposition operation is written in Verilog and a function call is made to this macro from the MAP® C source file. We refer to this as the Verilog Map implementation.

The SRC-6 implementation used was different from the FORTRAN code in that the FORTRAN implementation assumes the capability to access multiple words at the same time. The SRC-6 hardware would require additional clock cycles for accessing multiple words from the same bank. Thus implementations suitable for this underlying architecture which work on one word, two words and then four words of data at a time were designed. The two methodologies have different performance results.

We initially look at the basic 32-bit and 64-bit C Map and Verilog Map implementations, then the basic 1024-bit C Map and Verilog Map implementations. Later we scale up the architecture and look at multi-unit parallel implementation and implementations with 128-bit data transfers for all three sub-benchmarks.

6. The C Map Implementation

A pure C implementation was done for each of the 32-bit, 64-bit and the 1024-bit benchmarks.. The basic code format is similar for all three. A C main program reads the input bit-stream from an input file, calls a MAP function for performing the transposition operation, and then writes the results to an output file. The input bit stream is stored in one long array in the main program. The data arrangement within the input array for the 32-bit and 64-bit input data is the same. Once the input data is stored within the array, it is transferred from the common memory to the bank through

DMA calls. The MAP® has six OBM banks of 524288 words, each word being 64 bits. The input bit stream is of 10 million words transferred in blocks of 262144 words at a time, since we cannot load the entire input data into the OBM banks.

6.1. 32-bit and 64-bit benchmarks.

The main C program and the map C code for the 64-bit benchmark are presented in Figures 2 and 3.

The code is similar for the 32-bit benchmark. The MAP code constructs are mostly similar to the C constructs. Certain constructs are vendor-specific and are used instead of traditional C constructs for the purposes of current version compiler compatibility. In the MAP C code, we have the vendor-specific macro ‘selector_64’ that is used instead of a large switch statement. A sample code is presented below:

selector_64 (i==0, temp0, temp, &temp);

The macro selects variable temp0 and assigns it to temp when the condition i==0 is true. We can use the above construct on different conditions that represent different cases of a switch statement. We note that constructs like this are necessary for ensuring that efficient code is generated; although the transposition itself is a relatively simple operation, expressing the bit-level selection in a 64-bit word can be textually tedious to the point of obscuring the underlying algorithm.The data transposition is implemented using shift and bit-wise or operations as shown in the code of Figure 3.

void main(){

//declarations…

// Assigning values for (i = 0; i < m; i++){ fscanf(in, "%lld", &temp); A[i] = temp; E[i] = 0; }

for(j=0;j<230;j++){ for(k=0;k<nblocks;k++) //assign values in blocks of half the bank capacity // call map function dt (A, E, m, &time, 0); }

Figure 2. Version 1 main.c code for the 64-bit C Map implementation

MAPLD, September 7-9, 2005

Page 5: 243 Akella Paper

void dt (uint64_t A[], uint64_t E[], int m, int64_t *time, int mapnum) {…declarations....DMA Transfer of data from CM to OBM..//Transposition operation done in blocksnblocks = datasize/m; for (block=0; block < nblocks; block++){ for (i=0; i<m; i++) { l = block*m + i; EL[l] = 0; for(j=0; j<m;j++) { k = block*m + j; temp = ((AL[k] & (1ULL<<(31-i)))) << i; EL[l] |= temp >> j; } }

} …. ..DMA Transfer of output data from OBM to CM.. read_timer(&t2); *time = t2 - t1;

Figure 3. Version 1 map C code for the 64-bit C Map implementation

6.1.1. Code Improvements. The version 1 code presented in Figure 3 works similar to the software implementation. Its inner loop generates one output data at a time and the outer loop runs the inner loop until all the outputs are generated. This code is not productive and results in little performance benefit, as shown in the timing results of Table 2. We would like to exploit the inherent parallel nature of the FPGA architecture, so we modify the code to operate upon on all the n output values (for each block or one n x n matrix of data) at the same time.

We use temporary variables for each of the n output values which warrant the use of additional space on the FPGA. Each time one n-bit input word is read, we obtain one bit of each of the output values. The ith input would give us the ith bit of each of the outputs and is thus placed in the ith position of all the temporary variables using shift and or operations. Thus, when we are done reading the nth input value, we have all the transposed values. We then transfer these values onto the OBM. This modification allows us to generate the n output values in n cycles instead of n2 cycles as was originally the case.

This modification works quite well for the 32-bit benchmark, but with the 64-bit benchmark we run into synthesis problems because we run out of FPGA resources. This comes from an increase in the number of temporary variables (32 to 64) and a corresponding increase in the number of shift and or operators. To overcome the increase in resource usage we must modify the shift operations portion of the code. The original code uses variable-distance shifts and thus generates shift operators of different sizes. This can be modified to perform shifts of

constant distances, which take little or no logic to implement. A code sample is provided below:

temp17 |= ((AL[l] & (1ULL<<46)) << 17) >> i;

The statement can be replaced by the following statement that uses constant distance shifts.

temp17 = (temp17 << 1) | (1 & (AL[l]>>46));

The original functionality is retained, but the resource usage is drastically reduced, especially because we have 64 such modifications for the 64-bit transposition operation. The modified code for version 2 is presented in Figure 4. This code synthesizes well using resources well below the total available on the XC2V6000 chip.

Finally we use parallel sections using compiler pragmas to overlap DMA transfer of input data with computation on the data that has in a previous step been transferred onto the OBMs. This provides great performance benefits, as was shown in previous work on the DARPA Boolean equation benchmark [9], by permitting full overlap of data movement with computation. The code for the 32-bit and 64-bit benchmarks has been run on the SRC-6 computer, and the timing results are presented in Table 3.

The 1024-bit benchmark requires either a multi-unit implementation where multiple 64-bit transposition units are employed for conducting the 1024-bit transposition operation or calling the 64-bit unit multiple times for transposing the 1024-bit matrix. We first look at the Verilog Map implementation of the 32-bit and 64-bit benchmarks before we discuss the 1024-bit implementation for both the C Map and the Verilog MAP implementations.

MAPLD, September 7-9, 2005

Page 6: 243 Akella Paper

void dt (uint64_t A[], uint64_t E[], int m, int icpl, int loops, int64_t *time, int mapnum) { …Declarations… ..Initial DMA Transfer… #pragma src parallel sections { #pragma src section // Data transposition operation { ..declarations.. for (block1=0;block1<limit;block1++){ for (i=0; i<mm; i++) { ……… temp0 = (temp0 << 1) | (1 & (AL[l]>>63)); ……….. temp63 = (temp63 << 1) | (1 & (AL[l])); } for(i=0;i<mm;i++){ k = block1*mm + i + ioff; temp = temp63; selector_64 (i==0, temp0, temp, &temp); .............. selector_64 (i==63, temp63, temp, &temp); EL[k] = temp; } } }// end of data transposition parallel section #pragma src section // DMA transfer { ..declarations.. if (block < loops) { inputarraysub += icpl; joff = BLOCKSIZE - ioff; .. Parallel DMA transfer of data } } //end of block for loop

Figure 4. Final version map C code for the 64-bit C Map implementation

6.2. The Verilog Map implementation

As we have mentioned earlier, the SRC-6 programming environment allows us to incorporate user-defined hardware macros designed using either VHDL/Verilog. The C Map implementation provides a good speed–up over a standard processor, but we wanted to explore the possibilities of designing the application using a hardware description language that gives us a highly customized implementation. This is similar to traditional software programming methodology where critical functions are rewritten in assembly language for better performance. The hardware description languages provide us the ability to operate at the bit level instead of at the word level in C. This added advantage can be exploited in designing the transposition algorithm that mainly requires bit-manipulation. We therefore model the transposition algorithm in Verilog and implement it on the SRC-6 platform. An additional benefit from this extra implementation is to measure the quality of the code synthesized from C by the SRC MAP C compiler.

The basic idea and setup used for the Verilog design is similar for all three versions of the benchmarks to the Map

C code. The basic hardware architecture of a 64-bit transposition module is given in Figure 5.

dt_op temp [63 : 0]

dt_op temp [63 : 0]

dt_op A [0] temp [63 : 0]

A [31]

A [63]

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

Figure 5. Basic architecture of the 64-bit Verilog transposition unit

The module consists of 64 units (dt_op) in parallel, each working on one of the 64 output values. Each unit holds a 64-bit shift register and a 64-bit or and performs the following operation:

temp <= (temp << 1) | ({63'b0, A});

This is different from the C version in that we operate on one bit of the input held in the variable A. Also, the concatenation of A with the word 63’b0 (63-bit word having a value of zero) is not possible in C using a simple bit-level operation as in the code above but would in C need shift and addition operations. A setup similar to that of the 64-bit is used for the 32-bit benchmark, with the inputs and outputs of the macro being 32-bit words. The 32-bit and 64-bit benchmarks using Verilog macros were implemented on the SRC-6 and the timing results we collected are presented in Table 4.

6.3. 1024-bit benchmark

The 1024-bit benchmark is a different from the previous two in that it operates upon a “word size” that is 16 times

MAPLD, September 7-9, 2005

Page 7: 243 Akella Paper

the size of the word that can be stored within the OBMs. This requires us to store the 1024-bit word in 16 locations of one OBM. We must also use multiple code units, with each unit performing a 64-bit transposition, in order to perform the entire 1024-bit transposition.

Figure 6. Breaking up of a 4x4 matrix into 2x2 sub-matrices

Transposing a 1024x1024 matrix can be broken down using 256 64-bit transposition modules. Each 64-bit transposition module would transpose each of the 256 64x64 sub-matrices. The resulting transposed words can then be appropriately concatenated to obtain the transposed 1024-bit words. The idea behind this method can be shown using the 4x4 example below that is broken down into the transposition of 2x2 matrices as shown in Figure 6.

In Figure 6, the 4x4 matrix has been broken down into four 2x2 sub-matrices that we could number 1, 2, 3, and 4 going anti-clockwise. The transposed sub-matrices would give us 2-bit transposed values that can be appropriately concatenated to give us the final 4-bit transposition values. The Figure 7 below gives us the final 4-bit transposed matrix.

Figure 7. The transposed 4x4 matrixIf you look at the transposed matrix above and if we

number the transposed matrices of sub-matrices 1, 2, 3, 4 as a, b, c, d we would see that we have these transposed sub-matrices in the order a, c, b, d going anti-clockwise. What this means is that the sub-matrix a (sub-matrix 1 transposed) values concatenated with sub-matrix c (sub-matrix 3 transposed) form the first two 4-bit transposed values. The next two 4-bit transposed values are obtained

by concatenating the values of sub-matrix b (sub-matrix 2 transposed) and values of sub-matrix d (sub-matrix 4 transposed). This is a simple idea and is relatively obvious examining the transposition operation.

Now that we have decided to break up the design into 64-bit units, we face the problem of resource usage and availability for implementation on the FPGA. The 64-bit transposition C Map implementation uses about 61% of the FPGA slices, including the communication overhead. The Verilog Map implementation takes 2% of the FPGA slices and about 28% of the slices for the infrastructure to read/write the OBM and some additional control logic for the MAP® C overhead. The 64-bit Verilog unit also uses around 4096 register bits, which is about 6% of the total register bits available on the FPGA. The entire 1024x1024 matrix transposition would require 256 64-bit units, so it is clearly impossible to implement this on one FPGA chip. We considered some available options for using multiple 64-bit units but a four unit design for the C MAP®

implementation doesn’t synthesize running out resources by using 91,012 slices, 269% of the available 33792 slices. However we have realized through experimentation that the four units employed do not actually run in parallel but are executed sequentially one after another. The SRC MAP®

compiler pipelines only innermost loops and data independent statements are executed in parallel within these loops. The four unit design has four inner loops within one outer loop. The outer loop is not pipelined and the four inner loops execute sequentially thus eliminating the advantage of having four units. We could as well have one unit that can be called 256 times to transpose the 1024x1024 matrix. This implementation in Map C synthesizes well utilizes resources well below those available.

The idea is to use one 64-bit unit and have it work on all the 64-bit parts of the 1024-bit input words one part at a time (since each 1024-bit word can be broken down into 16 64-bit words). Since each unit works on a 64x64 sub-matrix, we would have one unit working on the first 64 rows (of the 1024x1024 matrix) sixteen times to transpose those 64 rows. Thus, in 64x16 i.e. 1024 cycles (each 64-bit unit takes 64 cycles for a 64x64 matrix transposition) we would transpose 64 rows of the 1024x1024 matrix, and in 16*1024=16384 cycles we would be able to transpose all 1024 rows of the matrix. This scheme would take 16384 cycles to generate all the final transposed values at one time. However the values are generated out of order as the first 64 rows would give us the first 64-bits of each of the 1024 outputs and the next 64 rows would give the second set of 64-bits and so on so forth. The different 64-bit parts of each output are not generated in sequence. Thus we would have to transpose the whole 1024x1024 bit matrix before we could transfer the final 1024 transposed values to the memory. This requires substantial temporary storage space on the FPGA (about 1024x1024 bits to be exact).

MAPLD, September 7-9, 2005

Page 8: 243 Akella Paper

If we work on columns instead of rows, we could generate 64 1024-bit transposed values in 1024 cycles.. This “columnar” scheme would work on the first 64 columns of the 1024x1024 matrix. Each 64-bit unit would generate the transposed values of the 64 rows in the first 64 columns of the matrix. Thus, in 1024 cycles we would have the first 64 transposed values of the matrix. This operation’s correctness is evident from the 4x4 example that we have shown above. The transposed values are a concatenation of the transposed values of the 1 and 3 sub-matrices. Thus, if we have 2-bit transposition units working the first 2 columns of the 4x4 matrix we would get the sub parts of the transposed values of the first 2 columns which we could concatenate appropriately to obtain the first 2 transposed values of the 4x4 matrix. Similarly, we could generate the first 64 transposed values of our 1024x1024 matrix by working on the first 64 columns. The entire transposition would take 16384 cycles, but we would have generated and also transferred the values without having had to store them on the FPGA, thus saving time compared to transferring 1024 values all at the end.

We have in our analysis not taken into consideration the fact that all the 16 64-bit parts of each 1024-bit word cannot be transferred to the FPGA at once due to certain limitations. First, we have to store the 16 values in the available six banks, so we cannot later read all the 16 values at the same time. In obtaining one 1024-bit value, then, we would need multiple cycles and add latency to the design. If we stored all the 16 values in the same memory, we would need 16 cycles to read the full 1024-bit value. Thus to read the 1024 values we would need 16384 cycles

The time required to move the output values from the FPGA onto the OBM is 16384 cycles.

The map C code for the 1024-bit Verilog Map design is presented in Figure 8. The C Map implementation code structure is similar except that it uses C ‘shift’ and ‘and’ operations instead of a Verilog macro for the performing the transposition operation. The computation calls occurring within a loop are pipelined with one computation taking one clock cycle. The data transfer is overlapped with the computations that would allow us to achieve a transfer rate of one word per cycle. The loop-carried scalar dependencies that arise due to certain sections of code are avoided by replacing them with vendor specific code structures such as cg_accum_add_32. This replaces a conditional increment statement within a pipelined for loop.

The C Map and the Verilog Map implementation of the 1024-benchmark employing one 64-bit transposition unit, were implemented on the SRC-6 reconfigurable computer and the timing results collected are shown in Table 3 and Table 4 respectively.

void dt (uint64_t A[], uint64_t E…….. int mapnum) { …Declarations… ..Intiail DMA Transfer… for (block = 0; block < loops; block++) { // parallel sections for transposition and dma transfer #pragma src parallel sections { #pragma src section { for (block1 = 0; block1 < limit/2; block1++){ ……

for (col = 0; col < 16; col++){ ……. for(part=0; part<16;part++){ ……

for (j0 = 0; j0 < bw*2; j0 += 1) { k0 = block1*16384 + j0*16 + part*1024 + ioff + col; dt_op (AL[k0], i0, j0, j0 == 0, &temp0); cg_accum_add_32 (1, j0>64, 0, j0 == 0, &i0); l0 = block1*16384 + part + col*1024 + i0*16 + ioff; EL[l0] = temp0; } }

} } }// end computation sections #pragma src section // DMA transfer { .. DMA transfer of data in parallel to the computation } } //end of block for loop

Figure 8. 1024-bit map C code for the 1024-bit Verilog Map implementation

6.4. Parallel 3-unit implementation

After having looked at the basic implementations and obtaining better performance with the Verilog Map designs for the three benchmarks we look at scaling the architecture and having multi-unit parallel Verilog Map implementations. The multi-unit designs are possible for Verilog Map as the Verilog macros do not use a lot of resources and fit well on the available FPGA space.

The parallel implementation has 3 transposition units with one Verilog macro call for all three. The Verilog macro implements these units in parallel. The data is transferred to 3 OBMs one after another with the 3 parallel units taking input from 3 OBMs respectively. The output from the three parallel units is written to the remaining 3 OBMs. The Verilog macro has three instantiations of the transposition unit that execute concurrently. Each of these units takes the input from one of the three input memory banks and writes the transposed values to one of the three output memory banks. The code structure is similar for all the three benchmarks. The data transfer speed of one word

MAPLD, September 7-9, 2005

Page 9: 243 Akella Paper

per cycle remains the same and thus the performance improvement is obtained only during computation. The computation in the parallel implementation is three times faster over the one unit implementations. The one unit implementations had three cycles per word effective computational throughput which is set by the Map compiler that sets the pipeline depth and the cycle time for code block transitions. The one cycle per word data transfer throughput is maintained, obtaining a total of four cycles per word total throughput. Since we have three units working in parallel now, we obtain a 1 cycle per word computational throughput and the data transfer speed of one word per cycle is maintained. This provides us a two cycle per word total throughput and thus theoretically twice the speedup over the one unit implementations. However the pipeline depth and other factors affect the throughput of the design. The parallel 3 unit SRC-6 implementation results are presented in Table 5.

6.5. 128-bit Data Transfer

The SRC-6 machine has the capability of data transfers of 128-bit words between the common memory and the OBMs with a 64-bit word transferred at the positive edge and the other 64-bit word at the negative edge to two adjacent memory banks. This enables us to transfer two 64-bit words in one cycle. The earlier implementations had 64-bit word transfers which were not exploiting the full bandwidth. The 32-bit benchmark could have two words in one 64-bit word and thus can have four words transferred in one cycle with a 128-bit DMA transfer. The 64-bit and the 1024-bit benchmark designs could have two 64-bit words transferred in one cycle. The two-word and four-word per cycle transfer would require us to modify the way the transposition is performed in the Verilog macro. In the case of the 64-bit benchmark the two input words transferred could be read from the two banks concurrently and operated upon using two-bit shifts instead of the one-bit shifts employed in the previous implementations. Similarly for the 32-bit benchmark the four 32-bit input words transferred within two 64-bit words could be read from the two banks concurrently and operated upon using four-bit shifts. The 1024-bit benchmark operation is slightly different as the two words transferred form inputs from different blocks of data that need to be operated upon separately. In this case we employ two 64-bit transposition units that work on the alternate columns of the matrix. Here each column would represent 16 blocks of 64x64 matrices and since there are 16 such columns in a 1024x1024 matrix we have the each of the two units work on 8 columns 16 times for generating the final transposed values.

The multi-bit shifts in case of the 32-bit and 64-bit benchmarks and two-unit processing in case of the 1024-bit benchmark enable us to work on multiple words concurrently providing speedup in both computation as well

as data transfer. We are working on four words at a time for the 32-bit benchmark and two words at a time for the 64-bit and the 1024-bit benchmark. Thus compared to the basic one unit one word transfer per cycle implementation we theoretically expect to have twice the speedup in case of the 64-bit design and a four times speed up in case of the 32-bit benchmark. However the multi-bit shifts in the designs affect their pipeline depths and thus the overall throughputs. The 128-bit transfer SRC-6 Verilog implementation results are presented in Table 6.

6.6. Parallel 2-unit with 128-bit Data Transfer

The 128-bit data transfer implementations are scaled for a parallel implementation of two units working on two different streams of data. In these implementations we have two DMAs instead of one, to four OBMs with each DMA transferring 128-bits to two adjacent OBMs. The 32-bit and the 64-bit benchmarks have two units working in parallel with each unit reading 128-bits from two banks. The 32-bit transposition unit would perform four bit shifts and the 64-bit transposition unit would perform two bit shifts as in the original implementation for 128-bit word data transfers. The 1024-bit benchmark has to operate upon the four words separately and thus employs four 64-bit units that read concurrently from the four banks of memory. The 32-bit design is effectively working on eight 32-bit words at a time and the 64-bit and the 1024-bit designs are working on four 64-bit words at a time. Since we are using up four OBMs from the available six on the MAP® for reading in the input we would be left with only two OBMs for writing the output. Each unit generates four output words per cycle which need to be written to only two OBMs with two words being written to one OBM. Thus the two writes to each OBM would take up two cycles and add two-cycle latency to the pipelined loop instead of just one cycle. This affects the overall throughput of the design. The 1024-bit benchmark works on four words at a time thus operating at two times faster on computation but has two cycle latency with the output transfer thus achieving the same maximum throughput as the original 128-bit transfer implementation. The implementation results are presented in Table 7.

7. Results and Analysis

The timing results for the software and SRC-6 implementations for both C Map and Verilog Map are presented in Tables 1, 2, 3, and 4.

In Table 2 are presented the timing results for the C Map implementation using a nested loop structure without using parallel sections. It is quite obvious that we have not obtained any performance benefits but rather see degradation for both the 32-bit and the 64-bit benchmarks compared to the software implementation timing shown in

MAPLD, September 7-9, 2005

Page 10: 243 Akella Paper

Table 1. We have not implemented the 1024-bit benchmark since it requires the use of multiple 64-bit units and would thus result in lower performance.

The speedup numbers are tabulated in Table 8 below. The speedup is listed for all the three benchmarks for the five different implementations. These implementation being C Map, Verilog Map, Parallel 3-unit Verilog map, 128-bit data transfer Verilog Map and parallel 2-unit 128-bit transfer Verilog map that are named as A, B, C, D, and E respectively.

Table 1. Timing results for the software C implementation

Benchmark no of iterations

time in secs

time per iteration

32-bit 400 3767 9.41

64-bit 230 3270 14.21

1024-bit 12 2004 187.66

Table 2. Timing results for the C Map implementation with a nested loop construct

Benchmark no of iterations

time in secs

time per iteration

32-bit 400 6171 15.42

64-bit 230 5132 22.31

Table 3. Timing Results for the final C Map implementation

Benchmark no of iterations

time in secs

time per iteration

32-bit 400 244 0.61

64-bit 230 129 0.56

1024-bit 12 97 8.08

Table 4. Timing results for the Verilog Map implementation

Benchmark no of iterations

time in secs

time per iteration

32-bit 400 179 0.44

64-bit 230 98 0.42

1024-bit 12 72.7 6.05

Table 5. Timing results for the Parallel 3-unit Verilog Map implementation

Benchmark no of iterations

time in secs

time per iteration

32-bit 400 95 0.23

64-bit 230 60 0.26

1024-bit 12 44 3.66

Table 6. Timing results for the128-bit transfer Verilog Map implementation

Benchmark no of iterations

time in secs

time per iteration

32-bit 400 82 0.20

64-bit 230 63 0.27

1024-bit 12 30 2.5

Table 7. Timing results for the parallel 2-unit, 128-bit transfer Verilog Map implementation

Benchmark no of iterations

time in secs

time per iteration

32-bit 400 55 0.13

64-bit 230 54 0.23

1024-bit 12 30 2.5

Table 8. Speedup of different SRC-6 implementations over the software implementations.

Benchmarks Speedup over softwareA B C D E

32-bit 15 21 41 46 6864-bit 25 33 55 52 61

1024-bit 23 31 51 75 75

Table 8 shows that the C Map implementation of the 32-bit benchmark is ~15 times faster than the software implementation and the C Map implementation of the 64-bit benchmark is 25 times faster than the software implementation. The C Map 1024-bit benchmark implementation shows a speed up of about 23 times over the software implementation.

The speedup from the initial implementation for all the benchmarks can be easily seen, indicating the benefits of a parallel design with the transpositions for all the output words for a block of data conducted concurrently rather than sequentially.

The Verilog Map implementations provide an even greater improvement and speedup compared to the C Map implementations. The advantage of working at the bit-level provides better performance. The results shown in Table 8 indicate that the Verilog implementation of the 32-bit and 64-bit benchmarks are about 21 and 34 times faster,

MAPLD, September 7-9, 2005

Page 11: 243 Akella Paper

respectively, than the software implementation. The Verilog implementation of the 1024-bit shows a 31 times speed up compared to the software C implementation indicating the advantages of using hardware design units for the 1024-bit transposition.

The Verilog Map implementations show better performance than the C Map implementations basically due the fact that in Verilog the operations are performed at the bit level and not the word level making the transposition operations faster.

The numbers shown are significant for the parallel 3-unit implementations and also the 128-bit data transfer implementations. The parallel 3-unit exploits the available FPGA space and implements the three units in parallel working on input data concurrently for conducting the transposition operation and thus achieving speedup over the computation. However the data transfer is not full bandwidth and is done at 1 word per cycle. The number of parallel units could be increased but this number is restricted by the number of memory banks available from which data could be read concurrently without and any overlap. The SRC-6 has six OBMs and thus we use three for input and three for output. If we had more parallel units we would have to have some of the units read from the same bank which would add additional cycles to the pipeline. Thus employing three units we achieve a speedup of 41 times for the 32-bit, 55 times for the 64-bit and about 51 times for the 1024-bit benchmark respectively.

The 128-bit data transfer utilizes the full-bandwidth of data transfer between the common memory and the OBMs and thus transfers two words per cycle obtaining twice the speed of the original implementations. The computational units also perform the transposition on two words concurrently to maintain the two-word per cycle throughput. These implementations obtain a speedup of 46 times for 32-bit, 52 times for the 64-bit and 75 times for the 1024-bit benchmark respectively. Finally we have the parallel two-unit implementations along with the 128-bit data transfer that tries to exploit both the data transfer speedup as well as computational speedup by having two units instead of one working on four words of input rather than two words of input. The throughput however is limited by the fact that we have to write four words of output into two banks of memory thus overlapping two writes. This certainly limits the throughput but we still achieve considerable speedup for the 32-bit and the 64-bit benchmarks with the 1024-bit achieving the same speedup as the one-unit 128-bit transfer implementation. The 32-bit benchmark obtains a 68 times speedup, the 64-bit a 61 times and the 1024-bit a 75 times speedup. The throughputs are mostly governed by the pipeline depths and the data transition throughputs that the compiler sets based on the design. We have not yet obtained a specific way of measuring these parameters except through observing the dataflow graph or the Verilog code generated during the compilation process. Thus it is mostly a post-compilation

process and is a bit tedious.

8. Conclusions

We have implemented the Data Transposition benchmark on the SRC-6 platform and compare its performance against the software implementation. We have looked at multiple architectures with emphasis on exploiting the inherent parallelism that FPGAs provide. The SRC-6 reconfigurable computer provides the ability to implement the DARPA benchmark 3 and obtain a 34-times speedup on the 64-bit benchmark and a 31-times speedup on the 1024-bit benchmark compared to the software implementations through basic implementations. The higher-end architectures that exploit parallelism as well the 128-bit data transfer capability obtain better performance with the 32-bit, 64-bit, and the 1024-bit benchmark achieving a 68-times, 61-times and a 75-times speedup over the software implementation respectively. The benefits of implementing on a reconfigurable platform using parallel architectures can be clearly observed from the improvement in performance that was obtained. The SRC-6 reconfigurable platform thus provides the means to implement high data bandwidth, complex computation algorithms and obtain considerable performance improvements.

The parallelism has been exploited to a certain degree but is limited by the SRC-6 memory architecture. The six banks set a constrain as to how many parallel units can operate on the FPGA concurrently. The other important factor obviously is how many of the units we could fit on the FPGA. We have the resource usage for the different benchmarks for the different implementation given in Table 9. The numbers give the overall slice usage including the map communication overhead. We have a certain amount of communication overhead due to the reads and writes to the OBMs from the FPGA which is unavoidable.

Table 9. Resource usage of the SRC-6 implementations

Benchmark Slices used*B C D E

32-bit 7,024 10,155 7,312 8,87564-bit 9,507 18,007 9,588 13,672

1024-bit 7,599 16,051 10,852 19,946*Virtex-II 6000 has a total of 33792 slices

The 3-unit parallel implementation for the 32-bit, the 64-bit and the 1024-bit benchmarks use up around 30%, 53% and 47% slices respectively. The 128-bit implementations employ two units for the 32-bit and 64-bit benchmarks and four units for the 1024-bit benchmark and use up 26%, 40% and 59% slices respectively. This implies that we could have more parallel units implemented on the chip. However this would lead to bank conflicts with

MAPLD, September 7-9, 2005

Page 12: 243 Akella Paper

multiple units reading from the same OBM. If we look beyond the SRC-6 system architecture and

assume we have more than six OBMs with smaller capacities, we could have more units in parallel trying to read from these memory banks. The number of memory banks we would require depends on how many parallel units we could implement on the available FPGA space. The slice counts we have obtained tell us that we could implement around ten units for the 32-bit benchmark and around five and six units for the 64-bit and the 1024-bit benchmarks. These calculations include the communication overhead involved due to memory interface. This overhead is compiler specific as we have observed that a 64-bit transposition unit just by itself uses around 3% of the LUTs and about 6% of the register bits on the FPGA. Thus by rough calculation the FPGA could easily fit around ten of the 64-bit units. However even if we assume that we could fit ten units on the FPGA we probably would run out of I/O pins for having interface with ten different memory banks. The Xilinx Virtex® II 6000 chip on the SRC-6 MAP® board has a total of 1,104 pins. The six memory banks on the SRC-6 system use up a considerable number of these pins, some pins are used up by inter-chip communication between the two FPGA chips on the MAP, some are used for the communication with the control processor and some for the chain ports for communication between the two MAPs. Thus in this architecture we would probably run out of pins on the FPGA to build an appropriate interface.

The Data Transposition problem allows us to have high degree of parallelism where multiple units can work on small blocks of data with no data dependency. This feature pushes us to look at implementing this algorithm on highly parallel, distributed memory architectures. We believe that a system with multiple FPGAs, each connected to couple of local memory banks, would be an appropriate architecture for implementing this application. The FPGA sizes could be large enough to fit one unit including the communication overhead for the memory access. Each FPGA would hold one transposition unit with its local memory banks holding the input and output data. Many such processing elements working in parallel would achieve a very high performance. However this architecture would be specific for this application or applications that allow high degree of parallelism where each processing element size is not very big.

We would like to explore these kinds of architectures for implementing highly parallel data independent algorithms by looking away from traditional Von Neumann type architectures where memory is considered as one big space instead of several smaller distributed units.

9. References

[1] Defense Advanced Research Projects Agency. High productivity computing systems discrete mathematics benchmarks 2003.

[2] University of South Carolina reconfigurable website, http://www.cse.sc.edu/~buell/Public_Data/DARPA_HPCS/DARPA_discrete_math.html.

[3] J. Choi, J. J. Dongarra, D. Walker, “Parallel matrix transpose algorithms on distributed memory concurrent computers,” Oak Ridge National Laboratory, October 1993.

[4] S. Krishnamoorthy, G. Baumgartner, D. Cociorva, C. Lam, P. Sadayappan, “Efficient parallel out-of-core matrix transposition,” Proceedings of the IEEE International Conference on Cluster Computing (Cluster’03), December 2003, IEEE Computer Society Press, pp. 300-307.

[5] M. Kauffmann, U. Meyer, J. F. Sibeyn, “Matrix transpose on meshes: Theory and practice,” 11th International Parallel Processing Symposium, 1997.

[6] A. Edelman, “Optimal matrix transposition and bit reversal on hypercubes: All to all personalized communication,” Journal of Parallel Distributed Computing, vol. 11, pp. 328-331, 1991.

[7] SRC Computers, Inc. www.srccomp.com

[8] The SRC-6 MAP® Hardware Guide, Version 1.8. SRC Computers, Inc.

[9] D.A. Buell, S. Akella, J.P. Davis, G. Quan, “The DARPA Boolean equation benchmark on a reconfigurable computer”, Proceedings of the Military Applications of Programmable Logic Device, 2004.

MAPLD, September 7-9, 2005