## The Fastest Filter in the West

The DSP arena is a technological battleground that is zealously guarded by its incumbents (T.I., Motorola, Analog Devices, AT&T, etc.). Any newcomer will certainly be challenged and, to survive, will have to answer to a higher standard. In most instances that standard is speed, and here, distributed arithmetic has something to offer. Our candidate for this speed challenge is the symmetrical FIR filter. Specifically, our champion will be a programmable 8 tap filter with 8 bits of both coefficient and data values. It is programmable in the sense that its gate resources can be configured to do other tasks. Our adversary is the fixed-point "DSP" chip which, in single precision, processes 16 bit words. This may seem like an unfair match but there are many applications where 8 bits is sufficient. In fact, there are dedicated 8 bit, 8 tap FIR filter chips such as the Harris HSP43881 that offer sample rates up to 30 Mhz. Let us see how the Xilinx FPGA can compete with his device.

We start with the section of the Distributed Arithmetic Tutorial entitled "The Ultimate in Speed." The 8 tap FIR filter response is computed with 8 consecutive samples of a single data stream, and these 8 samples serve as the input variables for eqn. 6 which defines the "ultimate-in-speed" processing. Here the equation is reduced to 8 sums:

$$y = -[sum0] + [sum1]2^{-1} + {[sum2] + [sum3]2^{-1}}2^{-2} + {[sum4] + [sum5]2^{-1} + {[sum6] + [sum7]2^{-1}}2^{-2}}2^{-4}$$

It is now possible to translate this equation into functional blocks as was done in fig. 2. Since these are standard blocks they can be readily mapped into CLBs. Is the design for all intents and purposes complete? Hardly. First with A=B=K=8, and with a DALUT bit count of  $2^{K}xAxB$  or 16,384 bits, the number of CLBs required is 16384/32 or 512. This is half the capacity of the largest Xilinx FPGA - the 4025. And to this must be added the memory address decode circuits. Furthermore, the memory output multiplexing circuits are considerable. Any attempt at relief through the use of tri-state buses may introduce speed-limiting interconnect delays. A very poor start, indeed.

Fortunately, the symmetry of the FIR filter can be exploited with great advantage. This is a claim that none of the DSP contenders can make. The FIR coefficients or tap weights are symmetric (or antisymmetric) about the mid-point. Thus  $A_1=A_8$ ,  $A_2=A_7$ ,  $A_3=A_6$ , and  $A_4=A_5$  for the symmetric filter and  $A_1=-A_8$  etc. for the antisymmetric filter. Pursuing the symmetric filter, the DALUT content for the b data bit is:

 $[sumb] = x_{1b} \bullet A_1 + x_{2b} \bullet A_2 + x_{3b} \bullet A_3 + x_{4b} \bullet A_4 + x_{5b} \bullet A_5 + x_{6b} \bullet A_6 + x_{7b} \bullet A_7 + x_{8b} \bullet A_8$ 

and we are now able to halve the number of coefficients. Thus

$$[sumb] = (x_{1b} + x_{8b}) \bullet A_1 + (x_{2b} + x_{7b}) \bullet A_2 + (x_{3b} + x_{6b}) \bullet A_3 + (x_{4b} + x_{5b}) \bullet A_4$$

The sums within the parentheses represent the b output bit of the parallel adders that are summing the data appearing at the symmetric filter taps. The carry resulting from this addition adds an additional bit to the data word; thus B = 9. With the sum within each parentheses representing a DALUT address bit, the number of input variables, K, has effectively been halved. Now the number of DALUT bits is  $2^4x8x9$  or 1152, and the number of DALUT CLBs reduces to 36! We can now draw the functional blocks of the data path and estimate the total CLB count.

The data path shown in the figure below is all bit parallel. Starting at the left the 8 bits input samples are loaded into an 8 word shift register chain. This input buffer is 8 words long with 8 bits per word and is implemented with 64 flip-flops. The symmetric tap outputs are summed in 4 parallel pre-adders; the carry out adds a ninth bit to the sum. The word growth from fractional 2's complement to a range from -2 to 2<sup>-7</sup> is shown. Bits of the same weight from the 4 pre-adders serve as DALUT address bits. With the indicated word growth 9 DAULTs are required. The 8 lower DALUTs feed a binary-tree like adder network where the lower input to each adder (all are parallel) is scaled down by a power of 2. Word growth continues through the adder stages. While many applications required only 8 bit output precision, a double precision

data path is shown. An adder with 10 output bits indicated is essentially a 9 bit parallel adder with the carry bit passed along, and so on down the chain. Note that the last stage is a subtracter; however, for the DALUT contents addressed by the sign (-2) bits to be applied at the proper time, 3 stages of pipeline registers are inserted. With each stage triggered by a common system clock, a latency of 7 clocks results.

The system clock is, thus, the sample clock. With combinatorial paths limited to one CLB stage, the longest delay path is the carry propagation of the 13 bit adder. Assuming 1 nsec/bit for the 4000E-3 devices, the total delay path is 13 nsecs and the maximum system clock is, thus, 70 Mhz. The data path CLB count is approximately 150. The Xilinx 4025 could easily accommodate 4 filtered data paths. Can this design serve as a replicable module to construct higher order filters? How difficult would it be to extend the filter to 9 taps? The 9 tap FIR filter is more flexible - it can, for example, be used for both high-pass and low-pass applications (the even tapped filter is restricted to low pass applications). And the odd tapped FIR filter is often more efficient. The half band FIR filter is an outstanding example. As a decimating filter, the half band can probably process samples at twice the 70 Mhz system clock rate.

A gate-efficient serial implementation of the 8 tap filter consumes only 30 CLBs but has a maximum sample rate of 7 Mhz. We have an interesting tradeoff - the serial FIR filter uses  $1/5^{\text{th}}$  the number of gates but produces  $1/10^{\text{th}}$  the performance.



X7727

## Fig. 1. The Fastest Filter in the West

All stages are registered Single system clock One CLB stage delay between clocks Maximum fanout of 8 for DALUT address lines Approximately 160 CLBs for data path