Xilinx Answer #415 : Fast Integer Multipliers in Xilinx FPGAs -- Reprint from XCELL Issue #14

Answers Database

Fast Integer Multipliers in Xilinx FPGAs -- Reprint from XCELL Issue #14

Record #415
Problem Title:
Fast Integer Multipliers in Xilinx FPGAs -- Reprint from XCELL Issue #14

Problem Description:

Solution 1:

The following is a reprint of an article that appeared in XCELL
Issue #14, 3Q1994, and also in an abbreviated form in EDN, March
1993. It was named best 1993 design idea by EDN.

Fast Integer Multipliers Using Xilinx FPGA's
--------------------------------------------

Ken Chapman.
(Applications Engineer Europe)

Introduction.
-------------
Question : Can you implement a 16 by 16-bit multiply in an FPGA, leave room
         for other logic, and perform at 25MHz?

Digital multipliers provide the key to many digital systems including digital filters, correlators, and neural networks. These multipliers are typically required to take up to 16-bit operands and provide results in <50ns (20MHz systems). Figure 1 is a common module found in many signal processing applications. Eg. with the output 'y' applied to an activation function, this would be a simple neuron.
---------- --- x ______| Integer |____________| A | | Multiply | | d | ---------- | d |_________ y
Figure 1 | B ______| e |
-------- | | r | y = Ax + B
A -------- ---

The use of programmable logic, and the increasing size of devices available,
has lead to complete systems being implemented in one component. However,
the digital multiplier is generally considered to be too slow when
implemented this way, or just too large to make effective use of the
programmable part. Consequently, dedicated multiplier devices are connected
to the main system often resulting in a performance degredation. This is
caused not by the speed of the multipliers, but the task of getting the data
to and from the device (normally time multiplexed to reduce the large
I/O requirement).

A technique for implementing a compact and fast digital multiplier in a
programmable part is required.

Finding the Technique
---------------------

There are three main implementations of multipliers:-

a) Shift and Add - One operand is shifted to the left by one bit each
                 cycle and applied to an accumulator when the
                 corresponding bit in second operand is high.

b) Look-up table - the operands are applied as addresses to a
                 pre-programmed memory of which the output data
                 forms the result.

c) Logical tree - Each of the resultant bits are a logic function of the relevant bits of each operand.

Using a 16 by 16-bit multiplier as an example, consider each technique.

Shift and add implementation is compact but very slow. The result is obtained
after 16 clock cycles and the accumulator has to be 32 bits wide. This
accumulator will determine the maximum clock rate dependent on the carry
logic chain. This implementation also precludes any new operands from being
applied until the calculation has been completed, read and finally cleared.

A look-up table solution will be as fast as the memory used, but rapidly
becomes very large with increasing operand size. It requires a
4294967296 x 32-bit memory for the example! Small multipliers work well this
way, such as a 4 by 4 multiply implemented in a byte wide ROM.

Some very complex implementaions of logical trees have been developed
employing product sharing, and are to be found in many of the dedicated
arithmetic devices. The gate count is high and can be considered to be a
reduced version of the ROM table. The logic involved tends to have a high
fan-in requirement of up to 32 inputs.

A Technique for FPGA
--------------------

There is obviously a conflict as to which technique to employ when the
multiplier has to be both fast and compact. Choosing a technique for
use in a specific architecture also adds constraints. The XC4000 series
of Field Programmable Gate Arrays would appear to offer the correct
ingredients for a solution:-

        i) Currently up to 10,000 gates (typical in real designs) which
        indicates room for a multiplier and other logic.

       ii) Fast carry logic for compact and high speed simple arithmetic
        functions such as addition.

      iii) Ability to configure custom blocks of RAM and ROM.

Having all these features available means that any of the described
techniques may be implemented.

The compact shift and add technique, though too slow for most applications
but is very easily implemented in the device architecture. The accumulator
can make excellent use of the fast carry logic with a correspondingly low
cycle time even for a result of 32 bits.

A look-up table built of a custom ROM block would be ideal for small
multipiers. Since any configurable logic block (CLB) can be configured as
either two 16 by 1 memories or a single 32 by 1 memory, it is easy to see
how small look-up tables can be implemented. Unfortunately a larger memory
would rapidly require many of these small elements to be combined.
Operands over 4 bits would soon become difficult to handle since the memory
requirement rapidly escalates for each additional operand bit.

Logic trees can be implemented, but again, as the operand width increases the
fan-in requirement of the logic soon exceeds the 9 inputs available to a
CLB and functions have to be split. This prevents a compact design, and
impacts performance due to the multiple block delays from operands to
result. The device architecture does however offer an easy way to pipeline
any design due to the two registers in each CLB. Once again operands of
more than 4 bits become an escalating problem to handle.

The real strengths of the device in this application are its ability to
handle small look-up tables, provide logic functions of less than 9 inputs,
and form fast adders of any size. It would therefore be ideal to use a
hybrid technique tailored to these properties.

Using small look-up tables for partial products and combining the results by
addition makes best use of all the features.

Anatomy of a Hybrid Multiplier
------------------------------

Partial products are formed by splitting the first operand into sections and
multiplying each section by all of the second operand. These products
contain all the information about the multiplication which must then be
combined by addition, whilst at the same time restoring the bit weighting of
the sections from the first operand.

The design example will illustrate an 8 by 8-bit multiplier, of which the
first operand is split into two nibbles. Both nibbles produce a 12-bit
partial product after multiplication by the second operand. These products
are applied to a 16-bit combining adder to form the final 16 bit result.

The partial products are expanded into 16-bit operands required by the adder,
and to restore the weighting factor of the two nibbles. The product from the
low order nibble, with a weighting of 1, is converted to 16 bits by padding
four additonal MSB's with the value of zero. The product from the high order
nibble, with a weighting of 16, is effectively shifted by 4 bits using 4
additional LSB's, again with value of zero.

It is not possible for the adder to overflow because it is known that an
integer multiply of two operands with 'n' and 'm' bits respectively will
generate an 'n+m' bit result. In this 8 by 8-bit example the maximum
result occurs when multipling:- FF x FF = FE01 (hex).

Implementing the Technique
--------------------------

The recent introduction of the XILINX XBLOX synthesis tool, is a way of
using block diagrams as the method of design entry for the Xilinx XC4000
devices. It greatly simplifies the actual design entry phase. This can be
seen in figure 2 which shows the whole design for an 8 by 8 multiplier which
multiplies the two operands called 'X_OPERAND' and 'A_OPERAND'.

The key factor in making the design compact, fast and easy to implement is
to make the look-up tables for the partial products as small as possible.
This demands that the simple 16 by 1-bit memory elements would only be
combined to expand the width of the memory to that required for the partial
products. So in the example 16 by 12-bit memories are synthesised using just
6 CLB's, each containing two elements.

Obviously a memory with 16 addresses has 4 address lines, which means that
only the nibble section from the 'X_OPERAND' can be applied. The need to
connect the 'A_OPERAND' has been totally eliminated in this design by
ensuring that only those partial products that can be obtained by a single
value of 'A_OPERAND' are available at any one time.

These values are quite simply:-

     0, 1 x 'A_OPERAND', 2 x 'A_OPERAND'........... 15 x 'A_OPERAND',

Making the look-up tables from RAM means that the 16 partial products
(of which the first is always zero) relating to a single value of 'A_OPERAND'
can be modified. This is performed using the accumulator, address counter and
small state machine which accesses each location of the memory and stores a
new product. Both tables can be modified to contain the same data as each
nibble of the 'X_OPERAND' is multiplied by the same value.

Clearly, the multiplier can not be used for the period during which the
table contents are being modified, and in no way can a result be
obtained quickly when changing the value of 'A_OPERAND'. In contrast, changes
in the 'X_OPERAND' will give a result with minimum delay, and the data path
can easily be pipelined for an optimised performance in a clocked system.

Most applications can tolerate the time during which the 'A_OPERAND' is
converted into new contents for the look-up tables. Consider once again the
module of figure 1. In many cases only an ability to modify 'A' is required
in order to tune a system. Another example is video processing, where 'A' is
changed at the end of each page, during the screen fly-back.

For some applications, 'A' will be fixed from the design concept and
throughout the life of the system. Computer simulations may provide the
optimum values for multiplicands and offsets in the system. In these cases,
the look-up tables can be implemented as ROM's in the XC4000 with the
partial products pre-defined. This in turn eliminates the overhead logic to
program the tables and increases the performance by removing the address
multiplexers from the data path. The reconfigurable nature of XC4000 devices
would still permit different values to be loaded using various device
configuration bitstreams.

Expanding the Hybrid Technique
------------------------------

The technique expands easily by units of 4 bits on the 'fast' data input,
where each additional nibble connects to a separate look-up table. The size
and contents of each table are then determined by the second operand which
can be any number of bits. Larger multipliers require no further logic or
time to set the products into the tables.

Further levels of combining adders are required for more than two look-up
tables. These will impact performance, but their effect can be minimised
with a pipelined design (registers at the table ouputs and each adder).

Negative values in Twos complement format can be converted using simple
'invert and add 1' pipeline stages where necessary, handling the sign bits
separately.

A 16 by 16-bit multiplier of this type uses four tables to generate 20-bit
partial products corresponding to each nibble. Two 24-bit combining adders
are used to form partial products for the high and low order bytes of the
input data. A final 32-bit adder combines these products restoring the
weighting of 256 to the high order product (a shift of 8 bits).

Performance
-----------
The figures shown in table 1 are all worst case for an XC4000-5 device. The
designs were each processed using the Xilinx automatic tools without any
user intervention.

The combinatorial speed is taken as a device pin to pin delay of which
10 nano-seconds are the on and off chip time. Since the multiplier is
generally imbedded in the system, this time may be deducted in estimating
in-system performance.

Pipelined speed is determined by the speed of the slowest element, which is
the largest adder in the system.

Table 1
----------------------------------------------------------------------- | Style | 8 x 8-bit | 8 x 8-bit | 16 x 16-bit | 16 x 16-bit | | | RAM Look-up | ROM Look-up | RAM Look-up | ROM Look-up | |-----------------------------------------------------------------------| |CLB Count | 39 | 22 | 117 | 84 | | | | | | |

|----------------------------------------------------------------------- | Combinatorial | 56ns | 46ns | 96ns | 88ns | | Delay | | | | |

|----------------------------------------------------------------------- | Pipelined | 39.9MHz | 41.3MHz | 25MHz | 25.5MHz | | Performance | | | | |
-----------------------------------------------------------------------

Note an XC4010 has 400 CLB's available.

End of Record #415 - Last Modified: 04/01/97 07:46

For the latest news, design tips, and patch information on the Xilinx design environment, check out the Technical Tips!