![]() |
|
![]() |
|
Answers Database
Fast Integer Multipliers in Xilinx FPGAs -- Reprint from XCELL Issue #14
Record #415
Problem Title: Digital multipliers provide the key to many digital systems including digital filters, correlators, and neural networks. These multipliers are typically required to take up to 16-bit operands and provide results in <50ns (20MHz systems). Figure 1 is a common module found in many signal processing applications. Eg. with the output 'y' applied to an activation function, this would be a simple neuron.---------- --- x ______| Integer |____________| A | | Multiply | | d | ---------- | d |_________ yFigure 1 | B ______| e | The use of programmable logic, and the increasing size of devices available, has lead to complete systems being implemented in one component. However, the digital multiplier is generally considered to be too slow when implemented this way, or just too large to make effective use of the programmable part. Consequently, dedicated multiplier devices are connected to the main system often resulting in a performance degredation. This is caused not by the speed of the multipliers, but the task of getting the data to and from the device (normally time multiplexed to reduce the large I/O requirement). A technique for implementing a compact and fast digital multiplier in a programmable part is required. Finding the Technique --------------------- There are three main implementations of multipliers:- a) Shift and Add - One operand is shifted to the left by one bit each cycle and applied to an accumulator when the corresponding bit in second operand is high. b) Look-up table - the operands are applied as addresses to a pre-programmed memory of which the output data forms the result. c) Logical tree - Each of the resultant bits are a logic function of the relevant bits of each operand. Using a 16 by 16-bit multiplier as an example, consider each technique. Shift and add implementation is compact but very slow. The result is obtained after 16 clock cycles and the accumulator has to be 32 bits wide. This accumulator will determine the maximum clock rate dependent on the carry logic chain. This implementation also precludes any new operands from being applied until the calculation has been completed, read and finally cleared. A look-up table solution will be as fast as the memory used, but rapidly becomes very large with increasing operand size. It requires a 4294967296 x 32-bit memory for the example! Small multipliers work well this way, such as a 4 by 4 multiply implemented in a byte wide ROM. Some very complex implementaions of logical trees have been developed employing product sharing, and are to be found in many of the dedicated arithmetic devices. The gate count is high and can be considered to be a reduced version of the ROM table. The logic involved tends to have a high fan-in requirement of up to 32 inputs. A Technique for FPGA -------------------- There is obviously a conflict as to which technique to employ when the multiplier has to be both fast and compact. Choosing a technique for use in a specific architecture also adds constraints. The XC4000 series of Field Programmable Gate Arrays would appear to offer the correct ingredients for a solution:- i) Currently up to 10,000 gates (typical in real designs) which indicates room for a multiplier and other logic. ii) Fast carry logic for compact and high speed simple arithmetic functions such as addition. iii) Ability to configure custom blocks of RAM and ROM. Having all these features available means that any of the described techniques may be implemented. The compact shift and add technique, though too slow for most applications but is very easily implemented in the device architecture. The accumulator can make excellent use of the fast carry logic with a correspondingly low cycle time even for a result of 32 bits. A look-up table built of a custom ROM block would be ideal for small multipiers. Since any configurable logic block (CLB) can be configured as either two 16 by 1 memories or a single 32 by 1 memory, it is easy to see how small look-up tables can be implemented. Unfortunately a larger memory would rapidly require many of these small elements to be combined. Operands over 4 bits would soon become difficult to handle since the memory requirement rapidly escalates for each additional operand bit. Logic trees can be implemented, but again, as the operand width increases the fan-in requirement of the logic soon exceeds the 9 inputs available to a CLB and functions have to be split. This prevents a compact design, and impacts performance due to the multiple block delays from operands to result. The device architecture does however offer an easy way to pipeline any design due to the two registers in each CLB. Once again operands of more than 4 bits become an escalating problem to handle. The real strengths of the device in this application are its ability to handle small look-up tables, provide logic functions of less than 9 inputs, and form fast adders of any size. It would therefore be ideal to use a hybrid technique tailored to these properties. Using small look-up tables for partial products and combining the results by addition makes best use of all the features. Anatomy of a Hybrid Multiplier ------------------------------ Partial products are formed by splitting the first operand into sections and multiplying each section by all of the second operand. These products contain all the information about the multiplication which must then be combined by addition, whilst at the same time restoring the bit weighting of the sections from the first operand. The design example will illustrate an 8 by 8-bit multiplier, of which the first operand is split into two nibbles. Both nibbles produce a 12-bit partial product after multiplication by the second operand. These products are applied to a 16-bit combining adder to form the final 16 bit result. The partial products are expanded into 16-bit operands required by the adder, and to restore the weighting factor of the two nibbles. The product from the low order nibble, with a weighting of 1, is converted to 16 bits by padding four additonal MSB's with the value of zero. The product from the high order nibble, with a weighting of 16, is effectively shifted by 4 bits using 4 additional LSB's, again with value of zero. It is not possible for the adder to overflow because it is known that an integer multiply of two operands with 'n' and 'm' bits respectively will generate an 'n+m' bit result. In this 8 by 8-bit example the maximum result occurs when multipling:- FF x FF = FE01 (hex). Implementing the Technique -------------------------- The recent introduction of the XILINX XBLOX synthesis tool, is a way of using block diagrams as the method of design entry for the Xilinx XC4000 devices. It greatly simplifies the actual design entry phase. This can be seen in figure 2 which shows the whole design for an 8 by 8 multiplier which multiplies the two operands called 'X_OPERAND' and 'A_OPERAND'. The key factor in making the design compact, fast and easy to implement is to make the look-up tables for the partial products as small as possible. This demands that the simple 16 by 1-bit memory elements would only be combined to expand the width of the memory to that required for the partial products. So in the example 16 by 12-bit memories are synthesised using just 6 CLB's, each containing two elements. Obviously a memory with 16 addresses has 4 address lines, which means that only the nibble section from the 'X_OPERAND' can be applied. The need to connect the 'A_OPERAND' has been totally eliminated in this design by ensuring that only those partial products that can be obtained by a single value of 'A_OPERAND' are available at any one time. These values are quite simply:- 0, 1 x 'A_OPERAND', 2 x 'A_OPERAND'........... 15 x 'A_OPERAND', Making the look-up tables from RAM means that the 16 partial products (of which the first is always zero) relating to a single value of 'A_OPERAND' can be modified. This is performed using the accumulator, address counter and small state machine which accesses each location of the memory and stores a new product. Both tables can be modified to contain the same data as each nibble of the 'X_OPERAND' is multiplied by the same value. Clearly, the multiplier can not be used for the period during which the table contents are being modified, and in no way can a result be obtained quickly when changing the value of 'A_OPERAND'. In contrast, changes in the 'X_OPERAND' will give a result with minimum delay, and the data path can easily be pipelined for an optimised performance in a clocked system. Most applications can tolerate the time during which the 'A_OPERAND' is converted into new contents for the look-up tables. Consider once again the module of figure 1. In many cases only an ability to modify 'A' is required in order to tune a system. Another example is video processing, where 'A' is changed at the end of each page, during the screen fly-back. For some applications, 'A' will be fixed from the design concept and throughout the life of the system. Computer simulations may provide the optimum values for multiplicands and offsets in the system. In these cases, the look-up tables can be implemented as ROM's in the XC4000 with the partial products pre-defined. This in turn eliminates the overhead logic to program the tables and increases the performance by removing the address multiplexers from the data path. The reconfigurable nature of XC4000 devices would still permit different values to be loaded using various device configuration bitstreams. Expanding the Hybrid Technique ------------------------------ The technique expands easily by units of 4 bits on the 'fast' data input, where each additional nibble connects to a separate look-up table. The size and contents of each table are then determined by the second operand which can be any number of bits. Larger multipliers require no further logic or time to set the products into the tables. Further levels of combining adders are required for more than two look-up tables. These will impact performance, but their effect can be minimised with a pipelined design (registers at the table ouputs and each adder). Negative values in Twos complement format can be converted using simple 'invert and add 1' pipeline stages where necessary, handling the sign bits separately. A 16 by 16-bit multiplier of this type uses four tables to generate 20-bit partial products corresponding to each nibble. Two 24-bit combining adders are used to form partial products for the high and low order bytes of the input data. A final 32-bit adder combines these products restoring the weighting of 256 to the high order product (a shift of 8 bits). Performance ----------- The figures shown in table 1 are all worst case for an XC4000-5 device. The designs were each processed using the Xilinx automatic tools without any user intervention. The combinatorial speed is taken as a device pin to pin delay of which 10 nano-seconds are the on and off chip time. Since the multiplier is generally imbedded in the system, this time may be deducted in estimating in-system performance. Pipelined speed is determined by the speed of the slowest element, which is the largest adder in the system. Table 1----------------------------------------------------------------------- | Style | 8 x 8-bit | 8 x 8-bit | 16 x 16-bit | 16 x 16-bit | | | RAM Look-up | ROM Look-up | RAM Look-up | ROM Look-up | |-----------------------------------------------------------------------| |CLB Count | 39 | 22 | 117 | 84 | | | | | | ||----------------------------------------------------------------------- | Combinatorial | 56ns | 46ns | 96ns | 88ns | | Delay | | | | ||----------------------------------------------------------------------- | Pipelined | 39.9MHz | 41.3MHz | 25MHz | 25.5MHz | | Performance | | | | |----------------------------------------------------------------------- Note an XC4010 has 400 CLB's available. End of Record #415 - Last Modified: 04/01/97 07:46 |
| For the latest news, design tips, and patch information on the Xilinx design environment, check out the Technical Tips! |