"8x12=96" and "12x8=96"...so what is Ken Chapman
on about this week?
Well I haven’t quite gone mad just yet, it’s just that I’m thinking
about those Xilinx Virtex and Spartan-II devices again and how multipliers
are constructed in the CLBs. I will prove to you that "8x12"
is not equal to "12x8", and hopefully in the process,
provide you with some details of how to get the most out of your
designs that contain multiplication.
To prove this, I need to bend the rules a little, and change my
definition of the maths! In my case I’m not going to consider the
numerical values of "8" and "12," but consider
the multiplication of an 8-bit value and 12-bit value.
Here we see full binary (unsigned) multiplication of the values
3021 (BCD hexadecimal) and 228 (E4 hexadecimal).
The act of multiplication comes down to some very simple procedures.
I have labeled the two inputs "A"
and "B". The multiplication
is performed by taking each bit of "B"
in turn, and using it to multiply the whole of input "A".
Now in binary this is very simple. Since one bit can only represent
the value "0" or "1," then the result of this
one-bit multiplication can only be the value zero or "A".
It then requires that all these partial products are added together
to form the final product of multiplication. Of course, we have
to restore the original bit weighting associated with each bit of
"B", and this is achieved by the ever-increasing partial
product shifting to the left that is provided by the "0"
padding on each line.
Regardless of which value is the multiplicand, and which is the
multiplier, the result is always 688788 (A8294 hexadecimal) and
I’m still losing the game!
The one-it multiplication is logically very simple requiring only
sets of AND gates. These either allow the "A"
value to be passed, or force the partial product completely to zero.
Then, we need to add all the partial products together with appropriate
bit weighting. If we have time (enough clock cycles), we can adopt
the classical "shift and add" technique based on an accumulator,
but in this case I am looking for a maximum performance parallel
multiplier. For this, I will need an "addition tree".
It can be seen that the bit weighting is restored throughout the
addition process. It is always easy to determine the number of bits
required at any stage, as each adder output is in itself a partial
product of the final result. As shown, the first adders provide
the products of the 12-bit input "A"
multiplied by 2 bits of "B",
and hence generate a 14-bit (12+2) product. The second stage adders
require 16-bits to represent a 12-bit by 4-bit product. The final
result is then the full 20-bit product.
Obviously, the addition of zero bits does not require a real addition
process. However, if the adder is pipelined, then the least significant
bits to which zero is added will also require flip-flops to balance
the pipeline delay. These flip-flops occupy the same space as a
full adder (unless that wonderful SRL16E can combine several stages
together).
Since addition is such a key part of this multiplication process,
let’s review how a full adder is so cleverly achieved in Virtex
and Spartan-II. Each function generator (of which there are 4 in
each CLB arranged into 2 "slices") is complemented by
a simple 2-input XOR gate called XORCY and a 2:1 multiplexer called
MUXCY. From the left-hand truth table for a full adder, it is clear
that the sum in the result of A xor B xor Cin.
In Virtex and Spartan-II, this function is broken into two parts
such that the Half_sum = A
xor B is generated first using the function generator, and then
the dedicated XORCY is used to form the full sum of Half_sum
xor Cin. The clever bit can be seen in the right-hand table.
This shows how the Half-sum is used
to control the MUXCY multiplexer which generates the carry output.
Notice how the Half_sum selects either
the original carry input or the "A" input.
Whilst the latter stages of the addition tree are
pure add functions, look at the way in which the first two partial
products are formed and then applied to the first stage adder. This
means that in the majority of cases, the two adder inputs are each
driven by a 2-input AND gate. As these AND gates would each occupy
a function generator, a multiplier suddenly becomes very large in
an FPGA. In my case of a 12-bit by 8-bit multiplier it would require
12x8 = 96 function generators (48 slices) just to implement the
AND gates.
However, we can quickly see an optimisation. The AND
gate associated with one of the adder inputs can be absorbed into
the function generator forming the Half_sum
for addition. This in itself reduces the size of the 12-bit by 8-bit
multiplier by 48 function generators (24 slices).
It would be nice to absorb the other AND gate into
the function generator, but the signal it produces is also required
by the MUXCY part of the addition function. So this would appear
to be the most we can achieve.
However, Virtex and Spartan-II does allow this second
AND gate to be absorbed. Next to each function generator is yet
another component called the MULT_AND. It has the effect of re-creating
the same input to the MUXCY, even though the desired signal is now
buried within the function generator.
So, back to the issue of "8x12" not being
equal to "12x8".
Since all the AND gates associated with multiplication
are essentially free in Virtex and Spartan-II, the structure of
a parallel multiplier is the same as that for an addition tree with
the first stage adders also performing the multiplication by two
bits of "B". As we break the
"B" word into 2-bit sections,
so the resulting addition tree must reflect the number of bits in
the "B" word.
------ --------------------------------------------------------------------
The fact that 8 is a power of 2 means that the "12x8"
breaks down nicely into 4 multiplier adders in the first stage;
hence, it leads to a symmetrical addition tree of 3 levels. In contrast,
the "8x12" is less elegant: the 6 multiplier adders of
the first stage do not sum easily, leading to more adders and 4
levels of logic. For a fully pipelined multiplier, there is even
the requirement for delay compensation.
Admittedly, the multiplier adders and pure adders
of the "8x12" are generally a smaller number of bits than
in the "12x8"; but with Virtex and Spartan-II, this has
a very minimal impact on performance. In any case, both multipliers
have the same largest-size adder at the final stage. Combinatorial
multiplier performance will be set by the number of logic levels,
and in this case, the "12x8" will definitely win.
So, I have eventually proven that 8x12 is not
equal to 12x8 when it comes to the size and performance of multipliers
in Virtex and Spartan-II devices. Consider this as you use full
parallel multipliers in your designs.
The Variable Parallel Multiplier IP contained in the
Core Generator allows you to define the width of the "A"
and "B" ports, so do consider
the impact this will have. The data sheet indicates that latency
is a function of the "B" port;
if you play with the GUI, you will see instant feedback that for
an 8x12 the latency is 4 cycles. For the 12x8, it is just 3 cycles.
I suggest that this information is also used to determine the number
of levels of logic in combinatorial multipliers before actually
selecting the combinatorial option.
If you are an HDL user, try to understand how your
tool maps what you write to the structure. All synthesis tools stating
that they specifically target Virtex or Spartan-II devices should
be making full use of the MULT_AND gate, and hence producing optimum
area solutions. However, you may need to experiment to discover
the "A" and "B"
port-mapping.
A more significant issue with HDL coding is performance.
Although the above VHDL example will generate a multiplier, it will
result in a combinatorial multiplier with a single register at the
output. To increase performance, the addition stages really do need
to be pipelined. This is where you may insert a Xilinx IP Core.
Alternatively, your code could describe the internal
structure of the multiplier in more detail. Synthesis vendors have
also been looking at coding styles combined with "re-timing,"
which can distribute lumped register delays following a multiplier
back into the addition stages. In this case, it is vital that you
understand the particular synthesis tools you use and that you fully
appreciate that...
“8x12 is not equal to 12x8”
Share your comments, questions and ideas with Ken Chapman and
other interested designers at the "8x12 Does NOT Equal 12x8"
FORUM.
|