| 
             "8x12=96" and "12x8=96"...so what is Ken Chapman 
              on about this week?  
            Well I haven’t quite gone mad just yet, it’s just that I’m thinking 
              about those Xilinx Virtex and Spartan-II devices again and how multipliers 
              are constructed in the CLBs. I will prove to you that "8x12" 
              is not equal to "12x8", and hopefully in the process, 
              provide you with some details of how to get the most out of your 
              designs that contain multiplication.  
            To prove this, I need to bend the rules a little, and change my 
              definition of the maths! In my case I’m not going to consider the 
              numerical values of "8" and "12," but consider 
              the multiplication of an 8-bit value and 12-bit value.  
            Here we see full binary (unsigned) multiplication of the values 
              3021 (BCD hexadecimal) and 228 (E4 hexadecimal).  
            The act of multiplication comes down to some very simple procedures. 
              I have labeled the two inputs "A" 
              and "B". The multiplication 
              is performed by taking each bit of "B" 
              in turn, and using it to multiply the whole of input "A". 
              Now in binary this is very simple. Since one bit can only represent 
              the value "0" or "1," then the result of this 
              one-bit multiplication can only be the value zero or "A". 
             
            It then requires that all these partial products are added together 
              to form the final product of multiplication. Of course, we have 
              to restore the original bit weighting associated with each bit of 
              "B", and this is achieved by the ever-increasing partial 
              product shifting to the left that is provided by the "0" 
              padding on each line. 
            Regardless of which value is the multiplicand, and which is the 
              multiplier, the result is always 688788 (A8294 hexadecimal) and 
              I’m still losing the game! 
              
            The one-it multiplication is logically very simple requiring only 
              sets of AND gates. These either allow the "A" 
              value to be passed, or force the partial product completely to zero. 
              
            Then, we need to add all the partial products together with appropriate 
              bit weighting. If we have time (enough clock cycles), we can adopt 
              the classical "shift and add" technique based on an accumulator, 
              but in this case I am looking for a maximum performance parallel 
              multiplier. For this, I will need an "addition tree". 
             
            It can be seen that the bit weighting is restored throughout the 
              addition process. It is always easy to determine the number of bits 
              required at any stage, as each adder output is in itself a partial 
              product of the final result. As shown, the first adders provide 
              the products of the 12-bit input "A" 
              multiplied by 2 bits of "B", 
              and hence generate a 14-bit (12+2) product. The second stage adders 
              require 16-bits to represent a 12-bit by 4-bit product. The final 
              result is then the full 20-bit product. 
              
            Obviously, the addition of zero bits does not require a real addition 
              process. However, if the adder is pipelined, then the least significant 
              bits to which zero is added will also require flip-flops to balance 
              the pipeline delay. These flip-flops occupy the same space as a 
              full adder (unless that wonderful SRL16E can combine several stages 
              together). 
            Since addition is such a key part of this multiplication process, 
              let’s review how a full adder is so cleverly achieved in Virtex 
              and Spartan-II. Each function generator (of which there are 4 in 
              each CLB arranged into 2 "slices") is complemented by 
              a simple 2-input XOR gate called XORCY and a 2:1 multiplexer called 
              MUXCY. From the left-hand truth table for a full adder, it is clear 
              that the sum in the result of A xor B xor Cin. 
            In Virtex and Spartan-II, this function is broken into two parts 
              such that the Half_sum = A 
              xor B is generated first using the function generator, and then 
              the dedicated XORCY is used to form the full sum of Half_sum 
              xor Cin. The clever bit can be seen in the right-hand table. 
              This shows how the Half-sum is used 
              to control the MUXCY multiplexer which generates the carry output. 
              Notice how the Half_sum selects either 
              the original carry input or the "A" input. 
              
              
            Whilst the latter stages of the addition tree are 
              pure add functions, look at the way in which the first two partial 
              products are formed and then applied to the first stage adder. This 
              means that in the majority of cases, the two adder inputs are each 
              driven by a 2-input AND gate. As these AND gates would each occupy 
              a function generator, a multiplier suddenly becomes very large in 
              an FPGA. In my case of a 12-bit by 8-bit multiplier it would require 
              12x8 = 96 function generators (48 slices) just to implement the 
              AND gates. 
              
             
            However, we can quickly see an optimisation. The AND 
              gate associated with one of the adder inputs can be absorbed into 
              the function generator forming the Half_sum 
              for addition. This in itself reduces the size of the 12-bit by 8-bit 
              multiplier by 48 function generators (24 slices).  
              
            It would be nice to absorb the other AND gate into 
              the function generator, but the signal it produces is also required 
              by the MUXCY part of the addition function. So this would appear 
              to be the most we can achieve.  
            However, Virtex and Spartan-II does allow this second 
              AND gate to be absorbed. Next to each function generator is yet 
              another component called the MULT_AND. It has the effect of re-creating 
              the same input to the MUXCY, even though the desired signal is now 
              buried within the function generator. 
              
            So, back to the issue of "8x12" not being 
              equal to "12x8". 
             Since all the AND gates associated with multiplication 
              are essentially free in Virtex and Spartan-II, the structure of 
              a parallel multiplier is the same as that for an addition tree with 
              the first stage adders also performing the multiplication by two 
              bits of "B". As we break the 
              "B" word into 2-bit sections, 
              so the resulting addition tree must reflect the number of bits in 
              the "B" word. 
              
              ------ --------------------------------------------------------------------  
             
            The fact that 8 is a power of 2 means that the "12x8" 
              breaks down nicely into 4 multiplier adders in the first stage; 
              hence, it leads to a symmetrical addition tree of 3 levels. In contrast, 
              the "8x12" is less elegant: the 6 multiplier adders of 
              the first stage do not sum easily, leading to more adders and 4 
              levels of logic. For a fully pipelined multiplier, there is even 
              the requirement for delay compensation. 
            Admittedly, the multiplier adders and pure adders 
              of the "8x12" are generally a smaller number of bits than 
              in the "12x8"; but with Virtex and Spartan-II, this has 
              a very minimal impact on performance. In any case, both multipliers 
              have the same largest-size adder at the final stage. Combinatorial 
              multiplier performance will be set by the number of logic levels, 
              and in this case, the "12x8" will definitely win. 
            So, I have eventually proven that 8x12 is not 
              equal to 12x8 when it comes to the size and performance of multipliers 
              in Virtex and Spartan-II devices. Consider this as you use full 
              parallel multipliers in your designs.  
            The Variable Parallel Multiplier IP contained in the 
              Core Generator allows you to define the width of the "A" 
              and "B" ports, so do consider 
              the impact this will have. The data sheet indicates that latency 
              is a function of the "B" port; 
              if you play with the GUI, you will see instant feedback that for 
              an 8x12 the latency is 4 cycles. For the 12x8, it is just 3 cycles. 
              I suggest that this information is also used to determine the number 
              of levels of logic in combinatorial multipliers before actually 
              selecting the combinatorial option.  
            If you are an HDL user, try to understand how your 
              tool maps what you write to the structure. All synthesis tools stating 
              that they specifically target Virtex or Spartan-II devices should 
              be making full use of the MULT_AND gate, and hence producing optimum 
              area solutions. However, you may need to experiment to discover 
              the "A" and "B" 
              port-mapping. 
              
            A more significant issue with HDL coding is performance. 
              Although the above VHDL example will generate a multiplier, it will 
              result in a combinatorial multiplier with a single register at the 
              output. To increase performance, the addition stages really do need 
              to be pipelined. This is where you may insert a Xilinx IP Core. 
             
            Alternatively, your code could describe the internal 
              structure of the multiplier in more detail. Synthesis vendors have 
              also been looking at coding styles combined with "re-timing," 
              which can distribute lumped register delays following a multiplier 
              back into the addition stages. In this case, it is vital that you 
              understand the particular synthesis tools you use and that you fully 
              appreciate that... 
            “8x12 is not equal to 12x8”  
            Share your comments, questions and ideas with Ken Chapman and 
              other interested designers at the "8x12 Does NOT Equal 12x8" 
              FORUM. 
       |