Introduction

The use of a prescaler is a common technique for improving counter performance. Originally, a small high-performance counter was used to divide an incoming clock, thus providing a slower clock to a larger, lower-performance counter. This technique has since been adapted to synchronous counters.

In a synchronous counter, the first few bits of the counter are decoded to create a parallel Count Enable (CEP). This clock enable is used to reduce the effective clock rate. The carry chain in the more significant bits is, thereby, allowed several clock periods in which to settle. However, using this technique results in a counter that, without further adaptation, is non-loadable.

Typically, in the LCA implementation of such a counter, the critical delay is the generation and distribution of CEP. This delay can be shortened by pipelining CEP and using a high-speed Longline for its distribution. However, where ultimate speed is the objective, even the relatively small Longline delay can be eliminated.

To eliminate this delay, the LSB of the counter is replicated to create an “active Longline.” This involves locating an LSB replica immediately adjacent to each bit in the counter. In counter organizations where one CLB provides the flip-flops for two counter bits, the number of replicas required is approximately half the number of bits in the counter.

In XC3000 designs, direct interconnect can be used between the LSB replicas and the counter bits. This results in an effective distribution delay of zero. In XC4000 designs, the residual routing delay is minimal.

Implementation

XC3000
The XC3000 design for the ultra-fast counter is shown in Figure 1. This design uses two parallel count enable signals, Q0 and CEP2. Q0 acts as a 1-bit prescaler, halving the effective clock rate in the rest of the counter. It is the distribution of Q0 that is critical, and depends upon replication.

Even with the effective clock rate halved, it is necessary to use a second 2-bit prescaler for any significant length of counter. The parallel count enable signal (CEP2), generated by this second prescaler, occurs once every eight clock cycles. Reducing the effective clock rate by a factor of eight permits the use of a simple ripple carry scheme for the remaining bits of the counter. The Q0 prescaler allows two clock cycles for the distribution of CEP2, and a Longline is adequately fast.

Except for the Q1 flip-flop, the column of CLBs on the left consists entirely of replicated LSBs. Only one flip-flop, at the top of the column, is configured to toggle. The remaining flip-flops in the column act as slaves to this one master flip-flop.

These slave flip-flops are organized as a shift register with inverters between stages. At each stage there is a pair of flip-flops (QX0 and QY0) contained within a single CLB. The two flip-flops operate in parallel. This duplication permits both vertical direct interconnect to the next stage, and horizontal direct interconnect to the counter bits.

The first stage toggles by continuously loading the inverse of its current state. Stage two loads the inverse
Figure 1. X3000 Ultra-Fast Counter
Ultra-Fast Synchronous Counters

of stage one, delayed by one clock period. Given that stage one is toggling, this combination of inversion and delay causes stage two to operate in synchronism with stage one, as shown in Figure 2. Similarly, stage three operates in synchronism with stage two, and so on. This slave mode of operation guarantees that all N stages will operate in synchronism after no more than N-1 clocks, regardless of their initial state.

To avoid unnecessary loading on the direct interconnects, the Q0 output is taken from the last stage of the shift register. Otherwise, the additional loading would cause a small increase (~0.1 ns) in the direct interconnect delay, and this would reduce the maximum clock frequency by ~1 MHz.

The second prescaler, Q1 and Q2, is a simple 2-bit counter, enabled by Q0. CEP2 is High for two clock periods while Q1 and Q2 are both High. The CEP2 pipeline flip-flop is also enabled by Q0. In this way, CEP2 changes at the same time as Q1 and Q2, and each has two clock periods in which to set up. CLB input constraints require that Q2 be externally routed to the CEP2 decoder.

The remaining bits of the counter use a ripple-carry scheme. Pairs of bits are implemented together, using two CLBs per pair. One CLB provides the two flip-flops, and is placed adjacent to a Q0 CLB to exploit the direct interconnect. The second CLB implements the carry chain, with each pair of bits adding one TIL0 delay. To minimize the cumulative delay and maximize the counter length, direct interconnect should also be used in the carry path.

With all critical delays reduced to a clock-to-output delay plus a set-up time, with no routing delay, the minimum clock period is 10.5 ns (95 MHz). The ripple-carry delay in the more significant bits in an XC3000-125 counter is approximately 15 ns plus 5.7 ns per bit-pair. With the counter running at its minimum clock period, the carry chain has 84 ns in which to settle. This will permit up to 12 bit-pairs in the ripple carry path. A counter running at the maximum speed can, therefore, have up to 27 bits including the prescalers.

XC4000
The XC4000 design, shown in Figure 3, is very similar to the XC3000 design. The principle difference is that the dedicated carry logic can be used in the more significant bits of the counter.

To maximize the performance, all critical paths are restricted to single-length interconnects, only one of which is driven from any output. This again requires that pairs of flip-flops be used in each stage of the LSB shift register. Using double-length interconnects or driving multiple single-length lines, the number of flip-flops can be reduced, with only a slight loss of performance.

The minimum clock period is the clock-to-output delay plus routing delay and set-up time. With the interconnection strategy described above, this can be kept below 9 ns (111 MHz). The ripple-carry delay in the more significant bits is 13 ns plus 1.5 ns per bit-pair. The 72 ns available permits a theoretical maximum counter length of 87 bits. In practice, the number of bits will be limited by the loading on the Long line distributing CEP2. The available time should allow counters in excess of 20 bits long to be constructed.
Figure 3. XC4000 Ultra-Fast Counter