

# Local Clocking Resources in Virtex-II Devices

Author: Emi Eto and Lyman Lewis

## Summary

This application note describes the different local clocking resources available in the Virtex<sup>™</sup>-II architecture. Along with a reference design, this application note details how to use the local clocking resources in source-synchronous applications.

## Introduction

In high-speed interfaces, a common interfacing technique used is for the transmitting device to forward the clock along with the data. This type of interface is termed "source-synchronous" as the output data is synchronous with the source (or transmitter's) clock. In source-synchronous systems it is the responsibility of the receiver to ensure the clock is routed to all data loads while meeting the required input setup and hold timing. Source-synchronous devices often limit the loading of the forwarded clock. There are two main clocking schemes used for source-synchronous systems: free running clock and data strobes.

#### Free Running Clock

For many designs, the incoming clock can be phased-shifted by the DCM to place the clock exactly in the center of the data window. This is particularly useful when targeting large data busses or there is a training pattern at initialization.

In the case where the data bus width for the source-synchronous device is small (say eight data signals per clock load), but there is a requirement for a wide overall bus, multiple "groups" of these source-synchronous devices are used. For example, eight devices can create a 64-bit data bus. For a Virtex-II receiver, this corresponds to eight global clocks and eight DCMs, each with a specific phase shift, to capture each "group" of data. In many designs, it is not practical to use this many resources. While one DCM can be used to capture the data, finding the exact phase shift value is difficult, especially across the process, voltage, and temperature corners of the Virtex-II and the multiple external devices.

In these designs, utilizing the local clock resources for clock routing is a useful alternate solution.

#### **Data Strobes**

Many DDR memories forward a data strobe rather than a clock to the receiving device. However, unlike a clock, this strobe signal is not free running. If the external device is not transmitting data, the strobe will not transition. Additionally, since this strobe is bi-directional, the strobe is always transmitted by the end data transmitter.

Due to the nature of the data strobe, a DCM cannot be used to phase shift the strobe. In these cases, the local clock resources offer a very effective solution.

<sup>© 2003</sup> Xilinx, Inc. All rights reserved. All Xilinx trademarks, registered trademarks, patents, and further disclaimers are as listed at <a href="http://www.xilinx.com/legal.htm">http://www.xilinx.com/legal.htm</a>. All other trademarks and registered trademarks are the property of their respective owners. All specifications are subject to change without notice.

NOTICE OF DISCLAIMER: Xilinx is providing this design, code, or information "as is." By providing the design, code, or information as one possible implementation of this feature, application, or standard, Xilinx makes no representation that this implementation is free from any claims of infringement. You are responsible for obtaining any rights you may require for your implementation. Xilinx expressly disclaims any warranty whatsoever with respect to the adequacy of the implementation, including but not limited to any warranties or representations that this implementation is free from claims of infringement and any implied warranties of merchantability or fitness for a particular purpose.

## **Local Clocking Resources**

The Virtex-II architecture has several local clocking resources available: Full Hex lines and Long lines. Some basic terminology is first defined before discussing the required resources. A configuration logic block (CLB) in the Virtex-II architecture include four slices and two 3-state buffers. An input/output interconnect (IOI) tile is a group of four input/output blocks (IOBs). For the purpose of this application note, the four IOBs within the IOI tile are numbered to match the slice coordinates where the bottom IOB will be PAD0 (Figure 1)



#### **Full Hex Lines**

Full Hex lines are low-skew resources located throughout the device. These resources allow a clock to be received and distributed to a fixed number of loads within a local region (span six IOI tile rows, or six CLB columns/rows) with low delay and low skew.

#### Long Lines

Long lines are low-skew resources that span the entire length and width of the device. Each device has a total of 24 horizontal lines per row and 24 vertical lines per column.

#### **Local Clocking Schemes**

For local clocking schemes, the following solutions are provided by the Virtex-II architecture.

- 1. Data capture only. For non-free-running strobes/clocks the data transfer uses global clock resources.
- 2. Data capture and transfer to internal logic for free-running strobes/clocks.

## Data Capture Using IOB Registers Only

In many applications, the primary difficulty is in capturing the data at the IOB flip-flops. Once the data is captured, it can be transferred to a system clock domain with relative ease, without requiring access to internal logic. This is often the case in DDR memory where the clock rates are 200 MHz or less, but the data valid window at the IOB is very small.

In this configuration the designers can connect directly to the vertical Full Hex (VFULLHEX) line. However, there is a requirement to place the clock pad in the top IOB (PAD3) within the IOI tile on the left or the right side of the device only. From the top IOB there is a direct connection to the two vertical Full Hex lines, one running to the upper six IOI tiles (including the IOI tile that contains the clock) and another running to the bottom six IOI tiles. This is shown in Figure 2. The skew on this local clocking structure is called out in the software place and route (PAR) report under the clock table section (reports minimum skew of the local clock resource). One important factor is the number of data pads that can be driven by a single local clock. In this scheme, up to 36 data pads can be available per local clock. However, due to  $V_{REF}$  (and other reserved pins), and the effect of package and die size on bond-out availability, the designer will always have less pads available.

For the differential I/Os, on the left or the right side of the device, PAD2 (P-side of the differential input signal) and PAD3 (N-side of the differential input signal) are used for a differential input clock. PAD3 can directly access two vertical Full Hex lines, one running to the upper six IOI tiles (including the clock tile) and another running to the bottom six IOI tiles. Therefore, the clock must be internally inverted.



## Data Capture and Transfer

The scheme is applicable to memory interfaces with a free-running strobe or source synchronous designs with performances up to 150 MHz in a Virtex-II device. The maximum number of strobes/clocks and the associated data used in a particular configuration depends on both the device and package. Single-ended inputs as well as differential inputs will be discussed in this section.

For designs with performances of up to 150 MHz, with aligned clock and data edges, the delay element can be enabled to delay the clock to satisfy setup time requirements. This delay is sufficient to center the clock with respect to the data. Since clock and data delay are closely matched, this pattern is also useful for designs with a clock already center aligned with respect to data when the delay element is not enabled. For differential inputs PAD2, the P side of the differential input should be used for the strobe/clock. If PAD1, the N side of the differential input is used as the clock/strobe pin, the clock must be inverted. The reference design uses a minimum of routing resources, thereby minimizing clock skew and reducing the chances of clock distortion.

There is one local clock route (FULLHEX) in any direction out of a tile (left, right, up, or down). Xilinx recommends using only one local clock per tile. Even though a second path is available from this tile, it has a longer clock route delay and is not recommended due to simultaneously switching outputs (SSO) considerations.

On the top or bottom of the device, one clock can be assigned to PAD1 using the HFULLHEX to the left. PAD1 should be used for a right to left flow.

When using PAD2 or PAD1 in a fully bonded package, there are 36 pins available on the left and right side of the device. Most I/O standards using local clocks require the  $V_{REF}$  pin. Typically three  $V_{REF}$  pins will be used leaving 32 pads for data and one for the clock. On the top and bottom there are up to six tiles available. When using the top or bottom, it is possible to have

two  $V_{REF}$  pins within the six tiles so only 16 PADS will be available for data and clock. If the pins are picked for only one  $V_{REF}$  in the array, then there can be 16 data and one clock.

If the number of data bits per data strobe/clock does not fit within the 1 x 12 IOI tiles, then use the CLB registers instead of the IOB registers for data capture. The data capture registers should be location constrained to the CLBs adjacent to the IOBs. This method is helpful when wide data buses are required.

## LEFT Side Banks 6 and 7 (PAD2)

Data can be assigned to the same tile as the clock and five tiles above the clock tile and six tiles below.

The local clock array starts in the tile immediately to the right of the clock and connects five columns of CLBs and one column of BRAM/Multiplier. The local clock pad routes to an HFULLHEX. It connects to six vertical long lines and then connects to an VFULLHEX up six tiles and an VFULLHEX down six tiles. This makes up the 6 x 12 logic array. This array consists of five CLB columns and one BRAM/Multiplier column. The 6 x 12 array connects to three BRAMs/Multipliers in the column. A second HFULLHEX for the clock array connects the IOBs six tiles below and five tiles above the local clock and the tile that the clock is in.

### Right Side Banks 2 and 3 (PAD1)

Data can be assigned to the same tile as the clock and five tiles above the clock tile and six tiles below.

The local clock route starts in the same tile as the local clock pad and connects to the IOB column, four columns of CLBs and one column of BRAM/Multiplier. With a right to left flow, six BRAMs/Multipliers can be accessed on the low skew resource. The route connects a vertical Long Line, it then connects to two VFULLHEX 12 tiles above and two VFULLHEX 12 tiles below the clock tile. This makes up the 6 x 24 array of logic. This array consists of one IOB column, four CLB columns, and one BRAM/Multiplier column.

#### TOP Banks 0 and 1, or Bottom Banks 4 and 5 (PAD1)

To have the clock assigned to the right of the data pins, the data pins are assigned to the five tiles to the left of the local clock tile and the tile the local clock is assigned.

#### **RIGHT to LEFT Flow**

The local clock route starts in the same tile that the clock is assigned to and extends five tiles to the left. Data pins can be assigned to the same tile the clock pad is assigned to and five tiles to the left. The top half of the 6 x 24 is not available on the top of the device, therefore it is just a 6 x 12 array of logic and an additional row of IOB registers. Depending on placement, this 6 x 12 array will or will not include BRAMs/Multipliers. If BRAMs/Multipliers are required in the design, the number of IOI tiles is reduced, thereby the number of data pads available is reduced.

#### **Constraints**

The 6 x 12 and 5 x 24 (IOBs not included) patterns describe the largest possible array size for the data capture and transfer local clock patterns. The IOBs can not be included in the array definition. The I/O pins should already be location constrained in the UCF file. The RANGE constraint is only used for CLBs. Defining the CLB range guarantees the clock loads all stay within the array required for the local clock pattern. It is recommended to add an additional range constraint or hard LOC for BRAMs/Multipliers to stay within the array. The correct clock location for the BRAMs/Multipliers needs to be chosen.

If lower skew is required, the size of the array should be reduced. It is possible to reduce the overall skew of the array by making the array size three columns by 12 rows ( $3 \times 12$ ). The array is composed of two or more  $6 \times 6$  patterns. The top and bottom half of the  $6 \times 6$  array are mirrors of each other. Reducing the number of rows to one  $6 \times 6$  placed above the clock does

not reduce the skew. The array can be defined as three CLBs below and three CLBs above the clock. This creates two 3 x 6 arrays and reduces the overall skew. The same is true for the IOBs. To reduce the skew to IOB registers, place the half of the IOB data bits in tiles immediately above the clock pin and the other half of IOBs immediately below the clock pin. An example of a 6 x 12 array user constraint file follows.

#### Example User Constraint File for 6 x 12 Array

## Define a TNM\_NET to be used for the clock period and the area\_group NET "clk\_rx\_I" TNM\_NET = "rx\_clk"; ## Add the delay element for the clock pin if appropriate INST "clk\_rx\_I" IOBDELAY = IBUF; ## Define the clock period TIMESPEC "TS\_rx\_clk" = PERIOD "rx\_clk" 8 ns HIGH 50%; ## Attach the TIMEGRP to and AREA\_GROUP TIMEGRP "rx\_clk" AREA\_GROUP = "CLKGRP\_RX\_CLK"; ## Define a 6x12 range of CLBs for the local clock. If a BRAM/Multiplier is ## required, then define a 5 column by 12 row of CLBs and one column of ## BRAM/Multiplier for the local clock. This area\_group is locate in the ## bottom left corner of the device. The actual area\_group should be defined ## to a range adjacent to the local clock pin AREA\_GROUP "CLKGRP\_RX\_CLK" RANGE = SLICE\_X0Y0 : SLICE\_ X9Y23;

### Array Examples Right Side Banks 2 and 3 (PAD1)

A 6 x 24 array using a single HFULLHEX line on the right side of the device (Banks 2 and 3) is shown in Figure 3. The clock pin must be location constrained to PAD1. The single HFULLHEX routes to a 1 x 12 IOB column, a 1 x 24 BRAM/Multiplier column, and a 4 x 24 CLB array totaling six columns. Figure 3 shows two VFULLHEX lines along the IOB column, one VFULLHEX line spans six rows up including the IOI tile row with the clock pin, and the second VFULLHEX line spans six rows below the IOI tile row with the clock pin.

Table 1 lists the minimum (shortest distance) and the maximum (longest distance) delays from the clock net to the destination registers. The destination registers are listed as IOB\_I\_REG (IOB input register), IOB\_O\_REG (IOB output register), BRAM, and CLB\_REG (CLB register). These delays do not include the clock pad delay since it is I/O standard dependent. The delay to the multiplier, not listed here, is slightly more than the BRAM delay. The multiplier delay is within the range of the adjacent CLB column. The skew for each path can be determined by subtracting the minimum delay value from the maximum delay value.

|     | 1 x 12<br>IOB_I_REG | 1 x 12<br>IOB_O_REG | 1 x 24<br>BRAM | 4 x 24<br>CLB_REG |
|-----|---------------------|---------------------|----------------|-------------------|
| - 4 | 1.101 - 1.140       | 1.159 - 1.213       | 1.185 - 1.290  | 1.329 - 1.477     |
| - 5 | 0.958 - 0.991       | 1.008 - 1.107       | 1.030 - 1.121  | 1.155 - 1.284     |
| - 6 | 0.885 - 0.915       | 0.933 - 1.026       | 0.953 - 1.039  | 1.067 - 1.186     |

Table 1: Right Side Banks 2 and 3 (PAD1)

Notes:



Figure 3: Right Banks 2 and 3 (PAD1)

## Left Side Banks 6 and 7 (PAD2)

A 6 x 12 array using two HFULLHEX lines on the left side of the device (Banks 6 and 7) is shown in Figure 4. The clock pin must be location constrained to PAD2. The HFULLHEX on the right side of the clock pad routes to one BRAM/Multiplier column, and five CLB columns totaling six columns. The HFULLHEX on the left side of the clock pad routes to the IOB column. Figure 4 shows two VFULLHEX lines, one VFULLHEX line spans six rows up including the IOI tile row with the clock pin, and the second VFULLHEX line spans six rows below the IOI tile row with the clock pin.

Table 2 lists the minimum (shortest distance) and the maximum (longest distance) delays from the clock net to the destination registers. The destination registers are listed as IOB\_I\_REG (IOB input register), IOB\_O\_REG (IOB output register), BRAM, and CLB\_REG (CLB register). These delays do not include the clock pad delay since it is I/O standard dependent. The delay to the multiplier, not listed here, is slightly more than the BRAM delay. The multiplier delay is within the range of the adjacent CLB column. The skew for each path can be determined by subtracting the minimum delay value from the maximum delay value.

|     | IOB_I_REG     | IOB_O_REG     | BRAM          | CLB_REG       |
|-----|---------------|---------------|---------------|---------------|
| - 4 | 1.344 - 1.394 | 1.423 - 1.506 | 1.133 - 1.167 | 1.257 - 1.363 |
| - 5 | 1.190 - 1.212 | 1.237 - 1.309 | 0.985 - 1.014 | 1.093 - 1.185 |
| - 6 | 1.077 - 1.117 | 1.142 - 1.210 | 0.901 - 0.937 | 1.008 - 1.092 |

#### Table 2: Left Side Banks 6 and 7 (PAD2)

Notes:



## Legend

Green:Vertical Long Line Orange: VFULLHEX Red: HFULLHEX Yellow: BRAM/Multiplier, Slices Blue: Input Data Strobe/Clock

x609\_04\_012302

## Top Banks 0 and 1 (PAD1)

A 6 x 12 array using a single HFULLHEX line on the top of the device (Banks 0 and 1) is shown in Figure 5. The clock pin must be location constrained to PAD1. This example includes a BRAM/Multiplier column. The HFULLHEX on the left side of the clock pad routes to one BRAM/Multiplier column, four CLB columns, and the column including the IOI tile with the clock pad totaling six columns. Figure 5 shows two VFULLHEX lines spanning 12 rows below the IOI tile row with the clock pad.

Table 3 lists the minimum (shortest distance) and the maximum (longest distance) delays from the clock net to the destination registers. The destination registers are listed as IOB\_I\_REG (IOB input register), IOB\_O\_REG (IOB output register), BRAM, and CLB\_REG (CLB register). These delays do not include the clock pad delay since it is I/O standard dependent. The delay to the multiplier, not listed here, is slightly more than the BRAM delay. The multiplier delay is within the range of the adjacent CLB column. The skew for each path can be determined by subtracting the minimum delay value from the maximum delay value. The delay from the clock net to the BRAM/Multiplier will vary depending on the proximity of the clock pad to the BRAM/Multiplier column. However, the skew for the BRAM/Multiplier column stays the same.

|     | IOB_I_REG     | IOB_O_REG     | BRAM          | CLB_REG       |
|-----|---------------|---------------|---------------|---------------|
| - 4 | 1.279 - 1.334 | 1.293 - 1.347 | 1.486 - 1.519 | 1.604 - 1.707 |
| - 5 | 1.113 - 1.160 | 1.124 - 1.171 | 1.292 - 1.321 | 1.395 - 1.485 |
| - 6 | 1.026 - 1.069 | 1.037 - 1.079 | 1.194 - 1.221 | 1.288 - 1.369 |

#### Table 3: Top Banks 0 and 1 (PAD1)

Notes:



## Legend

Green:Vertical Long Line Orange: VFULLHEX Red: HFULLHEX Yellow: BRAM/Multiplier, Slices Blue: Input Data Strobe/Clock

x609\_05\_012303

Figure 5: Top Banks 0 and 1 (PAD1)

## Bottom Banks 4 and 5 (PAD1)

A 6 x 12 array using a single HFULLHEX line on the bottom of the device (Banks 4 and 5) is shown in Figure 6. The clock pin must be location constrained to PAD1. This example includes a BRAM/Multiplier column. The HFULLHEX on the left side of the clock pad routes to one BRAM/Multiplier column, four CLB columns, and the column including the IOI tile with the clock pad totaling six columns. Figure 6 shows two VFULLHEX lines spanning 12 rows above the IOI tile row with the clock pad.

Table 4 lists the minimum (shortest distance) and the maximum (longest distance) delays from the clock net to the destination registers. The destination registers are listed as IOB\_I\_REG (IOB input register), IOB\_O\_REG (IOB output register), BRAM, and CLB\_REG (CLB register). These delays do not include the clock pad delay since it is I/O standard dependent. The delay to the multiplier not listed here is slightly more than the BRAM delay. The multiplier delay is within the range of the adjacent CLB column. The skew for each path can be determined by subtracting the minimum delay value from the maximum delay value. The delay from the clock net to the BRAM/Multiplier will vary depending on the proximity of the clock pad to the BRAM/Multiplier column. However, the skew for the BRAM/Multiplier column stays the same.

|     | IOB_I_REG     | IOB_O_REG     | BRAM          | CLB_REG       |
|-----|---------------|---------------|---------------|---------------|
| - 4 | 1.289 - 1.336 | 1.488 - 1.536 | 1.464 - 1.504 | 1.600 - 1.694 |
| - 5 | 1.121 - 1.162 | 1.294 - 1.336 | 1.272 - 1.307 | 1.391 - 1.473 |
| - 6 | 1.033 - 1.071 | 1.197 - 1.235 | 1.177 - 1.209 | 1.285 - 1.359 |

#### Table 4: Bottom Banks 4 and 5 (PAD1)

Notes:



Green:Vertical Long Line Orange: VFULLHEX Red: HFULLHEX Yellow: BRAM/Multiplier, Slices Blue: Input Data Strobe/Clock

x609\_06\_012303

Figure 6: Bottom Banks 4 and 5 (PAD1)

## Software Support

## Software Version: ISE 5.1i SP 3 - F.26

#### Place and Route (ISE 5.1i SP3 - F.26)

PAR supports the routing templates for the data strobe/clock for the left and right sides (see **Known PAR Issues (ISE 5.1i SP3 - F.26)**). There is no automatic placer support to place the data strobe/clock in respect to data signal and internal synchronous elements. The data strobe/clock and data signals are required to be constrained in a user constraint file (UCF). The

use of TIMEGRP with AREA\_GROUP in the UCF is mandatory for synchronous elements in CLBs. See the example in the **Constraints** section.

Use of the USELOWSKEWLINES constraint has no effect in controlling the local clocking routing resources for the patterns described in this application note.

#### Timing (ISE 5.1i SP3 - F.26)

MAXSKEW and MAXDELAY constraints should be used conservatively. The MAXSKEW and MAXDELAY constraint values should adhere to the skew and delay listed in Table 1 through Table 4. Any skew and delay lower than the given values can cause adverse behavior and prevent the use of local clock resources. If lower skew is required, see the **Constraints** section. The MAXSKEW and MAXDELAY constraints should be used to report the skew and delay during timing analysis.

#### Known PAR Issues (ISE 5.1i SP3 - F.26)

The router in ISE 5.1i SP3 - F.26 does not select the local clock routing resource for the data strobe/clock on PAD2 on the left side of the device. This pattern is supported in the next release of the ISE software version 5.2i - F.28.

The use of MAXSKEW and MAXDELAY constraints with aggressive skew and delay requirements can cause adverse behavior and prevent the use of local clock resources. Adhere to the skew and delay listed in Table 1 through Table 4.

Post optimization in the router phase can cause the data strobe/clock routing to lose the local clock resource. Support by means of a tactical patch is scheduled for version 5.2i - F.28 of the ISE software.

#### Software Version: ISE 5.2i - F.28

#### Place and Route (ISE 5.2i - F.28)

PAR supports the routing templates for the data strobe/clock for the left, right, top, and bottom of the device. No automatic placer support is available to place the data strobe/clock with respect to the data signals and synchronous elements in CLBs The data strobe/clock and data signals are required to be constrained in a user constraints file (UCF). The use of TIMEGRP with AREA\_GROUP in the UCF is mandatory for synchronous elements in CLBs. See the example in the **Constraints** section.

Use of the USELOWSKEWLINES constraint has no effect in controlling the local clocking routing resources for the patterns described in this application note.

#### Timing (ISE 5.2i - F.28)

MAXSKEW and MAXDELAY constraints should be used conservatively. The MAXSKEW and MAXDELAY constraint values should adhere to the skew and delay listed in Table 1 through Table 4. Any skew and delay lower than the given values can cause adverse behavior and prevent the use of local clock resources. If lower skew is required, see the **Constraints** section. The MAXSKEW and MAXDELAY constraints should be used to report the skew and delay during timing analysis.

#### Known PAR Issues (ISE 5.2i - F.28)

The use of MAXSKEW and MAXDELAY constraints with aggressive skew and delay requirements can cause adverse behavior and prevent the use of local clock resources. Adhere to the skew and delay listed in Table 1 through Table 4.

Post optimization in the router phase can cause the data strobe/clock routing to lose the local clock resource. Request a tactical patch to prevent post optimization in the router phase.

|                     | 01/23/03                                                                                                                                                                                                                                                                                                                                                                  | 1.0          | Initial Xilinx release.                    |  |  |
|---------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------|--------------------------------------------|--|--|
|                     | Date                                                                                                                                                                                                                                                                                                                                                                      | Version      | Revision                                   |  |  |
| Revision<br>History | The followin                                                                                                                                                                                                                                                                                                                                                              | g table shov | vs the revision history for this document. |  |  |
| Conclusion          | The different local clocking resources available in the Virtex-II architecture are described in this application note. This reference design covers data capture and transfer. For further information on data capture using IOB registers only, consult <u>XAPP253</u> : Synthesizable 400 Mb/s DDR SDRAM Controller or <u>XAPP266</u> : Synthesizable FCRAM Controller. |              |                                            |  |  |
| Reference<br>Design | The reference design file is available on the Xilinx FTP site at:<br><u>ftp://ftp.xilinx.com/pub/applications/xapp/xapp609.zip</u>                                                                                                                                                                                                                                        |              |                                            |  |  |
|                     | Uncertainties = duty cycle distortion + clock jitter + package skew + local clock resource skew Where duty cycle distortion is only used in DDR applications. The values for these parameters are available in the Virtex-II data sheet.                                                                                                                                  |              |                                            |  |  |
|                     | Where setup/hold is the setup/hold time of the receiver data capture flip-flop. The clock_period in a DDR application is actually the clock_period divided by two.                                                                                                                                                                                                        |              |                                            |  |  |
|                     | Data Valid Window = Clock_Period - (Setup + Hold) - Uncertainties                                                                                                                                                                                                                                                                                                         |              |                                            |  |  |
| Timing Analysis     | The following generic equation is used in all source-synchronous data valid window analysis.                                                                                                                                                                                                                                                                              |              |                                            |  |  |