Download a PDF of this article: Part 1, 2, 3, 4, 5
Fundamental limitations separate synthesis fromplace and route - linked by wire load models - in
180-nm technology and 300-MHz designs. In the
past year,conventional synthesis,even with extensions such as timing-driven place and route,
has given way to physical synthesis,which offers
better performance and improved timing predictability. But physical synthesis alone does not
present a complete answer to the challenge of advanced chip design; current physical synthesis
tools cannot swallow a multimillion-gate design
in one gulp. Instead,the design must be subdivided into manageable blocks.
This is of particular concern to Agilent, which competes in the high-end ASIC market,where designs often
exceed 5 million gates and operate at frequencies
greater than 250 MHz. Those ASICs are used in a variety
of high-performance applications, such as advanced
networking products and computer workstation chip
sets,and our challenge is to maintain performance that
can keep up with cutting-edge CPUs while keeping cost
down and meeting aggressive schedules.
The answer for us is a blended technique that we call
"structured custom." It has evolved from Agilent's
legacy as part of Hewlett-Packard,when we designed instrument chips and portions of CPU cores. The chips
were full custom and automation was used to help design them. Those early techniques formed the basis of
our thinking today.
The design of a chip typically begins with its partition
into macro functions, each created by an individual designer. A designer at the next level of hierarchy then
places these blocks (macro functions) into a new design
and the process continues until the chip is built. A typical chip today has approximately seven levels of hierarchy and is composed of a hundred individual designs, all
of which must be managed by a limited number of
designers.
In the early days,many,if not all,of our blocks were
custom. Today, Agilent uses data path and custom analog
design for selected blocks in many of our chips. However,
in the interest of greater productivity for both customers
and physical designers we strive to use an RTL-based
standard-cell approach for most of the blocks.
In our divide-and-conquer approach, it is typical to
split a design so that any macro function can be modified and rebuilt rapidly. Thus, when the engineering
change orders arrive -- and they always do -- we can rebuild the chip quickly because each block is independent and only the affected blocks must be modified. Completing a change order is largely a matter of rerunning the top-level route with the modified blocks.
A key aspect of our desire to rapidly turn blocks is
achieving one-pass timing closure across all levels of hierarchy. This requires understanding the sources of timing variance and compensating for them. An excellent
way to reduce timing discrepancies is to move from statistical wire load models,known as WLMs, to the location-based RC estimates used in physical synthesis.
The physical synthesis edge
In the past few years, ASIC designers have seen traditional synthesis techniques break down. Starting at 0.35
micron,wire load delays caused when wire capacitance
slows the driver become a significant portion of overall
delay. At 0.25 micron, wire delays due to propagation
delay in the wire itself also become significant. And at
0.18 micron, delays arising from wires often exceed gate
delays on critical paths.
The WLM has been the traditional statistical method
of coupling synthesis timing with post-artwork timing. For smaller process technologies, interconnect exerts a
greater influence on overall delay and WLM-based timing correlates less with post-artwork timing.
Synthesis tool writers have looked into the wire parasitic problem and realized that it is quite complicated. A
wire can be described by several factors, including
length, width, neighboring wires, gate loading and fanout. Gate loading and fan-out are the "knobs" that are
directly controlled by synthesis. WLMs make the assumption that fan-out can predict a wire's parasitics. Therefore, in a single integer -- that is, fan-out -- synthesis tools have tried to wrap up an extremely complicated
problem.
Differentiating physical synthesis tools,like Synopsys' Physical Compiler, is being able to use their knowledge of placement to make more accurate estimates of
wire delay. Unlike WLMs, which are based on statistical
distribution of wires with a common fan-out, Physical
Compiler estimates wire resistance and capacitance
wire by wire.
For the physical synthesis tool to work quickly, it uses
Steiner or half-perimeter estimates to calculate wire
length for each net based on the locations of the pins to
which it is attached. Currently, Physical Compiler uses a
lumped RC model based on horizontal and vertical resistance and capacitance parameters for the wires. These
location-based estimates provide a much more accurate
prediction of post-route timing.
The impact of block size
The longest possible wire in a block, barring a meandering route, runs from corner to corner, horizontally across
the width of the block and vertically across its height.
Thus, a block 's half-perimeter bounds the worst-case, direct, point-to-point route. Similarly, it bounds direct multiple fan-out nets.
As blocks grow, so do their half-perimeters. So for any
process larger blocks tend to have longer wires. Although some wires in a large block are short,there are
generally several that are quite long that result in large
wire capacitance and resistance and long wire delays. Consequently, larger blocks are increasingly susceptible
to larger wire delays.
Furthermore, not all routes are direct.As they meander to avoid localized congested hot spots,WLM and
even physical synthesis estimates become less accurate. Just as growing block size increases variance from
WLM estimates, larger blocks are more susceptible to
meander-induced error. Although physical synthesis
greatly enhances the accuracy of timing prediction, it is
still sensitive to increasing block size.
Fig.1 shows how timing variance relates to block size
in a 0.18-micron process. Each curve shows typical error
for a synthesis prediction vs.actual extracted timing. The first curve highlights the error of a WLM-based prediction. The second one illustrates the decreased error
based on a physical synthesis timing estimate.
By introducing a little pessimism into our WLMs,Agilent can tolerate small timing errors and still achieve
one-pass timing closure. As Fig.1 shows, WLM-based
estimates are valid for blocks up to approximately
75,000 gates. Similarly, by using interconnect estimates
that are equally pessimistic, it is possible to achieve
one-pass timing closure using physical synthesis for up
to 200,000 gates.
To get the most out of hierarchical design we try to
determine the optimal block size. Our main objective is
choosing a block size that gives us a good chance of
one-pass timing closure on each block. Another goal is
keeping the block size large in order to hold the number
of standard-cell blocks to a minimum (50 blocks is a typical target). As the number of gates continues to grow
exponentially with each generation of chips, achieving
one-pass timing closure on larger blocks is a key to productivity -- and physical synthesis is the latest tool to
help us achieve aggressive productivity objectives.
The merits of hierarchy
Using an average block size of 150,000 gates, a 10 million gate design results in 67 blocks. Physical Compiler
does a good job of achieving one-pass timing closure
with blocks of this size. This is a big improvement over
the 174 blocks that result from using conventional place
and route techniques with the largest one-pass timing
closure, WLM-based blocks.
Once we identify the standard-cell blocks, early floor
planning, which enables exploration of trade-offs in the
physical architecture of the chip well before the blocks
are completed,may begin. Our early floor planner requires only block size and shape estimates and a top-level netlist that connects them.
There are several advantages to such floor planning. The most important is that any major architectural obstacles that affect timing are identified early in the design cycle. Once we are aware of the timing concerns we can choose whether to address them from the RTL level
or the physical level. Another key advantage offered by
a timing-aware floor planner is it can generate major
timing constraints,which are needed to enable budgeted block synthesis or physical synthesis or both.
When the floor plan begins to solidify,the design
shifts from the top-down "divide"stage to the bottom-up "conquer" stage. At this point, there are clear specifications for block size, shape, timing and port
locations. These specifications allow the chip design
tasks to be subdivided as required. Generally,each designer is responsible for several blocks. Because the design is hierarchically split and the designs can be
processed with a great deal of independence, many designers are typically put on the design for a short time
to help accelerate the design. Theoretically,for a 10 million gate design there is nothing stopping us from using
67 independent designers to manage the 67 standard-cell blocks if a schedule required us to accelerate the
design.
Hierarchical design enables this divide-and-conquer
approach, making it possible to perform different parts of
the design in parallel. Simple blocks in the chip are implemented with one level of hierarchy and consist simply
of standard cells. More complex blocks are partitioned
into multiple levels of hierarchy. The submodules of complex blocks can be individually implemented and reassembled bottom-up.
At the top level of the chip, simple and complex
blocks are assembled.Each block, regardless of how
many sublevels of hierarchy it contains, is treated as a
hard macro at the top level. Similarly,hard IP macros are
handled just like any other piece of the chip. Thus, our
hierarchical design framework mirrors the SoC approach
that is advocated by many in the IC design community.
Floor planning
Floor planning is an iterative process that can start before the RTL is finished and continues until final integration of the hierarchical pieces begins. Although it is a
continuous process, it can be roughly divided into two
phases: early and malleable (see Fig.2,page 24).
The early phase involves rapid exploration of physical
design alternatives; it also highlights possible timing obstacles in the logical design and
involves many quick iterations. We want to be able to run several trials in a day with each iteration preferably taking less
than an hour. This is possible
with top-level netlists consisting of about 50,000 nets and
using simplified models for
delay and congestion. The estimates are typically within 15
percent of the actual congestion and timing results.
As the early floor plan
changes, so do many of the
block timing budgets. Block size
tends to be a strong function of
the block time budget. As
blocks shrink and grow with changes in the budget, the
next floor plan iteration accommodates the new block
size estimates.After a few iterations this process converges and block size stabilizes.
As the early floor plan is refined,it quickly becomes
apparent which paths are going to present a timing challenge. The 80-20 rule typically applies: 20 percent of the
paths represent 80 percent of the difficulty, so identifying the troublesome 20 percent early gives us more time
to deal with them. Among these paths there are usually
a handful of particularly tough ones. Catching those
early, while the RTL is being developed, allows some fine
tuning of the RTL and avoids time-consuming custom
artwork solutions late in the design cycle. The remaining
paths are addressed with timing-aware floor planning.
There is no clear line dividing the early floor planning
phase from the malleable phase. (We use the term
"malleable "because it implies a degree of solidity, but
also a degree of adjustability.) The malleable floor plan
is beginning to firm up, but it is by no means frozen --
the malleable phase's primary feature is incremental refinement of the floor plan.We move and adjust blocks,
pushing them together or apart as needed, keeping
their relative positions fairly constant. Similarly, block
size remains reasonably consistent in the malleable
phase, generally changing by no more than 10 percent.
Malleable floor planning involves incremental improvement of the general floor plan created in the early
phase. With the block size estimates firming up, we
begin getting meaningful congestion data from trial
routes of the top-level netlist. We run trial routes periodically throughout the floor planning process (early
and malleable)for two reasons. First,they ensure that
the evolving floor plan can be routed and they give early
warning to possible routing hot spots. Second, trial
route data provides more accurate timing estimates because they include detailed route information.
The malleable phase tends to have significant content of place-and-routed blocks, rather than simple rectangular estimates. These blocks function as hard
macros that enable the trial routes to accurately reflect
over-the-block routing. Similarly, the block timing models for placed and routed blocks reflect actual extracted
timing rather than estimates based on synthesis or
physical synthesis runs or both.
As the dummy block rectangles are replaced with the
real Library Exchange Format, the malleable floor plan
models timing and congestion with increased accuracy.
Fewer, subtler refinements are made until timing is met
and the ability to route is assured. Then this final floor
plan is frozen and becomes the blueprint for assembling
the chip.
The designer's perspective
Physical designers begin with the design in the early floor
planning state. It is important at this stage to estimate
the size of the block from synthesis runs and to guess
how much logic will be added to the block. Normally, customers give only block diagrams of what they believe the
chip will look like; from these diagrams, guesses of major
buses and critical timing signals are identified. Clocking
and power needs are also identified at this stage. These
estimates are then put into a specification that is used by
the designer assigned to floor planning.
Early estimates are important because the best way
to have a successful chip is to catch problems early. If
the RTL has not been written or is still incomplete,there
is a greater chance that the specifications and protocols
of the chip can be influenced to produce a chip that is easier to design.
The floor planner will generate timing estimates, which will be used in
first-pass synthesis. As the RTL is delivered, the blocks are built and rebuilt in
an iterative fashion. Blocks shrink and
grow (usually grow) from the initial estimates and the floor plan evolves accordingly. The floor plan results in
updated interblock delay estimates that
are budgeted back to block constraints.
Two factors
When the blocks are complete, they are
routed together. This is where two fac-
tors are put to the test: Were the floor
plan predictions of route congestion
and timing accurate, and did the blocks
meet their requirements? Fortunately,
both concerns can be addressed early. First, the blocks are tested for consistency with their requirements as they
are completed. Second, as the blocks
evolve, several trial routes of the chip
are performed (using conservatively
dense Library Exchange Format models
for the unfinished blocks). These trial
routes provide an ongoing double check
to verify that the floor plan is on-target.
Though it is not possible for current
physical synthesis tools to tackle a flat
high-speed 10 million gate design with
a hierarchical approach, with early floor
planning and a solid final assembly solution in place,physical synthesis
greatly enhances productivity by making one-pass timing closure an option
on larger blocks.