VLSI and Embedded Systems: June 2015

Tuesday, June 23, 2015

Need For Testing Testbench

Need for testing:

1. To verify the functionality of the design according to specifications.

2. To perform random testing and hence to reduce the test time of the design.

3. To detect failures so that bugs (in software code) can be identified and corrected before it gets shipped to costumer. (There are possibilities that error may in the specification itself. Sometimes miscommunications between teams may lead to wrong design.)

A test bench or testing workbench is a virtual environment used to verify the correctness or soundness of a design or model, for example, a software product.
In hardware testing of electronic devices, where an engineer would sit at a lab bench with tools for measurement and manipulation, such as oscilloscopes, multimeters, soldering irons, wire cutters and manually verify the correctness of the device under test (DUT).
In software or firmware or hardware engineering, a test bench refers to an environment in which the product under development is tested with the aid of software and hardware tools.
The software may need to be modified slightly in some cases to work with the test bench but careful coding can ensure that the changes can be undone easily and without introducing bugs.

Components of a test bench

A test bench has four components:

Input: The entrance criteria or deliverables needed to perform work
Procedures to do: The tasks or processes that will transform the input into the output
Procedures to check: The processes that determine that the output meets the standards
Output: The exit criteria or deliverables produced from the workbench

Kinds of test benches

The following types of test bench are the most common: Test Benches

Stimulus only — contains only the stimulus driver and DUT; does not contain any results verification.
Full test bench — Contains stimulus driver, known good results, and results comparison.
Simulator specific — the test bench is written in a simulator-specific format.
Hybrid test bench — Combines techniques from more than one test bench style.
Fast test bench — Test bench written to get ultimate speed from simulation.

Test Benches

To simulate your design, you need both the design under test (DUT) or unit under test (UUT) and the stimulus provided by the test bench.

A test bench is HDL code that allows you to provide a documented, repeatable set of stimuli that is portable across different simulators.

A test bench can be as simple as a file with clock and input data or a more complicated file that includes error checking, file input and output, and conditional testing.

Methods to create the test bench:

Text editor

This is used to verify complex designs. It allows you to use all the features available in the HDL language and gives you flexibility in verifying the design.

Advantages: It produces more precise and accurate results than using the Test Bench Waveform Editor.

Disadvantages: It is more challenging to create the code.

To assist in creating the test bench, you can create a template that lays out the initial framework, including the instantiation of the UUT and the initializing stimulus for your design.

Test Bench Waveform Editor

This is used to verify less complicated simulation tasks and is recommended if you are new to HDL simulation. It allows you to graphically enter the test bench to drive the stimulus to your design.

Advantages: Easy to work.

Disadvantages: implementation and analysis of complex designs.

Test Bench Strategies

Because the test bench becomes a part of the hierarchy in your code, the following has to considered:

Make the test bench the top level of the code.

The test bench instantiates the unit under test (UUT), and stimulus is applied from the top-level test bench to the lower-level design or portion of the design being tested.

Use the instance name UUT for the instantiated unit under test.

This is the default instance name that Project Navigator expects.

You can use the same test bench for both functional and timing simulation. Following are general recommendations that apply to both types of simulation:

Initialize all input ports at simulation time zero, but do not drive expected stimulus until after 100 nanoseconds (ns) simulation time.

During timing simulation, a global set/reset signal is automatically pulsed for the first 100 ns of simulation. To keep the test bench consistent for both timing and functional simulation, it is recommended that you hold off input stimulus until the global set/reset has completed.

Do not provide data to the input ports at the same time as the clock.

For non-timing simulation, this can cause some signals to be applied before the clock and some after the clock. Apply data only after the clock is applied to the input ports. This makes it easier to keep track of which clock edge data changes are being applied.

If output checking is performed in the test bench, apply data just before the next clock cycle.

For timing simulation, it could take up to an entire clock cycle for the data to propagate through the logic and settle to a known value. Checking data too early in the clock cycle may yield incorrect results.

Example:

library IEEE;
use IEEE.STD_LOGIC_1164.ALL;
use IEEE.STD_LOGIC_ARITH.ALL;
use IEEE.STD_LOGIC_UNSIGNED.ALL;

entity test is
port (clk : in std_logic;
count : out std_logic_vector(3 downto 0);
reset :in std_logic
);
end test;

architecture Behavioral of test is
signal c : std_logic_vector(3 downto 0) :=(others => '0'); --initializing count to zero.
begin
count <= c;
process(clk,reset)
begin
if(clk'event and clk='1') then
-- when count reaches its maximum(that is 15) reset it to 0
if(c = "1111") then
c <="0000";
end if;
c <= c+'1'; --increment count at every positive edge of clk.
end if;
if(reset='1') then --when reset equal to '1' make count equal to 0.
c <=(others => '0'); -- c ="0000"
end if;
end process;

end Behavioral;

LIBRARY ieee;
USE ieee.std_logic_1164.ALL;
USE ieee.std_logic_unsigned.all;

-- entity declaration for your testbench.Dont declare any ports here
ENTITY test_tb IS
END test_tb;

ARCHITECTURE behavior OF test_tb IS
-- Component Declaration for the Unit Under Test (UUT)
COMPONENT test --'test' is the name of the module needed to be tested.
--just copy and paste the input and output ports of your module as such.
PORT(
clk : IN std_logic;
count : OUT std_logic_vector(3 downto 0);
reset : IN std_logic
);
END COMPONENT;
--declare inputs and initialize them
signal clk : std_logic := '0';
signal reset : std_logic := '0';
--declare outputs and initialize them
signal count : std_logic_vector(3 downto 0);
-- Clock period definitions
constant clk_period : time := 1 ns;
BEGIN
-- Instantiate the Unit Under Test (UUT)
uut: test PORT MAP (
clk => clk,
count => count,
reset => reset
);

-- Clock process definitions( clock with 50% duty cycle is generated here.
clk_process :process
begin
clk <= '0';
wait for clk_period/2; --for 0.5 ns signal is '0'.
clk <= '1';
wait for clk_period/2; --for next 0.5 ns signal is '1'.
end process;
-- Stimulus process
stim_proc: process
begin
wait for 7 ns;
reset <='1';
wait for 3 ns;
reset <='0';
wait for 17 ns;
reset <= '1';
wait for 1 ns;
reset <= '0';
wait;
end process;

END;

2. If multiple clocks are to be generated with different frequencies, then clock generation can be simplified if a procedure is called as concurrent procedure call.

library ieee;

use ieee.std_logic_1164.all;

entity tb is

end entity;

architecture sim of tb is

-- Procedure for clock generation

procedure clk_gen(signal clk : out std_logic; constant FREQ : real) is

constant PERIOD : time := 1 sec / FREQ; -- Full period

constant HIGH_TIME : time := PERIOD / 2; -- High time

constant LOW_TIME : time := PERIOD - HIGH_TIME; -- Low time; always >= HIGH_TIME

begin

-- Check the arguments

assert (HIGH_TIME /= 0 fs) report "clk_plain: High time is zero; time resolution to large for frequency" severity FAILURE;

-- Generate a clock cycle

loop

clk <= '1';

wait for HIGH_TIME;

clk <= '0';

wait for LOW_TIME;

end loop;

end procedure;

-- Clock frequency and signal

signal clk_166 : std_logic;

signal clk_125 : std_logic;

begin

-- Clock generation with concurrent procedure call

clk_gen(clk_166, 166.667E6); -- 166.667 MHz clock

clk_gen(clk_125, 125.000E6); -- 125.000 MHz clock

-- Time resolution show

assert FALSE report "Time resolution: " & time'image(time'succ(0 fs)) severity NOTE;

end architecture;

Simulation Overview

During HDL simulation, the simulator software verifies the functionality and timing of your design or portion of your design. The simulator interprets VHDL or Verilog code into circuit functionality and displays logical results of the described HDL to determine correct circuit operation. Simulation allows you to create and verify complex functions in a relatively small amount of time.

Simulation takes place at several points in the design flow. It is one of the first steps after design entry and one of the last steps after implementation, as part of verifying the end functionality and performance of the design. Simulation is an iterative process, which may require repeating until both design functionality and timing is met. For a typical design, simulation comprises the following high-level steps:

Compilation of the simulation libraries
Creation of the design and test bench
Functional simulation
Implementation of the design and creation of the timing simulation netlist
Timing simulation

Functional Simulation

After you compile the simulation libraries and create the test bench and design code, you can perform functional simulation on the design.

Functional simulation is an iterative process, which may require multiple simulations to achieve the desired end functionality of the design.

Once the desired functionality is achieved, use the output data to create a self-checking testbench. ISim has the capability to generate a self-checking testbench. It is important to set up the proper infrastructure for this type of simulation; spending time up front may save time in back end design debugging.

The general recommendations are:

Spend time up front to create a good test bench.

Creating a test bench that you can use for both functional and timing simulation can save time.

Ensure that your libraries are properly compiled and mapped.

If your design includes Xilinx CORE Generator cores, make sure that your libraries are properly compiled and mapped to the simulator. If you are using the ISim or the ModelSim Xilinx Edition simulator, this is automatically done for you.

Automate the compilation and elaboration simulation steps.

When you invoke the simulator from within Project Navigator, the ISE® tools automatically run these steps. If you are running the simulator outside of Project Navigator, it is recommended that you create a script or use another method to automate these steps.

Customize the simulator interface to present the information needed to debug the design.

You may want to include the information console, the structure or hierarchy view, and the waveform viewer as well as other facilities to evaluate the simulation. Customization can improve the simulation experience and can be tied into the automation of the compilation and elaboration steps.

If you are using a waveform viewer as a part of simulation debugging, organize the waveform view to display the proper signals in the proper order with the proper radices. This saves time and helps prevent confusion in interpreting simulation results.

CFAA UNIT 8 complete topics

CHAPTER 4

DESIGN PROCESS FLOWS AND SOFTWARE TOOLS

The purpose of this chapter is to:

Describe the functionality of design tools, both generic and those appropriate to a field-programmable gate-array (FPGA)
Describe some typical examples, from start to finish, of design process flows, using manufacturer-supplied cases. For simplicity only single-chip designs will be used at this stage. We include schematic-based, behavioral-and floorplan-based examples.

4.1 THE SOFTWARE TOOLBOX

Designing is a very exciting part of engineering. Leaving aside formal definitions, it is the phase of engineering where creativity and problem solving play the most important part. These activities allow explorations of design options by trial and error, engineering judgment, “guesstimation,” analysis, measurement, simulation, breadboarding, and so forth. There is normally an overall strategy, however, to guide the transition from specification to implementation, as discussed in Chapter 1. What the engineer must do is flesh out a complete behavioral, structural, and physical description so that a design can be built and tested. With these often onerous tasks and informal procedures, we can expect tools to “get in the way” of designing. The purpose of this chapter is to be informative about the applicability of tools for FPGA design and suggest fairly loose design process flows. The treatment is based on taking a simple design example, the 7-segment display driver, through a behavioral and structural design process. This example is conceptually simple, but it is a nontrivial 4-input, 7-output function. It is also a standard transistor-transistor logic (TTL) catalog part, the TTL7446. This serves to emphasize alternatives in design process flows as well as highlighting the relevance, or irrelevance, of specific tools.

This treatment is in distinction to some more conventional approaches, mostly as found in manufacturers’ literature, which often portray the design process as a sequential walk through a small number of design steps, yielding a perfect result every time.

Another source of confusion is the plethora of tools available today. This is because the electronic CAD industry has matured over the last 10 years into a number of sustainable businesses (see Chapter 7). The design bottlenecks of the mid-1970s have been eliminated by structured design techniques embodied in new tools and often the products of start-up companies. Today such companies are mature and concentrating on marketing their products. The noise level is rising, and it is important to be able to identify the functionality of a tool to assess its relevance to FPGA design. Table 4–1 enumerates tool types against design description and level of design abstraction.

As a starting point, let us examine the primary activities in designing. These may be termed design capture, design validation, and physical design. Figure 4–1 shows possible flows between activities as well as tools that can assist in the processes within activities. Note that the starting point for designing is the existence of a specification and that the end point is the production of manufacturing data. Transitions in the flow are driven by design refinement, yielding acceptable or infeasible designs. Transitions to revise the specification are allowed.

Figure 4–1. Design process flows.

Now in the case of FPGAs and our example, to get into the flow we need a specification. The specification of a 7-segment display driver is intuitively obvious, given a labeling scheme for the segments (see Figure 4–2) and no performance or electrical requirements. It could be, “drive the segments of the display to show the decimal digits 0–9 from their 4-bit binary representation as input.” Well, how do tools in a contemporary toolbox help, given this simple specification? The following sections illustrate process steps.

4.1.1 Design Capture

All design capture tools provide functions for capturing design intent. They may capture structural intent, behavioral intent, or both. They may be graphically based, for example, schematics, or language based, for example, hardware description language (HDL).

Figure 4–2. Labeling of 7-segment display.

Behavioral Tools Behavioral tools address the question “what do I want this circuit to do?” Since the behavior of a circuit is a mapping between a set of input states and a set of output states, dependent on internal state (for sequential circuits), then representations, such as truth tables, logic equations, and state descriptions, are particularly useful. Truth tables are a perfectly simple and universal behavioral description. The initial truth table for the display decoder is shown in Table 4–2. With the availability of programmable array logic (PALs), several commercial behavioral HDL notations have emerged as de facto standards: ABEL, CUPL, PALASM. The PALASM2 form of the same design is shown in Figure 4–3. For the established PAL families there is even a standard file format for specifying fuse maps. In some cases these can be interpreted as truth tables, but they are really a physical description of features of a chip.

Synthesis If behavioral tools only provided methods of describing circuit functions, then they would be at least useful in refining specifications, assuming the behavioral description could be animated. To take a first step to a real manufacturable entity, however, a set of elements with defined behaviors must be hooked up to make a circuit. Now if the behavioral semantics of these elements are well understood, for example, in the case of logic gates, 2-level logic blocks like PLAs, then it is possible to transform a behavioral description into a structural composition of known parts. This is called synthesis. Trivial examples would include interpreting a logic diagram as a behavioral and structural design; mapping a set of logic equations into a PLA; decomposing an HDL into a set of register-transfer-level interconnected elements. Optimization is associated with synthesis and is applied to the automatic manipulation of logic equations to yield an optimal, in some sense, logic-level design. Today optimization and synthesis products are available from a range of companies.

Structural Tools Structural tools address capturing the design statement, “I think this interconnection of units will exhibit the required behavior.” This is the more traditional design metaphor, and it assumes the engineer is using knowledge of the behavior of units, experience, judgment, and so forth, in making a plausible design statement. Historically, this metaphor has always been supported by schematic tools. In the case of the display driver, a schematic is shown in Figure 4–4. This nonobvious design is structured across three levels of logic. The top level drawing uses a left-to-right signal flow convention.

Hardware Description Languages In recent times HDLs like VHDL [Mazor93] and Verilog have emerged as alternative textual tools for design capture. Their present existence is based on work over a long period of time to discover how to exploit the power of programming notations in hardware design. In designs with repeated structures or conditional structures, there are obvious advantages in using HDL descriptions, assuming HDL compilers exist that can sensibly instantiate the design to a low-level netlist form.

This type of HDL, compared to the purely behavioral type, can be used for structural design and/or behavioral design. The most common use is as a structural design tool to allow more rapid generation of netlist data, that is, descriptions of interconnections of gate level elements. Figure 4–5 shows a VHDL description of the schematics.

4.1.2 Tools for Design Validation

Validation tools address the question “what will this (structural) design do and how fast will it do it?” A broad spectrum of simulator products, from circuit-level simulators, which solve the underlying differential equations by numerical methods, to ad hoc functional simulators in a high-level programming language, can be used. However, with the emergence of component families at mediumscale integration (MSI), large-scale integration (LSI), and very-large-scale integration (VLSI) levels, logic level simulators have become the key design validation tools. Since FPGAs by and large deliver gate level functions, logic level simulators are particularly relevant.

Logic simulators are good at answering the “what” question, at a cost in computer time. The “how fast” question has been addressed by simulation tools in which time delays are computed more accurately, such as by using RC circuit models in switch level simulators. These have proved essential tools in getting full custom VLSI designs to operate correctly the first time they are fabricated. Now with FPGAs the computation of wire path delays can be complicated because the paths are active in the sense that routing uses transistor switches and multiplexors. It is infeasible to compute the behavior of FPGAs at switch level. Delays through FPGA cells can, however, be modeled by a finite set of constants and path-delay computations built into timing estimators. Since the performance of a circuit is usually determined by a single critical path, estimator tools can replace timing analysis.

Figure 4–4. Display driver schematic.

Figure 4–5. VHDL for display driver.

Our example is a purely combinational circuit, and the design can be validated by applying the ten legal input vectors and observing that the output vectors are consistent with Table 4–2.

FPGA technology raises some interesting questions on the relevance of validation tools. Simulation tools have only gained users in recent times due to rising complexity in systems design and difficulties in designing application-specific integrated circuits (ASICs). Before simulation, systems were built on breadboards with standard components and a rapid wiring technology (wrapping, Multiwire, etc). But FPGAs can be used as instantaneous breadboards and synthesis programs create valid designs algorithmically. The FPGA technology push is already creating more interest in design methodologies. One can predict that areas like design debugging will come under review now that equivalents to software debugging techniques are available in hardware.

4.1.3 Tools for Physical Design

Physical tools address the question “how do I make this (structural) design fit physical reality while meeting performance requirements?” All electronics technologies deliver finite spatial resources for building functions and communications (wires). Resources are especially tight with FPGAs, as the convenience of programmability is bought at the cost of chip area for control store.

Conceptually the simplest tool, and one that provides the essential handle on reality, is the graphical editor. For FPGAs it is most likely to be a symbolic editor since it is desirable to abstract away the underlying details of the silicon. The figures in this book mostly show symbolic artwork of two FPGA families.

The physical design process can be greatly assisted by automating the generation of random and regular logic. Block generators for regular structures such as read-only memories and programmable logic arrays are normally part of a software supplier’s package. Generators can also be custom built by engineers using ordinary programming languages. For random logic, placement-and-routing routines can be used to build both random logic blocks and to assemble blocks at higher levels.

The total spatial demand of a design includes a component for function units and a component for wires. It is important to know early on if the spatial demand of a design can be met by the spatial capacity of a particular FPGA. Estimator and fitter programs assist in this task by computing function and wire demand for a particular design for comparison with members of a standard part family.

The result of running standard physical design software on the display driver example is shown later in Figure 4–13. It shows a logic cell array (LCA) implementation and two configurable array logic (CAL) implementations: one using a ROM generator and synthesis program and the other handcrafted logic. It exposes the eternal verity that humans can outperform algorithms, in this case by a factor of 2, provided the design is tractable and there is sufficient (human) design time.

It is likely the small example we have been using is only part of a bigger design. In fact it was taken from a VLSI class exercise in which the design specification is as given in Section 4.4. In this case a key physical design activity is floor planning, in which the gross disposition of units in the design and a wiring strategy for the design are defined. Editors, generators, and place-and-route programs can all assist the floor planning phase.

Precooked designs are also a useful source of tested designs with proven layouts. These are usually found in supplier- or user-distributed libraries [Newk83].

4.2 THE FPGA DESIGN DICHOTOMY

One of the interesting properties of FPGAs is that they merge what up until now have been two diverging families of electronic products: programmable logic devices, notably PALs, and ASICs, notably gate arrays and cell-based ICs. Each strand imposed a different implementation style on the engineer, each came with its preferred design process flow, and each came with its preferred toolset. The following sections set out to show explicitly how FPGAs can be designed either from an ASIC or PLD perspective.

Parenthetically, it is too simple to suggest that only two design methodologies and toolsets are important. In reality methodologies should be designed and toolsets chosen to match the end FPGA application. For example, one could classify the chief applications areas of programmable logic as: ASIC, ASIC emulation, and algorithm implementation/coprocessing, in which case the following observations would be relevant to method and toolset.

ASIC

Simple ASIC: A PAL design methodology that places emphasis on tools for capturing a behavioral description, optimizing logic equations and fitting the design to a member of a standard part catalog.

Complex ASIC: ASIC design places emphasis on both high-functionality tools for tasks like synthesis and automatic place-and-route, and low-level tools for complete design control and low-level optimization. More details are contained in the following sections.

ASIC Prototyping: The assumption here is that the real implementation is probably a gate array or cell-base IC (CBIC). So emphasis is in transforming a schematic to a netlist in the most painless way possible. Automatic tools for physical design are key. There is a desire for technology independence, but the preservation of gate-level equivalent circuits may be a better design goal.

Algorithm Implementation: Algorithm implementations usually exhibit some “natural” structure: cellular automata, systolic circuits, self-timed micropipelines, pipelined coprocessors, dedicated datapaths, and so forth. Emphasis is on the very efficient design of a basic unit using low-level tools like symbolic editors and structure compilers.

4.3 DESIGN PROCESS FLOW: THE PROGRAMMABLE LOGIC DEVICE ROUTE

A typical programmable logic device (PLD) design process flow is shown in Figure 4–6. The starting specification might be informal and hand drafted in anything from an HDL to a collection of Boolean equations. Design capture moves the design into the CAD environment and requires the engineer to restate the creation for machine consumption. This might be preparing a file in a preferred notation or marking entries in a table. Synthesis and optimization uses algorithms to manipulate the logic design to improve it by exploiting don’t care states to minimize the number of gates used. Function verification may be a unit-delay simulator to generate circuit outputs, or may amount to a manual inspection of truth table data. Device fitting matches logic/gate demand and input/output (I/O) demand to part family members. Finally, physical design produces manufacturing data, a binary stream, for programming specific devices. This must be achieved by generating a layout of the required circuit in the chosen FPGA cell and wire scheme. It may amount to running a simple logic block generator or it may be placing and routing random logic using manual and automatic software tools.

Figure 4–6. PLD design process flow.

4.3.1 Conceptual Design of Display Driver

In the case of a CAL design the specification is given in Table 4–2. The minimization of this table using ESPRESSO [Brayton84] is shown in Table 4–3. Now this is a standard sum-of-products form that we can expect to implement within an array of cells that allows a two-level logic structure to be built.

In the case of the LCA implementation, because there is no “natural” form for implementing a two-level logic structure, designs must be created as fully instantiated (flat) netlists. For the display driver design specified in the PALASM2 of Figure 4–3, a netlist must be optimized by proprietary tools to accommodate implementation in configurable logic blocks (CLBs) and the result merged with a top-level schematic to create the complete design.

4.3.2 Design Verification of Display Driver

In the case of the CAL design, since the logic was synthesized there is no formal verification step other than informal inspection of the truth tables.

In the case of the LCA design, the flat netlist of logic primitives can be exercised with a logic simulator. The result of this for a test pattern of the ten digits is shown in Figure 4–7. Readers can manually construct the digits from the output to confirm the correct functions are being computed.

4.3.3 Physical Design of Display Driver

In the case of the LCA implementation, the result of applying placement and routing algorithms to the merged netlist is shown in Figure 4–23.

In the case of the CAL implementation, physical design can be automatic, as in the synthesized ROM logic block of Figure 4–21, or physical design can be place and route (manual or automatic) from a suitable gate-level netlist. Using the gate-level schematic of Figure 4–4 yields the manual design shown in Figure 4–22. The synthesized ROM is further described in Section 4.5.1. It is interesting to compare the quality of the logic synthesis with that of a manual implementation generated with software design tools. The configurable logic ROM implementation requires an 11 (4 inputs plus 7 outputs)-by-8 (product terms) array of cells compared with an ll-by-4 array when implemented by hand. The following points are worth noting:

Figure 4–7. Result of simulating display driver.

Flexibility. The greatest advantage of the human-generated implementation is the way the layout of the logic unit has been optimized to fit in with the chip pin-out requirements and the counter design. Although there is a structure to the layout (note that input variables are routed horizontally across the array and outputs are collected using vertical wires), it is specific to this particular example.
Variable Ordering. In the automatically-generated array the order of the input variables in the cascade of gates is fixed. Only the function performed by gates in the cascade can be changed. The human implementation achieves a much better factorization of the logic equations by changing the order of variables in the gate cascades and using trees rather than simple cascades of gates. This can only be achieved by having a special routing plan for each function implemented.
Design Time. Although the automatic implementation is much less efficient, it only took about 10 minutes. The human-generated implementation took several days.
Design Size. This example approaches the limit of the size of circuit which may be designed efficiently by a person, and at this level of complexity, human design may surpass automatic design. But for the most part circuit optimization can be left to algorithmic methods.

4.4 DESIGN PROCESS FLOW: THE APPLICATION-SPECIFIC INTEGRATED CIRCUIT ROUTE

There are many sources of ASIC design process flows. VLSI design methodologies [Mead80], [Weste93] and ASIC manufacturers [Naish88] are good sources. The former place greater emphasis on layout issues, such as wiring management, to achieve optimal physical designs, while the latter largely rely on structural design to manage the whole process flow. A generic model for ASIC process flow is shown in Figure 4–8. Both the Gajski-Kuhn and more traditional forms are shown. This indicates the refinement of a specification to a structural design before entering design validation and physical design processes. In reality the behavioral and structural descriptions are built by stepwise refinement of the design. A typical ASIC design process flow, represented by paths in the Gajski-Kuhn diagram, shows the three design descriptions being built and the grain of the design unit decreasing. Thus a starting behavioral specification posits a structural design. This could be a composition of interconnected function units. This, in turn, could be stepwise refined by expanding to a structural design of simple elements, a gate-level netlist, or synthesizing a two-level logic block.

Figure 4–8. ASIC design process flow.

To illustrate the flow let us return to the display driver example. In fact, the 7-segment display driver was part of a larger timer design exercise for a VLSI course. It had the specification: Design a simple stopwatch to count in tenths of a second to one minute and display the current time on three 7-segment displays (tenths, seconds, and tens of seconds). The watch is controlled by three signals: INIT clears the watch and puts it in the “stop” state, SS start/stop is a high-going edge to start the watch from the stop state (a high-going edge in the start state puts the watch in the stop state), CLOCK, a 10-Hz signal.

4.4.1 Conceptual Design of Stopwatch

The 7-segment decoders provide a good example of “random” combinational logic, while the counters provide a good example of classic sequential logic. This design uses three separate units for the three displays although the design complexity and area could be reduced by implementing it as a single-state machine with multiplexed displays (since the large 7-segment decoder logic would not need to be duplicated). This is necessary to comply with the design specification given.

The Counter The counter is built from four toggle flip-flops (Figure 4–9). The basic 4-bit ripple counter is converted to a decade counter by a gate that sets the clear line when the counter gets to ten. The output of this gate is also used as the clock for the succeeding counter. The counter can also be cleared by the user’s INIT signal. Providing the clear capability is the only real problem in the design of the toggle flip-flop (the basic D latches have only clock and D inputs). The clear is provided by extra logic gates that force Os onto the D inputs and 1s onto the clock inputs. A basic master-slave flip-flop can be built with only two D latch cells, whereas this clearable implementation requires six cells. The layout is shown in Figure 4–10.

Figure 4–9. Toggle flip-flop design.

The Decoder The decoder takes advantage of the ability to produce any function of two Boolean variables within one cell and uses three levels of logic rather than the two-level logic normally used to implement such functions. Its schematic description is given in Figure 4–4 and physical design in Figure 4–22 later.

The Control Logic The controller function (at the bottom left of Figure 4–11) is implemented using a toggle flip-flop. The design is the same as those within the counters. The toggle flip-flop is clocked by the start-stop input and its output determines whether the counter should be stopped or run freely. The initialize signal clears the control flip-flop and stops the counter. The counters are stopped by ANDing the 10-Hz clock with the output of the control flipflop: thus when the output of the control flipflop is 0, the counters clock inputs are held low, and when it is 1 the counters receive the 10-Hz clock input.

Figure 4–10. Counter design.

Figure 4–11. Full Stopwatch design.

Figure 4–12. Stopwatch floorplan.

4.4.2 Design Verification of Stopwatch

See Problem 3.

4.4.3 Physical Design of Stopwatch

The floorplan of the full watch circuit is given below as Figure 4–12 and the layout in Figure 4–11. The floorplan is based on using three sides of the design to make connections to each of the displays accessible and the fourth side to get the control signals into the chip. Note that the floorplan requires simple transformations, rotations, and mirroring of the basic unit.

4.5 LIBRARIES AND DESIGN IDIOMS

In a sense design is idiomatic. Engineers can express designs in modes particular to a set of components at a specific level of design abstraction, for example, register transfer elements, or components particular to an implementation style for example, small-scale integration (SSI), MSI, and LSI TTL [TTL88]. If one adopts a bottom-up view of the world, then large designs are viewed as made up from aggregations of elementary components, usually gates. Since electronic logic by and large evolved from TTL elements, designs are perceived as compositions of TTL primitives. History has created a data base of collective design experience. This data base may be mined to create new designs from old experience, provided a set of primitive components exists in the new implementation style. This view has led the ASIC vendors to reproduce TTL libraries for customers’ use in designing (see Table 4–4). Alternatively, if one adopts a top-down view of the world, large designs are composed from generic high-level elements, such as counters, ALUs, and register files, whose specific features are defined in the design refinement process. This view of designing has led the ASIC suppliers to provide parameterized register-transfer (RT) and processor-memory switch (PMS) level component libraries, Table 4–4.

These hard and soft libraries are equally valid ways of collecting design experience and making it available to practicing engineers in tried and tested forms. Hard libraries try to maintain historical certainty in the encapsulation and functions of their elements. In a sense, they are fossilized design fragments that may limit the optimality of complete designs. Soft libraries can provide “made to measure” components for optimal designs at the expense of variations in performance for particular instantiations.

4.5.1 Parameterized Libraries

It is often very useful to think of library components as elements “sized” to a set of parameters: an n input gate, an n word by m bit stack, and so forth. In building a circuit the parameters are replaced by integer constants to suit the application. A simple way of satisfying this requirement would be to fill a library with all useful instantiations of a parameterized module and select a particular one by indexing from the parameter. This approach allows for more precise characterization of library elements at the expense of more storage. An alternative approach is to use the parameters in a program to generate the precise instance required. Two examples of this approach are described in the following subsections.

Parameterized Modules The requirement is to build a count “up-to-n” circuit. A generic counter and comparator, and constant generator, could be used to fit the requirement. The alternative approach would be to compose a counter from stages, each of which could test for a 0 or 1 depending on the position of these binary digits in the constant. By adding comparator gates, see Figure 4–13, to a basic toggle register we have the elements for building a count up-to-n circuit. The power of a programming language can be harnessed to assemble such a part. We need to build a vector of counter stages, each of which is conditionally defined. The iteration and conditional selection features of a programming language are well-suited to this task, provided there is a library of routines for describing and instantiating cells. The fragment of C code shown in Figure 4–14 shows this for assembling the “onestage” and “zerostage” of Figure 4–13. With upto = 5, the count-to-five circuit of Figure 4–15 would be generated.

Block Generators: The Configurable Logic ROM Generator With cellular FPGAs there is likely to be a simple way of implementing a two-level structure. In the case of CAL it is a configurable logic ROM. We will describe how it works by first discussing a simpler but less efficient scheme.

Simplified Version This version consists of an n-bit decoder, an “AND” plane, producing 2ⁿ^-1 minterms feeding a special “OR” plane which uses two gate functions. As shown in Figure 4–16, the Algotronix CAL cell has four inputs, one per side. Any two can be selected as XI, X2 for the evaluation of an arbitrary Boolean function. The simplified CAL-ROM layout is shown in Figure 4–17. If the row is selected, the cell function depends on whether we require 0 or 1 output for this particular bit. On the other hand, if the row is not selected, we simply pass through the input received from the cell below, to take care of the case where the selected row is below. If the selected row is above, the output of this cell is of no consequence. Figure 4–18 gives the CAL cell configurations.

Compact Version The height of the array can be cut in half, and the decoder width reduced by one cell. The last input variable is introduced at the bottom of each OR-plane column, as shown in Figure 4–19, and each cell in the OR-plane computes a function of this variable and one of the minterms. If the row is selected, the input from below must be AO. If the row is not selected, it may either be below or above the selected row. In either case, it simply passes on the input received from below—either A0 or the output of the selected row. The cell function chosen depends on the value desired for A0 = 0 and A0 = 1, as shown in Table 4–5 and Figure 4–20. The program lisa automatically generates this arrangement for a multiple-output function given an arbitrary truth table.

Figure 4–13. Stages for count-to-n circuit.

Figure 4–14. CAL operation.

Figure 4–21 shows the resulting layout of the CAL cell array, in which the 4 inputs appear at the lower left hand edge, q3, q2, q1, q0 and the outputs a,b,..., g at the upper edge. Figure 4–22 is a manually-composed alternative, with the inputs at the left and outputs at the right. Figure 4–23 is a layout for a Xilinx XC3020 design using the logic schematic given in Figure 4–4 as input.

Silicon Compilation The examples in this section of parameterized modules and generated blocks have shown physical synthesis from high level descriptions. The term “silicon compilation” [Gray79], has been applied to this process. In the case of the counter modules, the start point was a number of schematics for count cells, and a C program was designed to compose the cells to form the function. Because the start point for this process was a structural description, the C program and (software) library functions represent a structure compiler.

In the case of ROM block generator, the start point was a truth table, so the process was behavior compilation and the synthesis software represents a behavior compiler.

Figure 4–15. Count-to-five circuit.

Figure 4–16. CAL cell. Note: the output is a function of the West (× 1) and South (×2) inputs.

Figure 4–17. Simplified CAL-ROM layout.

Figure 4–18. CAL cell configuration.

Figure 4–19. Compact CAL-ROM layout.

Figure 4–20. CAL cell options.

Both examples show the power of silicon compilation techniques in generating substantial design fragments quickly and with guaranteed structural integrity.

Figure 4–21. Physical designs for the display driver: ROM-based CAL design.

4.6 PLACEMENT, ROUTING, AND WIREABILITY

An eternal requirement in CAD tool provision is for automatic place and route, preferably with 100% guaranteed success and with a short run time. Unfortunately, placement and routing takes place within the context of a real design and physical reality imposed by one or more FPGAs, the implementation medium. Any design contains a demand for wires with its point-to-point connections and bus structures. Since the FPGA contains a finite number of wires, all FPGA architectures have a defined wiring capacity. When there are gross discrepancies between the wiring demand of a particular design and the wiring capacity of the target FPGA, it is likely to be difficult to discover an implementation by any method including advanced placement and routing algorithms. It is worth trying to understand the relationship between wiring capacity and demand so that better engineering decisions can be made about the suitability of a particular FPGA for a particular design.

Figure 4–22. Physical designs for the display driver: manual CAL design.

4.6.1 Rent’s Rule and Package Pin-outs

Rent’s rule was originally “discovered" from examining packages of logic modules [Russo71]. It attempts to answer the question, “How many connections (n) are required to a structure holding a number of components (N) each of which has a number of connections (m)?” (see Figure 4–24). The relationship can be stated as n ≈ m * N^(l-p), where p is Rent’s coefficient (≈0.5). It has an intuitive explanation if one imagines a square array of components on a board. Cutting the board in half, creating two edge connectors, would expose m wires. The rule can be applied as a “sanity check” on components that deliver random logic, like FPGAs, which are made of N logic cells with m pins contained in packages of n pins. To make the most use of an FPGA while implementing random logic, the ratio of user I/O package pins to connections to the logic cell should be according to Rent. This exercise is left to the reader to be applied to the commercial offerings at any point.

Figure 4–23. Physical designs for the display driver: LCA design.

4.6.2 Rent’s Rule and Wireability

Heller et al. [Heller78] were the first to attempt an analysis of wiring demand, in this case for logic designs in the context of gate array “images.” They addressed questions such as: What characterizes useful gate array architectures? How big should wiring channels be? What is the average interconnection length of wires? What is the distribution and number of interconnections? Simple one- and two-dimensional models (see Figure 4–25) can be used. This shows three terminal logic elements communicating with a wiring channel of fixed capacity (the capacity in the two-dimensional case is the sum of the horizontal and vertical wiring channels). By using Rent’s rule to compute average wire length and making an assumption on the distribution of wires, it is possible to estimate a channel capacity that will allow the interconnection of a given number of logic elements. This is shown in Figure 4–26. It shows that even for relatively low gate counts, the ideal channel capacity is high, but as the gate count grows logarithmically, the channel capacity grows linearly. The analysis also covers variation for m terminal logic elements. The result indicates that to help the wireability of a design, the FPGA architecture, or structures built from it in the case of channelless arrays, should support channel capacities corresponding to gate count as in Figure 4–26.

Figure 4–24. Two levels of physical hierarchy.

Figure 4–25. Idealized wiring models.

Figure 4–26. Variation in wire demand with gate count.

4.6.3 Partitioning Designs within and across Chips

Partitioning is important in three areas:

Partitioning the resources within a single FPGA to support multiple designs
Partitioning a design into subsystems for placement onto regions of an FPGA
Partitioning a very large design across more than a single FPGA part

An example of the first case is shown in Figure 4–27. In this case, each of the subsystems is self-contained and can be designed in isolation. There are at least two important design decisions: can the subsystems be fitted onto the chip? and does the pin-out make sense at the board level? It is likely that there is more than one choice of implementation of each subsystem, and that in the parcel packing problem, in laying out the chip, the modules may have alternative shapes. This degree of freedom can be important in making feasible designs. The second question can only be answered in the context of the board design and involves matching the wire demand of the chip pin-out to the wire capacity of the board (defined from the copper dimension design rules and number of interconnect layers).

In the second case the need for partitioning comes from a requirement to divide a large design into subunits for implementation, the traditional divide and conquer approach to engineering problems.

Figure 4–27. Multiple designs on a chip.

PROBLEMS

1. Find a simulator and simulate the display driver design from the logic.

2. Verify Table 4–2 as a correct behavioral description.

3. Verify the same logic block by developing expressions for the outputs evaluating these expressions for the ten binary constant inputs.

4. Simulate the whole design of the stopwatch.

5. Comment on the overpinning or underpinning of some commercial FPGA parts.

6. Describe a wiring strategy for building random logic within a cellular FPGA consistent with wireability rules.

CHAPTER 5

CASE STUDIES

The purpose of this chapter is to:

Illustrate the stages in transforming a problem definition into a working solution
Provide illustrations of typical design tools
Show the usefulness of hierarchy

We have deliberately chosen elementary examples, which for the most part only use a small portion of a field-programmable gate array (FPGA) chip, apart from the final example of the videostore controller. The problems have been carried out with Xilinx and Algotronix FPGAs, using CAD tools obtained from these companies and ViewLogic Inc. Since CAD software is regularly improved, and command sequences change, any detailed information on particular CAD tools would probably be out of date by the time this book is published. We recommend that you obtain the latest editions of data books, user guides, and so forth, from the companies or their local distributors. At the end of the chapter there are suggestions for projects you might like to try for yourself.

The Xilinx examples have been developed for a generally available demonstration board, which is shown in Figure 5–1. It has eight switches for input, eight individual light-emitting diode (LED) displays for output, and a 7-segment numerical LED display. Some of the examples have been devised to fit this restricted set of circumstances.

5.1 COMBINATIONAL CIRCUITS

5.1.1 Parallel Adder Cell

The simplest form of parallel adder is a ripple adder, which consists of a cascade of 1-bit full adder cells. Each cell has inputs A, B along with carry input C. It produces outputs SUM and CARRY, determined by the Boolean equations for a full adder, corresponding to the truth table shown in Table 5–1.

SUM = A... + Ā.B. + Ā..C + A.B.C

CARRY = A.B + A.C + B.C

Xilinx Implementation The equations may be entered directly as the F and G outputs of a Xilinx 3000 series Configurable Logic Block (CLB), since each output can be an arbitrary function of up to four variables chosen from a total of five inputs. Since the functions are implemented by look-up tables, there is no advantage in manipulating the equations further. For such a simple case it is easy to enter the details directly on the CLB array with the editor XACT. Figure 5–2 shows the block concerned, while Figure 5–3 shows the routing to input and output pads. The routing was determined by XACT after each net had been given as a list of pins. The configuration of each input or output block can also be specified with XACT, in this case direct inputs and outputs. The design details were saved to a file to be used as input in generating a configuration bit-stream file for downloading to the chip itself.

Alternatively, a schematic diagram for the adder may be entered as a logic circuit, as shown in Figure 5–4, which was prepared with the editor ViewDraw. This was used to generate a netlist, that is, a file specifying the logic elements used and how they are interconnected. The netlist can be used for functional simulation with a simulator such as ViewSim, as shown in Figure 5–5. Since the design has not at this stage been mapped to an actual FPGA, this is only a functional simulation, and there is no useful timing information. Translation to an actual Xilinx configuration will be deferred until we have a more substantial example.

Algotronix Implementation Since the configurable array logic (CAL) block functions have two inputs, the equations need to be reexpressed:

SUM = (A ⊕ B) ⊕ C

CARRY = (A.B) + ((A + B).C)

The 1-bit adder can be laid out in a block of 3 × 3 CAL cells as shown in Figure 5–6. The graphical editor clare was used for this purpose, and allows the user to specify the function of each cell and routing to and through it. The lower left cell holds the datum for the block. Note that the cell is arranged so it can be cascaded vertically, with the least-significant bit at the bottom.

5.1.2 Parallel Adder

We can assemble a ripple-carry adder by simply cascading the full adder. We use a 4-bit example.

Figure 5–2.Configurable Logic Block for full adder cell.

Figure 5–3. CLB array for full adder cell (XACT).

Xilinx Implementation Figure 5–7 shows a schematic prepared with the schematic capture program ViewDraw. There are four instances of the full adder cell referred to in Section 5.1.1. To use it with the demonstration board, we need a top-level schematic, which is shown in Figure 5–8. This shows the specific input and output pads, which are connected to the dual-in-line switches and LEDs, respectively. Again, the design may be simulated at this stage, that is, at a functional level, and Figure 5–9 shows results of applying a set of input vectors VA, VB resulting in the output vector VSUM. The process of converting the design into an FPGA implementation requires the following steps:

Figure 5–4. Schematic for full adder cell (ViewDraw).

Derive a netlist from the schematic editor. This specifies the circuit in logic detail, still in terms of idealized elements.
Map the logic circuit to the particular technology of the Xilinx 3000 series CLBs, and input/output blocks (IOBs). This is a complex process, since each CLB can generate two output signals, and the aim is to reduce the total number of CLBs required. At the same time, some simplifications can be made, for example, redundant logic can be removed.
Placement, which starts from a random assignment of CLBs to positions in the logic array. For this design, the IOB positions are predetermined by the demonstration board. The initial placement is evaluated in terms of estimated wire length, based on the connection points of each net. The placement is improved with an iterative process known as simulated annealing, which moves CLBs to reduce the estimated wire length.
Routing. Each net is connected according to the netlist, and uses the variety of available routing resources: direct connections from a CLB to its immediate neighbor, connections via switch-matrices, and long horizontal or vertical connections. This process results in a file containing details of the placement and routing.

Figure 5–5. Simulation of full adder cell (ViewSim)

Figure 5–6. Configurable array logic for full adder

Bit-stream generation. The bit stream specifies the “program” for the FPGA, that is, the memory bits which control the function and internal connections of each CLB, the settings of all interconnect switches, and the mode of operation of each IOB.

The bit stream can be downloaded to the demonstration board, and the design checked with switches and lights. Later we will discuss a more systematic way of testing using random patterns.

Algotronix Implementation The graphical editor clare was used to create four instances, as shown in Figure 5–10. Note that the cells cascade by abutment. The design was placed on the Algotronix CHS2×4 board described in Chapter 6, and was positioned in the northwest comer so that signals could be applied and monitored easily.

5.2 SEQUENTIAL CIRCUITS

5.2.1 Decade Counter

This design was implemented for a Xilinx 3020 FPGA with the aid of the ViewDraw schematic entry program, the ViewSim simulator, and Xilinx software for CLB mapping, placement, routing, and so on. The design illustrates alternative ways of creating a decade counter:

Figure 5–7. Schematic for ripple adder.

Figure 5–8. Top-level schematic for ripple adder.

Figure 5–9. Simulation for ripple adder.

Figure 5–10. CAL layout for ripple adder.

Using a macro element for a synchronous decade counter functionally equivalent to the 74160 TTL component
As a collection of gates and flip-flops
By direct entry to CLBs

Figure 5–11 shows the top-level schematic. The demonstration board has an RC controlled oscillator based on a single configurable logic block. The GOSC macro, shown in the lower left of the figure, produces a clock that runs at approximately 100 Hz when connected to external R and C components through pins 12 and 14. Another pin is connected to a dual in-line package switch that determines if counting takes place, or otherwise holds the counters clear, depending on the signal COUNT_EN. The oscillator is connected to the global clock GCLOCK, which in turn increments the least-significant decade counter. Figure 5–12 shows the schematic diagram for the 74160 synchronous counter. Figure 5–13 shows an encoded state machine design. The next state (N) variables are determined from the present state (P) variables. Figure 5–14 shows the state machine for the decade counter. Figures 5–15 and 5–16 show the basis for the design, which takes advantage of unassigned states that will of course never be entered, that is, decimal 10-15, and assigns 0 or 1 depending on the resulting simplification in the Boolean equations. For example, in deriving the least-significant bit N₀ of the next state, the box corresponding to present state , that is, “12” (decimal) is assigned to 1 to allow the whole column to be taken, and later combined with the right-hand column as ₀. The fifth D flip-flop is used to delay the carry-out following state “9” until the next count pulse arrives. Figure 5–17 shows how the CLBs may be set up directly by the user, that is, assigning their mode and Boolean equations to be implemented. Note that the three alternative methods are shown for illustrative purposes, and that in practice the 74160 macro method would be preferred by most users.

The schematic information was used to generate a netlist for use by Xilinx software, which first mapped the logic to CLBs, and then determined a good placement on the 8 × 8 CLB array. The routing was completed automatically, and the CLB array layout is shown in Figure 5–18. The RC oscillator CLB is at site BA in the upper left. Note that after buffering by the clock buffer it drives two vertical clock lines in the first two columns. Some connections between adjacent CLBs are made by direct connection, which is fast, while others pass through the much slower switch matrices, revealed by the sloping routing. Each decade uses four CLBs. Figure 5–19 shows the settings of a particular CLB at location FE—this corresponds to the two least-significant flip-flops at the left-hand side. Interestingly, the mapping process has swapped the roles of the two outputs, presumably to reduce interconnection delay, as can be seen by comparing the figures. The configuration detail shows that both of the edge-triggered D flip-flops are clocked on a rising clock edge (K), provided the clock is enabled (EC). The reset-direct (RD) signal is used to hold the counters clear.

Figure 5–11. Top-level schematic for 3-decade counter.

Figure 5–12. Macro cell for 74160 decade counter.

Figure 5–13. State machine for decade counter.

Figure 5–14. State sequence for decade counter.

5.3 PSEUDORANDOM NUMBER GENERATION

Random numbers can be very useful in simulation and testing digital circuits. Here we will use these numbers in testing the parallel adder described earlier. The term pseudorandom is used because the process is in fact completely predictable. The linear feedback shift register (LFSR) method is a particularly simple one, which can generate a sequence of length 2ⁿ – 1 for a register n bits long, provided it has the correct feedback connections. Figure 5–20 shows a 4-bit LFSR that can generate a sequence of length 15, not including the all-Is state, which is a lock-up state—once entered, the register will stay in this state indefinitely. Table 5–2 shows the feedback connections to produce a maximal length sequence for various register lengths.

5.4 RANDOM TESTING

For large systems, random testing is often a practical alternative to exhaustive, or case-by-case testing, and may be built into integrated circuits and systems.

Figure 5–15. Equations for state machine: state encoding for decade counter, less-significant bits.

Figure 5–16. Equations for state machine: state encoding for decade counter more-significant bits.

Figure 5–17. CLB entries for decade counter.

Figure 5–18. Configurable logic block array for 3-decade counter.

Input patterns may be readily generated with the LFSR method just described. Signature analysis is a technique for compressing the resulting output patterns into a short word whose final value indicates success or failure. While compression inevitably could allow some circuit failures to produce an identical signature, the risk is usually small.

We will take the 4-bit ripple adder described in Figure 5–4. Figure 5–21 shows the top-level schematic. Since there are eight input bits, along with carry-in, we will use a 9-bit pseudorandom number generator. To compress the four sum bits and the carry-out into 4 bits to be displayed on the demonstration board LEDs, we use a 4-bit signature analyzer.

Figure 5–19. CLB configuration.

Figure 5–20. Linear feedback shift register.

5.4.1 Signature Analyzer

The signature analyzer is an adaptation of the LFSR. At every clock cycle, a set of outputs from the ripple adder is loaded in parallel to an LFSR, with the feedback connections given earlier. These are XORed with the previous register contents as shown in Figure 5–22.

TABLE 5-2. LFSR Connections for Pseudorandom Sequence Generation

Register Length	Feedback Connections
2	0,1
3	0,2
4	0,3
5	1,4
6	0,5
7	2,6
8	1,2,3,7
9	3,8

Adapted from J. Mavor, M. Jack. and P. Denyer, Introduction to MOS LSI Design, Addison-Wesley. 1983, p. 118. © 1983, Addison-Wesley Publishing Co. Inc., reproduced with permission. Note: For a register length of 8, form the input by two EXCLUSIVE-OR gates feeding a third.

Figure 5–21. Top-level schematic of a parallel adder.

Figure 5–22. Signature analyzer.

Figure 5–23. State diagram

5.4.2 State Machine

As discussed in Chapter 2, state machines can be implemented in different ways. Encoded state machines may minimize the storage required, but at the expense of decoding logic, reducing speed. Here we use the one-hot method, that is, one bit per state. The four states are as follows:

Initial state: wait for switch to be pressed
Accumulate signature from four output bits in signature analyzer
Accumulate carry-out bit in signature analyzer
Display 4-bit signature on LEDs

Figure 5–23 shows the state diagram and Figure 5–24 shows the schematic. Note that State0 has inverting logic on both input and output, so that it automatically goes high when the FPGA is powered-up. Also, the State3 output is used to reset states 1 and 2 in the unlikely event of a logic state error, for example, if either bit was simultaneously asserted with State3.

5.5 SYSTOLIC SORTER*

This example is included to suggest that FPGAs may provide novel ways of solving well-known problems. Sorting is required in many computer applications, particularly commercial ones. Here we look at a high-speed sorter, but one confined to miniscule amounts of data. It is described as systolic because it uses a repetitive block structure, and has a regular data flow. In general, each item to be sorted consists of a number of fields, of which one is selected as a key, on which to sort. For simplicity we regard the key as a nonnegative integer, and assume that we wish to sort items in ascending order of their keys. For simplicity, we concentrate on sorting the keys.

Figure 5–24. Schematic for state machine.

Knuth [Knuth73] distinguishes some basic sorting methods:

Sort by insertion: Find the correct place to insert an item in an already-ordered list

Sort by exchange: Consider pairs of keys, and swap them if out of order

Sort by enumeration: Find the smallest (or largest) key and remove the item to a new list. Repeat until the original list is empty.

The scheme described here is a combination of the first two methods. We assume that the items to be sorted arrive individually, so it makes sense to keep previous arrivals in order—we choose ascending. Borrowing a technique from full-custom VLSI design, we make up blocks of cells, and arrange for simple connections of data and control signals. The overall floorplan is shown in Figure 5–25. Each key is stored in a vertical column of one-bit blocks, with the least-significant bit at the top, and the most-significant bit adjacent to the control block at the bottom. New keys arrive at the left-hand side, and will “ripple” to the right, as they are compared with stored keys until they fall into the correct column. Each control block, of which there is one per key, simply uses the result of the arithmetic comparison made above:

if the stored key (S) is smaller
than the one to its left (L), key L is passed to the cell on the right, i.e.
bypassing S. otherwise, L must be stored in place of S, and S passed to the right.

Figure 5–26 shows the data flow for the comparator cell, as implemented in Algotronix CAL logic. When all the keys have been submitted, the sorted list is returned by a series of left shifts, that is, so the smallest key emerges first.

Figure 5–25. Systolic sorter.

Figure 5–26. Sorter comparator cell—data flow.

In every column an arithmetic comparison is made, commencing with the less–significant bit at the top of the key. Each bit stage generates a signal “less,” which indicates that, based on the series of bits so far, the stored key is less than the one to its left. This result may change as the more significant bits are considered. The comparator logic is illustrated in Figure 5-27, and is described as follows:

Figure 5–27. Sorter comparator cell. Comparison logic.

if the left-item bit and the
stored bit are equivalent, pass on the “less” result from the Is comparator
above. otherwise, if the left-item bit is a 1, set “less” to 1 or if a 0, then
set “less” to 0.

It is implemented in the 1-bit comparator shown in Figure 5–28.

The control block is shown in Figure 5–29. Before sorting starts, the CLRb signal sets all the D flip-flops to 1, and as keys appear, Os are propagated through the control blocks. When the set of keys to be sorted has been entered, the results are retrieved in an ascending order by issuing a set of “read” instructions, which shift the contents of the control and key blocks left. Figure 5–30 shows an overall view of the sorter.

5.6 MULTIPLIERS

Multiplier designs are a simple way of studying different varieties of digital logic designs. It is possible to develop different multiplier examples to study the varieties of design approaches and implement these designs using FPGAs. In the following sections Xilinx 3000 implementations are assumed, but the approaches vary from fully automated to handcrafted with corresponding gains in performance.

5.6.1 Parallel Multipliers

The simplest of the multiplier designs is a parallel-combinational logic multiplier. This multiplier produces a result by adding the partial products of equal binary weight with the carries of the lower binary weight. The design consists simply of AND, OR, XOR, and inverter gates at the top-level schematic.

Schematic Entry Figure 5-31 is an example of a 4- × 4-bit combinational logic multiplier. The design was created using ViewLogic WorkView. Using CAD tools, a logic design can be entered directly as a top-level schematic, manually connecting gates together. To obtain more information about creating a schematic design, refer to the current Xilinx and/or Viewlogic WorkView manuals.

Functional Simulation It is always a good idea to check the design functionality before actually continuing with the creation of the layout file. The software used to create the schematic design provided support for a functional simulation of the schematic circuit. Refer to Figure 5–32.

Figure 5–28. Comparator. CAL block.

Figure 5–29. Control: CAL block.

Figure 5–30. Overall layout of sorter.

Figure 5–31. 4- × 4-bit combinational multiplier.

Figure 5–32. Waveform for functional simulation.

File-conversion Process In order to be able to create a Xilinx layout for the parallel multiplier, it is necessary to follow a series of steps to convert the file to the correct format. The file-conversion process consists of the following steps:

Generate a “wirelist” (net) file from the schematic level file
Generate a netlist file from the wirelist file
Generate a layout file using the automatic router

Speed Most of the technology used today requires the use of high-speed circuits. The speed performance of the Xilinx design will be affected by several factors: choice of part being used, the arrangement of the programmed logic within the circuit, and the efficiency of the layout itself.

XACT Design Entry It is possible to enter a design directly by using XACT (the Xilinx design editor). This can be done by configuring the blocks directly, using logic expressions instead of entering a schematic. Figure 5–33 gives an example of the multiplier as entered using XACT.

5.6.2 Serial Multiplier with Parallel Addition

Another example of a multiplier circuit is a serial multiplier. The main difference between the parallel logic multiplier and the serial multiplier is the introduction of memory and timing in the design.

Schematic Entry Figure 5–34 is a schematic for a serial parallel multiplier created using ViewLogic WorkView.

The serial multiplier can be processed as before to generate the Xilinx layout file. Figure 5–35 shows the serial multiplier layout configuration for a 3000 family Xilinx chip.

5.7 A PARALLEL CONTROLLER DESIGN*

An outline of the design and implementation of a video framestore design is presented that used the Petri net method described in Chapter 2, and is implemented on a Xilinx LCA.

The function of the framestore is to receive or transmit digital video data from a memory in a raster scan format. Region of interest (ROI) is supported, allowing the reception and transmission of frames of any size. Video data arrive on a 16-bit-wide bus every 100 ns. A frame synchronization signal defines the start of a field of data of 260 lines by 760 pixels. Each line starts with a horizontal synchronization signal. The video controller has to deposit the data in a video framestore in the correct order (interleaving two fields to create a full frame) and to manage accesses to the memory. A second independent video channel operates at the same time, but it must access a different block of memory. A transputer is able to access both memory blocks while video operations are being performed. At the end of a frame, the video channels may be required to switch memory banks, and this must be handled by the controller.

Figure 5–33. XACT view of the parallel multiplier.

The controller has a number of registers containing its operating parameters, such as horizontal start and end count, vertical start and end count, and start address in memory. Figure 5–36 shows the datapath structure of the controller design. It contains all the datapaths and a large Petri net controller and has been implemented in a Xilinx 3090 FPGA.

Figure 5–34. Schematic for the serial multiplier.

Figure 5–35. XACT configuration for the serial multiplier.

5.7.1 Operation of Part of the Petri Net Controller

The controller Petri net uses inhibiting and enabling arcs and is made up of many small linked Petri nets. To illustrate both the Petri net technique and its use in hardware design, a small fragment of the controller is described. Figure 5–37 shows the parts of the controller associated with generating the horizontal synchronous output signal, ‘HsyncOut’, and with determining the position of active pixels within a line. Consider that the circuit is initially idle, and is to be operated as a timing master, with ‘HsyncOut’ connected back to the controller’s ‘Hsync’ input. Although ‘Hsync’ is not directly used in this part of the circuit, its assertion will cause the external pixel counter, ‘Hctr’ in Figure 5–36, to be reset and thereafter to count toward the register values ‘Hstart’ and ‘Hend’, so that the Petri net inputs ‘HctrAtHstart’ and ‘HctrAtHend’ are asserted at the appropriate times.

Figure 5–36. Simplified block diagram of the MAXbus controller.

An image transfer is initiated when an “arming” transition fires in another part of the net, causing a token to arrive in P3. This token is removed on the next clock cycle, so that the arming signal is only one cycle wide. The arming signal enables transition T4, so that on the next clock cycle T4 fires, removing the token in P4 and thereby releasing the “done” output to indicate that the controller is running. The arming signal also enables transition T1 so that on the next clock cycle the token in P2 moves to place PI, asserting ‘HsyncOut’. Having no other associated conditions, transition T2 will immediately fire on the following clock cycle, returning the token to P2. ‘HsyncOut’ is thus one cycle wide.

As mentioned earlier, the generation of the ‘HsyncOut’ signal causes the pixel counter to be cleared, and to count toward ‘Hstart’ and ‘Hend’. When ‘Hctr’ reaches ‘Hstart’, ‘HctrAtHstart’ is asserted, T5 fires, and ‘Hblank’ becomes inactive. External circuitry (the serial clock in Figure 5–36) uses this signal, together with a ‘Vblank’ signal generated elsewhere, to decide whether to clock active data. Later, ‘HctrAtEnd’ is asserted and T6 fires, reenabling ‘Hblank’. The assertion of ‘HctrAtHend’ also has the effect of firing transition T3, thereby generating a further ‘Hsync’ output from place PI, to indicate the start of a new line. The process continues until a token arrives at P4 indicating that the sequence is done and this inhibits T3 from firing.

Figure 5–37. A small part of the controller.

5.7.2 Characteristics of the Design Technique

It can be seen that the small ‘Hblank’ circuit lies in a separate Petri net running in parallel with the rest of the circuit. The introduction of parallelism in this conventional way is over and above the parallelism that can be expressed within a Petri net. Parallelism is used in the circuit both to have more than one task running at a time and simply to express the design more easily. In the case of the ‘Hblank’ circuit, this convenience allows the implementation to be a natural expression of the definition of horizontal blanking, taking place from the end of one line to the start of the next.

5.7.3 Implementation on XC3090

The design was entered in schematic form to the ViewLogic CAD system. Four designs have been produced: an 8-bit design and a 16-bit design, each with and without ROI capability. These have all been simulated using nominal delays for the logic cell array (LCA) circuit components.

The full 16-bit design uses 83% of the 320 CLBs and 86% of the 144 IOBs available. This design could not be routed automatically on the tools available at the time (1991). An early effort to cut down the design to 68% CLB usage also proved to be unroutable using automatic tools. For comparison, an inoperable version with 50% CLB usage was routed automatically. With smaller array parts, automatic routing was successful at 80% CLB usage. However, it was clear that the place-and-route strategies in the software could not cope satisfactorily with the larger arrays.

These problems were overcome using informal tools from Xilinx for routing a functional block at a time. This system allows the designer to carry out floor planning, leaving space for routing channels, which the automatic tools cannot do. The automatic router is then run on the small blocks, where it manages to achieve a higher packing density. Each iteration adds one functional block to the work already routed; earlier routing is copied into the current iteration using the place-and-route program’s (APR) guide option. Manual intervention, using the XACT LCA editor program, is needed to tidy up the design at each stage. The newer design tools from the manufacturer more adequately deal with these problems, as will be seen shortly.

As a result of this experience the following hints are offered for future users:

Supplement long lines by routing out on one side of the chip and back in the other side.
Bring signals used in a number of places into the chip using multiple pins.
Minimize three state buffer usage as they reduce access to nearby CLBs.
Minimize use of the RD and CKEN pins on CLBs, as these cannot be swapped to other locations in the way that, say, X and Y can, to make routing easier. The exception is when driving a number of these pins from the same long line.
When using outer long lines to drive, the three state input to IOBs indicate this to the router with a small piece of manual prerouting.

Two further problems were encountered. The first arises when routing a block that has a pin that is to be connected to another block that has yet to be routed. The pin can become buried with no access path to it. To overcome this a dummy circuit is added, using XACT, connecting to such pins, thereby forcing a route to be found before the current iteration can be declared complete.

The second problem arises when copying routing information from a previous iteration. During the routing process CLB pins are swapped about to make routing easier. When copying routing information from the previous iteration, the router attempts to repeat the pin swaps already decided on. However, it occasionally fails, believing the pin swaps to be illegal (which they sometimes are until other pin swaps have been done first). This is tiresome, especially when it involves the loss of routing done manually. When the problem is observed, the solution is to return to the circuit diagram and lock the correct pin designations in place.

During the time of preparing this book, there have been substantial improvements in computer-aided design support. Following the successful completion of the original design at Bristol University in 1991, the design files were made available to Direct Insight Ltd. (U.K.). In 1993 the design was completed using NeoCAD software for Xilinx FPGAs, and with the aid of a 50 MHz PC486 computer. The standard placement-and-routing package was run with 100 cost tables. No timing constraints were specified, and so the system ran without timing optimization. After 108 hours, it had tried 64 cost tables, and one of these resulted in a fully-routed chip, taking 1 hour and 25 minutes. Twenty-eight had less than 10 nets unrouted, and human intervention could probably have quickly resulted in other fully-routed options. The fully-routed version was run through five cost-based clean-ups and five delay-based clean-ups, taking a further 18 minutes. The design achieved a 6.3 MHz clock rate. Following further routing iterations over 24 hours, the clock rate had increased to 7 MHz. This is comparable with the existing design, which operates reliably at 6.75 MHz. A layout for the final NeoCAD design is shown in Figure 5–38.

In 1994 the current production release of Xilinx software XACT 5.0.0 was run with XACT-Performance enabled, so that the resulting circuit would run at its highest speed. After 82 hours on Sun Microsystems SPARC 10 workstations, the system had generated 11 full-routed designs out of 16 attempts. The first fully-routed result was available after 1 hour and 49 minutes. The varied results were produced by selecting different placement parameters, including the placement “seed” and “placer effort.” The clock rates for fully-routed chips varied from 5.4 MHz to 6.9 MHz. Six of the 11 fully-routed designs performed at 6.0 MHz or faster. The best two results gave speeds of 6.8 and 6.9 MHz, which compare favorably with the existing design (6.75 MHz). A layout for the fastest Xilinx design is shown in Figure 5–39.

Figure 5–38 Final NeoCAD design for the parallel controller.

5.7.4 Conclusions

The full 16-bit version of the controller was eventually routed. While in 1991 this took over four months of continuous effort, more recent CAD software allows rapid completion of quite complex designs to specified performance criteria. An 8-layer printed-circuit board has been designed and manufactured, and there are now more than 25 copies of the design in routine use.

SUGGESTIONS FOR PROJECTS

Faster adder circuits based on carry-lookahead logic [Wakerly90]. Note that the Xilinx 4000 series incorporates carry-lookahead logic in its CLBs.
Arithmetic-logic unit, (see [Wakerly90]).
Design an elementary processor, (see [Segee91]).
Adapt the sorter design for another FPGA family.
Boundary Scan. Boundary Scan testing has recently become standard, and considerably simplifies testing both chips and their printed circuit board environment. See [Maunder90] for details.
Dynamic memory controller. Dynamic memory requires periodic refresh to avoid loss of contents. For example, the Intel 4116 16K DRAM is organized as 128 rows and 128 columns. Each row can be read and rewritten in a single cycle, and each must be visited every 2 ms.
Asynchronous serial data transmission and reception. The RS232 standard defines start and stop bits for the 8-bit characters. Using a clock running at 4 times the nominal baud rate, it is possible to sample the incoming bits correctly. The transmitter design is simpler than the receiver.
Manchester-coded data transmission. Manchester coding carries both data and timing information on the same channel, and the edge transitions in each bit time allow a receiver to adjust its clock by speeding up or slowing down from the nominal bit rate. The receiver design calls for a phaselocked loop, which can automatically adjust the sampling rate.

Figure 5–39. Final Xilinx design for the parallel controller.