## CMU Fall’01 18-760 VLSI CAD

[120 pts] (V2) Homework 2. Out Thu Sep 13, Due Thu Sep 27 '01.

## 1. BDD ordering [10 pts]

We saw that variable order is highly significant for something as simple as a multiplexor. How about something like a comparator? A simple comparator takes two 2-bit unsigned binary numbers a 1 a 0 and b 1 b 0 and compares their magnitude, and sets the output $\mathrm{z}=1$ just if ala0 is less than or equal to b1b0. Do this:

- Is there a particularly good variable ordering for this function? Show it--show the BDD. Is there a particularly bad variable ordering for this function? Show it--show the BDD.
- Draw a gate-level netlist using any AND, OR, NOT, EXOR gates you want, to implement the simple comparator from the previous problem. Apply Minato's ordering heuristic (and where you need to break ties or make any arbitrary ordering decicion--just tell us what you did and show the work). Show what variable ordering it produces.


## 2. ITE for Gates [10 pts]

What is the fewest number of calls you need to make to ITE to implement $a \oplus b$ ? In other words, you don't want to use ITE several times to build AND, OR, and NOT, to do $a ' b+a b$ ' - that's too easy. You can do it much more simply if you think about it Draw the multiplexor-hardware picture of the ITE and label clearly what's going in and out of each ITE.

## 3. ITE Decomposition [ 10 pts ]

Using what you know of Boolean algebra, the definition of ITE, and the properties of cofactors, show that the ITE decomposition below (from class) actually works:

$$
\begin{equation*}
\operatorname{ITE}(I, T, E)=x \bullet \operatorname{ITE}\left(I_{x}, T_{x}, E_{x}\right)+\bar{x} \bullet \operatorname{ITE}\left(I_{\bar{x}}, T_{\bar{x}}, E_{\bar{x}}\right) \tag{EQ1}
\end{equation*}
$$

## 4. ITE Recursion [10 pts]

Let $f(x, y)=x \bullet y$ and $g(x, y)=x \bar{\oplus} y$ (this is exclusive-nor). Assume we have a multi-rooted DAG for the BDDs representing these two functions (you need to draw them.) We want to EXOR these functions, and compute a new function $Q(x, y)=(f \oplus g)(x, y)$. Using the ITE operator as discussed in the notes, show how you would implement this EXOR operation. As in the slides, show how the recursive computation proceeds as ITE calls itself. At each node of the recursive call tree, tell what ITE is computing (label the nodes in your BDD in some sensible way) and show clearly when each recursive call terminates. Draw the final recursive call tree.

Draw the final result, i.e., show the final form of the multi-rooted DAG that now represents functions $f, g$, and $Q$.

## 5. Derived Operators [20 pts]

Suppose we have a software package that has data structures representing variables and Boolean functions as BDDs, and that the following operations are available as subroutines in this software. (We will write these in a simplified sort of C language notation):
$b d d$ var2func(var x$) \quad$ Generate the BDD corresponding to a single variable x . Input is a single variable (of type var), and the returned output is a BDD.
$b d d$ ITE ( $b d d \mathrm{I}, b d d \mathrm{~T}, b d d \mathrm{E}$ )
Compute the if-then-else operation. Inputs are 3 BDDs, called I (if part), T (then part) E (else part), and the returned output is another BDD.
int iszero( $b d d$ func ) Returns integer 1 just if func is the always-zero function. Input is a BDD, output is integer 1 or 0 (it's not a BDD)
bdd cofactor(bdd func, var x , int val)
Computes cofactor of func with respect to var x , setting $\mathrm{x}=$ val. func is a BDD, var is a variable, val is integer 0 or 1
$b d d \mathbf{A N D}(b d d \mathrm{f}, b d d \mathrm{~g})$
$b d d \mathbf{O R}(b d d \mathrm{f}, b d d \mathrm{~g})$
$b d d \operatorname{EXOR}(b d d \mathrm{f}, b d d \mathrm{~g})$
$b d d$ NOT( $b d d \mathrm{f}$ )
Compute the basic logical (gate type) operations on BDDs. AND, OR and EXOR create new BDDs representing the logical AND, OR and exclusive-or of their inputs. NOT creates the BDD for the complement of its input.

## bdd CONST1, CONST0

You can assume these two BDDs are already defined. These are just the constant 1 function and the constant 0 function. Note that iszero(CONST0) $==($ int $) 1$.

No other operations are implemented, and there is no way for you to examine the BDD data structure directly.

Describe (in C-like pseudo-code notation-we don't need real code here!) how you would implement the following operations:

- int depends(bdd $\mathrm{f}, \operatorname{var} \mathrm{x})$ Determine whether function f depends on the specified variable $x$. This means if you change $x$, at least sometime the output of $f$ will also change. f is a BDD, x is a variable, depends returns integer 0 or 1 .
- $b d d$ univquant $(b d d \mathrm{f}, v a r \mathrm{x})$

Compute universal quantification of function $f$ with respect to variable x . f is a BDD , x is a variable, univquant returns a BDD.

- int opposite ( $b d d \mathrm{f}, b d d \mathrm{~g}$ ) Determine whether two functions are complementary, i.e., if one of them is the complement of the other. $f$ and $g$ are BDDs, opposite returns integer 0 or 1 .
- $b d d$ exchange( $b d d \mathrm{f}, v a r \mathrm{a}, v a r \mathrm{~b}$ )

Exchange roles of specified variables in function $f$.
For example, exchange $(a \cdot b+d, a, d) \longrightarrow d \bullet b+a$.
$f$ is a BDD, $a$ and $b$ are variables, exchange returns a BDD.

- $b d d$ compose ( $b d d \mathrm{f}, b d d \mathrm{~g}, v a r \mathrm{x}$ )

Creates a new function (composition function) with variable x in f set to the output of g . The picture below clarifies what we are computing. f and g are BDDs, x is a variable, and compose returns a BDD.


HINTS: lots of things that look difficult are easier when you cofactor them and look at the cofactors. Play around with ITE of various combinations of the cofactors. The first 3 operators are pretty straightforward, the last 2 are rather tricky. None of these things requires some sophisticated recursive algorithm, just a few lines of calls to the right operators with the right inputs. To emphasize this, here is the answer for the first part, for depends (bdd f, var x ):
int depends (bdd f, var x$)$ \{
return(1-iszero( EXOR( cofactor(f, x, 0), cofactor(f, x, 1)) );
\}

Notice how this works. If function $f()$ depends on variable $x$, then if $I$ change $x$ from $x=0$ to $\mathrm{x}=1$, there ought to be at least some pattern for the remaining inputs that makes the output of the function change. But this is exactly what the Boolean Difference tries to compute. So, we compute the BDD for a new function $f(\ldots x=0 \ldots) \oplus f(\ldots x=1 \ldots)$ using calls to cofactor and to EXOR. What does this new function tell us? If the new function is zero always, for all inputs, then you cannot affect the output of function $f$ by changing variable x . In other words, f does not depend on variable x . So if the iszero( ) function returns integer 1, it means the original f function does not depend on $x$. To get the true/ false return for depends () correct, we have to invert this, which is what the $(1-\ldots)$ does.

## 6. Combinational Verification in KBDD [30 pts]

$k b d d$ is a BDD calculator done by Prof. Randy Bryant's research group that has all the operators you'd want to use to manipulate Boolean functions, and a simple command line interface to type in functions, etc. You will use this to try to verify the correctness of a logic gate network whose BDDs are much too big to do by hand.
$k b d d$ lives in /afs/ece/class/ee760/bin/kbdd, and it works on IBM AIX boxes and on SUN Solaris boxes.

As a starting point, the following page shows a complete trace of a session with kbdd, using it to do the network repair problem on the last homework assignment. Inputs are in normal font, outputs italics, kbdd's prompts for input shown in bold as KBDD:
\% /afs/ece/class/ee760/bin/kbdd
KBDD: \# input variables
KBDD: boolean abcin d0 d1 d2 d3
KBDD: \#
KBDD: \# define the correct equation for the adder's carry out
KBDD: eval cout $\mathrm{a} \& \mathrm{~b}+(\mathrm{a}+\mathrm{b}) \& \mathrm{cin}$
cout: $a \& b+(a+b) \& c i n$
KBDD: \#
KBDD: \# define the incorrect version of this equation (just for fun)
KBDD: eval wrong a\&b + (!(a\&b))\&cin
wrong: $a \& b+(!(a \& b)) \& c i n$
KBDD: \#
KBDD: \# define the to-be-repaired version with the MUX
KBDD: eval repair $\mathrm{a} \& \mathrm{~b}+(\mathrm{d} 0 \&!\mathrm{a} \&!\mathrm{b}+\mathrm{d} 1 \&!\mathrm{a} \& \mathrm{~b}+\mathrm{d} 2 \& \mathrm{a} \&!\mathrm{b}+\mathrm{d} 3 \& \mathrm{a} \& \mathrm{~b}) \& \mathrm{cin}$
repair: $a \& b+(d 0 \&!a \&!b+d 1 \&!a \& b+d 2 \& a \&!b+d 3 \& a \& b) \& c i n$
KBDD: \#
KBDD: \# make the Z function that compares the right version of
KBDD: \# the network and the version with the MUX replacing the
KBDD: \# suspect gate (this is EXNOR of cout and repair functions)
KBDD: eval Z repair\&cout + !repair\&!cout
Z: repair\&cout + !repair\&!cout
KBDD: \# universally quantify away the non-mux vars: a b cin
KBDD: quantify u ForallZ Z abcin
KBDD: \#
KBDD: \# let's ask kbdd to show an equation for this quantified function
KBDD: sop ForallZ
$!d 0 \& d 1 \& d 2$
KBDD: \#
KBDD: \# what values of the d's make this function $==1$ ?
KBDD: satisfy ForallZ
Variables: d0 d1 d2
011
KBDD: \#
KBDD: \# that's it!
KBDD: quit
\%

## KBDD Quick Reference

| boolean | Declare variables and variable ordering |
| :---: | :---: |
| Extended naming |  |
| $\operatorname{var}[m . . n]$ | Numeric range (ascending or descending) |
| $\{s 1, s 2, \ldots\}$ | Enumeration |
| evaluate dest expr | dest $:=$ bdd for boolean expression expr |
| Operations | (decreasing precedence) |
| ! | Complement |
| $\wedge$ | Exclusive-Or |
| \& | And |
| + | Or |
| bdd funct | Print BDD DAG as lisp-like representation |
| sop funct | Print sum-of-products representation of funct |
| satisfy funct | Print all satisfying variable assignments of funct |
| verify $f 1 f 2$ | Verify that two functions $f 1 f 2$ are equivalent |
| size funct | Compute total BDD nodes for set of functions |
| replace dest funct var replace | Functional composition: dest $:=$ funct with variable var replaced by replace function output |
| quantify [ $\mathbf{u} \mid \mathbf{e}$ ] dest funct var ... | dest $:=$ Quantification of function funct over variables var ... |
| e | Existential quantification is done |
| u | Universal quantification is done |
| adder $n$ Cout Sums As Bs Cin | Compute functions for n -bit adder |
| $n$ | Word size |
| Cout | Carry output |
| Sums | Destinations for sum outputs: Sum.n ... Sum. 0 |
| As | A inputs: A.n-1 ... A. 2 A.1 A. 0 |
| Bs | B inputs: B.n-1 ... B. 2 B. 1 B.0 |
| Cin | Carry input |
| alu181 Cout Fs M Ss Cin As Bs | Compute functions for ' 181 TTL ALU |
| Cout | Destination for carry output |
| Fs | Destinations for function outputs: F. 3 F. 2 F. 1 F. 0 |
| M | Mode input |
| Ss | Operation inputs: S. 3 S. 2 S. 1 S. 0 |
| Cin | Carry input function |
| As | A inputs: A. 3 A. 2 A. 1 A. 0 |
| Bs | B inputs: B. 3 B. 2 B. 1 B. 0 |
| mux $n$ Out Sels Ins | Compute functions for 2 n -bit multiplexor |
| $n$ | Word size |
| Out | Destination for output function |
| Sels | Control inputs: Sel.n-1 ... Sel. 1 Sel. 0 |
| Ins | Data inputs: In. $2 n-1$... In. 1 In. 0 |
| quit | Exit KBDD |

So, what shall we use $k b d d$ to verify? It turns out that adders are a very good choice here, because they come in so many different styles, and we know that the BDD for a basic adder is very simple and small. Look at the appendix to this assignment for a description of a very interesting industrial-strength adder, from Grant McFarland's webpage at Stanford (http://umunhum.stanford.edu/~farland/notes.html). This adder has these features:

- 32-bit carry lookahead structure (CLA)
- Faster carry propagation through the use a so-called Ling adder structure
- Conditional carry propagation using a carry select multiplexor

This is a pretty scary looking adder-- but these are the kinds of tricks people really play to implement very fast adders for things like microprocessors. And, this is exactly what a "real world" industrial design description looks like.

Question: Does this thing really work? The description in the appendix is pretty detailed--but, hey, mistakes happen. To check this, but to keep the complexity down, we want you to verify 2 outputs from this complicated adder: the final carry out cout and the most significant bit of the sum $\mathbf{s 3 1}$. You don't need to verify any of the other bits.

This means, we want you to build the BDDs for all these two functions of the input bits ( $\mathbf{a 0}$ a31, b0-b31) and compare them to an "ordinary" adder. Are they identically the same Boolean functions? Use $k b d d$ to formally verify these 2 parts of this aggressive, custom 32-bit design. Include a listing of your session with $k b d d$ showing how you did this. Note especially that $k b d d$ has basic n-bit adders built in, so to create the "baseline" sum-bit equations is a simple operation. For example, to do a basic 4-bit adder, this will suffice:
kbdd: boolean a[3..0] b[3..0] s[3..0] cout cin
kbdd: adder 4 cout s[3..0] a[3..0] b[3..0] cin
All the work is in creating the description adder in $k b d d$, and then comparing it appropriately.
Note: It's probably easiest to edit a script file that you then run through kbdd using the source kbdd command. In UNIX, if you type the following italics stuff:

## \% script

\% /afs/ece/class/ee760/bin/kbdd
kbdd: source myfilename
kbdd: quit
<you hit control-D>
then it will (1) start saving everything in a file called typescript, (2) run $k b d d$, (3) tell kbdd to run your commands in your file myfilename. You type control-D to stop the script saving. You can then print this typescript file and hand it in, and include some comments so when we read it, we understand what you did.

## 7. Multi-Terminal BDDs [10 pts]

BDDs come in a number of specialized variants, one of which is the Multi-Terminal BDD, or MTBDD.

MTBDDs are a generalization of BDDs that allow an arbitrary number of real-valued terminals instead of just the basic binary terminals, 0 and 1 . While a BDD represents function that returns a Boolean value for any assignment to its variables, an MTBDD returns a real number. More formally:

$$
\begin{array}{ll}
\text { BDDs } & \mathrm{F}: \mathrm{B}^{\mathrm{n}} \rightarrow \mathrm{~B} \\
\text { MTBDDs } & \mathrm{F}: \mathrm{B}^{\mathrm{n}} \rightarrow \mathrm{R}
\end{array}
$$

Suppose you wanted to represent the following function:

$$
\begin{aligned}
& F(a, b)=1.2 \text { for } a=1, b=0 \\
& F(a, b)=2.5 \text { for } a=0 \\
& F(a, b)=4.7 \text { for } a=1, b=1
\end{aligned}
$$

These numbers could have any number of meanings. One example would be the power in mW consumed by a particular circuit when it's input changes from "a" to "b". Or it could be the delay in ns, or the rise time of the output. The MTBDD for this function would be the following:


Furthermore, MTBDDs can be used to represent real-valued matrices fairly efficiently. To do so, we introduce $\log (\#$ rows $)+\log (\#$ columns) variables to encode the row position, and column position. For a simple 2 by 2 matrix, we get the following:


For very large matrices with lots of repeated numbers, the MTBDD representation can be considerably smaller than a simple listing of the values, and operations like matrix-addition or matrix-multiplication are correspondingly faster.

It turns out that working with MTBDDs is just as easy as with BDDs. To apply an arbitrary binary operator (function with two operands, like " $a+b$ "), we can use the same expansion and cofactoring rules as with BDDs. The cofactor of a function $F$ with respect to variable x , or $\mathrm{F}_{\mathrm{x}}$, is obtained by redrawing the MTBDD without all of the " x " nodes, and redirecting their incoming edges to the "hi" son. For example:


Using cofactors, we can decompose operations on MTBDDs in exactly the same manner as for BDDs:

$$
\begin{aligned}
& \mathrm{F}=\mathrm{x} \mathrm{~F}_{\mathrm{x}}+\mathrm{x}^{\prime} \mathrm{F}_{\mathrm{x}}, \\
& \mathrm{~F}+\mathrm{G}=\mathrm{x}\left(\mathrm{~F}_{\mathrm{x}}+\mathrm{G}_{\mathrm{x}}\right)+\mathrm{x}^{\prime}\left(\mathrm{F}_{\mathrm{x}^{\prime}}+\mathrm{G}_{\mathrm{x}^{\prime}}\right) \\
& \mathrm{F} * \mathrm{G}=\mathrm{x}\left(\mathrm{~F}_{\mathrm{x}} * \mathrm{G}_{\mathrm{x}}\right)+\mathrm{x}^{\prime}\left(\mathrm{F}_{\mathrm{x}^{\prime}} * \mathrm{G}_{\mathrm{x}^{\prime}}\right)
\end{aligned}
$$

This results in a nice recursive algorithm for performing arbitrary operations on two input MTBDDs that looks remarkably similar to algorithms for BDDs.

## Do this:

1.) Draw the MTBDDs for the two following functions:

$$
\left.\begin{array}{ll}
F(a, b)= & 3 \text { for } a=0 \\
0 \text { for } a=1, b=0 \\
4 \text { for } a=1, b=1
\end{array}\right] \begin{aligned}
& \\
& G(a, b)=\begin{array}{l}
\text { for } b=1 \\
4 \text { for } b=0
\end{array}
\end{aligned}
$$

2.) Compute and draw the resulting MTBDDs for $\mathrm{H}=\mathrm{F}+\mathrm{G}$ and $\mathrm{M}=\mathrm{F} * \mathrm{G}$
(Hint: This will be easier if you think about what the results should be for each of the assignments to a and b first, and then try drawing the MTBDD. )

## 8. Metaproducts [20 pts]

A wonderful property of BDDs is that they only represent the "abstract" Boolean function, not the way you chose to implement it with logic gates. This is why BDDs are so useful for verification: you can implement $\mathbf{x}+\mathbf{x}^{\prime}$, or $\mathbf{x}+\mathbf{y}+\mathbf{x}^{\prime} \mathbf{y}$ ', or just plain " 1 ", and in all these cases, you get the identical BDD.

Of course, sometimes we actually want to represent the way we have implemented something as logic gates. Suppose we really want to represent--for whatever odd reason--the SOP expression $=\mathbf{x}+\mathbf{x}^{\prime}$. Is there any way we can represent this directly, without immediately resolving it to the " 1 " function? In other words, can we preserve the SOP product structure of this expression? The answer is "yes" -- sort of.

We need a new representation of the function that "records" the SOP structure, but which also behaves as much like a BDD as possible. It turns out there is a very elegant trick for recasting the original function in a new set of variables, and then just representing this new function as a BDD, that does much of what we want it to do. This new "SOP preserving" structure is called metaproduct notation. (The idea is due to Olivier Coudert, originally of Bull Research Center in France.) Here's the trick:

- For each variable $\mathbf{x}$ in your SOP form, the metaproduct formula has 2 different variables: $\mathbf{r}_{\boldsymbol{x}}$ and $\mathbf{s}_{\boldsymbol{x}} . \mathbf{r}_{\boldsymbol{x}}$ is the occurence variable for $\mathbf{x} ; \mathbf{s}_{\boldsymbol{x}}$ is the sign variable for $\mathbf{x}$.
- Suppose your function is $\mathbf{f}(\mathbf{x}, \mathbf{y}, \mathbf{z}, \mathbf{w})$. For each product term in your SOP form, for example $\mathbf{x y} \mathbf{y}^{\prime}$ ', you get a corresponding metaproduct term.
If your literal is in positive form, like $\mathbf{x}$, you get $\left(\mathbf{r}_{x} \mathbf{s}_{x}\right)$ in the metaproduct.
If your literal is in negative form, like $\mathbf{x}^{\prime}$, you get ( $\mathbf{r}_{\boldsymbol{x}} \mathbf{s}_{\boldsymbol{x}}{ }^{\prime}$ ) in the metaproduct.
If a variable is missing from the product, like $\mathbf{w}$, you get $\left(\mathbf{r}_{w}{ }^{\prime}\right)$ in the metaproduct. (It turns out the $\mathbf{s}_{w}$ variable doesn't matter in this case, since $\mathbf{w}$ is not present, sign doesn't matter)
- So, $\mathbf{x y}{ }^{\prime} \mathbf{z}^{\prime}$ would get transformed into ( $\mathbf{r}_{x} \mathbf{s}_{x} \mathbf{r}_{y} \mathbf{s}_{\boldsymbol{y}}{ }^{\prime} \mathbf{r}_{z} \mathbf{s}^{\mathbf{s}}{ }_{z} \mathbf{r}_{w}{ }^{\prime}$ ) in metaproduct form.

Read $\mathbf{r}_{\boldsymbol{x}}=\mathbf{1}$ as meaning "the variable $\mathbf{x}$ occurs in the product." Read $\mathbf{r}_{\boldsymbol{x}}=\mathbf{0}$ as meaning "the variable $\mathbf{x}$ does not occur in the product." Similarly, read $\mathbf{s}_{\boldsymbol{x}}=0$ as "the polarity of x is positive" and $\mathbf{s}_{\boldsymbol{x}}=1$ as "the polarity of x is negative". For example:


So, for example, if we actually tried to represent $\mathbf{f}(\mathbf{x})=\left(\mathbf{x}+\mathbf{x}^{\prime}\right)$ we would get ( $\left.\mathbf{r}_{\boldsymbol{x}} \mathbf{s}_{\boldsymbol{x}}+\mathbf{r}_{x} \mathbf{s}_{\boldsymbol{x}}{ }^{\prime}\right)$ for the metaproduct form. To manipulate this, we represent it as a BDD. we get one more rule here:

- Interpret the paths from the BDD root to the " 1 " leaf as specifying the individual product terms in the metaproduct form. If a variable is omitted on a path, you need to include it in both polarities in the final metaproduct.

This sounds more complicated than it really is. Let's try it. Do this:

- Draw the BDD for the metaproduct for the single-variable function $\mathbf{f}(\mathbf{x})=\mathbf{x}+\mathbf{x}$. Show how walking the paths from root to " 1 " leaf generates the correct metaproduct for this SOP form.
- Using what you know about BDDs, complement this BDD for this metaproduct. Again, walk the paths from root to " 1 " and generate the new metaproduct for $f^{\prime}(x)$. Interpret the result -- does this makes sense? Why?
- Draw the BDD for the metaproduct of the 4 variable function:

$$
\mathbf{f}(\mathbf{x}, \mathbf{y}, \mathbf{z}, \mathbf{w})=\mathbf{y w} \mathbf{w}^{\prime}+\mathbf{x z} \mathbf{w}^{\prime}+\mathbf{x y} \mathbf{y}^{\prime} \mathbf{z} \mathbf{w}^{\prime}
$$

Again, show how the paths from root to leaf in this BDD generate the correct metaproduct. (It's OK to use kbdd for this one, as long as you can figure out the BDD structure yourself.)

- Again, complement this BDD , and show that the result makes sense as a metaproduct. (For this one, you might want to use kbdd -- if it's too messy to do by hand.)


## CLA and Ling Adders

## 1 Introduction

One of the most popular designs for fast integer adders are Carry-Look-Ahead adders. Rather than waiting for carry signals to ripple from the least significant bit to the most significant bit, CLA adders divided the inputs into groups of $r$ bits and implement the logic equations to determine if each group will generate or propagate a carry. By combining the generate and propagate signals of $r$ groups at with each successive stage of logic, a CLA adder can derive the carrys into each bit in order $\log _{r} n$ gates instead of order $n$ for a ripple carry adder. This paper discusses the design of a very simple 32 bit CLA adder, some improvements that can be made to that adder, and a variation of CLA adders known as Ling adders.

## 2 A Simple CLA Adder

An overview of the adder's 4 stages is shown in figure 1 with stage 1 and the top and stage 4 at the bottom. In stage 1 the local generate and propagate signals for each bit are created. In stage 2 these signals are combined to create generate and propagate signals out of each group of 3 bits. In stage 3 the group signals are combined into 9 bit block signals. In stage 4 the carry into each block is calculated and these signals begin traveling back up the adder tree. In stage 3 the carry into each group is created, and in stage 2 the carry into each bit is created. Finally, stage 1 uses the local carry signals to calculate the final sum bits.


Figure 1: CLA Adder

### 2.1 Generate and Propagate Signals

In the first stage of logic the adder must calculate the local generate and propagate signals ( $g_{i}$ and $p_{i}$ ) which tell if each bit will generate a carry into the next bit or propagate a carry from the previous bit.

$$
\begin{align*}
g_{i} & =a_{i} b_{i}  \tag{1}\\
p_{i} & =a_{i}+b_{i} \tag{2}
\end{align*}
$$

In stage 2 these signals are then combined into group generates and propagates ( $G G_{i}$ and $P G_{i}$ ) for each of the ten groups as follows:

$$
\begin{align*}
& G G_{0}=g_{1}+p_{1}\left(g_{0}+p_{0} c_{I N}\right)  \tag{3}\\
& G G_{1}=g_{4}+p_{1}\left(g_{3}+p_{3} g_{2}\right)  \tag{4}\\
& \vdots  \tag{5}\\
& G G_{10}=g_{31}+p_{31}\left(g_{30}+p_{30} g_{29}\right)  \tag{6}\\
& P G_{1}=p_{4} p_{3} p_{2}  \tag{7}\\
& P G_{2}=p_{7} p_{6} p_{5}  \tag{8}\\
& \vdots \\
& P G_{10}=p_{31} p_{30} p_{29}
\end{align*}
$$

where $c_{I N}$ is the carry in signal to the least significant bit. Since $c_{I N}$ is included in $G G_{0}$, no group propagate signal from group 0 is needed. The group propagate signals are formed with a simple 3 input AND gate. The group generate signals are formed with the fanin- 3 generate gate shown in figure 2. In stage 3 these signals are used to create the block generate and propagate signals ( $G B_{i}$ and $P B_{i}$ ).

$$
\begin{align*}
G B_{0}= & G G_{2}+P G_{2}\left(G G_{1}+P G_{1} G G_{0}\right)  \tag{9}\\
G B_{1}= & G G_{5}+P G_{5}\left(G G_{4}+P G_{4} G G_{3}\right)  \tag{10}\\
G B_{2}= & G G_{8}+P G_{8}\left(G G_{7}+P G_{7} G G_{6}\right)  \tag{11}\\
G B_{3}= & G G_{10}+P G_{10} G G_{9}  \tag{12}\\
& P B_{1}=P G_{5} P G_{4} P G_{3}  \tag{13}\\
& P B_{2}=P G_{8} P G_{7} P G_{6}  \tag{14}\\
& P B_{3}=P G_{10} P G_{9} \tag{15}
\end{align*}
$$

All the blocks can use the same fanin- 3 generate gate and 3 input AND gate used in the previous stage except for block 3 which contains only two groups. Its propagate signal requires only a 2 input OR, and its generate is create using a fanin-2 generate gate shown in figure 3. Having created the block generate and propagate signals, the adder begins to finally create the true carry signals.


Figure 2: FanIn-3 Generate Gate


Figure 3: FanIn-2 Generate Gate

### 2.2 Carry Signals

In stage 4 the block generate and propagate signals are used to create the carry signals into each block ( $C B_{i}$ ).

$$
\begin{align*}
C B_{1} & =G B_{0}  \tag{16}\\
C B_{2} & =G B_{1}+P B_{1} G B_{0}  \tag{17}\\
C B_{3} & =G B_{2}+P B_{2}\left(G B_{1}+P B_{1} G B_{0}\right)  \tag{18}\\
C_{\text {OUT }} & =G B_{3}+P B_{3} C B_{3} \tag{19}
\end{align*}
$$

where $C_{\text {OUT }}$ is the overflow carry out of the entire adder. These signals then begin to travel back up the adder stages, first forming the carry into each group $\left(C G_{i}\right)$ in stage 3 . For block 2 these equations are as follows:

$$
\begin{align*}
C G_{6} & =C B_{2}  \tag{20}\\
C G_{7} & =G G_{6}+P G_{6} C B_{2}  \tag{21}\\
C G_{8} & =G G_{7}+P G_{7}\left(G G_{6}+P G_{6} C B_{2}\right) \tag{22}
\end{align*}
$$

In stage 2 these group carry signals are used to form the local carry into each bit ( $c_{i}$ ). For group 8 these equations are as follows:

$$
\begin{align*}
c_{23} & =C G_{8}  \tag{23}\\
c_{24} & =g_{23}+p_{23} C G_{8}  \tag{24}\\
c_{25} & =g_{24}+p_{24}\left(g_{23}+p_{23} C G_{8}\right) \tag{25}
\end{align*}
$$

All of these signals can be created using the fanin- 3 and fanin- 2 generate gates shown in figures 2 and 3. This means each center group and block will use one fanin-3 gate and one OR gate to create generate and propagate signals for the stage below, and one fanin- 3 gate and one fanin- 2 gate to create carry signals for the stage above. The wiring of these groups and blocks is shown in figure 4 for group 1. hen the local carry signals reach stage 1 , they are used to create the final sum bits $\left(s_{i}\right)$ according to the equations:

$$
\begin{align*}
t_{i} & =a_{i} \oplus b_{i}  \tag{26}\\
s_{i} & =t_{i} \oplus c_{i} \tag{27}
\end{align*}
$$

### 2.3 Critical Path

The worst case inputs for this adder are when $a_{i} \oplus b_{i}=1$ for all the input bits and then $c_{I N}$ is toggled. The local generate signals require 3 series transistors to form. For an $N$ bit CLA adder combining $r$ groups at each level, the generate signals must travel up $\left\lceil\log _{r} N\right\rceil-1$ levels of $r+1$ series transistors each. Then the signal travels down $\left\lceil\log _{r} N\right\rceil-2$ levels of no more than $r+1$


Figure 4: Group 1
series transistors. Final, the XOR to form the local sums takes 2 series transistors. Therefore, the maximum number of series transistors in the critical path can be written as:

$$
\begin{align*}
& T_{d}=3+\left(\left\lceil\log _{r} N\right\rceil-1\right)(r+1)+\left(\left\lceil\log _{r} N\right\rceil-2\right)(r+1)+2  \tag{28}\\
& T_{d}=\left(2\left\lceil\log _{r} N\right\rceil-3\right)(r+1)+5 \tag{29}
\end{align*}
$$

For a 32 bit adder with $r=3$ as described in this paper this equation gives a maximum of 25 transistors. The true critical path is 24 transistors since block 3 contains only 2 groups instead of 3. The critical path is shown in table 1. Although faster designs are possible, this adder has the

| Operation | Signal | Delay | Total |
| :---: | :---: | :---: | :---: |
| Local Generate | $g_{i}$ | 3 | 3 |
| Group Generate | $G G_{i}$ | 4 | 7 |
| Block Generate | $G B_{i}$ | 4 | 11 |
| Block Carry | $C B_{3}$ | 4 | 15 |
| Group Carry | $G G_{10}$ | 3 | 18 |
| Local Carry | $c_{31}$ | 4 | 22 |
| Local Sum | $s_{31}$ | 2 | 24 |

Table 1: Simple CLA Critical Path
advantage of a relatively simple layout and wiring. The next section discusses changes which can
be made in this design to improve performance.

## 3 An Improved CLA Adder

The critical path delay of the simple CLA adder design presented in the previous section can be reduced significantly at the price of making the layout and wiring more complex.

### 3.1 Single Stage Group Generate

The first improvement to be made is using a single complex gate to create the group generate and propagate signals in a single stage directly from the adder inputs. In the simple design the expression used for the group 1 generate signal was as follows:

$$
\begin{equation*}
G G_{1}=g_{4}+p_{4}\left(g_{3}+p_{3} g_{2}\right) \tag{30}
\end{equation*}
$$

Expanding this in terms of the adder inputs gives:

$$
\begin{equation*}
G G_{1}=a_{4} b_{4}+\left(a_{4}+b_{4}\right)\left[a_{3} b_{3}+\left(a_{3}+b_{3}\right) a_{2} b_{2}\right] \tag{31}
\end{equation*}
$$

This equation can be implemented by an NMOS network containing 4 series transistors followed by an inverter. The PMOS network must implement the complement of this function, which normally would also require 4 series transistors. However, the relation $\overline{g_{i}} \overline{p_{i}}=\overline{p_{i}}$ can be used to simply the expression for $\overline{G G_{1}}$ as follows:

$$
\begin{align*}
& \overline{G G_{1}}=\overline{g_{4}}\left[\overline{p_{4}}+\overline{g_{3}}\left(\overline{p_{3}}+\overline{g_{2}}\right)\right]  \tag{32}\\
& \left.\left.\overline{G G_{1}}=\overline{p_{4}}+\overline{g_{4}} \overline{p_{3}}+\overline{g_{3}} \overline{g_{2}}\right)\right]  \tag{33}\\
& \overline{G G_{1}}=\overline{a_{4}} \overline{b_{4}}+\left(\overline{a_{4}}+\overline{b_{4}}\right)\left[\overline{a_{3}} \overline{b_{3}}+\left(\overline{a_{3}}+\overline{b_{3}}\right)\left(\overline{a_{2}}+\overline{b_{2}}\right)\right] \tag{34}
\end{align*}
$$

This simplified expression can be implemented by a PMOS network with 3 series transistors followed by an inverter. The gate implementing the group generate for group 1 is shown in figure 5 . The gate implementing the group propagate is shown in figure 6. This change reduces the total number of series transistors used in forming the group generate signals from 7 to 5 .

### 3.2 Carry Select Mux

The second improvement eliminates the need to travel back up the adder tree after the block carrys have been formed. This is done by generating two sets of sum bits. One set assumes the carry into each block will be 0 , and the other set assumes it will be 1 . This can occur in parallel with the generation of the block carrys which are then used to control a mux which selects the proper set of sum bits. This is the same method used in carry select adders.


Figure 5: CLA Group Generate


Figure 6: CLA Group Propagate

In the simple CLA adder the equations implemented by group carry, local carry, and final sum stages for bit 23 are as follows:

$$
\begin{align*}
& s_{23}=t_{23} \oplus c_{23}  \tag{35}\\
& s_{23}=t_{23} \oplus C G_{8}  \tag{36}\\
& s_{23}=t_{23} \oplus\left[G G_{7}+P G_{7}\left(G G_{6}+P G_{6} C B_{2}\right)\right] \tag{37}
\end{align*}
$$

This expression is converted to a mux controlled by $C B_{2}$ by defining the signals $C G F_{8}$ and $C G T_{8}$ :

$$
\begin{align*}
& C G F_{8}=G G_{7}+P G_{7} G G_{6}  \tag{38}\\
& C G T_{8}=G G_{7}+P G_{7}\left(G G_{6}+P G_{6}\right) \tag{39}
\end{align*}
$$

The signal $C G F_{8}$ is the carry into group 8 assuming the block carry is zero, and $C G T_{8}$ assumes the block carry is one. The final sum bit is then written as:

$$
\begin{equation*}
s_{23}=\overline{C B_{2}}\left[C G F_{8} \oplus t_{23}\right]+C B_{2}\left[C G T_{8} \oplus t_{23}\right] \tag{40}
\end{equation*}
$$

Using these signals, the other sum bits of the group are written in similar fashion.

$$
\begin{align*}
s_{24} & =\overline{C B_{2}}\left[\left(g_{23}+p_{23} C G F_{8}\right) \oplus t_{24}\right]+C B_{2}\left[\left(g_{23}+p_{23} C G T_{8}\right) \oplus t_{24}\right]  \tag{41}\\
s_{25} & =\overline{C B_{2}}\left[\left(g_{24}+p_{24}\left(g_{23}+p_{23} C G F_{8}\right)\right) \oplus t_{24}\right]+C B_{2}\left[\left(g_{24}+p_{24}\left(g_{23}+p_{23} C G T_{8}\right)\right) \oplus t_{24}\right] \tag{42}
\end{align*}
$$

Because the signals $C G F_{8}$ and $C G T_{8}$ will appear after the local generate and propagate signals, the critical path delay can be further reduced by applying the same principal to make the inputs to the mux controlled by the block carry muxes controlled by $C G F_{8}$ and $C G T_{8}$. This also allows the simplification of $g_{i}+p_{i}=p_{i}$ to be applied.

$$
\begin{align*}
s_{23}= & \overline{C B_{2}}\left[\overline{C G F_{8}} t_{23}+C G F_{8} \overline{t_{23}}\right]+ \\
& C B_{2}\left[\overline{C G T_{8}} t_{23}+C G T_{8} \overline{t_{23}}\right]  \tag{43}\\
s_{24}= & \overline{C B_{2}}\left[\overline{C G F_{8}}\left(g_{23} \oplus t_{24}\right)+C G F_{8}\left(p_{23} \oplus t_{24}\right)\right]+ \\
& C B_{2}\left[\overline{C G T_{8}}\left(g_{23} \oplus t_{24}\right)+C G T_{8}\left(p_{23} \oplus t_{24}\right)\right]  \tag{44}\\
s_{25}= & \overline{C B_{2}}\left\{\overline{C G F_{8}}\left[\left(g_{24}+p_{24} g_{23}\right) \oplus t_{25}\right]+C G F_{8}\left[\left(g_{24}+p_{24} p_{23}\right) \oplus t_{25}\right]\right\}+ \\
& C B_{2}\left\{\overline{C G T_{8}}\left[\left(g_{24}+p_{24} g_{23}\right) \oplus t_{25}\right]+C G T_{8}\left[\left(g_{24}+p_{24} p_{23} \oplus t_{25}\right]\right\}\right. \tag{45}
\end{align*}
$$

The 3 bit slice which implements these functions is shown for group 8 in figure 7 . sing the bit slice eliminates the need to go back up the adder tree after forming the block carrys, and reduces the critical path after the block carrys to a single mux delay.

Because of the reduced delay from the formation of the block carrys to the final sum output, CoUT can no longer be implemented as a function of $C B_{2}$ as shown in equation 19 without becoming the critical path. To avoid this a fanin-4 generate gate is used to form CoUT directly from the block generates and propagates.

$$
\begin{equation*}
C_{O U T}=G B_{3}+P B_{3}\left[G B_{2}+P B_{2}\left(G B_{1}+P B_{1} G B_{0}\right)\right] \tag{46}
\end{equation*}
$$

This gate is shown in figure 8 and removes COUT from the critical path.


Figure 7: Sum Selection Slice


Figure 8: FanIn-4 Generate Gate

### 3.3 Critical Path

With a single stage group generate the critical path must still pass up $\left\lceil\log _{r} N\right\rceil-1$ levels. Of these the first level will contain $r+2$ series transistors and the others $r+1$. The carry select mux eliminates the need to travel back up the levels of the adder to form the local carries. The mux delay from the arrival of the control signal is counted as one series transistor to form the complement of the control signal and one transistor to pass the input to the output. The number of series transistors in the critical path is therefore:

$$
\begin{equation*}
T_{d}=\left(\left\lceil\log _{r} N\right\rceil-1\right)(r+1)+3 \tag{47}
\end{equation*}
$$

For the 32 bit adder shown here with $r=3$ this gives 15 series transistors. Using the single stage group generate eliminates 2 series transistors, and the carry select mux reduces the delay from the formation of the block carries from 9 series transistors to 2 . The total critical path is reduced by 9 series transistors from a total of 24 to 15 . The new critical path is shown in table 2.

| Operation | Signal | Delay | Total |
| :---: | :---: | :---: | :---: |
| Group Generate | $G G_{i}$ | 5 | 5 |
| Block Generate | $G B_{2}$ | 4 | 9 |
| Block Carry | $C B_{3}$ | 4 | 13 |
| Result Mux | $s_{31}$ | 2 | 15 |

Table 2: Improved CLA Critical Path

## 4 A Ling Adder

One final improvement that can be made to CLA design is the use of a pseudo-carry as proposed by Ling[1, 2]. This method allows a single local propagate signal to be removed from the critical path. To show how this is done the group generate signal for group 1 is shown below:

$$
\begin{equation*}
G G_{1}=g_{4}+p_{4} g_{3}+p_{4} p_{3} g_{2} \tag{48}
\end{equation*}
$$

Ling observed that each term in $G G_{1}$ contains $p_{4}$ except for the very first term which is simply $g_{4}$. However, $p_{4}$ can still be factored out of this expression by noting that $g_{i}=p_{i} g_{i}$.

$$
\begin{align*}
& G G_{1}=p_{4} G G_{1}^{*}  \tag{49}\\
& G G_{1}^{*}=g_{4}+g_{3}+p_{3} p_{2} \tag{50}
\end{align*}
$$

The Ling group generate signal $\left(G G_{1}^{*}\right)$ is simpler and can be calculated more quickly than the CLA group generate signal. When expanded out the CLA and Ling group generates are as follows:

$$
\begin{align*}
G G_{1} & =a_{4} b_{4}+\left(a_{4}+b_{4}\right)\left[a_{3} b_{3}+\left(a_{3}+b_{3}\right) a_{2} b_{2}\right]  \tag{51}\\
G G_{1}^{*} & =a_{4} b_{4}+a_{3} b_{3}+\left(a_{3}+b_{3}\right) a_{2} b_{2} \tag{52}
\end{align*}
$$

The gate used to implement the group generate signal is shown in figure 9 and has one less series transistor than the equivalent CLA gate shown in figure 5 . he Ling group propagate signals ( $P G_{i}^{*}$ )


Figure 9: Ling Group Generate
are formed using the same gates as in the CLA design, but they are shifted one bit to the right. The CLA and Ling group propagate signals for group one are shown below.

$$
\begin{align*}
P G_{1} & =p_{4} p_{3} p_{2}  \tag{53}\\
P G_{1}^{*} & =p_{3} p_{2} p_{1} \tag{54}
\end{align*}
$$

These Ling group generate and propagate signals are then combined in the same manner as before to create block carry signals.

$$
\begin{align*}
C B_{1}^{*} & =G B_{0}^{*}  \tag{55}\\
C B_{2}^{*} & =G B_{1}^{*}+P B_{1}^{*} G B_{0}^{*}  \tag{56}\\
C B_{3}^{*} & =G B_{2}^{*}+P B_{2}^{*}\left(G B_{1}^{*}+P B_{1}^{*} G B_{0}^{*}\right)  \tag{57}\\
C_{O U T}^{*} & =G B_{3}^{*}+P B_{3}^{*}\left[G B_{2}^{*}+P B_{2}^{*}\left(G B_{1}^{*}+P B_{1}^{*} G B_{0}^{*}\right)\right] \tag{58}
\end{align*}
$$

The true COUT is simply $p_{31} C_{O U T}^{*}$ which could be formed with a simple AND gate, but this would make it the critical path. Instead, the final group generate signal $\left(G G_{10}\right)$ is formed using the CLA expression rather than the Ling group generate. Also the final group propagate ( $P G_{10}^{*}$ ) is formed with a 4 input AND instead of a 3 input AND to include $p_{31}$. These changes allow the true Cout to be formed from the block generate and propagate signals as shown above without making it the critical path.

The final change that must be implemented to complete the Ling adder is to insert into the sum logic the local propagate signal which was factored out of each group generate. This is done simply by ANDing the $C G F_{i}^{*}$ and $C G T_{i}^{*}$ signals formed from the Ling group generate and propagates with the local propagate signal of the most significant bit of the previous group. This change is shown in figure 10 which depicts the sum selection logic for group 8 of the Ling adder.

### 4.1 Critical Path

The only difference in the critical path of the improved CLA and the Ling adder is the use of the Ling group generate is the first stage as shown in table 3. This allows the group generate signals to be formed in $r+1$ series transistors instead of $r+2$. The changes in the sum selection logic are off the critical path and have no effect on the total delay. Therefore, the series transistors in the critical path can be written as:

$$
\begin{equation*}
T_{d}=\left(\left\lceil\log _{r} N\right\rceil-1\right)(r+1)+2 \tag{59}
\end{equation*}
$$

For a 32 bit adder with $r=3$ the net improvement of a Ling adder over the improved CLA adder is a total delay of 14 series transistors instead of 15 .

| Operation | Signal | Delay | Total |
| :---: | :---: | :---: | :---: |
| Group Generate | $G G_{i}$ | 4 | 4 |
| Block Generate | $G B_{2}$ | 4 | 8 |
| Block Carry | $C B_{3}$ | 4 | 12 |
| Result Mux | $s_{31}$ | 2 | 14 |

Table 3: Ling Critical Path


Figure 10: Ling Sum Selection Slice

## References

[1] H. Ling. High speed binary parallel adder. IEEE Transactions on Computers, EC-15(5):799802, October 1966.
[2] H. Ling. High speed binary adder. IBM Journal of Research and Developement, 25(3):156-166, May 1981.
[3] R. Brent and H. Kung. A regular layout for parallel adders. IEEE Transactions on Computers, C-31(3):260-264, March 1982.
[4] G. Bewick, P. Song, G. DeMicheli, and M. Flynn. Approaching a nanosecond: A 32-bit adder. In Proceedings of the International Conference on Computer Design, pages 221-224, 1988.
[5] I. Hwang and A. Fisher. A 3.1 ns 32 b CMOS adder in multiple output domino logic. In International Solid State Circuits Conference, pages 140-141, 1988.
[6] A. Omondi. Computer Arithmetic Systems: Algorithms, Architecture and Implementations. Prentice Hall, 1994.
[7] N. Quach and M. Flynn. High-speed addition in CMOS. Technical Report CSL-TR-90-415, Stanford University, February 1990.
[8] S. Waser and M. Flynn. Introduction to Arithmetic for Digital Systems Designers. Holts, Rinehart and Winston, 1982.

