# CMOS CIRCUIT SPEED AND POWER OPTIMIZATION 

 USING SIMPLIFIED RC DELAY MODELBy<br>SUNIL KUMAR LAKKAKULA<br>Bachelor of Technology<br>Electrical \& Electronics Engineering<br>Acharya Nagarjuna University<br>Guntur, Andhra Pradesh, India 2007<br>Master of Science<br>Electrical Engineering<br>Oklahoma State University<br>Stillwater, Oklahoma 2009

Submitted to the Faculty of the Graduate College of the
Oklahoma State University in partial fulfillment of the requirements for the Degree of
DOCTOR OF PHILOSOPHY
May, 2015

# CMOS CIRCUIT SPEED AND POWER OPTIMIZATION USING SIMPLIFIED RC DELAY MODEL 

Dissertation Approved:

| Dr. Louis G. Johnson |
| :---: |
| Dissertation Adviser |
| Dr. Gary Yen |
| Dr. Rama Ramakumar |

Dr. Blayne Mayfield

## ACKNOWLEDGEMENTS

First, I would like to thank my adviser Dr. Louis G. Johnson for his wonderful support, guidance and encouragement throughout my PhD program. I am honored to be his PhD student. I also thank him for providing me with financial assistance through Graduate Research Assistantship during most of my PhD program.

I would like to thank my committee members Dr. Gary Yen, Dr. Rama Ramakumar and Dr. Blayne Mayfield for serving on my committee and providing their valuable comments and feedback on my work.

I am also thankful to the Department of Campus Life and the Service-Learning Volunteer Center at Oklahoma State University for allowing me to work as a Graduate Teaching Assistant during which I have learning great community skills and volunteer skills.

I am also very grateful and thankful to the following people who played a major role in my life during my stay in the United States. Firstly my friends Rajashekhar Yaramasu, Siddarth Kota, Brian Joseph, Dr. Satyanarayana Achanta, Satish Bhattiprolu, Satyashil Shinde, Dr. Anand Govindarajan, Dr. Upasana Manimegalai Sridhar, Dr. Kumar Singarapu, Jeet Turakhia, Suresh Kumar Jayaraman, Vijayalakshmi Sethuraman, Samyukta Koteeswaran, Ram Kumar Isakki for their support and cooperation. Also, Tim Huff, Regina Henry, Ruthie Loffi, Kent Sampson, Joyce Montgomery, Marie Basler from the Department of Campus Life for their valuable support and guidance for me in iii
Acknowledgements reflect the views of the author and are not endorsed by committee members or Oklahoma State University.
learning leadership and service skills. My labmates Dr. Julius Marpaung, Wira Mulia for their assistance in learning new technical skills. My American friends Alice Sharrock, Russ Sharrock, Janina Graves, Mason Williams, Grice's Family for their valuable friendship. My relatives in the United States for welcoming me and my wife into their homes with great love and affection.

Finally, I thank my parents Nageswara Rao Lakkakula and Lakshmi Lakkakula for their constant support and trust in me without which I have not come this far and be able to finish my PhD. I thank my brother Anil Kumar Lakkakula for his blessings from heaven. I also thank my friend and sister Surekha Bathula for her support. Last but not the least I thank my wife Mounika Tiruamalsetty for being with me and supporting me in finishing my PhD successfully.

Date of Degree: May, 2015

## Title of Study: CMOS CIRCUIT SPEED AND POWER OPTIMIZATION USING SIMPLIFIED RC DELAY MODEL

Major Field: ELECTRICAL ENGINEERING


#### Abstract

A simplified RC delay model which is expressed explicitly in terms of transistor widths is presented to easily perform circuit analysis and quickly optimize the transistor widths for delay and power without having to do tedious layout or schematic simulation. A novel heuristic gradient descent method is used to effectively solve the optimization problem. Popular parallel prefix adders such as Brent-Kung, Skylansky and Kogge-Stone adders are modeled using the proposed simplified RC delay model and a power-delay performance comparison is done after optimization to show the ability of the model. The optimization results suggest that the Brent-Kung adder is more efficient in terms of both delay and power by having the lowest power-delay product followed by the Skylansky adder and then the Kogge-Stone adder.


## TABLE OF CONTENTS

Chapter ..... Page
I. INTRODUCTION ..... 1
1.1 Background .....  2
1.1.1 MOSFET Operation and Characteristics ..... 2
1.1.2 Circuit Delay and Power ..... 4
II. REVIEW OF LITERATURE ..... 6
2.1 Review of device models ..... 6
2.2 Review of the Method of Logical Effort ..... 9
2.3 Review of the gradient descent based optimization ..... 13
2.4 Review of optimization techniques based on transistor sizing ..... 15
2.5 Review of parallel prefix adders and their performance ..... 18
III. METHODOLOGY ..... 21
3.1 Simplified RC delay Model ..... 21
3.2 Power Model ..... 31
3.3 Optimization Methodology ..... 33
3.4 Model Validation ..... 31
3.4.1 Lightly loaded case ..... 36
3.4.2 Heavily loaded case ..... 37
3.5 Convexity of the objective function. ..... 38
IV. PERFORMANCE EVALUATION \& OPTIMIZATION RESULTS ..... 40
4.1 Parallel prefix adders ..... 40
4.2 Optimization results ..... 43
V. CONCLUSION ..... 46
REFERENCES ..... 47
APPENDICES ..... 51
Appendix A ..... 51
Appendix B ..... 56
Appendix C ..... 61

## LIST OF TABLES

Table ..... Page
Table 2.1 Logical effort for inputs of static CMOS gates, assuming $\gamma=2$ ..... 10
Table 2.2 Parasitic delay estimates of different logic gates ..... 11
Table 3.1 Parasitic delay expressions for 2-input NAND gate ..... 22
Table 3.2 Capacitance values for various processes ..... 25
Table 3.3 CPU time comparison ..... 38

## LIST OF FIGURES

Figure ..... Page
Figure 1.1: Cross section of nMOS transistor .....  2
Figure 1.2: I-V characteristics of nMOS transistor .....  3
Figure 1.3: Plot defining circuit delay ..... 4
Figure 1.4: Example RC Circuit for Elmore Delay ..... 5
Figure 2.1: Switched-Resistor Transistor Model ..... 8
Figure 2.2: Piecewise Linear Model ..... 8
Figure 2.3: FO4 - Inverter driving four identical inverters ..... 11
Figure 2.4: Logic network consisting of three two-input NAND gates ..... 12
Figure 3.1: 2-input NAND gate ..... 22
Figure 3.2: Channel Propagation Capacitance ..... 24
Figure 3.3: Diffusion Capacitance with diffusion contact ..... 26
Figure 3.4: Diffusion Capacitance with no diffusion contact ..... 26
Figure 3.5: Diffusion Capacitance with shared diffusion contact ..... 27
Figure 3.6: $2 \times 4$ Decoder circuit with enable signal ..... 29
Figure 3.7: 3-input NAND gate and inverter ..... 29
Figure 3.8: OF Surface when minimizing delay ..... 34
Figure 3.9: Four inverter chain with a light load and no $\mathrm{C}_{\text {wire }}$ ..... 36
Figure 3.10: Model Vs SPICE power-delay plot for lightly loaded case ..... 36
Figure 3.11: Four inverter chain with a heavy load and large $\mathrm{C}_{\text {wire }}$ ..... 37
Figure 3.12: Model Vs SPICE power-delay plot for heavily loaded case ..... 37
Figure 4.1: 16-bit Brent-Kung adder ..... 42
Figure 4.2: 16-bit Skylansky adder ..... 42
Figure 4.3: 16-bit Kogge-Stone adder ..... 42
Figure 4.4: Power-delay comparison plot for 32-bit adder case ..... 44
Figure 4.5: Power-delay comparison plot for 64-bit adder case ..... 44
Figure 4.6: Power-delay comparison plot for 128-bit adder case ..... 45

## CHAPTER I

## INTRODUCTION

A major concern in VLSI circuit design is the delay and power dissipation of the circuit. Circuit optimization based on transistor sizing is one very useful method to optimize the circuit for speed and power dissipation to achieve better needed circuit performance. Hence a model which can be expressed explicitly in terms of transistor widths is required to effectively analyze the effects of changing the transistor widths on circuit speed and power to allow flexible tradeoff. Using RC transistor models to model and simulate a CMOS circuit is popular because of its simplicity and small computational burden compared to complex and computationally expensive BSIM models that are used in SPICE [1] simulators. This dissertation primarily presents the implementation of a simplified RC delay model to quickly estimate the CMOS circuit delay and its application in the circuit speed and power optimization by correctly sizing the transistor widths. In many cases there are several possible critical paths for the circuit to be optimized. There is no simple solution for optimum transistor sizes when multiple paths are optimized simultaneously. For better accuracy numerical techniques are needed to solve the optimization problem. It is much easier to use the numerical techniques with our simplified RC delay equations rather than SPICE for the following reasons: the delays are explicit functions of transistor widths making it obvious what happens when any width is changed; the width dependence of the delays is smooth so that simple numerical methods can be used to quickly find the optimum transistor widths; if the optimum widths found do not meet the specification, then there is probably no feasible solution and the specification must be changed.

Since addition forms the basis for many processing operations and adder circuits are of great interest to digital system designers, popular prefix adders such as Brent-Kung[2], Skylansky[3] and KoggeStone[4] adders are modeled using the proposed simplified RC delay model in this dissertation and a power-delay performance comparison was done to show the ability of the method in quickly optimizing the complex circuits like adders.

### 1.1 Background

Now, let's look at some back ground to understand how the MOS transistors work and how the circuit delay and power are calculated.

### 1.1.1 MOSFET Operation and Characteristics

A Complementary Metal Oxide Semiconductor (CMOS) circuit consists of two types of transistors, nMOS transistor and a pMOS transistor. nMOS is called n-channel MOSFET and the majority charge carriers are electrons. pMOS is called p-channel MOSFET and the majority charge carriers are holes. The design and operation of an nMOS transistor is discussed below.

In nMOS, the source and drain are n-type regions and the body is p-type region. With sufficient positive voltage at the gate, holes from the p-type body are driven away from the gate, forming an n-type channel between the p-type body and the oxide. This channel extends between the source and the drain, when a voltage is applied between the drain and the source current is conducted through it. Figure 1.1 shows the cross section of an nMOS transistor without and with a channel formation.

The operation of a MOSFET is categorized into three different regions or modes. They are Cut-off, Ohmic or linear and Saturation. Let us discuss the operation regions of an nMOS transistor.

Cut-off region $\left(\mathrm{V}_{\mathrm{GS}}<\mathrm{V}_{\text {th }}\right)$ : In this region the gate to source bias $\mathrm{V}_{\mathrm{GS}}$ is less than the threshold voltage $\mathrm{V}_{\mathrm{th}}$ and there is no significant conduction between drain and source. Hence the transistor can be considered as turned off in this region and the drain current is equal to zero.


Figure 1.1: Cross section of nMOS transistor (a) without channel (b) with channel Ohmic region $\left(\mathrm{V}_{\mathrm{GS}}>\mathrm{V}_{\mathrm{th}}\right.$ and $\left.\mathrm{V}_{\mathrm{DS}}<\mathrm{V}_{\mathrm{GS}}-\mathrm{V}_{\mathrm{th}}\right)$ : In this region the gate to source bias is greater than the threshold voltage and the drain to source bias is less than the gate voltage $\mathrm{V}_{\mathrm{GS}}-\mathrm{V}_{\mathrm{th}}$. The transistor is turned on and the current conduction takes place between drain and source. In this region as the drain to source voltage increases the drain current $\left(\mathrm{I}_{\mathrm{D}}\right)$ also increases.

Saturation region $\left(\mathrm{V}_{\mathrm{GS}}>\mathrm{V}_{\mathrm{th}}\right.$ and $\left.\mathrm{V}_{\mathrm{DS}}>\mathrm{V}_{\mathrm{GS}}-\mathrm{V}_{\mathrm{th}}\right)$ : In this region the gate to source bias is greater than the threshold voltage and the drain to source bias is greater than the gate voltage $\mathrm{V}_{\mathrm{GS}}-\mathrm{V}_{\mathrm{th}}$. The transistor is turned on and the current conduction takes place between drain and source. In this region as the drain to source bias voltage increase there is almost no significant increase in the value of the drain current.

The current vs voltage (I-V) characteristics of an nMOS transistor are shown in Figure. 1.2.


Figure 1.2: I-V characteristics of nMOS transistor

In pMOS, the source and drain are p-type regions and body is n-type. The working of a pMOS transistor is similar to nMOS but with a negative voltage applied at its gate terminal.

### 1.1.2 Circuit Delay and Power

Delay: The usual definition of the circuit delay is the time difference between the half- $\mathrm{V}_{\mathrm{dd}}$ point of the circuit's input voltage waveform and the half- $\mathrm{V}_{\mathrm{dd}}$ point of the circuit's output voltage waveform.


Figure 1.3: Plot defining circuit delay
From the above figure, delay $=t-t_{1}$. To determine the delay of a circuit accurately the exact waveforms of the input and output are required. The process of determining the exact waveforms of input and output is computationally expensive. Hence Elmore[5] defined the delay at node $e, \mathrm{t}_{\mathrm{d} e}$, as the centroid of the output impulse response $e^{\prime}(t)$ curve which can be found independently of the exact waveforms. It is given by the expression shown in Eq. (1.1). It is also called the first moment of the impulse response. Elmore delay is a good approximation for delay in tree RC networks.

$$
\begin{equation*}
\operatorname{Delay}\left(t_{d e}\right)=\int_{0}^{\infty} t e^{\prime}(t) d t \approx \sum_{k} R_{e k} C_{k} \frac{\Delta V_{k}}{\Delta V_{e}} \tag{1.1}
\end{equation*}
$$

where $\mathrm{R}_{\mathrm{ck}}$ is defined as the resistance of the path to the $\mathrm{V}_{\mathrm{dd}} / \mathrm{Gnd}$ node shared by node $e$ and node $k . \mathrm{C}_{\mathrm{k}}$ is the $\mathrm{k}^{\text {th }}$ node capacitance. $\Delta \mathrm{V}_{\mathrm{k}}$ is the voltage difference between $\mathrm{V}_{\mathrm{dd}} / \mathrm{Gnd}$ node and the initial voltage at the node $k . \Delta \mathrm{V}_{\mathrm{k}}=\mathrm{V}_{\mathrm{k}}(0)$ for falling voltages and $\Delta \mathrm{V}_{\mathrm{k}}=\mathrm{V}_{\mathrm{dd}}-\mathrm{V}_{\mathrm{k}}(0)$ for rising voltages. Similarly $\Delta \mathrm{V}_{\mathrm{e}}$ is the voltage difference between $\mathrm{V}_{\mathrm{dd}} /$ Gnd node and the initial voltage at the node $e . \Delta \mathrm{V}_{\mathrm{e}}=\mathrm{V}_{\mathrm{e}}(0)$ for falling voltages and $\Delta V_{e}=V_{d d}-V_{e}(0)$ for rising voltages. When all of the node voltages start at the same voltage then the last voltage fraction drops out. For example consider the following RC network shown in Figure 1.3 and assume that all of the node voltages start at the same voltage.


Figure 1.3: Example RC Circuit for Elmore Delay
The Elmore delay at node 3 is given as $t_{d 3}=R_{1} C_{1}+\left(R_{1}+R_{2}\right) C_{2}+\left(R_{1}+R_{2}+R_{3}\right) C_{3}$

Power: The usual definition of power dissipation is the amount of heat energy dissipated in unit time. The average power dissipation $\mathrm{P}_{\text {avg }}$ is defined as shown in Eq. (1.2)

$$
\begin{equation*}
P_{a v g}=\frac{1}{T} \int_{0}^{T} I_{V d d}(t) V_{d d} d t \tag{1.2}
\end{equation*}
$$

where $\mathrm{I}_{\mathrm{Vdd}}(\mathrm{t})$ is the power supply current and $\mathrm{V}_{\mathrm{dd}}$ is the supply voltage.
The power dissipative components in CMOS circuits consist of off-state leakage power, dynamic power due to charging and discharging of node capacitances, short-circuit power, switching power due to parasitic capacitances and glitch power due to unequal arrival of signals. Estimating the power dissipation with accurate models is difficult and computationally expensive.

## CHAPTER II

## REVIEW OF LITERATURE

### 2.1 Review of device models

Prior to late 1960s, performance estimation techniques were computationally simple as the number of transistors on a single chip were small and used physical models based upon the approximate modeling of the physical phenomena within a transistor. As the number of transistors increases on a single chip, the complexity of the circuit also increases making the physical models inadequate for quantitative analysis as they are computationally expensive. A circuit simulator, like SPICE [1], handles the complex nonlinear physical models of the transistor and solves the whole circuit as a big matrix to get the node outputs. Usually no more than a few thousand transistors may be simulated in a reasonable amount of computation time. Initial attempts to create transistor models for the large circuit simulation used empirical models to estimate the delay and power. These models are entirely based upon curve fitting, using whatever functions and parameters that best fit the measured data. The use of empirical models limits the type of circuits that can be modeled. The disadvantages of using empirical models are lack of error control in the resulting models and difficulty in relating the performance values back to the circuit elements. To solve this problem, a combination of physical and empirical models is used to analyze the circuit's performance. In the Shockley square law model [6] the drain current $\mathrm{I}_{\mathrm{D}}$ is expressed as shown in Eq. 2.1.

$$
I_{D}=\left(\begin{array}{ccc}
0, & V_{G S} \leq V_{T H} & \text { Cutoff-region }  \tag{2.1}\\
K\left\{\left(V_{G S}-V_{T H}\right) V_{D S}-0.5 V_{D S}{ }^{2}\right\}, & V_{G S}>V_{T H}, V_{D S}<V_{D S \text { sat }} & \text { Ohmic-region } \\
0.5 K\left(V_{G S}-V_{T H}\right)^{2}, & V_{G S}>V_{T H}, V_{D S} \geq V_{D S \text { sat }} & \text { Saturation-region }
\end{array}\right.
$$

Where $\mathrm{V}_{\mathrm{DSsat}}=\mathrm{V}_{\mathrm{GS}}-\mathrm{V}_{\mathrm{TH}}$ is drain saturation voltage and $\mathrm{V}_{\mathrm{TH}}$ is threshold voltage. K is a drivability factor and is given by $\mu\left(\frac{\epsilon_{o x}}{t_{o x}}\right)\left(\frac{W}{L}\right)$, where $\mu$ denotes an effective mobility, $\epsilon_{o x}$ a dielectric constant of a gate oxide, $t_{o x}$ a gate oxide thickness, $W$ a channel width and $L$ channel length. In the alpha power law model [7], which is an extension of Shockley's square law model in the saturation region the drain current in the saturation region is modeled as shown in Eq. 2.2.

$$
\begin{equation*}
I_{D} \alpha\left(V_{G S}-V_{T H}\right)^{\alpha}, \quad V_{G S}>V_{T H}, V_{D S} \geq V_{D S s a t} \quad \text { Saturation-region } \tag{2.2}
\end{equation*}
$$

It includes the velocity saturation effects to predict the circuit behavior in the sub-micrometer regime which the original Shockley square law model does not take into account in case of deep submicron processes. The value of $\alpha$ is calculated directly from the measured data and usually lies between 2 and 1 . For deep submicron processes the value of $\alpha$ is close to 1 . By using this nonlinear device model the circuit simulation is computationally expensive and there is no closed form expression for determining the circuit delay which requires the waveforms for the input and output.

To further simplify the analysis problem and to improve the simulation speed, all non-linear elements were approximated by appropriate linear elements by performing small signal analysis. In the Switched-Resistor model used in the IRSIM circuit simulator [8], the transistor is modeled as a voltage controlled switch connected to a series resistance as shown in Figure 2.1.


Figure 2.1: Switched-Resistor Transistor Model

Hence a MOS circuit can be considered as an RC network and the delay through a circuit path can be computed by using the simple expression given by Elmore in Eq. (1.1). By modeling the transistor as a switched-resistor the delay and power analysis of a circuit can be done with much less computational burden and circuits with several hundreds of thousands of transistors can be simulated in a reasonable amount of time. Though this model is less accurate, as it approximates a non-linear transistor as a linear resistor but is very helpful in doing the circuit analysis with less computational burden and to quickly estimate the delay without needing to determine the input and output waveforms.

Reference [9] discusses a piecewise linear device model which is a modified switched resistor model with better accuracy. In this model the transistor is approximated as an open switch when the transistor is in the cutoff region, as a linear resistor when operating in the ohmic region and as a current source that is a function of input voltage when operating in the saturation region.


Figure 2.2: Piecewise Linear Model

The drain current $I_{D}$ using this model is given by Eq. 2.3.

$$
I_{D}=\left(\begin{array}{ccc}
0, & V_{G S}<V_{T H} & \text { Cutoff }- \text { region }  \tag{2.3}\\
a \cdot G \cdot V_{D S}, & V_{G S}>V_{T H}, V_{D S}<V_{D S s a t} & \text { Ohmic }- \text { region } \\
G_{m} \cdot\left(V_{G S}-V_{T H}\right), & V_{G S}>V_{T H}, V_{D S}>V_{D S s a t} & \text { Saturation - region }
\end{array}\right.
$$

This model also includes the input slope, effects due to short circuit current and velocity saturation. It also uses a linearized BSIM3 capacitance model. As the above includes all the major physical phenomena that are important in submicron and deep submicron regimes it is more accurate than the above discussed models but using this model is still computationally expensive and it also cannot produce a closed form expression to determine the circuit delay and hence requires input and output waveforms.

Hence using RC device models will be more useful for fast circuit analysis and efficient circuit optimization as the circuit delay can be defined as a closed form expression without needing the exact input and output waveforms.

### 2.2 Review of the Method of Logical Effort

Estimating the circuit delay early in the design process is very useful and can save a lot of time in designing a circuit meeting the required design specifications. As discussed earlier in section 2.1, using complex device models would make the delay estimation very difficult as there is no closed form solution for estimating the delay easily and quickly hence requiring the circuit layout simulation to be done to achieve the input and output waveforms to be able to estimate the delay. With this approach the time to successfully design a specific circuit would take very long to meet the required design specifications. The method of logical effort [10] is an easy way to estimate delay in CMOS circuits without requiring to do a tedious layout simulation and also allows early modifications to the circuit design to achieve the greatest speed or to meet any delay constraints by comparing delay estimates of different logic structures.

The delay incurred by a logic gate is comprised of two components, a fixed part called the parasitic delay $p$ and a part that is proportional to the load on the gate's output, called the effort delay or stage effort $f$. The total delay $d$ measured in units of $\tau$, is the sum of the effort and parasitic delays and is given by Eq. 2.4

$$
\begin{equation*}
d=f+p \tag{2.4}
\end{equation*}
$$

where $\tau$ is the delay of an inverter driving an identical inverter with no parasitics. The effort delay is given as $f=g h$ where $g$ is the logical effort and $h$ is the electrical effort given by $C_{o u} / C_{i n}$. Hence the delay through a single logic gate is

$$
\begin{equation*}
d=g h+p \tag{2.5}
\end{equation*}
$$

Logical effort is defined so that an inverter has a logical effort of 1 . The logical effort of any other logic gate tells how much worse it is at producing output current than is an inverter, given that each of its inputs may present only the same input capacitance as the inverter. The table 2.1 below shows the logical effort values for different logic gates with different inputs.

|  | Number of inputs |  |  |  |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Gate Type | $\mathbf{1}$ | $\mathbf{2}$ | $\mathbf{3}$ | $\mathbf{4}$ | $\mathbf{5}$ | $\mathbf{n}$ |  |
| Inverter | 1 |  |  |  |  |  |  |
| NAND |  | $4 / 3$ | $5 / 3$ | $6 / 3$ | $7 / 3$ | $(\mathrm{n}+2) / 3$ |  |
| NOR |  | $5 / 3$ | $7 / 3$ | $9 / 3$ | $11 / 3$ | $(2 \mathrm{n}+1) / 3$ |  |
| Multiplexer |  | 2 | 2 | 2 | 2 | 2 |  |

Table 2.1: Logical effort for inputs of static CMOS gates, assuming $\gamma=2$ [6]
where $\gamma$ is the ratio of an inverter's pull-up transistor width to pull-down transistor width.

The parasitic delay of a logic gate is fixed and is given as multiples of the parasitic delay of an inverter denoted as $p_{\text {inv }}$ and is typically 1.0 delay units. The table 2.2 shows crude estimates of
parasitic delay for a few logic gates. Though these not very accurate but is convenient for hand analysis.

| Gate Type | Parasitic Delay |
| :---: | :---: |
| Inverter | $\mathrm{p}_{\text {inv }}$ |
| n -input NAND | $\mathrm{np}_{\text {inv }}$ |
| n -input NOR | $\mathrm{np}_{\text {inv }}$ |
| n -way multiplexer | $2 \mathrm{np}_{\text {inv }}$ |

Table 2.2: Parasitic delay estimates of different logic gates

Let's look at how the delay of the logic gate can be estimated using the method of logical effort. For example consider a fanout-of-4 (FO4) inverter as shown in figure 2.3


Figure 2.3: FO4 - An inverter driving four identical inverters

Because each inverter is identical, $C_{\text {out }}=4 C_{\text {in }}$, so $h=4$. Since the logical effort $g$ of an inverter is 1 , the effort delay $f=g h=1 \mathrm{x} 4=4$. The parasitic delay for an inverter is $p=p_{\text {inv }}=1$. Hence the delay of an inverter is given by $d=f+p=4+1=5.0$ delay units.

The method of logical effort can be extended to multistage logical networks and it also can help reveal the best number of stages in a multistage network to obtain the least overall delay. The path logical effort $G$ is given by $G=\Pi g_{i}$ and the path electrical effort $H$ is given by $H=C_{\text {oul }} / C_{\text {in }}$ where $\mathrm{C}_{\text {in }}$ and $\mathrm{C}_{\text {out }}$ refer to the input and output capacitances of the path. In the case of estimating the path delay, the branching effort $b$ at the output of a logic gate needs to be taken into consideration to account for the fanout within a network. Hence the path branching effort $B$ is
given as $B=\Pi b_{i}$. Now the path effort delay $F$ can be defined as $F=G B H$ and the path parasitic delay $P$ is defined as $P=\sum p_{i}$. Having known all the terms, the path delay is given as

$$
\begin{equation*}
D=F+P \tag{2.6}
\end{equation*}
$$

In case of an N -stage logic network, the path delay is found to be the least when each stage in the path bears the same stage effort. i.e $f_{\text {opt }}=g_{i} h_{i}=F^{1 / N}$. Hence the minimum delay achievable along an N -stage logic network path is

$$
\begin{equation*}
D_{o p t}=N F^{l / N}+P \tag{2.7}
\end{equation*}
$$

Let's look at how the path delay in a multistage network can be estimated using the method of logical effort. For example consider the circuit as shown in figure 2.4


Figure 2.4: Logic network consisting of three two-input NAND gates

From the above discussion on multistage logic networks, to computer the path delay from A to B for the above circuit shown in figure 2.4
$G=\Pi g_{i}=(4 / 3)^{3} ; H=8 C / C=8 ; B=\Pi b_{i}=1 ; F=G B H=18.96 ; P=\sum p_{i}=3\left(2 p_{\text {inv }}\right)=6$

Hence least path delay is $D_{o p t}=N F^{1 / N}+P=3(18.96)^{1 / 3}+6=14.0$ delay units. Now let's look at how the transistors to be sized along the path to achieve this least delay. Since $f_{\text {opt }}=18.96^{1 / 3}=$ $8 / 3$, now $g_{3} h_{3}=8 / 3 ; h 3=8 / 3 g_{3} ; 8 C / z=8 / 3 g_{3} ; z=8 C\left(g_{3}\right)(8 / 3) ; z=8 C(4 / 3)(8 / 3)=4 C$. Similarly, $y=z(4 / 3)(8 / 3)=2 C$.

Though the method of logical effort is simple and effective in quickly estimating the circuit delay, it allows only fixed p transistor width to the n transistor width ratios $\gamma$ when determining the
transistor sizes while a more efficient design is possible when the p and n transistor widths can be chosen irrespective of the other. Also, when using the method of logical effort, there is no easy way to include the interconnect wire capacitance in estimating the circuit delay and hence makes it a less accurate method.

In this dissertation we present a simplified RC delay model that can easily take into account the interconnect wire capacitance in estimating the path delays and also allows the choice of p and n transistor widths irrespective of the other to be able to design a more efficient circuit.

### 2.3 Review of the gradient descent based optimization

Optimization is useful in achieving the minimum or maximum of an objective function (OF) by determining optimum values for the decision variables. In most cases it is required to minimize the function. When the OF is continuous w.r.t its decision variables, gradient based methods are very useful as the method tells clearly which direction to go in order to find a better OF value faster. In case of gradient descent method, one should take steps proportional to the negative of the gradient of the function at the current point to obtain a minimum OF value. Gradient descent method is useful in finding the local minimum of the function but does not guarantee that the solution found is a global optimum except when the OF is a convex function. When the OF is not continuous, the gradient of the function cannot be determined at all the values of the decision variables and hence direct search methods like particle swarm, leapfrogging can be used to optimize the function without requiring to evaluate the gradient of the function. Since the objective function in our analysis is continuous, gradient descent method can be used for optimization.

Let's look into the details of how the gradient descent method is implemented in general. Let $F(x)$ be a multivariable objective function that is differentiable in the neighborhood of any given point
$x_{n}$ in the decision variable space then the new point $x_{n+1}$ satisfying $F\left(x_{n}\right)>=F\left(x_{n+1}\right)$ in the direction of negative gradient of $F(x)$ for a small value of $\delta_{n}$ is defined as

$$
\begin{equation*}
x_{n+1}=x_{n}-\delta_{n}\left(F^{\prime}\left(x_{n}\right)\right) \tag{2.8}
\end{equation*}
$$

It takes several iterations to obtain the best possible minimum value for $F(x)$, and the value of $\delta_{n}$ can change in each iteration. The best value of $\delta_{n}$ can be chosen via a line search which is again an iterative process using the gradient descent method to obtain the optimum value for $\delta_{n}=\delta_{n}{ }^{*}$ that minimizes $F\left(x_{n}\right)$ in the negative direction of gradient of $F\left(x_{n}\right)$. Hence the Eq. 2.8 is now defined as

$$
\begin{equation*}
x_{n+1}=x_{n}-\delta_{n}{ }^{*}\left(F^{\prime}\left(x_{n}\right)\right) \tag{2.9}
\end{equation*}
$$

The Eq. 2.9 is the Cauchy method [11] popularly known as the steepest descent method. The convergence of the algorithm is decided using a small value $\varepsilon$ usually in the order of $10^{-5}$ such that $F^{\prime}\left(x_{n}\right)<=\varepsilon$ and/or $\Delta x=x_{n+1}-x_{n}<=\varepsilon$. Most often a fixed small value of $\delta_{n}=\delta$ is assumed in each iteration to reduce the convergence time. Though this method is simple and has the optimal property of finding the best minimum OF value it performs poorly in terms of convergence and may not converge in a reasonable amount of time when the OF is bumpy and has v-shaped minima, which is the case when minimizing the circuit delay in our analysis.

Reference [12] presented formulae to calculate the step-size $\delta_{n}$ in each iteration instead of doing a line search that helped in much faster convergence of the conventional steepest descent algorithm.

$$
\begin{equation*}
\delta_{n}=\frac{s_{n-1}^{T} y_{n-1}}{\left\|y_{n-1}\right\|_{2}^{2}} \tag{2.10}
\end{equation*}
$$

or

$$
\begin{equation*}
\delta_{n}=\frac{\left\|s_{n-1}\right\|_{2}^{2}}{s_{n-1}^{T} y_{n-1}} \tag{2.11}
\end{equation*}
$$

where $s_{n-1}=x_{n}-x_{n-1}$ and $y_{n-1}=F^{\prime}\left(x_{n}\right)-F^{\prime}\left(x_{n-1}\right)$. Though this method is better than Cauchy method, it still takes significant amount of time to converge when the OF surface is bumpy and has $v$-shaped minima. Hence there is a need for a better and faster way of implementing the gradient descent algorithm to achieve faster convergence when the OF surface is bumpy and has v -shaped minima with satisfactory results compared to the above discussed methods.

### 2.4 Review of optimization techniques based on transistor sizing

Literature suggests that there are several techniques that are used in improving circuit performance. One such technique is circuit optimization. Again circuit optimization can be done using various methods such as transistor sizing, transistor reordering, transistor tapering etc. However, transistor sizing is considered to be the simplest and most effective method in CMOS circuit optimization. Previous research shows many such attempts were made to optimize the circuit for speed and power using transistor sizing. Some have used RC transistor models to approximate the circuit delay/power, while some have used non-RC models to approximate the circuit delay/power. Let's review some of them below.

One of the early tools used for circuit speed optimization based on transistor sizing is discussed in [13] and is called SLOP (Switch Level Optimization) which uses an RC tree approximation for the circuit to estimate the circuit path delay. The total delay of path i including the transistor is given by the sum of all the delay contributions from each transistor in that path.

$$
\begin{equation*}
\mathrm{d}_{\mathrm{i}}=\mathrm{t}_{1}+\mathrm{t}_{2}+\ldots \ldots . .+\mathrm{t}_{\mathrm{j}}+\ldots \ldots .+\mathrm{t}_{\mathrm{n}} \tag{2.8}
\end{equation*}
$$

To minimize diw.r.t the transistor width $\mathrm{W}_{\mathrm{j}}$, evaluate $\frac{\partial d_{i}}{\partial W_{j}}=0$ and solve for $\mathrm{W}_{\mathrm{j}^{*}}$ which is the minimum delay contribution width of transistor $j$ in path i . The critical path delay $\mathrm{d}_{\mathrm{c}}=\left(\mathrm{d}_{\mathrm{i}}\right)_{\max }$ determines the speed of the circuit. The optimization is done in three main stages:
a) Initialization - to record the initial delays of $d_{i}$ and find the critical delay by the first simulation
b) Global test - to establish delay, gain and device matrices. Delay matrix is established by each time changing one transistor width with one step, simulating and recording delays of each path. The gain matrix is established by calculating the partial derivatives of $\mathrm{d}_{\mathrm{i}}$ with respect to the transistor width $\mathrm{W}_{\mathrm{j}}$. The device matrix is established by recording device number with non-zero gain.
c) Critical path test - to only test devices in the critical delay column.

Optimization done by this method is complicated and time consuming as it requires to change only one transistor width at a time and determine the delays of each path. Proper path balancing may not be achieved with this method hence there will be a problem of glitching.
[14] also uses an RC model to estimate the delay but the transistor sizing is done to minimize the area subject to a delay constraint. In this the optimization problem is considered as a convex programming problem. Convex optimization is considered to be efficient because any local minimum solution will also be the global minimum solution. In this case the minimization is done on one path at a time but there will be a need to minimize multiple path delays at a time in most of the cases which is when the optimization problem becomes non-convex and there will be no simple solution for the optimization problem.
[15] uses a different approach in using the RC model for the circuit area/power optimization. It tries to place large size transistors and route to meet the delay specification of the circuit and then the interconnect wire length is extracted. The circuit is optimized for area/power subject to the delay constraint and as a result the transistor sizes decrease creating spaces in the layout and the interconnect wires can be re-routed utilizing these spaces in such a way to reduce the overall wire
length which in turn reduces the wire capacitance. In this the optimization is achieved majorly by minimizing the interconnect wire length.
[16] also uses an RC model and the transistor sizing is done for optimizing the power-delay product rather than only delay or power of a CMOS circuit. The delay calculations are done using Elmore's RC delay model. A loading coefficient $\alpha=C_{L} / C_{d}$ is considered as the decision variable for the power-delay product (PDP) objective function optimization. CL is the load capacitance at a node and Cd is the circuit internal capacitance at that node. The PDP is expressed in terms of $\alpha$ and the optimum $\alpha_{\text {opt }}$ can be determined from solving the expression $\frac{\partial P D P}{\partial \alpha}=0$. As the model is not explicitly expressed in terms of transistor widths it is difficult to analyze the effects of changing the transistor widths on the power delay product.
[17] also uses RC Elmore delay model. In this both the transistor widths and lengths are used as decision variables for the optimization problem. The optimization problem was solved as a multiobjective optimization. The transistor widths and lengths were modified to match all the circuit path delays with the critical path delay to achieve path balancing to avoid glitching and at the same time minimize circuit power consumption. In achieving optimum results the transistor lengths have to be increased, which results in both increased gate capacitances and area. To reduce this negative influence of the increased transistor lengths, two alternate ways were proposed: twin transistors and merged transistors. In order to equalize all the path delays w.r.t the critical path, every path requires individual optimization. In this the delay is not essentially minimized but is made equal to the critical path delay for all the circuit paths to achieve path balancing.
[18],[19],[20],[21] uses complex and non-linear models to minimize delay and power. Though these models are accurate compared to simple RC models, circuit optimization for speed and
power using these models is difficult and is not very efficient because of the complexity of the models.
[22],[23] discusses different techniques like transistor reordering and transistor tapering to optimize the circuit for delay and power, however we prefer to stick with transistor sizing techniques as it is the simple and effective to optimize the circuit for speed and power.

### 2.5 Review of parallel prefix adders and their performance

Binary addition is one of the most often used arithmetic operations on microprocessors. A large variety of algorithms and implementations have been proposed for binary addition. When high operation speed is of great importance, parallel-prefix adders such as Brent-Kung[2], Skylansky[3] and Kogge-Stone[4] adders are commonly used. Any decrease in the delay will directly relate to an increase in the adder throughput. The primary requirements for any adder are that it should be fast and efficient in terms of power consumption and chip area. When the adder size is small, the above mentioned parallel prefix adders show similar performance in terms of delay and power but as the adder size gets large ( $\mathrm{N}>16$ bit), the difference in their performance becomes significant due to the difference in their implementation.

The Skylansky adder presents a least depth prefix network at the cost of increased fan-out at certain nodes. The fan-out increases exponentially as the adder size increases. The Kogge-Stone adder has optimal depth and low fan-out but produces massively complex circuit and also accounts for a large number of interconnects when the adder size is big. The Brent-Kung adder has the advantage of minimal number of circuit nodes, which yields in reduced area but this implementation requires maximum depth that causes an increase in the latency when compared to the other structures. The above analysis is mostly true when fixed transistor sizes or minimum sized transistors are used. Hence it is useful to learn the performance of these adders for different adder sizes with optimum transistor widths to allow a better choice between these adders in the
design of high speed microprocessors. Since the process of designing the adders and their optimization is a time taking process, it would be very useful to use a simple device model like RC transistor model and estimate the delay and power without requiring to draw the layout.
[24] presents the design and performance comparison of various high-speed adders using CMOS and Transmission gate technology. This work showed that the adders designed with transmission gate has much less power and delay than those with CMOS gates. But the focus of this dissertation is the design with CMOS gates. This work requires the adder layout/schematic to be done to do the analysis. Moreover, there is no information on optimum transistor widths being used in their study of adder comparison.
[25] presents the design and performance of any parallel prefix adders by using alternating odd and even cells by eliminating unwanted buffers between the cells from two stages in the conventional design. Similar approach will be used in this dissertation to eliminate the unwanted buffers in the adder design. Even this approach requires the adder layout/schematic to be done to do the analysis and does not discuss the optimization of the adders for optimum delay or power.
[26] also presents a performance comparison of various high-speed adders using Xylinx ISE for simulation and synthesis without discussing the optimization of the adders for improved performance.
[27] presents a comparison of adder performance with radix-4 and radix-2 in case of a 32 bit parallel prefix adder. This work shows that the adder implementation with radix-4, Sparse-4 has reduced delay with a minor increase in power than with radix-2 implementation. Also, this method requires the adder analysis using the layout/schematic and does not discuss the use optimum transistor widths in the design.

This dissertation presents the adder analysis without having to do the tedious layout/schematic simulation by using a simplified RC delay model and also presents a performance comparison of
different high-speed adders with optimum transistor widths for different delay and power constraints.

## CHAPTER III

## METHODOLOGY

### 3.1 Simplified RC Delay Model

The simplified RC delay model [28] consists of parasitic delay $t_{d P}$ that arises due the circuit's internal parasitic resistances and capacitances and is estimated using Elmore delay technique, and also consists of effort delay $t_{d F}$ that arises due to the capacitive load on the circuit's output node. Hence the delay $\left(t_{d}\right)$ of any circuit path is given by

$$
\begin{align*}
& t_{d}=t_{d P}+t_{d F}  \tag{3.1}\\
& t_{d F}=R_{\text {out }} C_{\text {load }} \tag{3.2}
\end{align*}
$$

When there is no load on the output node i.e $C_{\text {load }}=0$ then $t_{d}=t_{d P}$ and when there is a load on the output node then $t_{d}=t_{d P}+t_{d F}$. The parasitic delay and output resistance $R_{o u t}$ are determined by the topology of the circuit.

To discuss the implementation of the simplified RC delay model in quickly estimating the circuit or gate delay, consider a 2-input NAND gate as shown in Figure 3.1 as an example circuit. The transistor level diagram is shown on the left and the corresponding stick diagram is shown on the right in Figure 3.1. The circuit has four transistors and two parasitic node capacitances $C_{Y}$ and $C_{l}$. $C_{Y}$ is same as the $C_{\text {out }}$ which is the parasitic capacitance on the output node Y .


Figure 3.1: 2-input NAND gate a) Transistor diagram b) Stick diagram
The parasitic delay and the output resistance of this circuit are determined for different delay paths based on Elmore delay calculations as shown in Table 3.1.

| $A$ | $B$ | $\boldsymbol{Y}$ | $t_{d P}$ | $R_{\text {out }}$ |
| :--- | :--- | :--- | :--- | :--- |
| $\uparrow$ | 1 | $\downarrow$ | $R_{n A} C_{1}+\left(R_{n A}+R_{n B}\right) C_{\text {out }}$ | $R_{n A}+R_{n B}$ |
| $\downarrow$ | 1 | $\uparrow$ | $R_{p A} C_{1}+R_{p A} C_{\text {out }}$ | $R_{p A}$ |
| 1 | $\uparrow$ | $\downarrow$ | $\left(R_{n A}+R_{n B}\right) C_{\text {out }}$ | $R_{n A}+R_{n B}$ |
| 1 | $\downarrow$ | $\uparrow$ | $R_{p B} C_{\text {out }}$ | $R_{p B}$ |

Table 3.1: Parasitic delay expressions for 2-input NAND gate

Where $R_{p A}, R_{n A}, R_{p B}, R_{n B}$ are the channel resistances of the transistors. From the above table the delay paths are determined assuming only one input is changing at a time and other inputs are not switching. In the case of the 2-input NAND gate when one input is switching the other input has to be ' 1 ' to be able to determine the delay of the path from the changing input to the output because if the non-switching input is ' 0 ' there will be no change in the output even when the switching input is falling or rising, hence there is no question of delay in that case.

## Estimation of channel resistance in terms of Unit R:

The major goal of the simplified RC delay model is to see the effect of changing transistor widths without having to do detailed layout and simulation. The channel resistances of individual transistors are inversely proportional to the transistor channel widths and are proportional to the transistor channel lengths and is given by

$$
\begin{equation*}
R_{X}=R_{S} \frac{L_{X}}{W_{X}} \tag{3.3}
\end{equation*}
$$

Where $R_{X}$ is the channel resistance of transistor $\mathrm{X}, R_{s}$ is the transistor sheet resistance which is a process constant, $L_{X}$ is the transistor channel length and $W_{X}$ is the transistor channel width. In our design the transistor channel length is chosen to be the minimum feature size of the process which is $L_{\text {min }}$. Hence the channel resistances for the nFET and pFET transistors are given as shown below. The subscripts ' $n$ ' and ' $p$ ' refers to the $n$-channel and $p$-channel respectively.

$$
\begin{align*}
& R_{n X}=R_{s n} \frac{L_{\min }}{\boldsymbol{W}_{n X}}  \tag{3.4}\\
& R_{p X}=R_{s p} \frac{L_{\min }}{\boldsymbol{W}_{p X}} \tag{3.5}
\end{align*}
$$

To simplify calculations of channel resistance and make them process independent, lets define channel resistance as a multiple of $R$, the channel resistance of a minimum size nFET.

$$
\begin{align*}
& R \equiv R_{s n} \frac{L_{\min }}{W_{\min }}  \tag{3.6}\\
& R_{n X}=R_{S n} \frac{L_{\min }}{W_{n X}}=\frac{W_{\min }}{W_{n X}} R  \tag{3.7}\\
& R_{p X}=R_{s p} \frac{L_{\min }}{W_{p X}}=\frac{W_{\min }}{W_{p X}} 2 R \tag{3.8}
\end{align*}
$$

Note that there is an extra factor of 2 for pFET channel resistance to account for the significant difference between the n-channel and p-channel sheet resistance. For 0.18um technology $R=$ $5 K \Omega$ approximately.

## Estimation of parasitic capacitances in terms of Unit C:

The transistor gate capacitance $C_{i n}$ per unit width $W$ has stayed relatively constant if all the three dimensions of the gate scale by the same factor and is equal to $2 \mathrm{fF} / \mathrm{\mu m}$ approximately. To simplify calculations of parasitic capacitance and make them process independent, lets define Unit Capacitance $C$ given as

$$
\begin{equation*}
C \equiv\left(\frac{C_{i n}}{W}\right) W_{\min } \tag{3.9}
\end{equation*}
$$

For TSMC 0.18 um technology $C=0.89 f F$ approximately.

Hence capacitance at node i in terms of Unit $C$ is given as

$$
\begin{equation*}
C_{i} \equiv \frac{C_{i}}{\left(\frac{\left(\text { cin }^{W}\right)}{W}\right) W_{\text {min }}} C \tag{3.10}
\end{equation*}
$$

The internal parasitic capacitance at a node is separated into propagation channel capacitance ( $C_{\text {prop }}$ ) and diffusion capacitance ( $C_{d i f f}$ ).

The propagation channel capacitance $C_{\text {prop }}$ for the n-channel transistor and in terms of transistor channel width is given as


Figure 3.2: Channel Propagation Capacitance

$$
C_{\text {prop }}=\left\{\begin{array}{c}
\left(\frac{C_{n g d o}}{P}\right) W_{n X}, \text { off }  \tag{3.11}\\
{\left[\frac{1}{2}\left(\frac{C_{g}}{A}\right) L_{\text {min }}+\left(\frac{C_{n g d o}}{P}\right)\right] W_{n X}, \text { on }}
\end{array}\right.
$$

Normalizing to the unit capacitance

$$
C_{\text {prop }}=\left\{\begin{array}{c}
\frac{\left(\frac{C_{n g d o}}{P}\right) W_{n X}}{\left(\frac{C_{i n}}{W}\right) W_{\text {min }}} C, \text { off }  \tag{3.12}\\
\frac{\left[\frac{1}{2}\left(\frac{C_{g}}{A}\right) L_{\text {min }}+\left(\frac{c_{n g d o}}{P}\right)\right] W_{n X}}{\left(\frac{c_{\text {in }}}{W}\right) W_{\text {min }}} C, \text { on }
\end{array}\right.
$$

The numbers for various processes are shown in Table 3.2.

| Co. | line width | $C_{\text {ngdo }} / C_{i n}$ | $C_{\text {pgdo } o} / C_{i n}$ |
| :--- | :--- | :--- | :--- |
| AMIS | $0.6 \mu \mathrm{~m}$ (SUBM) | 0.10 | 0.16 |
| IBM | $0.5 \mu \mathrm{~m}$ (SUBM) | 0.16 | 0.22 |
| AMIS | $0.35 \mu \mathrm{~m}$ (SUBM) | 0.14 | 0.14 |
| IBM | $0.35 \mu \mathrm{~m}$ (SUBM) | 0.25 | 0.20 |
| TSMC | $0.35 \mu \mathrm{~m}$ (SUBM) | 0.15 | 0.23 |
| IBM | 0.25 (DEEP) | 0.27 | 0.29 |
| TSMC | 0.25 (DEEP) | 0.24 | 0.25 |
| IBM | 0.18 (DEEP) | 0.20 | 0.27 |
| TSMC | 0.18 (DEEP) | 0.26 | 0.30 |
| IBM | 0.13 (DEEP) | 0.20 | 0.19 |
|  |  |  |  |

Table 3.2: Capacitance values for various processes

From the above table the average value of $\mathrm{C}_{\text {ngdo }} / \mathrm{C}_{\mathrm{in}}$ for various processes is about 0.25 . Hence

$$
C_{\text {prop }} \approx\left\{\begin{array}{l}
\frac{1}{4} \frac{W_{n X}}{W_{\min }} C, \text { off }  \tag{3.13}\\
\frac{1}{2} \frac{W_{n X}}{W_{\min }} C, \text { on }
\end{array}\right.
$$

The above expression makes the node capacitance different depending on whether the transistors are ON or OFF which greatly complicates analysis without significantly improving accuracy. To simplify the model $C_{\text {prop }}$ is approximated as $C_{\text {prop }}=\frac{1}{2} \frac{W_{n X}}{W_{\min }} C$. Similarly for the p-channel $C_{\text {prop }}=\frac{1}{2} \frac{W_{p X}}{W_{\text {min }}} C$.

The diffusion capacitance ( $C_{\text {diff }}$ ) varies depending on a) if there is a diffusion contact b) if there is no diffusion contact between two transistors c ) if there is a diffusion contact but is shared between two transistors.
a) The diffusion capacitance of transistor A in terms of unit capacitance C when there is a diffusion contact for n -diffusion and p -diffusion is modeled as


Figure 3.3: Diffusion Capacitance with diffusion contact

$$
\begin{align*}
& C_{n d c X}=\frac{1}{2}\left(\frac{K W_{n A}}{W_{\min }}+K_{1}\right) C  \tag{3.14}\\
& C_{p d c X}=\frac{1}{2}\left(\frac{K W_{p A}}{W_{\min }}+K_{1}\right) C \tag{3.15}
\end{align*}
$$

Where K and $\mathrm{K}_{1}$ are process constants.
b) The diffusion capacitance when there is no diffusion contact between two transistor A and B for n -diffusion and p -diffusion is modeled as


Figure 3.4: Diffusion Capacitance with no diffusion contact

$$
\begin{align*}
& C_{n d n c X Y}=\frac{1}{2}\left(\frac{K\left(W_{n A}+W_{n B}\right)}{W_{\min }}+K_{1}\right) C  \tag{3.16}\\
& C_{p d n c X Y}=\frac{1}{2}\left(\frac{K\left(W_{p A}+W_{p B}\right)}{W_{\min }}+K_{1}\right) C \tag{3.17}
\end{align*}
$$

c) The diffusion capacitance when there is a diffusion contact but shared between two transistors A and B for n-diffusion and p-diffusion is modeled as


Figure 3.5: Diffusion Capacitance with shared diffusion contact

$$
\begin{align*}
& C_{n d s c X Y}=\frac{1}{2}\left(\frac{K\left(W_{n A}+W_{n B}\right)}{W_{\min }}+K_{1}\right) C  \tag{3.18}\\
& C_{p d s c X Y}=\frac{1}{2}\left(\frac{K\left(W_{p A}+W_{p B}\right)}{W_{\min }}+K_{1}\right) C \tag{3.19}
\end{align*}
$$

The diffusion capacitance when there is no diffusion contact and when there is a sharing contact between two transistors is modeled similarly as there is no significant difference within the accuracy of the model.

In case of 2-input NAND gate, the parasitic node capacitance $\mathrm{C}_{\text {out }}$ and $\mathrm{C}_{1}$ is calculated as shown below by using the above derived expressions for the $\mathrm{C}_{\text {prop }}$ and $\mathrm{C}_{\text {diff }}$

$$
\begin{align*}
C_{\text {out }} & =\frac{1}{2} \frac{W_{n B}}{W_{\min }} C+\frac{1}{2} \frac{W_{p A}}{W_{\min }} C+\frac{1}{2} \frac{W_{p B}}{W_{\min }} C+\frac{1}{2}\left(\frac{K W_{n B}}{W_{\min }}+K_{1}\right) C+\frac{1}{2}\left(\frac{K\left(W_{p A}+W_{p B}\right)}{W_{\min }}+K_{1}\right) C \\
& =\frac{1}{2}\left(\frac{(1+K)\left(W_{n B}+W_{p A}+W_{p B}\right)}{W_{\min }}+2 K_{1}\right) C \tag{3.20}
\end{align*}
$$

$$
\begin{align*}
C_{1} & =\frac{1}{2} \frac{W_{n A}}{W_{\min }} C+\frac{1}{2} \frac{W_{n B}}{W_{\min }} C+\frac{1}{2}\left(\frac{K\left(W_{n A}+W_{n B}\right)}{W_{\min }}+K_{1}\right) C \\
& =\frac{1}{2}\left(\frac{(1+K)\left(W_{n A}+W_{n B}\right)}{W_{\min }}+K_{1}\right) C \tag{3.21}
\end{align*}
$$

Similarly the channel resistances in case of 2-input NAND gate are given below

$$
\begin{array}{ll}
R_{n A}=\frac{W_{\min }}{W_{n A}} R & R_{n B}=\frac{W_{\min }}{W_{n B}} R \\
R_{p A}=\frac{W_{\min }}{W_{p A}} R & R_{p B}=\frac{W_{\min }}{W_{p B}} 2 R \tag{3.23}
\end{array}
$$

By substituting the channel resistances and the parasitic node capacitances in the parasitic delay equations for the 2 -input NAND gate in table 3.1 the path delays can be expressed explicitly in terms of transistor widths. Hence the model is greatly useful for faster and easy estimation of circuit delay and can be effectively used for circuit speed and power optimization using transistor sizing.

Similarly the input gate capacitance ( $C_{\text {in }}$ ) and the interconnect wire capacitance ( $C_{\text {wire }}$ ) can be modeled as shown below.

$$
\begin{equation*}
C_{i n}=\left(\frac{W_{n X}+W_{p X}}{W_{\min }}\right) C \tag{3.24}
\end{equation*}
$$

Modern deep submicron processes have a capacitance of about 0.2 fF per micrometer of wire length.

$$
\begin{equation*}
C_{\text {wire }}=0.2 \mathrm{fF} / \mu \mathrm{m} . L_{\text {wire }} \tag{3.25}
\end{equation*}
$$

Normalizing to the unit capacitance

$$
\begin{equation*}
C_{\text {wire }}=\frac{0.2 f F / \mu m \cdot L_{\text {wire }}}{\left(\frac{C_{\text {in }}}{W}\right) \cdot W_{\text {min }}} C=\frac{0.2 f F / \mu m \cdot L_{\text {wire }}}{2 f F / \mu m \cdot W_{\text {min }}} C=\frac{1}{10} \frac{L_{\text {wire }}}{W_{\min }} C \tag{3.26}
\end{equation*}
$$

In our analysis the wire resistance is not considered. Designers should add inverter repeaters to make the wire resistance negligible compared to transistor channel resistance. For more detailed discussion on how the expressions for the channel resistances and parasitic capacitances are derived please refer to [28].

This delay model can be easily extended to larger circuits where multiple gates are connected together by including interconnect wire capacitance and the input gate capacitance. For example consider a $2 \times 4$ Decoder circuit (DEC2) with enable signal using 3-input NAND gates and inverters as shown in Figure 3.6.


Figure 3.6: $2 \times 4$ Decoder circuit with enable signal

It is assumed that each of the NAND gates is identical and each of the inverters is identical.
Hence we only have to design one NAND gate and one inverter and use the same design for the others.


Figure 3.7: a) 3-input NAND

b) Inverter

Similar to the 2-input NAND gate we discussed earlier, determine the parasitic path delays to the output from each input for the 3-input NAND gate and the inverter circuit shown in Figure 3.7.

Now the possible path parasitic delays in case of $2 \times 4$ Decoder circuit shown in Figure 3.6 are

$$
\begin{align*}
& t_{d P(A O Y O r) D E C 2}=t_{d P(A Y f) N A N D 3}+R_{\text {out(AYfNAND3 }}\left[C_{\text {wire }}+C_{\text {in }(A Y r) I N V}\right]+t_{d P(A Y r) I N V}  \tag{3.27}\\
& t_{d P(A O Y O f) D E C 2}=t_{d P(A Y r) N A N D 3}+R_{\text {out(AYr)NAND3 }}\left[C_{\text {wire }}+C_{i n(A Y f I N V}\right]+t_{d P(A Y f I N V}  \tag{3.28}\\
& t_{d P(A Y Y O r) D E C 2}=t_{d P(B Y f N A N D 3}+R_{\text {out }(B Y f) N A N D 3}\left[C_{\text {wire }}+C_{\text {in(AYr)INV }}\right]+t_{d P(A Y r) / N V}  \tag{3.29}\\
& t_{d P(A I Y O f) D E C 2}=t_{d P(B Y r) N A N D 3}+R_{\text {out }(B Y r) N A N D 3}\left[C_{\text {wire }}+C_{i n(A Y f I N V}\right]+t_{d P(A Y f f I N V}  \tag{3.30}\\
& t_{d P(E N Y O r) D E C 2}=t_{d P(C Y f) N A N D 3}+R_{\text {out(CYffNAND3 }}\left[C_{\text {wire }}+C_{i n(A Y r) I N V}\right]+t_{d P(A Y r) I N V}  \tag{3.31}\\
& t_{d P(E N Y O f) D E C 2}=t_{d P(C Y r) N A N D 3}+R_{\text {out }(C Y r) N A N D 3}\left[C_{\text {wire }}+C_{i n(A Y Y f I N V}\right]+t_{d P(A Y f I N V} \tag{3.32}
\end{align*}
$$

To get the total path delay, add the corresponding effort delay $R_{\text {out }} \cdot C_{\text {load }}$ to each of the above expressions in Eq. 3.27 to Eq. 3.32.

Since the simplified RC delay model is explicitly expressed in terms of transistor channel widths, it helps in faster estimation of the circuit delay of any CMOS circuit and also helps analyze the effects of changing transistor channel widths on circuit delay easily. Unlike the logical effort model discussed in Chapter 2, this model allows p and n channel widths selection to be independent of each other in designing an effective circuit in terms of delay and power and is also more accurate than the logical effort model as it takes into consideration the interconnect wire capacitance which plays a significant role in determining the circuit delay when the wire lengths are long.

Minimizing just the delay using the above discussed delay model would result in large transistors which will consume more power and may not use the silicon area efficiently. Hence, there should be a constraint on power when minimizing the delay. Before using the power as a constraint to find the optimum transistor sizes, we need a model for the power changes when changing transistor widths [28]. Fortunately almost all power consumption is proportional to the width of the transistor channels. The power dissipative components in CMOS circuits consist of off-state leakage power, dynamic power due to charging and discharging of node capacitances, shortcircuit power, switching power due to parasitic capacitances and glitch power due to unequal arrival of signals. But the major components are the leakage power, dynamic power and the shortcircuit power. Let us look at how these components are modeled as proportional to transistor channel widths.

## Off-state leakage power

When transistors are turned off, there can be a small drain current ( $\mathrm{I}_{\text {Doff }}$ ). Though this current is too small to have a considerable impact on the delay, it can have a significant impact on the power consumption.
$P_{\text {leak }}=V_{d d} I_{D o f f}=V_{d d}\left(\frac{I_{\text {Doff }}}{I_{\text {Don }}}\right) I_{\text {Don }}$

Where $\mathrm{I}_{\text {Don }}$ is the drain current when the transistor is on. If we approximate $\mathrm{I}_{\text {Don }}$ with channel impedance
$I_{D o n}=\frac{V_{d d}}{R_{o n}}=\frac{V_{d d}}{R} \frac{W}{W_{\text {min }}}$

Where R is the unit resistance, W is the channel width, and $\mathrm{W}_{\text {min }}$ is the minimum transistor width. Hence the leakage power is expressed in terms of transistor width as
$P_{l e a k}=V_{d d}\left(\frac{I_{\text {Doff }}}{I_{\text {Don }}}\right) I_{\text {Don }}=\frac{V_{d d}{ }^{2}}{R}\left(\frac{I_{\text {Doff }}}{I_{\text {Don }}}\right) \frac{W}{W_{\text {min }}}$

Dynamic power

The dynamic power consumption for a single logic gate (neglecting $\mathrm{C}_{\text {wire }}$ ) is given as
$P_{d y n \_ \text {gate }}=\left(C_{\text {out }}+C_{\text {load }}\right) V_{d d}^{2} f$

Where $\mathrm{C}_{\text {out }}$ comes from the parasitic delay and $\mathrm{C}_{\text {load }}$ from the effort delay. Since each transistor appears as part of the parasitic delay in one logic gate and the effort delay in another logic gate, it follows that each transistor contributes capacitance both through $\mathrm{C}_{\text {out }}$ and $\mathrm{C}_{\text {load }}$. Therefore, the dynamic power per transistor is approximately modeled as
$P_{d y n_{-} t r a n}=\left[\left[(1+K) \frac{W}{W_{\min }}+K_{1}\right] C+C\right] V_{d d}{ }^{2} f=[(2+K) C] V_{d d}{ }^{2} f \frac{W}{W_{\min }}+K_{1} C V_{d d}{ }^{2} f$

Where C is the unit capacitance, f is the frequency of charging and discharging the node capacitance.

## Short circuit power

Short circuit power is the only power dissipation component that cannot be modeled as proportional to transistor width. Fortunately, it is almost always negligible compared with other power dissipative components. Therefore, we do not need to include that in our optimization calculations.

Since most of the power consumption per transistor is proportional to the transistor channel width, the total power consumption of a CMOS circuit can be approximated as proportional to the sum of all the transistor widths in the circuit.

### 3.3 Optimization Methodology

In order to effectively optimize the circuit for speed and power by determining the optimum transistor channel width sizes, a heuristic approach based on gradient descent method is used. As discussed in Chapter 2, the gradient descent, also known as steepest descent, is a first order optimization algorithm used to find a local minimum by taking steps proportional to the negative of the gradient of the objective function (OF) at the current point in the decision variable (DV) space.

$$
\begin{equation*}
X_{i+1}=X_{i}-\delta[\nabla O F /|\nabla O F|] \tag{3.38}
\end{equation*}
$$

Where $\delta$ is the step size.

In our optimization analysis, each transistor width $\left(\mathrm{W}_{\mathrm{i}}\right)$ is bounded between a minimum $\left(\mathrm{W}_{\text {min }}\right)$ and a maximum ( $\mathrm{W}_{\text {max }}$ ) to avoid very large transistor widths and the decision variable space is normalized so that each transistor width after normalization varies from 0 to 1 .

$$
\begin{equation*}
W_{\text {norm }}=\frac{W_{i}-W_{\min }}{W_{\max }-W_{\min }} \tag{3.39}
\end{equation*}
$$

When minimizing power, the OF is considered as the sum of the transistor widths in the circuit as discussed in the power model.

$$
\begin{equation*}
O F=\sum W_{i} \tag{3.40}
\end{equation*}
$$

When minimizing delay, the OF is the maximum of all possible critical path delays.

$$
\begin{equation*}
O F=\max \{\text { path_delays }\} \tag{3.41}
\end{equation*}
$$

In this case, the minimum of the OF lies at the intersection of two delay paths and looks like a surface as shown in the figure below.


Figure 3.8: OF surface when minimizing delay

Since the minimum of the OF is V-shaped, it is very difficult to find a point at which the gradient is zero or very small to determine the convergence. Hence the convergence of the optimization algorithm is achieved when the step size falls below a small value $\varepsilon$ usually in the order of $10^{-5}$ or when the gradient of the OF w.r.t the decision variables is less than or equal to $\varepsilon$. The heuristic approach used in order to achieve faster and effective convergence is discussed below in steps.

1. Choose an initial step size $\delta=\delta_{\text {int }}$ and start the optimization using gradient descent method
2. While tracking the best OF point in the DV space found in each iteration, if the algorithm did not find a better OF value in $n$ iterations from the current iteration start the optimization from the previously found best point with a reduced value of $\delta$ i.e. $\delta=\delta / 2$
3. To speed up the convergence of the optimization problem, when the value of $\delta$ is less than $\delta_{1}$ but greater than $\delta_{2}$ such that $\delta_{\text {int }}>\delta_{1}>\delta>\delta_{2}$ reduce the number of iterations $n$ to $n_{1}$ required to achieve the better OF value. When the $\delta$ value is less than $\delta_{2}$ such that $\delta_{1}>$ $\delta_{2}>\delta>\varepsilon$ further reduce the number of iterations $n_{1}$ to $n_{2}$ required to achieve a better OF value.

To achieve satisfactory results without effecting the optimization results much, but to significantly reduce the convergence time choose $n_{1}=n / 2, n_{2}=n / 10 . \delta_{1}, \delta_{2}$ can be chosen depending on the value of $\delta_{\text {int }}$. For example, when $n=50$ then $n_{l}=25, n_{2}=5$ and when $\delta_{\text {int }}=0.5$ then $\delta_{1}=0.1, \delta_{2}=0.01$.

The circuit optimization is done to minimize the circuit delay with power constraint and to minimize the circuit power with delay constraint. The optimization algorithm is defined in such a way, when minimizing the delay with power constraint, at the current point in DV space if the power constraint is met then the OF is circuit delay otherwise the OF will be the circuit power. Similarly when minimizing the power with delay constraint, at the current point in DV space if the delay constraint is met then the OF is the circuit power otherwise the OF will be the circuit delay.

This heuristic approach helps to optimize the circuit much faster than the conventional methods discussed in [11] and [12] while converging to a satisfactory solution to obtain the optimum transistor widths for the circuit optimization involving the multiple delay paths and hundreds of transistor widths.

### 3.4 Model Validation

To validate the simplified RC delay model a simple four inverter chain circuit is considered. The model is tested against SPICE and Logical Effort(LE) in the case of the circuit with lightly loaded condition, i.e., with no interconnect wire capacitance, $\mathrm{C}_{\text {wire }}$ and small load capacitance, $\mathrm{C}_{\text {load }}$ and also in case of the circuit with heavily loaded condition, i.e., with large $\mathrm{C}_{\text {wire }}$ and $\mathrm{C}_{\text {load }}$. The analysis is done for 0.18 micrometer process technology. In case of the model for 0.18 micrometer process the values for $K$ and $K_{1}$ in equations (3.14) - (3.19) is equal to one i.e. $K=K_{1}=1$. The model is equivalent to the logical effort method when $K_{1}=0$ and $C_{\text {wire }}$ is ignored.

### 3.4.1 Lightly loaded case

Consider the circuit of four inverter chain as shown in the Figure 3.9 with no interconnect wire capacitance and a small load capacitance on the output node.


Figure 3.9: Four inverter chain with a light load and no $\mathrm{C}_{\text {wire. }} \mathrm{C}=0.89 \mathrm{fF}$ for $0.18 \mu \mathrm{~m}$ process It is assumed that the first inverter gate transistor widths are kept fixed. The p-transistor width is fixed at $1.414 \mathrm{~W}_{\text {min }}$ and the n -transistor width is fixed at $\mathrm{W}_{\text {min }}$. For the remaining inverter gates all the transistor widths are set at $\mathrm{W}_{\text {min }}$ initially, where $\mathrm{W}_{\text {min }}=0.36$ micrometers for a 0.18 micrometer process. The circuit is then optimized with different power constraints using the simplified RC delay model and the results are shown in the plot below.


Figure 3.10: Power-delay plot for lightly loaded case

From the above plot, in case of lightly loaded case the optimum transistor widths found by the model agrees well with SPICE results in the region with tight power constraints but did not agree well when the power constraint is kept loose. Also, it is obvious that the simplified RC delay model predicts delays much more accurately than the logical effort method.

The above four inverter chain can be solved in closed form for minimum delay to determine the optimum transistor sizes as discussed in [28] and the optimum value of transistor size ratio is found to be $Z_{i}=\sqrt{2}$; where $Z_{i}=W_{p i} / W_{n i}$ and $i$ represents the number of the inverter in the
inverter chain. The optimum widths found by the simplified RC delay model for minimum circuit delay in the lightly loaded case also agrees with the closed form solution i.e $Z_{i}=\sqrt{2}$

### 3.4.2 Heavily loaded case

Consider the circuit of four inverter chain as shown in Figure 3.11 with large interconnect wire capacitance and a large load capacitance on the output node.


Figure 3.11 Four inverter chain with a heavy load and large $C_{\text {wire. }}$. $=0.89 \mathrm{fF}$ for $0.18 \mu \mathrm{~m}$ process

Similar to the lightly loaded case it is assumed that the first inverter gate transistor widths are kept fixed, p-transistor width is fixed at $1.414 \mathrm{~W}_{\min }$ and the n -transistor width is fixed at $\mathrm{W}_{\text {min }}$. For the remaining inverter gates all the transistor widths are set at $\mathrm{W}_{\text {min }}$ initially, where $\mathrm{W}_{\min }=0.36$ micrometers for 0.18 micrometer process. The circuit is then optimized with different power constraints using the simplified RC delay model and the results are shown in the plot below.


Figure 3.12 Power-delay plot for heavily loaded case

From the above plot, in case of heavily loaded case the optimum transistor widths found by the model agrees well with SPICE results. Again, it is obvious that the simplified RC delay model predicts delay much accurately than the logical effort method.

The above SPICE results are the values when the optimum transistor widths found from optimizing the inverter chain using the model and the logical effort method are used in SPICE simulation to justify the widths found are actually reducing the SPICE delay.

Also, a comparison of CPU computation time taken to estimate the delay in case of cascaded inverter gates is done and is shown in Table 3.3.

|  | CPU Time (seconds) |  |  |
| :--- | :--- | :--- | :--- |
|  | SPICE | IRSIM | Model |
| INV | 0.022 | 0.001 | $<10^{\wedge}-3$ |
| INV $_{2}$ | 0.022 | 0.001 | $<10^{\wedge}-3$ |
| $\mathrm{INV}_{4}$ | 0.024 | 0.001 | $<10^{\wedge}-3$ |
| $\mathrm{INV}_{8}$ | 0.029 | 0.001 | $<10^{\wedge}-3$ |
| $\mathrm{INV}_{16}$ | 0.102 | 0.001 | $<10^{\wedge}-3$ |
| $\mathrm{INV}_{32}$ | 0.188 | 0.002 | $<10^{\wedge}-3$ |
| $\mathrm{INV}_{64}$ | 0.411 | 0.003 | $<10^{\wedge}-3$ |
| $\mathrm{INV}_{128}$ | 0.953 | 0.005 | $<10^{\wedge}-3$ |
| $\mathrm{INV}_{256}$ | 1.582 | 0.008 | $<10^{\wedge}-3$ |

Table 3.3: CPU time comparison

In the above table the subscript beside the term INV refers to the number of cascaded inverter gates. As you can see from the table 3.3, SPICE takes significant amount of computation time for circuits with just couple of hundreds of transistors, while the IRSIM and the Model did not take much computation time compared to SPICE. Overall the model performed well in quickly estimating the delay of the circuit.

### 3.5 Convexity of the objective function

Reference [29] says a real function $f$ is convex on an interval [ $a, b$ ] if for any two points $x_{1}$ and $x_{2}$ in $[\mathrm{a}, \mathrm{b}]$ and any $\lambda$ where $0<\lambda<1$,

$$
\begin{equation*}
f\left[\lambda x_{1}+(1-\lambda) x_{2}\right] \leq \lambda f\left(x_{1}\right)+(1-\lambda) f\left(x_{2}\right) \tag{3.42}
\end{equation*}
$$

In our optimization using transistor sizing, each transistor width is bounded between a minimum and a maximum i.e. $W_{i} \in\left[W_{\min }, W_{\max }\right]$. The parasitic channel resistance, parasitic channel and diffusion capacitances, wire capacitance terms discussed in Section II are all convex on the interval $\left[W_{\min }, W_{\max }\right]$ as they all obey the above condition for the convex function. It is also observed that the product of the channel resistance and the parasitic capacitance is also convex on the interval $\left[W_{\min }, W_{\max }\right]$. From the properties of the convex functions, if two functions are convex then the sum of two convex functions is also convex. Hence the path delay expressions which are just the sum of the convex functions is also convex over the interval [ $\left.W_{\min }, W_{\max }\right]$.

When minimizing the delay, the objective function is the maximum of the possible critical path delays. Again, from the properties of convex functions, if two functions are convex then the maximum of the two convex functions is also convex. Since each path delay is convex on the interval $\left[W_{\min }, W_{\max }\right]$, the maximum of the path delays is also convex on the interval $\left[W_{\min }, W_{\max }\right]$.

When minimizing the power, the objective function is the sum of the transistor widths which is also convex on the interval $\left[W_{\min }, W_{\max }\right]$ as per the definition of the convex function.

## CHAPTER IV

## PERFORMANCE EVALUATION \& OPTIMIZATION RESULTS

### 4.1 Parallel prefix adders

Addition forms the basis for many processing operations. As a result, adder circuits are of great interest to digital system designers. Many adder architectures serve different speed and area requirements. The focus of this dissertation is to model the popular high speed parallel prefix adders such as Brent-Kung, Skylanksy and Kogge-Stone using the simplified RC delay model and compare their performance w.r.t speed and power by optimizing with different performance constraints.

These adders perform the addition operation based on carry generation and propagation logic.
The expressions to describe whether a group spanning bits i...j, inclusive, generate or propagate a carry are given as shown

$$
\begin{align*}
& G_{i: j}=G_{i: k}+P_{i: k} \cdot G_{k-1: j}  \tag{4.1}\\
& P_{i: j}=P_{i: k} \cdot P_{k-1: j} \tag{4.2}
\end{align*}
$$

With the base case

$$
\begin{align*}
& G_{i: i}=A_{i} \cdot B_{i}  \tag{4.3}\\
& P_{i: i}=P_{i}=A_{i} \oplus B_{i} \tag{4.4}
\end{align*}
$$

For an N bit adder size, the Brent-Kung adder computes the carry generate and propagate prefixes for 2-bit groups. These are used to find prefixes for 4-bit groups, which in turn are used to find prefixes for 8 -bit groups, and so forth. The prefixes then fan back down to compute the carries-in to each bit. The adder requires $2\left(\log _{2} \mathrm{~N}\right)-1$ stages. The fanout is limited to 2 at each stage. The Skylansky adder reduces the delay $\log _{2} \mathrm{~N}$ stages by computing intermediate prefixes along with the large group prefixes. This comes at the expense of fanouts that double at each level. These high fanouts cause poor performance on wide adders unless the gates are appropriately sized. The Kogge-Stone adder achieves both $\log 2 \mathrm{~N}$ stages and fanout of 2 at each stage. This comes at the cost of many long wires that must be routed between stages. The adder also contains more PG cells; while this may not impact the area if the adder layout is on a regular grid, it will increase the power consumption. Despite these costs, the Kogge-Stone adder is widely used in highperformance 32 -bit and 64-bit adders. This dissertation presents power-delay performance comparison for these adders in case of 32 bit, 64 bit and 128 bit adder sizes. For simplicity 16 -bit adder structures are presented as shown in the figures Figure 4.1, Figure 4.2 and Figure 4.3 to discuss the design methodology used for 32-bit, 64 -bit and 128 -bit adders. $\square$-refers to bitwise PG cell, $\square$ - refers to gray cell designed as AOI gate, $\square$ - refers to black cell designed as combination of AOI gate and NAND gate, ᄃ। - refers to gray cell designed as OAI gate, ㄷ refers to black cell designed combination of OAI gate and NOR gate, 8 - refers to an inverter gate, $\quad 8$ - refers to a pair of inverter gates and $\boldsymbol{\Delta}$-refers to sum bit generating gate. [30] discusses the design of bitwise PG cell, gray cell, black cell and sum bit cells in detail. It is important to note that AOI gate takes in un-inverted inputs and generates an inverted output. Similarly the OAI gate takes in inverted inputs and generates an un-inverted output. Hence inverter gates are used as needed to provide the right input signal to the corresponding gates. Also, in order to simplify the adder design and analysis problem it is assumed that gates are identical in size to other gates if the gates are driving a similar load. Since many gates drive similar loads, in each stage of the adder structure there will be groups of identical gates.


[^0]Figure 4.1: 16-bit Brent-Kung adder

$\begin{array}{lllllllllll}\mathrm{C}_{\text {out }} \mathrm{S}_{15} & \mathrm{~S}_{14} \mathrm{~S}_{13} & \mathrm{~S}_{12} & \mathrm{~S}_{11} \mathrm{~S}_{10} & \mathrm{~S}_{9} & \mathrm{~S}_{8} & \mathrm{~S}_{7} & \mathrm{~S}_{6} & \mathrm{~S}_{5} & \mathrm{~S}_{4} & \mathrm{~S}_{3} \\ \mathrm{~S}_{2} & \mathrm{~S}_{1} & \mathrm{~S}_{0}\end{array}$
Figure 4.2: 16-bit Skylansky adder


Figure 4.3: 16-bit Kogge-Stone adder

In the above figures of the adder structures, each identical group is enclosed in curly braces ' $\}$ ' and each logic gate in a stage is represented by the input bit position for quick reference to the identical groups.

The path delay expressions are determined using the simplified RC delay model as discussed earlier for those paths in order to define all the possible critical path delays and also to include all the circuit's non-identical transistor widths in the decision variable space. The wire capacitance as modeled in (3.26) is determined by the length of the interconnect wire. It is assumed that the wire capacitance for small wires is zero and only the wire capacitances of those wires that are significantly long were included as non-zero capacitance in calculating the circuit delay.

### 4.2 Optimization Results

The circuit analysis is done in the case of TSMC 0.18 micrometer process technology i.e. $\mathrm{K}=1$, $\mathrm{R}=5 \mathrm{~K} \Omega, \mathrm{C}=0.89 \mathrm{fF}$ and $\mathrm{C}_{\text {load }}=2 \mathrm{C}$. Using the optimization methodology discussed in Section III, the three adder structures are optimized to determine the optimum transistor widths while meeting the performance constraints and the results are shown in Figure 4.4, Figure 4.5 and Figure 4.6 for 32-bit, 64 -bit and 128 -bit adder sizes respectively. During the optimization, it is assumed that all the transistors widths in the first stage of the adders are kept fixed at the minimum transistor width. The first data point on the left-side of each power-delay curve refers to the minimum delay point found with no power constraint, the last data point on the right-side of each power-delay curve refers to the data point when minimum transistor widths are used for all the circuit transistors and the rest of the data points were determined by minimizing power with different delay constraints. The small solid circle on each curve refers to the minimum power-delay product data point.


Figure 4.4: Power-delay comparison plot for 32-bit adder case

In the case of a 32-bit adder after circuit optimization the Skylansky adder is the fastest of all three adders, followed by Kogge-Stone and then Brent-Kung. However, the Brent-Kung has the minimum power-delay product followed by Skylansky and then Kogge-Stone.


Figure 4.5: Power-delay comparison plot for 64-bit adder case
In the case of a 64-bit adder after circuit optimization the Kogge-Stone adder is the fastest of all three adders, but its power consumption is excessive. It is followed by Skylansky and then BrentKung in terms of speed, whereas Brent-Kung still has the minimum power-delay product followed by Skylansky and then the Kogge-Stone.


Figure 4.6: Power-delay comparison plot for 128-bit adder case
In the case of a 128-bit adder after circuit optimization the Kogge-Stone adder is still the fastest of all three adders, but again with excessive power consumption. Kogge-Stone is now followed by Brent-Kung and then the Skylansky in terms of speed. Whereas Brent-Kung still has the minimum power-delay product followed by Skylansky and then the Kogge-Stone.

From the above three plots it can be observed that Kogge-Stone is the fastest when power consumption is not a problem. The Brent-Kung adder has the minimum power-delay product when optimum transistor widths are used in all the three adder sizes. Appendix A shows more details on the optimum transistor widths found and the histograms of the transistor widths for all the three adders at various constraints.

## CHAPTER V

## CONCLUSION

The proposed simplified RC delay model can quickly estimate the circuit delay for any complex CMOS circuit. It is more accurate and allows more efficient circuit design than the existing logical effort method in that it includes the effects of interconnect wire capacitance in determining the circuit delay and determines p and n transistor widths independently. It is much easier to use the numerical optimization techniques with our simplified RC delay equations rather than SPICE for the following reasons:

1. The delays are explicit functions of transistor widths making it obvious what happened when any width is changed.
2. The width dependence of the delays is smooth so that simple numerical methods can be used to quickly find the optimum transistor widths.

Using the model and a heuristic gradient based optimization technique discussed in Chapter III the three popular parallel prefix adders are optimized and the power-delay performance comparison was done without having to do tedious layout and simulation. The results suggest that the Brent-Kung adder has the smallest power-delay product when optimum transistor widths are used, followed by the Skylansky adder and then the Kogge-Stone adder. While Kogge-Stone is the fastest, it comes at the expense of excessive power consumption.

## REFERENCES

[1] L. W. Nagel, and D. O. Pederson, "SPICE (Simulation Program with Integrated Circuit Emphasis)", Memorandum No. ERL-M382, University of California, Berkeley, Apr. 1973.
[2] R. P. Brent and H. T. Kung, "A Regular Layout for Parallel Adders", IEEE Transactions on Computers, vol-31, no.3, pp. 260-264, Mar 1982.
[3] J. Skylansky, "Conditional-Sum Addition Logic", IRE Transactions, EC-9, pp. 226-231, Jun2 1960.
[4] P. M. Kogge and H. S. Stone, "A Parallel Algorithm for the Efficient Solution of a General Class of Recurrence Equations", IEEE Transactions on Computers, vol. 22, no. 8, pp. 786-792, Aug 1973.
[5] W.C. Elmore, "The transient response of damped linear networks with particular regard to wideband amplifiers", Journal on Applied Physics, vol. 19, pp. 55-63, Jan 1948.
[6] W. Shockley, "A unipolar field effect transistor", Proc. IRE, vol. 40, pp. 1365-1376, Nov. 1952.
[7] T. Sakurai and A. R. Newton, "Alpha-Power Law MOSFET Model and its Applications to CMOS Inverter Delay and Other Formulas", IEEE Journal of SolidState Circuits, vol. 25, no. 2, April 1990.
[8] A. Slaz and M. Horowitz, "IRSIM: An Incremental MOS Switch-Level Simulator", Proc. $26^{\text {th }}$ Design Automation Conference, pp. 173-178, June 1989.
[9] J. Chang, "A Piecewise Linear Delay Modeling of CMOS circuits", Ph.D. dissertation, ECEN, OSU, Stillwater, OK, 2006.
[10] I. Sutherland, B. Sproull and D. Harris, "Logical Effort: Designing Fast CMOS Circuits", CA: Morgan Kaufmann Publishers, 1999, pp. 1-83
[11] A. Cauchy, "Méthode générale pour la resolution des systéms d'equations simulanées", Comp. Rend. Sci. Paris, 25, pp. 46-89, 1847.
[12] J. Barzilai and J. M. Bowrwein, "Two point step size gradient methods", IMA Journal of Numerical Analysis, 8, pp. 141-148, 1988.
[13] Jiren Yuan, Christer Svensson, "CMOS Circuit Speed Optimization Based on Switch Level Simulation", Circuits and Systems, IEEE International Symposium, vol.3, pp. 2109 2112, 1988.
[14] Sachin S. Sapatnekar, Vasant B. Rao, Pravin M. Vaidya, Sung-Mo Kang, "An Exact Solution to the Transistor Sizing Problem for CMOS Circuits Using Convex Optimization", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.12, no.11, pp. 1621-1634, November 1993.
[15] Masaaki Yamada, Sachiko Kurosawa, Reiko Nojima, Naohito Kojima, "Synergistic Power/Area Optimization with Transistor Sizing and Wire Length Minimization", IEEE Symposium on Low Power Electronics, pp. 50-51, 1994.
[16] Jiren Yuan, Christer Svensson, "Principle of CMOS Circuit Power-Delay Optimization with Transistor Sizing", IEEE International Symposium on Circuits and Systems, vol.1, pp.637-640, 1996.
[17] A. Wroblewski, O. Schumacher, C. V. Schimpfle, J. A. Nossek, "Minimizing Gate Capciatance with Transistor Sizing", IEEE International Symposium on Circuits and Systems, vol. 4, pp.186-189, 2001.
[18] H.Y. Chen, S.M Kang, "A New Circuit Optimization Technique for High Performance CMOS Circuits", IEEE Transactions on Computer-Aided Design, vol.10, no.5, May 1991.
[19] Manjit Borah, Robert Michael Owens, Mary Jane Irwin, "Transistor Sizing for Low Power CMOS Circuits", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.15, no.6, June 1996.
[20] Robert Rogenmoser, Hubert Kaeslin, "The Impact of Transistor Sizing on Power Efficiency in Submicron CMOS Circuits", IEEE Journal of Solid-State Circuits, vol.32, no.7, July 1997.
[21] Maitham Shams, Mohamed I. Elmasry, "Delay Optimization of CMOS Logic Circuits using Closed-Form Expressions", International Conference on Computer Design, pp. 563-568, 1999.
[22] Bradley S. Carlson, Suh-Juch Lee, "Delay Optimization of Digital CMOS VLSI Circuits by Transistor Reordering", IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol.14, no.10, October 1995.
[23] Li Ding, Pinaki Mazumder, "Optimal Transistor Tapering for High-Speed CMOS Circuits", Proceedings of the Design, Automation and Test in Europe Conference and Exhibition, pp. 708-713, 2002.
[24] A. Baliga, D. Yagain, "Design of High speed adders using CMOS and Transmission gates in Submicron Technology: A Comparative Study", IEEE Fourth International Conference on Emerging Trends in Engineering and Technology, pp.284-289, 2011.
[25] K. Nehru, A. Shanmugam, S. Vadivel, "Design of 64-Bit Low Power Parallel Prefix VLSI Adder for High Speed Arithmetic Circuits", IEEE International Conference on Computing, Communication and Applications, pp.1-4, 2012.
[26] A. N. Jayanthi, C. S. Ravichandran, "Comparison of Performance of High Speed VLSI Adders", IEEE International Conference on Current Trends in Engineering and Technology, pp.99-104, 2013.
[27] N. Poornima, V. S. Kanchana Bhaaskaran, "Power-Delay Optimized 32 Bit Radix-4, Sparse-4 Prefix Adder", IEEE Fifth International Conference on Signal and Image Processing, pp.201-205, 2014.
[28] L. G. Johnson, "Advanced Digital VLSI Design", Lecture notes for ECEN 6263, OSU, Stillwater, OK, 2010. Available http://lgjohn.okstate.edu/6263/index.html
[29] W. Rudin, "Principles of Mathematical Analysis", p. 101, 1976
[30] N. H. E. Weste and D. M. Harris, "CMOS VLSI DESIGN: A Circuits and Systems Perspective", $4^{\text {th }}$ ed., ch. 11, pp. 429-461, 2011.

## APPENDICES

## APPENDIX A

Histograms for 32-bit adder size for minimum delay case; where $\Delta t=\mathbf{R} * \mathbf{C}$


Histograms for 32-bit adder size with delay constraint of $\operatorname{tmax}=\mathbf{2 5 0 *} \Delta t$, where $\Delta t=\mathbf{R}^{*} \mathbf{C}$


Histograms for 32-bit adder size with delay constraint of tmax $=\mathbf{3 0 0} * \Delta t$, where $\Delta t=\mathbf{R}^{*} \mathbf{C}$


Histograms for 32-bit adder size with delay constraint of $\operatorname{tmax}=\mathbf{3 5 0} * \Delta t$, where $\Delta t=\mathbf{R}^{*} \mathbf{C}$


## Histograms for 32-bit adder size for minimum power-delay product



## APPENDIX B

## Histograms for 64-bit adder size for minimum delay case; $\Delta t=R * \mathbf{C}$


$\underline{\text { Histograms for 64-bit adder size with delay constraint of tmax }=\mathbf{3 0 0} * \Delta t \text {, where } \Delta t=R * C}$

$\underline{\text { Histograms for 64-bit adder size with delay constraint of tmax }=350 * \Delta t \text {, where } \Delta t=R^{*} \mathbf{C}}$

$\underline{\text { Histograms for 64-bit adder size with delay constraint of } \operatorname{tmax}=400^{*} \Delta t \text {, where } \Delta t=R^{*} C}$


## Histograms for 64-bit adder size for minimum power-delay product



## APPENDIX C

## Histograms for 128-bit adder size for minimum delay case; $\Delta \mathbf{t}=\mathbf{R} * \mathbf{C}$



Histograms for 128-bit adder size with delay constraint of $\operatorname{tmax}=400 * \Delta t$, where $\Delta t=R^{*} \mathbf{C}$


Histograms for 128-bit adder size with delay constraint of $\operatorname{tmax}=550 * \Delta t$, where $\Delta t=R * C$


## Histograms for 128-bit adder size for minimum power-delay product



## VITA

Sunil Kumar Lakkakula
Candidate for the Degree of
Doctor of Philosophy

## Thesis: CMOS CIRCUIT SPEED AND POWER OPTIMIZATION USING SIMPLIFIED RC DELAY MODEL

Major Field: Electrical Engineering
Biographical:
Education:
Completed the requirements for the Doctor of Philosophy in Electrical Engineering at Oklahoma State University, Stillwater, Oklahoma in May, 2015.

Completed the requirements for the Master of Science in Electrical Engineering at Oklahoma State University, Stillwater, Oklahoma in December, 2009.

Completed the requirements for the Bachelor of Technology in Electrical and Electronics Engineering at Acharya Nagarjuna University, Guntur, Andhra Pradesh, India in April, 2007.

Experience:
Graduate Research Associate, Oklahoma State University.
Professional Memberships:
Member, International Society of Automation
Member, Golden Key


[^0]:    $\mathrm{C}_{\text {out }} \mathrm{S}_{15} \mathrm{~S}_{14} \mathrm{~S}_{13} \mathrm{~S}_{12} \mathrm{~S}_{11} \mathrm{~S}_{10} \mathrm{~S}_{9} \mathrm{~S}_{8} \mathrm{~S}_{7} \mathrm{~S}_{6} \mathrm{~S}_{5} \mathrm{~S}_{4} \mathrm{~S}_{3} \mathrm{~S}_{2} \mathrm{~S}_{1} \mathrm{~S}_{0}$

