# Data Statistics and Low-Power Digital VLSI 

## A dissertation submitted to the SWISS FEDERAL INSTITUTE OF TECHNOLOGY ZURICH

for the degree of Doctor of Technical Sciences

presented by JÜRGEN WASSNER
Dipl.-Ing. Elektrotechnik born April ${ }^{\text {th }}, 1967$
citizen of Germany
accepted on the recommendation of
Prof. Dr. W. Fichtner, examiner
Prof. Dr. L. Thiele, co-examiner

## Acknowledgments

First of all, I wish to thank my supervisor, Prof. Wolfgang Fichtner. His faith in me and my work has been an essential source of motivation, and the excellent professional environment he provides at the Integrated Systems Laboratory (IIS) is invaluable. I would also like to thank Prof. Lothar Thiele for reading and co-examining the thesis.

I am greatly indebted in Dr. Hubert Kaeslin and Dr. Norbert Felber for their continuous support and encouragement during the work, myriad consultations, and their overall kindness and generosity. Without them, this thesis would not have been possible. I also want to express my gratitude to all colleagues at the IIS who contributed to the perfect working environment.

Part of this work has been carried out in collaboration with Bernafon Inc., Switzerland. In this regard I am thankful to Dr. Paul Zbinden and Arthur Schaub for their kind support. Furthermore, I acknowledge the financial sponsorship of KTI, the Commission for Technology and Innovation of the Swiss Government.

Last but not least, very special thanks go to all of my family. In particular, I gratefully think of my parents for all they have done for me. And, I would like to express my heartfelt gratitude to my wife, Dana, for her loving support and generously making many sacrifices during the past four years I was working on this project. Special thanks also go to our two daughters, Clara and Olivia, for keeping alive in me the child's perspective.

## Contents

Acknowledgments ..... vii
Abstract ..... xv
Zusammenfassung ..... xvii
1 Introduction ..... 1
1.1 Motivation ..... 1
1.2 Overview ..... 2
1.2.1 Goal of the Thesis ..... 2
1.2.2 Outline of the Thesis ..... 2
1.2.3 Contribution of the Thesis ..... 3
2 Sources of Power Dissipation ..... 5
2.1 Dynamic Power ..... 5
2.1.1 Capacitor Charging ..... 6
2.1.2 Crossover Current ..... 8
2.2 Static Power ..... 8
2.2.1 Leakage Current ..... 9
2.2.2 Resistive Loads ..... 9
2.3 Implications ..... 9
3 Probabilistic Circuit Analysis for Power Estimation ..... 11
3.1 Introduction ..... 11
3.1.1 Motivation ..... 11
3.1.2 Previous Work ..... 12
3.1.3 Outline ..... 12
3.2 Gate-level Power Estimation ..... 13
3.2.1 Circuit and Power Model ..... 13
3.2.2 Power Measures ..... 14
3.2.3 Switching Activity Estimation ..... 15
3.3 Probabilistic Measures for Binary Signals ..... 19
3.3.1 One-dimensional Binary Signals ..... 19
3.3.2 $n$-dimensional Binary Signals ..... 23
3.3.3 Complexity Considerations ..... 25
3.4 Probabilistic Analysis of Combinational Circuits ..... 27
3.4.1 Spectral Transform of Boolean Functions ..... 27
3.4.2 Exact Calculation of Switching Activities ..... 31
3.4.3 Approximation of Signal Correlation ..... 35
3.4.4 Experimental Results and Discussion ..... 38
3.4.5 Extension to Multi-Delay Model ..... 45
3.5 Probabilistic Analysis of Sequential Circuits ..... 50
3.5.1 Feedback Circuits ..... 50
3.5.2 Analysis via Spectral Transformation ..... 52
3.5.3 Markovian Analysis ..... 53
3.5.4 Implications ..... 55
3.6 Summary ..... 56
4 Data Statistics and Minimum Power Dissipation ..... 59
4.1 Introduction ..... 59
4.1.1 Motivation ..... 59
4.1.2 Previous Work ..... 60
4.1.3 Outline ..... 61
4.2 The Limits of VLSI Technology ..... 62
4.2.1 The Ideal MOSFET ..... 62
4.2.2 Technology Trends ..... 63
4.2.3 VLSI and Thermodynamics ..... 64
4.2.4 Implications ..... 66
4.3 Taxonomy for Lower Bound Problem ..... 67
4.3.1 Problem Formulation ..... 67
4.3.2 Representation of Digital Operations ..... 69
4.3.3 Representation of Data Statistics ..... 73
4.3.4 Problem Hierarchy ..... 74
4.4 The General Lower Bound Problem ..... 77
4.4.1 Tightness of the Lower Bound ..... 77
4.4.2 Energy Efficiency Rating ..... 78
4.4.3 Axiomatic Requirements ..... 79
4.4.4 Implications ..... 81
4.5 Lower Bound for Data Transmission ..... 81
4.5.1 Power Dissipation Bound ..... 82
4.5.2 Switching Activity Bounds ..... 86
4.5.3 Variable-Length Coding ..... 89
4.5.4 Fixed-Length Coding ..... 90
4.5.5 Implications ..... 94
4.6 Lower Bound in Boolean Optimization ..... 96
4.6.1 Problem Description ..... 96
4.6.2 Related Efforts ..... 97
4.6.3 Implications ..... 98
4.7 Lower Bound on Supply Voltage ..... 99
4.7.1 Information-theoretic Results ..... 99
4.7.2 Application to VLSI Processing ..... 101
4.7.3 Limitations ..... 103
4.8 Information Theory vs. Low Power VLSI ..... 106
4.8.1 Digest of Information-theoretic Results ..... 106
4.8.2 Problem Comparison ..... 107
4.8.3 Implications ..... 108
4.9 Summary ..... 110
5 Energy-Efficient Processing of Speech ..... 113
5.1 Introduction ..... 113
5.1.1 Motivation ..... 113
5.1.2 Previous Work ..... 114
5.1.3 Outline ..... 115
5.2 Speech Features and Coding ..... 115
5.2.1 Statistical Properties of Speech Data ..... 115
5.2.2 Speech Coding Techniques ..... 117
5.3 Activity Analysis for Speech Data ..... 118
5.3.1 Fixed-length Lossless Encodings ..... 118
5.3.2 Signal Parameters ..... 119
5.3.3 Switching Activity vs. Signal Parameters ..... 120
5.4 FIR Filtering of Speech Data ..... 125
5.4.1 Target Application ..... 125
5.4.2 Reference Architecture with 2sC Data ..... 125
5.4.3 Sign-Magnitude Representation ..... 129
5.4.4 Differential Encoding ..... 132
5.4.5 Adaptive Encoding ..... 135
5.4.6 Logarithmic Encoding ..... 137
5.5 Evaluation of Coding Schemes ..... 141
5.5.1 Application Parameters ..... 141
5.5.2 Power Estimation Model ..... 141
5.5.3 Experimental Results and Discussion ..... 143
5.5.4 Derivation of General Coding Guidelines ..... 147
5.6 Summary ..... 150
6 Concluding Remarks ..... 151
6.1 Prospects for Practical Application ..... 151
6.2 Future Work ..... 153
A Basics of Probability Theory ..... 155
B Basics of Information Theory ..... 159
C FSM Analysis Example ..... 163
D Memory Structures for FIR Filters ..... 167
Bibliography ..... 173

## Abstract

Ongoing advances of semiconductor technology enable the integration of more and more complex circuits, thus making feasible devices of ever increasing functionality for ever new applications. Naturally, power dissipation has evolved into a key constraint for the design of VLSI circuits, particularly for battery-powered applications.

This thesis deals specifically with power dissipation of digital VLSI circuits subject to the statistical properties of the data being processed. The dependence of power consumption on data statistics is explored with regard to three major aspects: power estimation, power minimization, and practical low-power design.

1. In the context of gate-level power estimation, a probabilistic approach to determine the switching activity in logic circuits from the given input statistics is presented. This approach can take into account any kind of signal correlation and calculates the correct switching activity for every node in the circuit. A novel approximation technique for the efficient control of the estimation accuracy is proposed.
2. The question of fundamental lower bounds on power dissipation in consideration of data statistics is pursued. A systematic view of minimum power dissipation is developed, which leads to a classification of the lower bound problem. The resulting subproblems are set in context to known information-theoretic bounds on data coding and transmission.
3. By means of speech filtering it is exemplified how designers of application-specific circuits may avail themselves of data statis-
tics to cut down the energy use. Various number representations and data encodings are examined, and potential power savings are quantified subject to statistical signal properties and operating conditions. This permits the derivation of general coding guidelines for application-specific processing of speech data.

## Zusammenfassung

Die anhaltende Weiterentwicklung der Halbleitertechnologie gestattet die Integration von immer komplexeren Schaltkreisen. Dies ermöglicht die Realisierung hochfunktioneller Geräte und eröffnet immer neue Einsatzfelder. Es überrascht daher nicht, dass die Leistungsaufnahme von VLSI Schaltkreisen mittlerweile eine Schlüsselposition im Entwurfsprozess einnimmt, insbesondere bei batteriebetriebenen Anwendungen.

In der vorliegenden Arbeit wird speziell die Leistungsaufnahme von digitalen VLSI Schaltkreisen im Zusammenhang mit den statistischen Eigenschaften der zu verarbeitenden Daten behandelt. Die Abhängigkeit der Leistungsaufnahme von der Datenstatistik wird im Hinblick auf drei wesentliche Aspekte untersucht: Leistungsabschätzung, Leistungsminimierung und praktischer Entwurf stromsparender Schaltungen.

1. Für die Leistungsabschätzung auf Gatterniveau wird ein wahrscheinlichkeitstheoretischer Zugang zur Bestimmung der Schaltaktivität in logischen Netzwerken vorgestellt. Dieser Zugang ermöglicht die Berücksichtigung jeglicher Art von Signalkorrelation, so dass eine genaue Berechnung der Knotenaktivitäten möglich ist. Ein neuartiges Näherungsverfahren erlaubt die elegante Steuerung der Schätzgenauigkeit.
2. Es wird der Frage nach einer grundsätzlichen unteren Schranke für die Leistungsaufnahme nachgegangen. Eine systematische Darstellung der Fragestellung führt zu einer Klassifizierung des Problems. Die daraus resultierenden Teilprobleme werden im

Zusammenhang mit bekannten informationstheoretischen Aussagen über Codier- und Übertragungsgrenzen diskutiert.
3. Anhand der Filterung von Sprachsignalen wird beispielhaft demonstriert, wie beim Entwurf anwendungsspezifischer Schaltungen die Statistik der Daten zur Senkung des Stromverbrauchs ausgenutzt werden kann. Das Sparpotential verschiedener Zahlendarstellungen und Codierungen wird in Abhängigkeit von statistischen Signalparametern und Betriebsbedingungen gemessen. Das erlaubt, Codierrichtlinien für die anwendungsspezifische Verarbeitung von Sprachsignalen aufzustellen.

## Chapter 1

## Introduction

### 1.1 Motivation

There is one formula that governs low-power digital VLSI design. It appears in virtually any text related to the subject matter and, presumably, is known to any VLSI designer who faces power dissipation issues. This formula goes

$$
\begin{aligned}
\text { Power }= & \text { Voltage }^{2} \times \text { Capacitance } \\
& \times \text { Clock frequency } \times \text { SWitching activity } .
\end{aligned}
$$

The present thesis will be about this enchanting formula. More specifically, our investigations will revolve mainly around the rearmost term in this equation - switching activity. Generally speaking, switching activity designates the time behavior, or liveliness, of logic signals. This time behavior, in one form or the other, must reflect the characteristics of the application data at hand. The resulting dependence of power dissipation on data statistics is the leitmotif for this thesis, as it combines challenges at the conceptual level with questions of practical significance.

With ongoing advances of semiconductor technology and the consequential boost in the complexity of integrated circuits (IC), power
dissipation in general has been trending higher on the list of VLSI design constraints. Today it is at or near the top of this list, notably for ICs in portable devices where battery lifetime is a major concern. A considerable portion of these devices processes data with statistical properties that are a priori known. In this case, data statistics may be seen as an additional parameter for the design of applicationspecific ICs (ASIC) and instruction set processors (ASIP). Fields of application include wireless communication, portable audio and video devices, and digital hearing instruments.

### 1.2 Overview

### 1.2.1 Goal of the Thesis

In this thesis we focus on three major aspects of low-power digital VLSI design in relation to the statistical properties of the application data. The corresponding goals can be formulated as follows:

1. Power estimation: Develop an analytical model suitable for explicit calculation of switching activity in logic circuits with arbitrary input statistics. Find means such that this model can deal with the state-explosion problem associated with bit-level analysis.
2. Minimum power dissipation: Explore the question of fundamental lower bounds on power dissipation by appending data statistics to the list of low-power VLSI design parameters. Illuminate the relation between the power dissipation limit and known information-theoretic bounds on data coding and transmission.
3. Practical low-power design: Show in the occurrence of speech processing that data statistics can be employed for the design of application-specific circuits with reduced power consumption.

### 1.2.2 Outline of the Thesis

The thesis is organized as follows: In chapter 2 the basic mechanisms of power dissipation in CMOS circuits are reviewed. The three subsequent chapters then are devoted to the three major aspects described
above: Chapter 3 deals with probabilistic methods for switching activity estimation in logic circuits. Chapter 4 pursues the question of minimum power dissipation in consideration of data statistics. Chapter 5 shows by means of speech filtering that data statistics can serve as vehicle for reduction of switching activity and energy use. Each of these three chapters is largely self-contained, but they all share one common theme - power dissipation of VLSI circuits in the context of data statistics. Finally, the thesis closes in chapter 6 with some general remarks and suggestions for future work.

### 1.2.3 Contribution of the Thesis

The main contributions of the thesis to the three aspects of low-power digital VLSI mentioned above can be summarized as follows.

Probabilistic circuit analysis: In chapter 3, a theoretical basis for exact calculation of switching activity in logic circuits is developed by bridging the concept of moments of a random variable known from probability theory and the concept of spectral transformation known from the theory of switching functions. Based on this theoretical foundation, a novel heuristic for the approximation of signal correlation in combinational circuits is suggested. This heuristic improves the state of the art in that it allows to control the degree of approximation for any kind of signal correlation by means of a single parameter [WKFF99].

Lower bound on power dissipation: In chapter 4, a systematic view of minimum power dissipation in consideration of data statistics is developed, which culminates in a classification of the problem of lower bounds on power dissipation. Solutions are provided for two of the resulting subproblems, i.e. "minimum supply voltage" and "minimum switching activity". An existing information-theoretic approach to minimum power dissipation [Sha97] is put into perspective of the new taxonomy. It is shown that the most general form of the lower bound problem is different in nature from the transmission bound problem of information theory.

Energy-efficient processing of speech: In chapter 5, waveform coding techniques known from low bit-rate communication are examined for energy-efficient filtering of speech signals. Arithmetic datapath units corresponding to the various coding schemes are presented, including a new circuit structure for the approximation of binary logarithm. Potential power savings are quantified subject to statistical signal properties and operating conditions. This permits the derivation of general coding guidelines for application-specific processing of speech data [WKFF00, WKFF01].

## Chapter 2

## Sources of Power Dissipation

This chapter reviews the basic mechanisms of power dissipation in digital VLSI circuits in order to reveal its relation to data statistics. A static CMOS logic style is presumed as this is the predominant choice for the implementation of VLSI circuits.

There are two fundamentally different sources of power dissipation in VLSI systems: dynamic and static power. Dynamic power is associated with the variation of signal voltages over time. This variation represents the progression of the information processing task at hand. Therefore, dynamic power dissipation is an inherent property of information representation and processing in VLSI circuits. Static power, on the other hand, gets dissipated even if all signals remain stable. In fully complementary-MOS circuits, static power is neither essential to information representation nor information processing.

### 2.1 Dynamic Power

Dynamic power dissipation has two sources: Current flowing for the charging and discharging of parasitic capacitors, and current resulting from conductive paths crossing over PMOS and NMOS devices. In figure 2.1 these currents are indicated for a CMOS inverter circuit.


Figure 2.1: CMOS inverter circuit with crossover current $I_{c o}$ and capacitor charge ( $I_{0 \rightarrow 1}$ ) and discharge ( $I_{1 \rightarrow 0}$ ) current.

### 2.1.1 Capacitor Charging

## Capacitor components

For the inverter circuit in figure 2.1 the voltage across the capacitor $C_{L}$ represents the logical value of the output signal. Assuming that the inverter drives another identical stage, $C_{L}$ represents the following physical capacitance components [CB95, Kae00]:

- The output-to-ground capacitance of the driving inverter circuit, made up of the junction capacitance between the drain regions and the local substrate.
- The capacitance between the drain regions and the gate electrode of the driving inverter, which must get mapped to $C_{L}$ with almost its fourfold value due to the Miller effect [ALES98].
- The input-to-ground capacitance of the driven circuit dominated by the thinoxide below the gate electrodes, and also containing the source-to-gate capacitances there.
- The total wiring capacitance of the interconnect between output of driving and input of driven inverter. This capacitance is made up of the wire-to-ground capacitance, which itself has
a parallel-plate and fringing-field component, and the wire-towire capacitance, which is significant for modern technologies with their reduced line spacing.

Furthermore, if the driven circuit has dependent inner nodes due to series-connected NMOS or PMOS transistors, as is the case for any multi-input logic gate, extra capacitance appears due to the drain/source regions at these inner nodes (junction and overlap capacitance, see above). Formally, the inner-node capacitance is attributed to the load capacitance $C_{L}$ of the driving cell, but under a simplified circuit model sometimes also is ascribed to the driven cell.

## Charging/Discharging power dissipation

Assuming rail-to-rail logic, i.e. signal voltages that swing back and forth between $V_{d d}$ and $V_{s s}$, the energy injected from the power supply in order to fully charge the load capacitor $C_{L}$ during a ' 0 ' to ' 1 ' transition of $V_{\text {out }}$ is

$$
\begin{equation*}
E_{s u p p l y}=V_{d d}^{2} C_{L} \tag{2.1}
\end{equation*}
$$

Half of this energy is dissipated in the p-channel device for the other half to be delivered to the capacitor $C_{L}$. During the subsequent ' 1 ' to ' 0 ' transition, the other half of the injected energy is wasted by discharging $C_{L}$ to ground via the n-channel device, see figure 2.1. Thus, the invested energy $E_{\text {supply }}$ is used for one charge-discharge cycle of $C_{L}$, corresponding to one switching cycle of the signal voltage $V_{\text {out }}$.

The average power dissipated during one charge-discharge cycle is obtained from dividing the invested energy $E_{\text {supply }}$ in (2.1) by the time required to complete one such cycle. Let $T_{c i}$ denote the time span of a computation interval and $f_{c i}$ be its inverse. For instance, for a single-edge-triggered clocking strategy $f_{c l k}=f_{c i}$, and in case of dual-edge-triggered clocking $f_{c l k}=f_{c i} / 2$. Then, the average power dissipation during one charge-discharge cycle may be written as

$$
\begin{equation*}
P_{s w}=V_{d d}^{2} C_{L} \frac{f_{c i}}{2} \alpha \tag{2.2}
\end{equation*}
$$

with $\alpha$ denoting the expected number of times $V_{\text {out }}$ switches from one
logical state to the other within one computation interval ${ }^{1}$.
The expected number of switching events of $V_{\text {out }}$ is a statistical quantity, which will be formally defined and investigated in chapter 3. For an isolated logic gate, $\alpha$ depends on the statistical properties of the gate's input signals and on the logic function of that very gate. In an overall circuit, $\alpha$ is a function of the data being processed, circuit structure, and the logic function of the gate at hand.

### 2.1.2 Crossover Current

The charge/discharge power dissipation in (2.2) is independent of the actual voltage waveform $V_{\text {out }}$. However, finite rise and fall times of input voltage $V_{i n}$ will result in a direct conducting path between $V_{d d}$ and $V_{s s}$ for short periods of time during switching, see figure 2.1. Specifically, a crossover current $I_{c o}$ will flow as long as $V_{t h, n}<V_{i n}<V_{d d}-\left|V_{t h, p}\right|$, where $V_{t h, n}$ and $V_{t h, p}$ is the threshold voltage of the NMOS and PMOS device, respectively. Besides input rise/fall times, the crossover current is an intricate function of various other parameters, e.g. capacitive load $C_{L}$, device geometries and threshold voltages.

Just as the switching power $P_{s w}$, the crossover component of power also varies approximately proportional with the number of signal transitions $\alpha$, simply because the conductive path between $V_{d d}$ and $V_{s s}$ exists only during switching events. However, the power consumed due to crossover currents contributes much less to the total dynamic power, typically about $10 \ldots .20 \%$ for well-designed circuits [Yea98].

### 2.2 Static Power

Static power dissipation basically may have two different sources: Leakage currents flowing due to the employed MOS transistor-based implementation technology, and currents flowing from $V_{d d}$ to $V_{s s}$ over functional resistive loads.

[^0]
### 2.2.1 Leakage Current

There are two main components of leakage currents:

- sub-threshold leakage due to carrier diffusion between source and drain regions when the transistor is turned off, and
- diode leakage at reverse-biased $\mathrm{p} / \mathrm{n}$ junctions (drain/well and well/well).

The magnitude of these currents is set predominantly by process parameters, and the contribution to total power usually is about $1 \%$ or less, although with advanced technologies and reduced supply and threshold voltages leakage power becomes more significant.

Strictly speaking, leakage power is also subject to data statistics, since sub-threshold leakage and diode leakage at drain/well junctions depend on the logical state of the cell.

### 2.2.2 Resistive Loads

In truly complementary-MOS circuits there exists no DC path between $V_{d d}$ and $V_{s s}$ if all signal values remain stable. However, sinks of static power may hide in units that do not adhere to a fully complementary-MOS design style. Such units include passive pullup/downs, clock generation units, macro-cell memories, LVDS receivers/transmitters, and pseudo-NMOS/PMOS subcircuits [Kae00].

Power dissipation due to resistive loads in general depends on the data statistics, because the voltage level of at least one terminal of the resistor will be controlled by some signal amplitude, thus determining the average proportion of time during which current is conducted.

### 2.3 Implications

Virtually any current causing CMOS circuits to dissipate power is subject to the statistical properties of the data that stimulate the circuit. However, there is one single source that predominates overall dissipation in static CMOS circuits, i.e. the current flowing for charging parasitic capacitances when signals change their logical state. The
dependence of the resulting power dissipation $P_{s w}$ in (2.2) on the temporal switching behavior of data signals motivates the investigation of data statistics in the context of low-power VLSI.

## Chapter 3

## Probabilistic Circuit Analysis for Power Estimation

### 3.1 Introduction

### 3.1.1 Motivation

For the design of low-power circuits, power estimation is indispensable as it supplies feedback on the current value of the power cost function during the optimization process. Power estimation techniques and tools exist for every abstraction level in the design process:

- system/algorithm level
- architectural/register-transfer (RT-) level
- logic/gate level
- transistor/physical level .

Among the levels of abstraction, the influence of data statistics on the produced power estimate is most evident at the logic/gate level. This is because the power estimation model (2.2) that applies at this
level is sufficiently specific to incorporate a data-related statistical measure, i.e. switching activity $\alpha$, but yet is simple enough to be applied even to the largest circuit within reasonable time. Although power models at lower levels of abstraction do, in principle, reflect data dependency, they are infeasible even for medium-sized circuits and have lost relevance in the light of standard-cell based design and high-level synthesis. On the other hand, power models at higher levels of abstraction do not, or only partially, reflect the data dependency involved in the power consuming charge/discharge processes, since physical design information is not available at this level.

### 3.1.2 Previous Work

Due to its significance for low-power design, a huge number of researchers have worked in the field of power estimation. Overviews of this work are provided for instance in [ $\left.\mathrm{S}^{+} 95, \mathrm{MPS} 98\right]$. Previous work within the scope of this chapter will be cited and classified in section 3.2.3.

### 3.1.3 Outline

This chapter is devoted to power estimation at the gate level, since the influence of data statistics is most evident on this level of abstraction. More specifically, the primary focus will be on probabilistic techniques for switching activity estimation. Despite the large number of publications on this topic, a comprehensive treatment that includes all types of signal correlation and thus allows an analytical calculation of switching activity can hardly be found.

Section 3.2 introduces basic terms and concepts in gate-level power estimation. In section 3.3 statistical measures for binary signals are defined, which subsequently will be employed for probabilistic analysis of logic circuits. Switching activity estimation in combinational circuits is the subject of section 3.4 , where a new technique for the approximation of signal correlation is introduced and experimentally verified. Section 3.5 then deals with probabilistic analysis of sequential circuits.

### 3.2 Gate-level Power Estimation

### 3.2.1 Circuit and Power Model

A combinational logic circuit is modeled as cycle-free directed graph with two types of nodes: primary inputs (PI) and logic gates, see figure 3.1. PIs have none, logic gates at least one incoming edge. All nodes exactly have one outgoing edge, which then may have arbitrary fan-out ${ }^{1}$. The edges of the graph represent electrical connections between the nodes. The orientation of the edges is given by the inputs and outputs of the logic gates. Nodes without any successor node are called primary outputs (PO).

If there are $n$ PIs and $m$ POs, the circuit implements a mapping $\{0,1\}^{n} \rightarrow\{0,1\}^{m}$ by means of the $g \geq m$ logic gates. These gates are memory-free and either implement basic Boolean functions such as ( N )AND, ( N )OR, $\mathrm{X}(\mathrm{N}$ )OR and NOT, or are themselves subcircuits composed of those basic functions. For now, it is assumed that the gates have zero propagation delay. A non-zero delay model will then be considered in section 3.4.5.

Each node of the circuit is associated with a binary signal. The PIs are associated with signals $x_{1}, \ldots, x_{n}$. The logic gates are associated with binary signals $y_{1}, \ldots, y_{m-1}, y_{m} \ldots, y_{g}$. These signals (logic gates) are topologically sorted such that signal $y_{j}(j=1, \ldots, g)$ does not depend on any signal $y_{l}$ with $l>j$.

For power estimation, the power dissipation model (2.2) is applied to every logic gate in the circuit. There are two parameters in this model that need to be identified:

Load capacitance. The load capacitance $C_{L_{j}}$ associated with gate $j(j=1, \ldots, g)$ is estimated from technology or physical layout information. It is assumed that all relevant capacitance components, such as the output-to-ground capacitance of gate $j$, the total capacitance for wiring gate $j$ to its successor gates as well as the input-to-ground and inner-node capacitance of these successor gates, are lumped into a single capacitor at the output of gate $j$, see section 2.1.1.

[^1]

Figure 3.1: Model of logic circuit for gate-level power estimation.

For a given circuit, the load capacitances are independent of data statistics and hence we will not further discuss how to obtain them.

Switching activity. The activity $\alpha_{y_{j}},(j=1, \ldots, g)$ refers to the switching behavior of the output of gate $j$, i.e. binary signal $y_{j}$. The precise implication of the node activities and their derivation depends on the objective of power estimation, as will be discussed next.

### 3.2.2 Power Measures

The following power measures may be discerned:
Average power. This is the most common objective of gate-level power estimation, where switching activity $\alpha_{y}$ refers to the average number of times logic gate $y$ changes its binary state within one computation interval. This number strongly depends on the statistical properties of the input data applied to the circuit. Summation over all gates yields the average power dissipation
of the circuit:

$$
\begin{equation*}
P_{a v g}=V_{d d}^{2} f_{c i} \sum_{j=1}^{g} C_{L_{j}} \frac{\alpha_{y_{j}}}{2} . \tag{3.1}
\end{equation*}
$$

Average power dissipation is of central importance for battery life-time considerations in mobile applications. Furthermore, this measure is of interest in reliability analysis for the overall system, where high power dissipation poses challenges on packaging and heat removal technology.

Instantaneous power. Given the present binary state of each gate, the instantaneous or cycle power of some circuit refers to the dissipation induced by one particular input pattern. Switching activity in this case is not a meaningful concept.

Maximum power. A measure of practical interest results when instantaneous power is maximized over all combinations of present circuit state and possible input pattern. This peak power is used to stay clear of switching noise and electromigration. It is also of interest for electro-magnetic compatibility considerations. Since in practice not all gates change their state at the same time during the computation interval, signal delays must be considered in order to obtain a tight upper bound on peak power.

Since only average power depends on data statistics, all subsequent discussion will be devoted to this measure. The provision of average switching activities $\alpha_{y_{j}}$ for evaluation of (3.1) will be discussed next.

### 3.2.3 Switching Activity Estimation

## Signal correlation

Average circuit power can be seen as a weighted sum of gate activities. Therefore, a random error afflicted with the estimated $\alpha_{y_{j}}$ can be tolerated, since such an error tends to average out due to summation over all nodes. However, there are also situations where activity must be estimated accurately for every node, e.g. for gate-level power optimization that performs activity-driven re-structuring of the circuit.

Thereby, the goal is to hide high-activity nodes within complex gates where they drive smaller load capacitances [IP97].

For accurate estimation of gate activities, signal correlation is crucial [SK96]. In combinational logic circuits, two basic types of correlation matter:

Spatio-temporal correlation. This type of correlation refers to statistical dependencies which are present in the application data. Spatial correlation refers to dependencies between different PI signals at the same time instance. Temporal correlation means dependencies between different time instances of the same or different PI signals. Spatio-temporal correlation of input signals causes correlation of internal signals and thus impacts their switching activity.

Structural correlation. Even if the input data is completely uncorrelated, correlation may be introduced by the circuit structure itself. Structural correlation takes effect through node fanouts which reconverge at subsequent gates in the circuit network. For instance, the inputs to gate $y_{3}$ in figure 3.1 are structurally correlated, because they all depend on PI $x_{2}$. This shall be taken into account for accurate calculation of $\alpha_{y_{3}}$.

Methods to obtain information on the gate switching activities $\alpha_{y_{j}}$ can be divided into two categories: simulation-based (dynamic) and probabilistic (static) approaches, see figure 3.2. The two approaches fundamentally differ in how they handle signal correlation. There also exist hybrid methods, which use aspects from both categories in order to combine their respective advantages while avoiding their drawbacks [ $\left.\mathrm{C}^{+} 98\right]$.

## Simulation-based activity estimation

In simulation-based approaches activity information is obtained from monitoring the gate toggling during explicit logic simulation of the circuit. In order to obtain representative activity information, the binary stimuli for this simulation shall resemble the statistical properties of the functional input data as close as possible. After a sufficiently large number of stimuli vectors is simulated, the average switching activity at every node will converge towards a narrow interval around the


Figure 3.2: Simulation-based (top) and probabilistic (bottom) activity estimation.
true value. The number of vectors necessary for convergence strongly depends on the circuit itself as well as on the desired accuracy. A priori information about the number of vectors required, is in general not available. This issue may be addressed by statistical estimation techniques that run the simulation until a stopping criterion for the overall circuit power [BNYT93, YTK98] or individual node activities [XN94] is met.

A great advantage of simulation-based methods is that they implicitly take into account all kind of signal correlations. From a practical point of view another plus is that logic simulation is a standard task in digital VLSI and a simulator is available virtually anywhere.

## Probabilistic activity estimation

As opposed to simulation, probabilistic techniques use statistical parameters to describe the properties of the circuit's input data. These parameters are employed to derive the switching activity of the gates. The advantage of such an approach is that no lengthy binary vectors are required and convergence of activity values is not an issue.

Due to their incremental nature, probabilistic methods are better suited to activity-driven power optimization, where the network structure is being modified as part of the process. Simulation-based methods in this case would require storage of long traces for every gate or complete re-simulation of the circuit.

The largest challenge for probabilistic techniques is to efficiently handle signal correlation effects. While for small circuits the activity of every gate may be calculated accurately, approximation of signal correlation is indispensable for large circuits. Several methods for probabilistic activity estimation have been proposed in the literature, which differ in how and to what extent they consider signal correlations. These methods can be divided in two classes:

Global approach. Methods of this class describe the logic function of any node globally, i.e. with respect to the PIs. To do so, either binary decision diagrams (BDD) [SSW96, MDG ${ }^{+} 97$, DTP98] or symbolic polynomials [CMD97] are employed. In both cases, the effect of reconvergent fanouts is accounted for, but spatial correlation between input signals was ignored. For large circuits, concepts, such as reconvergence regions [SSW96] and super-gates [CMD97], are introduced in order to cope with complexity.

Incremental approach. These methods describe the logic function of each gate with respect to its immediate input signals. Thus, structural correlation is not captured. In [MMP95, MMP98] correlation coefficients are used to describe signal dependencies. The complexity of this approach has been reduced in [TTSG97]. Methods employing correlation coefficients have the advantage that dependencies among input signals can be modeled by means of these coefficients. In order to handle large circuits efficiently, only pairwise signal correlation is considered. Relevant signal pairs, i.e. signals that converge anywhere in the circuit, must be identified prior to propagation of statistical parameters.

None of the above techniques offers efficient control over the approximation of all types of signal correlation. Section 3.4 describes such a method that governs the complexity-accuracy tradeoff by means of a single user-defined parameter. This approach shares features of either
of the two classes above: It is global because the circuit is described by polynomials of PI variables. However, these expressions are being constructed incrementally, with approximations being attained by employing immanent information of the Boolean function at hand.

### 3.3 Probabilistic Measures for Binary Signals

### 3.3.1 One-dimensional Binary Signals

## Signal model

In the framework of probabilistic circuit analysis, a binary discretetime signal $x(k)$ is modeled as random sequence $\left\{X_{k}\right\}$ with sample space $\Omega=\{0,1\}$ and $k=0,1,2, \ldots$ being the time index, see appendix A. In general it is assumed that $\left\{X_{k}\right\}$ is stationary in the strict sense ${ }^{2}$. Therefore, time index $k$ is suppressed when appropriate. Furthermore, the properties of the random sequence $\left\{X_{k}\right\}$ are referred to as the properties of signal $x(k)$.

## Signal probabilities and temporal correlation

The static probability $p_{x}^{1}$ of a binary signal $x$ is defined as the probability of finding $x$ in logic state ' 1 '. Since $x$ is stationary, this probability equals the expected value, i.e. the first moment ${ }^{3}$ of the associated binary random variable $X$ :

$$
\begin{equation*}
E[X]=\sum_{x \in\{0,1\}} x \cdot p(x)=\operatorname{Pr}\{x=1\} \stackrel{!}{=} p_{x}^{1} \tag{3.2}
\end{equation*}
$$

Thus, in case of a binary random variable the expected value itself represents a probability. The same holds true for the product of any number of binary random variables. This observation is key to the activity estimation algorithm to be presented in section 3.4.

[^2]In particular, $E\left[X^{2}\right]=E[X]=p_{x}^{1}$, and thus from (A.5) follows the variance or signal power of $x$ as

$$
\begin{equation*}
\sigma_{x}^{2}=E\left[X^{2}\right]-(E[X])^{2}=p_{x}^{1}\left(1-p_{x}^{1}\right) . \tag{3.3}
\end{equation*}
$$

As expected, for $p_{x}^{1}=0$ or $p_{x}^{1}=1$, i.e. for $x$ being a constant, the signal power is zero.

Of specific interest is the expected value of the product of two successive samples in a binary signal $x$ :

$$
\begin{equation*}
E\left[X_{k} \cdot X_{k+1}\right]=\operatorname{Pr}\{x(k)=1, x(k+1)=1\} \stackrel{!}{=} p_{x-x}^{1-1} . \tag{3.4}
\end{equation*}
$$

Accordingly, all first-order joint probabilities will be denoted as ${ }^{4}$

$$
\begin{equation*}
p_{x-x}^{i-j}=\operatorname{Pr}\{x(k)=i, x(k+1)=j\}, \quad i, j \in\{0,1\} \tag{3.5}
\end{equation*}
$$

with

$$
\begin{equation*}
p_{x-x}^{0-0}+p_{x-x}^{0-1}+p_{x-x}^{1-0}+p_{x-x}^{1-1}=1 \tag{3.6}
\end{equation*}
$$

In general, signal values at time $k$ and $k+1$ will not be independent. The correlation between successive samples in signal $x$ can be measured by the first-order correlation coefficient $\varrho_{x}$, which follows from (A.8) and the stationarity of $x$ as

$$
\begin{align*}
\varrho_{x} & =\frac{E\left[X_{k} \cdot X_{k+1}\right]-E\left[X_{k}\right] \cdot E\left[X_{k+1}\right]}{\sqrt{E\left[\left(X_{k}-E\left[X_{k}\right]\right)^{2}\right] \cdot E\left[\left(X_{k+1}-E\left[X_{k+1}\right]\right)^{2}\right]}} \\
& =\frac{p_{x-x}^{1-1}-\left(p_{x}^{1}\right)^{2}}{\sigma_{x}^{2}}=\frac{p_{x-x}^{1-1}-\left(p_{x}^{1}\right)^{2}}{p_{x}^{1}\left(1-p_{x}^{1}\right)} . \tag{3.7}
\end{align*}
$$

In general $-1 \leq \varrho_{x} \leq 1$. If $p_{x-x}^{1-1}=\left(p_{x}^{1}\right)^{2}$ then $\varrho_{x}=0$ and signal $x$ is said to be temporally uncorrelated. If $\varrho_{x}>0\left(\varrho_{x}<0\right)$ signal $x$ is temporally correlated (anti-correlated), indicating a higher probability of finding two successive samples of $x$ in the same (different) logical state than another signal $y$ with the same static probability $p_{y}^{1}=p_{x}^{1}$. Note that $\varrho_{x}$ is undefined for $\sigma_{x}^{2}=0$, i.e. for constant signals. However, employing L'Hospital's rule yields $\left.\varrho_{x}\right|_{p_{x}^{1}=0}=\left.\varrho_{x}\right|_{p_{x}^{1}=1}=1$, i.e. maximum correlation for constant signals, which is intuitive.

[^3]\[

$$
\begin{aligned}
& \varrho_{x}=0.5: \underset{\rightarrow}{\stackrel{k}{l}} \\
& \varrho_{y}=0 \text { : . O } 0110011001100110 \text {... } \\
& \varrho_{z}=-0.5: \ldots 011010101011010101 .
\end{aligned}
$$
\]

Figure 3.3: Correlated ( $\varrho_{x}=\frac{1}{2}$ ), uncorrelated ( $\varrho_{y}=0$ ), and anticorrelated ( $\varrho_{z}=-\frac{1}{2}$ ) periodic signal with $p_{x}^{1}=p_{y}^{1}=p_{z}^{1}=\frac{1}{2}$ and $\sigma_{x}^{2}=\sigma_{y}^{2}=\sigma_{z}^{2}=\frac{1}{4}$.

## Example:

Figure 3.3 shows three periodic signals ${ }^{5} x, y$, and $z$ with $p_{x}^{1}=p_{y}^{1}=$ $p_{z}^{1}=\frac{1}{2}$ but different temporal correlation. Although the three signals have equal signal power $\sigma_{x}^{2}=\sigma_{y}^{2}=\sigma_{z}^{2}=\frac{1}{4}$, the activity in terms of logical state changes is quite different.

## Switching activity

Formally, the switching activity of a binary signal $x$ is defined as

$$
\begin{equation*}
\alpha_{x}=p_{x-x}^{0-1}+p_{x-x}^{1-0} . \tag{3.8}
\end{equation*}
$$

Thus $\alpha_{x}$ gives the average number of times, a binary signal changes its logical state in a certain number of computational intervals. $\alpha_{x}$ is also referred to as activity factor, transition activity or transition probability. The latter name, however, is to be used with precaution, since for non-zero delay models a binary signal may change its logical state more than once per computational interval, resulting in $\alpha_{x}$ larger than one, see section 3.4.5.

The following theorem provides the basis for the activity estimation algorithm to be presented in section 3.4, by stating that the switching activity of any (stationary) binary signal $x$ can be computed from just two statistical measures of $x$.

[^4]
## Theorem 3.1

The switching activity $\alpha_{x}$ of a binary signal $x$ is given by its static probability $p_{x}^{1}$ and the joint probability $p_{x-x}^{1-1}$ as

$$
\begin{equation*}
\alpha_{x}=2\left(p_{x}^{1}-p_{x-x}^{1-1}\right) \tag{3.9}
\end{equation*}
$$

Proof: Since signal $x$ is stationary, the probability of finding $x$ at logic state ' 1 ' is given by $p_{x}^{1}=p_{x-x}^{1-0}+p_{x-x}^{1-1}=p_{x-x}^{0-1}+p_{x-x}^{1-1}$. Hence, the probability of $x$ changing its state from ' 1 ' to ' 0 ' is the same as the probability of changing from ' 0 ' to ' 1 ', i.e. $p_{x-x}^{1-0}=p_{x-x}^{0-1}=p_{x}^{1}-p_{x-x}^{1-1}$. Substituting this in (3.8) gives the proposition.

Applying (3.9) to the signals in figure 3.3 yields $\alpha_{x}=\frac{1}{4}, \alpha_{y}=\frac{1}{2}$, and $\alpha_{z}=\frac{3}{4}$. Thus, for a given static probability $p_{x}^{1}$ the switching activity decreases with increasing correlation coefficient $\varrho_{x}$. This relation can explicitly be expressed by substituting (3.3) into (3.7) and combining this with (3.9):

$$
\begin{equation*}
\alpha_{x}=2 p_{x}^{1}\left(1-p_{x}^{1}\right)\left(1-\varrho_{x}\right) . \tag{3.10}
\end{equation*}
$$

## Corollary 3.1

The switching activity $\alpha_{x}$ of a temporally uncorrelated signal $x$ is twice the signal power:

$$
\begin{equation*}
\alpha_{x}=2 \sigma_{x}^{2} \tag{3.11}
\end{equation*}
$$

Proof: With $\varrho_{x}=0$ this follows from (3.10) and (3.3).
Hence, to compute the switching activity $\alpha_{x}$ of a temporally uncorrelated signal, the knowledge of static probability $p_{x}^{1}$ is sufficient. On the other hand, applying (3.11) to a temporally correlated signal, introduces an error for $\alpha_{x}$. The following theorem provides a bound on this approximation error.

## Theorem 3.2

The switching activity $\alpha_{x}$ of a binary signal $x$ is bounded by

$$
\begin{equation*}
\alpha_{x} \leq 1-2 \cdot\left|p_{x}^{1}-0.5\right| . \tag{3.12}
\end{equation*}
$$

Proof: There can be at most twice as many signal transitions as there are ' 1 's (or ' 0 's) in $x$ (achieved by the alternating sequence


Figure 3.4: Switching activity $\alpha_{x}$ vs. static probability $p_{x}^{1}$ for a binary signal $x$. Any valid pair $\left(p_{x}^{1}, \alpha_{x}\right)$ falls into the unshaded region.
"01010..."). Thus, $\alpha_{x} \leq 2 p_{x}^{1}$ and $\alpha_{x} \leq 2\left(1-p_{x}^{1}\right)$. Together with $\alpha_{x} \leq 1$ this can be written as (3.12).

Figure 3.4 elucidates the relation between switching activity $\alpha_{x}$ and static probability $p_{x}^{1}$ for a stationary binary signal. Any valid pair $\left(p_{x}^{1}, \alpha_{x}\right)$ falls into the unshaded region in figure 3.4. The maximum switching activity of $\alpha_{x}=1$ can only be achieved if $p_{x}^{1}=\frac{1}{2}$. For temporally uncorrelated signals $\left(\varrho_{x}=0\right)$ holds $\alpha_{x} \leq \frac{1}{2}$, with equality if and only if $p_{x}^{1}=\frac{1}{2}$. From figure 3.4 one may also conclude that overestimation of switching activity is more likely than underestimation when erroneously applying (3.11) to a temporally correlated signal. Except for $p_{x}^{1}=\frac{1}{2}$, the maximum absolute error in $\alpha_{x}$ is always smaller for underestimation than for overestimation.

### 3.3.2 $n$-dimensional Binary Signals

The notations introduced previously naturally extend to $n$ dimensional binary signals $\boldsymbol{x}(k)=\left(x_{1}(k), \ldots, x_{n}(k)\right)$ with $x_{i}(k)(i=1, \ldots, n)$ being one-dimensional binary signals as introduced in section 3.3.1. A sample of $\boldsymbol{x}$ at time $k$ is called a
binary word. Since all $x_{i}(k)$ are stationary, also the binary words are stationary.

Definition (3.8) may be generalized to the switching activity of a $n$ dimensional binary signal as the sum of the switching activities of its constituting components:

$$
\begin{equation*}
\alpha_{x}=\alpha_{x_{1}}+\alpha_{x_{2}}+\ldots+\alpha_{x_{n}} . \tag{3.13}
\end{equation*}
$$

Clearly, it is not reasonable to speak of $\alpha_{x}$ as transition "probability" in this case.

## Moments of binary signals

Of peculiar importance for the activity estimation algorithm to be presented in section 3.4 are the moments of $\boldsymbol{x}$. Based on (A.6) and in analogy to (3.2) the $i$-th moment of a binary word is defined as

$$
\begin{equation*}
p_{x_{j_{1} \ldots x_{i}}}^{1 \ldots 1}=\operatorname{Pr}\left\{x_{j_{1}}=1, \ldots, x_{j_{i}}=1\right\}=E\left[X_{j_{1}} \cdot \ldots \cdot X_{j_{i}}\right] \tag{3.14}
\end{equation*}
$$

with $j_{1}, \ldots, j_{i}$ being a combination of $i$ disjoint elements from the set $\{1, \ldots, n\}$. For instance let $\boldsymbol{x}=\left(x_{1}, x_{2}, x_{3}\right)$, then the second moment $p_{x_{1} x_{3}}^{11}$ gives the probability of finding the first and third component of $\boldsymbol{x}$ simultaneously at logic ' 1 '. The first moments of $\boldsymbol{x}$ are just the static probabilities (3.2) of its component signals.

The $i$-th moments $p_{x_{j_{1}} \ldots x_{j_{i}}}^{1 \ldots 1}$ give evidence of spatial correlation between the components of a binary word, just the same way as $p_{x-x}^{1-1}$ expresses temporal correlation within a binary signal, see (3.7). In order to handle first-order spatio-temporal correlation within a $n$-dimensional binary signal $\boldsymbol{x}(k)$, the notation of lag-one $i$-th moment is introduced. These moments are joint probabilities of finding $i$ samples of $\boldsymbol{x}$, each of which either drawn at time $k$ or $k+1$, simultaneously at logic ' 1 '. Formally, this can be written as

$$
\begin{array}{r}
p_{x_{j_{1}^{1}}^{1} \ldots x_{j_{m}^{1}}-x_{j_{1}^{2}} \ldots x_{j_{r}^{2}}}^{1 \ldots 1-1}=\operatorname{Pr}\left\{x_{j_{1}^{1}}(k)=1, \ldots, x_{j_{m}^{1}}(k)=1,\right. \\
\left.x_{j_{1}^{2}}(k+1)=1, \ldots, x_{j_{r}^{2}}(k+1)=1\right\} \tag{3.15}
\end{array}
$$

with $2 \leq i=m+r \leq 2 n$.


Figure 3.5: Example of 3-dimensional periodic binary signal.

## Example:

For $n=3$ the lag-one third moment $p_{x_{1} x_{3}-x_{2}}^{11-1}$ gives the probability of simultaneously finding the first and third component of $\boldsymbol{x}$ at time $k$, and the second component at the following time step, at logic ' 1 '. In case of the periodic 3 -dimensional signal shown in figure 3.5 this yields $p_{x_{1} x_{3}-x_{2}}^{11-1}=\frac{2}{8}$.

### 3.3.3 Complexity Considerations

## Zero-order statistics

The zero-order statistics of a $n$-dimensional binary signal $\boldsymbol{x}$ is uniquely defined by its probability mass function (pmf), see appendix A. Since $\boldsymbol{x}$ is binary, the pmf $p(\boldsymbol{x}(k))$ comprises $2^{n}$ probability values. Only $2^{n}-1$ of these values are independent, because the sum over all values must yield one.

It can be shown, that $p(\boldsymbol{x}(k))$ can be constructed given all $i$-th moments $(i=1, \ldots, n)$ of $\boldsymbol{x}$ as defined in (3.14). To see this, note that there are $\binom{n}{i} i$-th moments of a $n$-dimensional binary word, all of which being independent. Furthermore, it is easy to verify that

$$
\begin{equation*}
\sum_{i=1}^{n}\binom{n}{i}=2^{n}-1 \tag{3.16}
\end{equation*}
$$

Thus, knowing all $i$-th moments $(i=1, \ldots, n)$ provides exactly the number of independent statistical measures necessary to completely specify the zero-order statistics of $\boldsymbol{x}$.

## Example:

If for $n=2$ the three moments $p_{x_{1}}^{1}, p_{x_{2}}^{1}$, and $p_{x_{1} x_{2}}^{11}$ are given, the remaining joint probabilities follow from

$$
\begin{aligned}
p_{x_{1} x_{2}}^{10} & =p_{x_{1}}^{1}-p_{x_{1} x_{2}}^{11} \\
p_{x_{1} x_{2}}^{01} & =p_{x_{2}}^{1}-p_{x_{1} x_{2}}^{11} \\
p_{x_{1} x_{2}}^{00} & =1-p_{x_{1} x_{2}}^{11}-p_{x_{1} x_{2}}^{10}-p_{x_{1} x_{2}}^{01}
\end{aligned}
$$

which is the $\operatorname{pmf} p\left(x_{1}, x_{2}\right)$. For any $n>2$ the pmf can be constructed by recursively applying this procedure.

## First-order statistics

The first-order statistics of a $n$-dimensional binary signal $\boldsymbol{x}$ is uniquely defined by the pmf $p(\boldsymbol{x}(k), \boldsymbol{x}(k+1))$, which comprises $2^{2 n}$ probability values. Given all $i$-th moments as in (3.14) and all lag-one $i$-th moments as in (3.15), $p(\boldsymbol{x}(k), \boldsymbol{x}(k+1))$ can be constructed in analogy to the example above.

To determine the total number of moments $M(n)$ that fully specify the first-order statistics of $\boldsymbol{x}$, the following is observed: First, according to (3.16), $2^{n}-1 i$-th moments are required to describe $\boldsymbol{x}(k)$. Since the signal is stationary, the same moments also apply to $\boldsymbol{x}(k+1)$. Second, according to (3.16) there are $2^{n}-1$ possibilities of selecting at least one out of $n$ signals at time $k$. Hence, the total number of lag-one moments is $\left(2^{n}-1\right)^{2}$, sine at least one signal at time points $k$ and $k+1$ must be selected, see (3.15). Thus, altogether

$$
\begin{equation*}
M(n)=2^{n}-1+\left(2^{n}-1\right)^{2}=2^{n}\left(2^{n}-1\right) . \tag{3.17}
\end{equation*}
$$

$M(n)$ also is the minimum number of values required to fully specify the first-order statistics of a binary signal. Although $M(n)=2^{2 n}-2^{n}<2^{2 n}$, the complexity of first-order statistical analysis remains exponential.

## Example:

For $n=2$ the $\operatorname{pmf} p(x(k), \boldsymbol{x}(k+1))$ is constituted of $2^{2 \cdot 2}=16$ joint probabilities. Given the $2^{2}-1=3$ moments $p_{x_{1}}^{1}, p_{x_{2}}^{1}$, and $p_{x_{1} x_{2}}^{11}$, as well as the $\left(2^{2}-1\right)^{2}=9$ lag-one moments $p_{x_{1}-x_{1}}^{1-1}, p_{x_{1}-x_{2}}^{1-1}, p_{x_{2}-x_{1}}^{1-1}$,
$p_{x_{2}-x_{2}}^{1-1}, p_{x_{1}-x_{1} x_{2}}^{1-11}, p_{x_{2}-x_{1} x_{2}}^{1-11}, p_{x_{1} x_{2}-x_{1}}^{11}, p_{x_{1} x_{2}-x_{2}}^{11-1}$, and $p_{x_{1} x_{2}-x_{1} x_{2}}^{11-11}$ the pmf $p\left(x_{1}(k), x_{2}(k), x_{1}(k+1), x_{2}(k+1)\right)$ can be derived as follows:

$$
\begin{aligned}
p(1111) & =p_{x_{1} x_{2}-x_{1} x_{2}}^{11-11} \\
p(1110) & =p_{x_{1} x_{2}-x_{1}}^{11-p(1111)} \\
p(1101) & =p_{x_{1} x_{2}-x_{2}}^{11-1}-p(1111) \\
& \vdots \\
p(1100) & =p_{x_{1} x_{2}}^{11}-p(1101)-p(1110)-p(1111) \\
& \vdots \\
p(0000) & =1-p(0001)-p(0010)-\ldots-p(1111) .
\end{aligned}
$$

### 3.4 Probabilistic Analysis of Combinational Circuits

This section introduces an exact and an approximate method for switching activity calculation in combinational logic circuits based on spectral transformation of Boolean functions. It is assumed that the logical state switching for all signals in the circuit happens instantaneously, i.e. the logic gates have zero propagation delay. The extension to non-zero delay models is possible and will be explained in section 3.4.5.

### 3.4.1 Spectral Transform of Boolean Functions

Why use spectral transforms?
Although other approaches to probabilistic activity estimation exist, see section 3.2.3, the spectral transform approach offers three major advantages:

1. The arithmetic notation used in some transforms greatly facilitates the application of the expectation operator $E[$.$] compared$
to an analysis based on ordinary logic equations. In particular, the arithmetic notation enables the application of the wellknown relation $E[X+Y]=E[X]+E[Y]$ for any two random variables $X$ and $Y$. This feature is at the heart of our approach.
2. Spectral transformation of Boolean functions results in a canonical representation. This representation comprises the minimum number of statistical measures (moments) that fully specify the relevant statistics of the input variables, see section 3.3.3.
3. As will be seen in section 3.4.3, the spectral transform provides a means for the efficient approximation of all types of signal correlation, where a single parameter controls the degree of approximation.

## Basics of spectral transformation

In general, a discrete function $f$ denotes a mapping $f: X \rightarrow Y$, where $X$ and $Y$ are finite non-empty sets. A switching function is a discrete function with $X=\{0,1\}^{n}$ and $Y=\{0,1\}$. Commonly, the term Boolean function is used as synonym for switching function. For some fixed variable ordering any Boolean function $y=f\left(x_{1}, \ldots, x_{n}\right)$ is specified by a binary $\left(2^{n}, 1\right)$-vector $\tilde{\boldsymbol{f}}=\left(f(0), \ldots, f\left(2^{n}-1\right)\right)$, which corresponds to the output column of the truth table associated with $f$. The spectral transform of a Boolean function $y=f\left(x_{1}, \ldots, x_{n}\right)$ with associated binary vector $\tilde{\boldsymbol{f}}$ is defined as

$$
\begin{equation*}
B(n) \cdot \tilde{\boldsymbol{f}}=\boldsymbol{a} \tag{3.18}
\end{equation*}
$$

with the $\left(2^{n}, 2^{n}\right)$ transformation matrix $B(n)$ being independent of $f$. The ( $2^{n}, 1$ )-vector $\boldsymbol{a}=\left(a_{0 \ldots 00}, a_{0 \ldots 01}, \ldots, a_{1 \ldots 11}\right)$ is called the spectrum of $f$. Given its spectrum, each Boolean function can be expressed as multivariate linear polynomial:

$$
\begin{equation*}
y=\sum_{i_{n} \ldots i_{1}} a_{i_{n} \ldots i_{1}} \cdot x_{n}^{i_{n}} \cdot \ldots \cdot x_{1}^{i_{1}} \tag{3.19}
\end{equation*}
$$

where the summation is over all binary vectors $i_{n} \ldots i_{1} \in\{0,1\}^{n}$. Depending on the definition of $B(n)$ and the interpretation of (3.19), the transforms have different names, such as Reed-Muller transform,

Walsh transform, and arithmetic transform [SF96]. For the activity estimation algorithm to be presented, only the latter will be of interest.

## Arithmetic transform

In case of arithmetic transform, the transformation matrix $B(n)$ is recursively defined as

$$
B(n)=\left[\begin{array}{cc}
B(n-1) & 0  \tag{3.20}\\
-B(n-1) & B(n-1)
\end{array}\right]
$$

with $B(0)=1$. Thus, in case of arithmetic transform, $B(n)$ in (3.18) maps the binary vector $\tilde{\boldsymbol{f}}$ to an integer vector $\boldsymbol{a}$. Addition and multiplication in (3.19) then denote ordinary arithmetic operations. For convenience, the polynomial (3.19) obtained as arithmetic transform of $y=f\left(x_{1}, \ldots, x_{n}\right)$ will be denoted $\mathcal{A}(y)$.

Depending on the domain of $f$, the arithmetic transform may be further classified: If $x_{i} \in \mathbb{Z}(i=1, \ldots, n)$ then $y \in \mathbb{Z}$. If $x_{i} \in[0,1](i=1, \ldots, n)$ then $y \in[0,1]$ and (3.19) is also referred to as probabilistic transform. The vector of coefficients $a \in \mathbb{Z}^{2^{n}}$ is identical in both cases and, depending on the domain, is called arithmetic spectrum or probability spectrum of $f$. Naturally, probabilistic transform and probability spectrum are the relevant forms in the context of probabilistic analysis of logic circuits.

To show that the probability spectrum $\boldsymbol{a}$ of any Boolean function $f$ is unique, it is observed that $B(n)(n=0,1,2, \ldots)$ is regular, and hence the inverse matrix $B^{-1}(n)$ exists. Then, the inverse arithmetic transform

$$
\begin{equation*}
B^{-1}(n) \cdot \boldsymbol{a}=\tilde{\boldsymbol{f}} \tag{3.21}
\end{equation*}
$$

yields a set of $2^{n}$ linear equations for the elements of $\boldsymbol{a}$, which can be obtained by substituting the $2^{n}$ distinct input assignments of $f$ and the corresponding output value into (3.19). Since with every input assignment a new component of $\boldsymbol{a}$ is selected, the $2^{n}$ equations are linearly independent. Consequently, there always exists a unique solution for $\boldsymbol{a}=\left(a_{0 \ldots 00}, a_{0 \ldots 01}, \ldots, a_{1 \ldots 11}\right)$ in (3.21) and thus $\mathcal{A}(y)$ is a canonical representation of $f$.

## Example:

For $n=2$, (3.19) can be written as $y=a_{00}+a_{01} x_{1}+a_{10} x_{2}+a_{11} x_{2} x_{1}$. In case of the exclusive-OR function (XOR)

$$
y=f\left(x_{1}, x_{2}\right)=\left(x_{1} \wedge \bar{x}_{2}\right) \vee\left(\bar{x}_{1} \wedge x_{2}\right)=x_{1} \oplus x_{2}
$$

the set of linear equations (3.21) becomes

$$
\begin{aligned}
a_{00} & =0 \\
a_{00}+a_{01} & =1 \\
a_{00}+a_{10} & =1 \\
a_{00}+a_{01}+a_{10}+a_{11} & =0
\end{aligned}
$$

which has the unique solution $\boldsymbol{a}=\left(a_{00}, a_{01}, a_{10}, a_{11}\right)=(0,1,1,-2)$. Hence, for the two-input xor holds $\mathcal{A}(y)=x_{1}+x_{2}-2 x_{2} x_{1}$.

In general, $\mathcal{A}(y)$ contains $2^{n}$ terms. Depending on the specific mapping $f$, however, some of the components of $\boldsymbol{a}$ are zero. Table 3.1 shows the arithmetic transform for basic Boolean functions.

It should be noted that any arithmetic transform permits two different interpretations:

1. $\mathcal{A}(y)$ represents the logic equation of the Boolean function $y=$ $f\left(x_{1}, \ldots, x_{n}\right)$ in an arithmetic notation, i.e. substitution of logic values ' 0 ' and ' 1 ' for the $x_{i}(i=1, \ldots, n)$ in $\mathcal{A}(y)$ yields the logic value of $y$ corresponding to $f$.

| Name | $y=f\left(x_{1}, x_{2}\right)$ | $\mathcal{A}(y)$ |
| :---: | :---: | :---: |
| NOT | $y=\overline{x_{1}}$ | $y=1-x_{1}$ |
| AND | $y=x_{1} \wedge x_{2}$ | $y=x_{1} x_{2}$ |
| NAND | $y=\overline{x_{1} \wedge x_{2}}$ | $y=1-x_{1} x_{2}$ |
| OR | $y=x_{1} \vee x_{2}$ | $y=x_{1}+x_{2}-x_{1} x_{2}$ |
| NOR | $y=\overline{x_{1} \vee x_{2}}$ | $y=1-x_{1}-x_{2}+x_{1} x_{2}$ |
| XOR | $y=x_{1} \oplus x_{2}$ | $y=x_{1}+x_{2}-2 x_{1} x_{2}$ |
| XNOR | $y=\overline{x_{1} \oplus x_{2}}$ | $y=1-x_{1}-x_{2}+2 x_{1} x_{2}$ |

Table 3.1: Arithmetic transforms for basic Boolean functions.
2. $\mathcal{A}(y)$ expresses the probability that the output of the corresponding Boolean function assumes the logic value ' 1 ' given the probabilities for the individual inputs being ' 1 ', i.e. substitution of $p_{x_{i}}^{1}(i=1, \ldots, n)$ in $\mathcal{A}(y)$ yields $p_{y}^{1}$.

### 3.4.2 Exact Calculation of Switching Activities

Given a topologically sorted logic circuit together with the first-order statistics of its primary inputs, the goal is to compute the switching activity $\alpha_{y_{j}}(j=1, \ldots, g)$, see figure 3.1. The following algorithm performs this computation in three major steps, where one of the three kinds of signal correlation (see section 3.2.3) matters in each step:
(I) Construct the arithmetic transforms $\mathcal{A}\left(y_{j}\right)(j=1, \ldots, g)$ in consideration of structural correlation;
(II) From $\mathcal{A}\left(y_{j}\right)$ compute the static signal probabilities $p_{y_{j}}^{1}(j=$ $1, \ldots, g$ ) in consideration of spatial correlation;
(III) From $\mathcal{A}\left(y_{j}\right)$ compute the joint probabilities $p_{y_{j}-y_{j}}^{1-1}$ in consideration of temporal correlation, and combine $p_{y_{j}}^{1}$ and $p_{y_{j}-y_{j}}^{1-1}$ to obtain the transition activities $\alpha_{y_{j}}(j=1, \ldots, g)$.

## Example:

The subsequent description of steps (I) through (III) shall be elucidated by means of a small example circuit with periodic input signals as shown in figure 3.6. The goal is to find the switching activity of gate $y_{3}$ by means of the above algorithm. From analyzing the binary output signal the expected result is inferred as $\alpha_{y_{3}}=1$.

## (I) Construction of $\mathcal{A}\left(y_{j}\right)$

Starting at the PI and proceeding in topological order to the PO, $\mathcal{A}\left(y_{j}\right)$ for each gate in the circuit is incrementally constructed using the basic arithmetic transforms in table 3.1. Compound logic gates


Figure 3.6: Example circuit with periodic input signals.
are decomposed. In the course of this procedure, structural correlation of signals arising from reconvergent fanouts are accounted for by applying the rule of exponent suppression [PM75]:

$$
\begin{equation*}
\left(x_{i}\right)^{\beta} \rightarrow x_{i} \text { for } \beta \geq 1 \tag{3.22}
\end{equation*}
$$

In other words, signal reconvergence is taken into account by the idempotence of multiplication in the Boolean algebra.

At the end of step (I) each node of the circuit is described as linear polynomial of the PI variables. This circuit model is by no means specific to activity estimation but has also proven useful in the context of VLSI verification and test [SF96].

## Example:

From table 3.1 the arithmetic transforms for gates $y_{1}$ and $y_{2}$ in the example of figure 3.6 follow as $\mathcal{A}\left(y_{1}\right)=x_{1}\left(1-x_{2}\right)=x_{1}-x_{2} x_{1}$ and $\mathcal{A}\left(y_{2}\right)=x_{3} x_{2}$. For construction of $\mathcal{A}\left(y_{3}\right)$ the exponent suppression rule must be applied, since $x_{2}$ reconverges in $y_{3}$. Thus, $\mathcal{A}\left(y_{3}\right)=$ $\mathcal{A}\left(y_{1}\right)+\mathcal{A}\left(y_{2}\right)-\mathcal{A}\left(y_{1}\right) \mathcal{A}\left(y_{2}\right)=x_{1}-x_{2} x_{1}+x_{3} x_{2}$.

## (II) Computation of $p_{y_{j}}^{1}$

In this step the statistical properties of the input signals get involved. Given the arithmetic transform $\mathcal{A}\left(y_{j}\right)$ the associated static signal probability $p_{y_{j}}^{1}$ can be obtained as follows: From the definition of static signal probability (3.2) one has

$$
\begin{equation*}
p_{y_{j}}^{1}=E\left[Y_{j}\right]=E\left[\mathcal{A}\left(Y_{j}\right)\right] . \tag{3.23}
\end{equation*}
$$

This enables the application of the well-known relation $E\left[X_{1}+X_{2}\right]=$ $E\left[X_{1}\right]+E\left[X_{2}\right]$ for any two random variables $X_{1}$ and $X_{2}$ to decompose $\mathcal{A}\left(y_{j}\right)$ as follows:

$$
\begin{align*}
p_{y_{j}}^{1}= & E\left[\sum_{i_{n} \ldots i_{1}} a_{i_{n} \ldots i_{1}} x_{n}^{i_{n}} \ldots x_{1}^{i_{1}}\right] \\
= & a_{0} \ldots 00 \\
& +a_{0 \ldots 01} E\left[X_{1}\right]+\ldots+a_{10 \ldots 0} E\left[X_{n}\right] \\
& +a_{0 \ldots 011} E\left[X_{2} X_{1}\right]+\ldots+a_{110 \ldots 0} E\left[X_{n} X_{n-1}\right]+\ldots \\
& +a_{1 \ldots 11} E\left[X_{n} \ldots X_{1}\right] \\
= & a_{0 \ldots 00} \\
& +a_{0 \ldots 01} p_{x_{1}}^{1}+\ldots+a_{10 \ldots 0} p_{x_{n}}^{1} \\
& +a_{0 \ldots 011} p_{x_{2} x_{1}}^{11}+\ldots+a_{110 \ldots 0} p_{x_{n} x_{n-1}}^{11}+\ldots \\
& +a_{1 \ldots 11} p_{x_{n} \ldots x_{1}}^{1 \ldots 1} \tag{3.24}
\end{align*}
$$

The probability values $p_{x_{1}}^{1}, p_{x_{2}}^{1}, \ldots, p_{x_{n} \ldots x_{1}}^{1 \ldots 1}$ in (3.24) are the $m$-th moments ( $m=1, \ldots, n$ ) of the $n$-dimensional signal at the PI of the circuit. Substituting these moments into terms with non-vanishing spectral components yields the static probability $p_{y_{j}}^{1}$ in consideration of spatial correlation.

The decomposition process in (3.24) corresponds to finding a disjoint cover for $y_{j}=f_{j}\left(x_{1}, \ldots, x_{n}\right)$ in disjunctive form [ $\left.\mathrm{MDG}^{+} 97\right]$. The arithmetic transform provides an elegant solution to this problem by employing the equivalence of expectation and the probability of being ' 1 ' of a binary random variable.

At this point, the switching activities $\alpha_{y_{j}}(j=1, \ldots, g)$ could be computed from $p_{y_{j}}^{1}$ by virtue of (3.11) which, however, does not account for temporal correlation of signals.

## Example:

Evaluation of $\mathcal{A}\left(y_{3}\right)=x_{1}-x_{2} x_{1}+x_{3} x_{2}$ with $p_{x_{1}}^{1}=\frac{3}{4}, p_{x_{2} x_{1}}^{11}=$ $\frac{2}{4}$ and $p_{x_{3} x_{2}}^{11}=\frac{1}{4}$ yields $p_{y_{3}}^{1}=\frac{1}{2}$. The switching activity without consideration of temporal correlation follows as $\hat{\alpha}_{y_{3}}=2 p_{y_{3}}^{1}\left(1-p_{y_{3}}^{1}\right)=$ $\frac{1}{2}$, which is only half the true value.

Now consider the disjunctive form $y_{3}=x_{1} \bar{x}_{2} \vee x_{2} x_{3}$. In order to apply the expectation operator as in (3.24), i.e. to obtain $p_{y_{3}}^{1}$, the terms of the right-hand side must be independent. Thus, the terms must form a disjoint cover for $y_{3}$, e.g. $y_{3}=x_{1} \bar{x}_{2} x_{3} \vee x_{1} \bar{x}_{2} x_{3} \vee x_{1} x_{2} x_{3} \vee$ $\bar{x}_{1} x_{2} x_{3}$. This then yields the correct value $p_{y_{3}}^{1}=\frac{1}{2}$. However, this procedure would in general require all $2^{n} \mathrm{pmf}$ values of the PI signals to be known.

## (III) Computation of $p_{y_{j}-y_{j}}^{1-1}$ and $\alpha_{y_{j}}$

From theorem 3.1 follows that, in order to account for temporal correlation at node $j$, the joint probability $p_{y_{j}-y_{j}}^{1-1}$ must be known. Contrary to [CMD97], where polynomial expressions for all four fist-order joint probabilities (3.5) were incrementally constructed, it suffices to know $p_{y_{j}-y_{j}}^{1-1}$. Moreover, the polynomial expression associated with $p_{y_{j}-y_{j}}^{1-1}$ can be obtained directly from $\mathcal{A}\left(y_{j}\right)$ and does not have to be incrementally constructed from PI variables. This greatly reduces the computational burden.

Since $p_{y_{j}-y_{j}}^{1-1}$ is the probability of signal $y_{j}$ being ' 1 ' at two consecutive time steps, it can be obtained from evaluating $\mathcal{A}\left(y_{j}(k) \wedge y_{j}(k+1)\right)$. This arithmetic transfrom can be written as

$$
\begin{align*}
& \mathcal{A}\left(y_{j}^{k} y_{j}^{k+1}\right)=\mathcal{A}\left(y_{j}^{k}\right) \cdot \mathcal{A}\left(y_{j}^{k+1}\right) \\
& =\sum_{i_{n} \ldots i_{1}} \sum_{j_{n} \ldots j_{1}} a_{i_{n} \ldots i_{1}} a_{j_{n} \ldots j_{1}}\left(x_{n}^{k}\right)^{i_{n}} \ldots\left(x_{1}^{k}\right)^{i_{1}}\left(x_{n}^{k+1}\right)^{j_{n}} \ldots\left(x_{1}^{k+1}\right)^{j_{1}} \\
& \quad i_{r}, j_{r} \in\{0,1\}(r=1, \ldots, n) \tag{3.25}
\end{align*}
$$

where for brevity time index $k$ is denoted as superscript. Following the same procedure as in (3.24), one gets

$$
\begin{align*}
p_{y_{j}-y_{j}}^{1-1}= & a_{0 \ldots 00}\left(a_{0 \ldots 00}+a_{0 \ldots 01} p_{x_{1}}^{1}+\ldots+a_{10 \ldots 0} p_{x_{n}}^{1}\right)+\ldots \\
& +a_{0 \ldots 01}\left(a_{0 \ldots 00}+a_{0 \ldots .01} p_{x_{1}-x_{1}}^{1-1}+\ldots+a_{10 \ldots} \ldots p_{x_{1}-x_{n}}^{1-1}\right)+\ldots \\
& +\left(a_{1 \ldots 11}\right)^{2} p_{x_{n} \ldots x_{1}-x_{n} \ldots x_{1}}^{1 \ldots 1} . \tag{3.26}
\end{align*}
$$

Thus, the expression for $p_{y_{j}-y_{j}}^{1-1}$ is comprised of the probability spectrum corresponding to $y_{j} \stackrel{y_{j}-y_{j}}{=} f_{j}\left(x_{1}, \ldots, x_{n}\right)$, the $m$-th moments ( $m=1, \ldots, n$ ), and the lag-one $t$-th moments $(t=2, \ldots, 2 n)$ of the

PI signal. Evaluation of each non-vanishing term with these moments yields the numerical value for $p_{y_{j}-y_{j}}^{1-1}$.

Finally, the switching activity in consideration of all three types of correlation can be obtained from $p_{y_{j}-y_{j}}^{1-1}$ and $p_{y_{j}}^{1}$ computed in step (II) by means of theorem 3.1.

## Example:

The arithmetic transform of $y_{3}$ at two consecutive time steps follows from $\mathcal{A}\left(y_{3}\right)=x_{1}-x_{2} x_{1}+x_{3} x_{2}$ as

$$
\begin{aligned}
\mathcal{A}\left(y_{3}^{k} y_{3}^{k+1}\right)= & x_{1}^{k} x_{1}^{k+1}-x_{1}^{k} x_{2}^{k+1} x_{1}^{k+1}+x_{1}^{k} x_{3}^{k+1} x_{2}^{k+1}-x_{2}^{k} x_{1}^{k} x_{1}^{k+1} \\
& +x_{2}^{k} x_{1}^{k} x_{2}^{k+1} x_{1}^{k+1}-x_{2}^{k} x_{1}^{k} x_{3}^{k+1} x_{2}^{k+1}+x_{3}^{k} x_{2}^{k} x_{1}^{k+1} \\
& -x_{3}^{k} x_{2}^{k} x_{2}^{k+1} x_{1}^{k+1}+x_{3}^{k} x_{2}^{k} x_{3}^{k+1} x_{2}^{k+1} .
\end{aligned}
$$

Evaluation with the corresponding moments yields ${ }^{6}$

$$
p_{y_{3}-y_{3}}^{1-1}=\frac{1}{4}(2-1+0-2+1-0+1-1+0)=0 .
$$

Together with $p_{y_{3}}^{1}=\frac{1}{2}$ from step (II), this gives the correct switching activity $\alpha_{y_{3}}=2\left(p_{y_{3}}^{1}-p_{y_{3}-y_{3}}^{1-1}\right)=1$.

### 3.4.3 Approximation of Signal Correlation

## Complexity considerations

The above procedure exactly models all relevant signal correlation. Its computational complexity with respect to the number of gates in the circuit is $O(g)$. However, the complexity with respect to the number of PI is $O\left(2^{2 n}\right)$. This is due to a maximum of $2^{n}$ non-vanishing terms in $\mathcal{A}\left(y_{j}\right)$ and the operation in (3.25). Also, as discussed in section 3.3.3, the number of moments for the PI signals to be stored is $M(n)=2^{n}\left(2^{n}-1\right)$. In order to handle large circuits, the following heuristic for approximation of signal correlation is proposed by the author [WKFF99]. This heuristic trades accuracy for computational cost.

[^5]
## Essential variables

The approximation is applied in step (I) of the exact algorithm. Assume that the inputs to gate $j$ are named $j^{\prime}$ and $j^{\prime \prime}$. Then, prior to construction of $\mathcal{A}\left(y_{j}\right)$ the number of PI variables in $\mathcal{A}\left(y_{j^{\prime \prime}}\right)$ and $\mathcal{A}\left(y_{j^{\prime}}\right)$ is limited to some value $d$. These $d$ essential variables are kept symbolically, while all other variables are evaluated to their respective static probability $p_{x_{i}}^{1}$. Such partial evaluation yields real-valued probability spectra $\boldsymbol{a}$.

The evaluation of some variable $x_{i}$ potentially introduces an error at subsequent nodes because
a) the exponent suppression rule is violated if paths from PI $x_{i}$ reconverge,
b) spatial correlation of $x_{i}$ can not be considered when evaluating $\mathcal{A}\left(y_{j}\right)$ or $\mathcal{A}\left(y_{j}^{k} y_{j}^{k+1}\right)$, and
b) temporal correlation of $x_{i}$ can not be considered when evaluating $\mathcal{A}\left(y_{j}^{k} y_{j}^{k+1}\right)$.

Thus, the choice of $d$ represents a tradeoff between computation/storage complexity and accuracy. With $d=0$, the approximation is equivalent to propagating static probability values, because all PI variables $x_{i}$ are evaluated to $p_{x_{i}}^{1}(i=1, \ldots, n)$ at the start of the algorithm. In this case any signal correlation is ignored. The other extremum, $d=\hat{n} \leq n$, produces exact switching activities, where $\hat{n}$ denotes the maximum number of PIs that any gate function depends on.

## Selecting essential variables

The key question is how to select the $d$ essential variables in the two polynomials $\mathcal{A}\left(y_{j^{\prime \prime}}\right)$ and $\mathcal{A}\left(y_{j^{\prime}}\right)$ at the input to gate $j$, without giving up the incremental nature of the algorithm. First, variables common to $\mathcal{A}\left(y_{j^{\prime \prime}}\right)$ and $\mathcal{A}\left(y_{j^{\prime}}\right)$ are chosen such that exponent suppression can take place correctly. However, reconvergence occurs only at a fraction of all nodes. In particular, variables that are actually due to exponent suppression at the current node, might have been evaluated
already at earlier stages of the algorithm. Thus, a secondary selection criteria must be found. Experiments showed that random selection is not a good choice, because estimation accuracy can widely vary between runs and monotonic behavior with respect to $d$ is not justified.

The generic structure of (3.19) however provides a means for guiding the selection in consideration of monotonic behavior as follows: The contribution of term $i_{n} \ldots i_{1}$ to $\mathcal{A}\left(y_{j}\right)$ depends on $a_{i_{n} \ldots i_{1}}$ and the order of that term. By assuming $x_{i}=\frac{1}{2}(i=1, \ldots, n)$, a normalized probability spectrum $\tilde{\boldsymbol{a}}$ may be defined:

$$
\begin{equation*}
\tilde{a}_{i_{n} \ldots i_{1}}=\left|a_{i_{n} \ldots i_{1}}\right| \cdot 2^{-\sum_{r=1}^{n} i_{r}} \tag{3.27}
\end{equation*}
$$

$i_{r} \in\{0,1\}(r=1, \ldots, n)$. Thus, $\tilde{a}_{i_{n} \ldots i_{1}}$ measures the relevance of term $i_{n} \ldots i_{1}$ in $\mathcal{A}\left(y_{j}\right)$. Evaluation of variables that appear only in terms of low relevance will introduce a small error if the exponent suppression rule for those variables is violated at subsequent nodes. Similarly, neglecting correlation by evaluating variables with their static probabilities instead of the required moment is less critical for terms of low relevance.

Based on the above observations, the essential variables in the two polynomials $\mathcal{A}\left(y_{j^{\prime \prime}}\right)$ and $\mathcal{A}\left(y_{j^{\prime}}\right)$ are chosen with the following priorities:

1. variables that appear in both polynomials;
2. variables from terms with largest $\tilde{a}_{i_{n} \ldots i_{1}}$.

Note that the selection process still contains an element of choice if terms of equal relevance exist. This non-determinism asks for experimental verification of the average performance of the proposed heuristic.

## Example:

Consider the arithmetic transform $\mathcal{A}\left(y_{3}\right)=x_{1}-x_{2} x_{1}+x_{3} x_{2}$ of gate $y_{3}$ in figure 3.6. The non-zero terms of the normalized probability spectrum follow from (3.27): $\tilde{a}_{001}=\frac{1}{2}, \tilde{a}_{011}=\frac{1}{4}, \tilde{a}_{110}=\frac{1}{4}$. If $d=1$ essential variables are to be identified, the choice will be $x_{1}$ which
yields the approximated transform $\mathcal{A}\left(y_{3}\right) \approx x_{1}-p_{x_{2}}^{1} x_{1}+p_{x_{3}}^{1} p_{x_{2}}^{1}=$ $\frac{1}{4}+\frac{1}{2} x_{1}$.

For $d=2$, however, either $x_{2}$ or $x_{3}$ can be selected as additional essential variable, since $\tilde{a}_{011}=\tilde{a}_{110}$. The corresponding approximation would be either $\mathcal{A}\left(y_{3}\right) \approx x_{1}+\frac{1}{2} x_{2}-x_{2} x_{1}$ or $\mathcal{A}\left(y_{3}\right) \approx \frac{1}{2} x_{1}+\frac{1}{2} x_{3}$, yielding different results for subsequent gates.

### 3.4.4 Experimental Results and Discussion

The proposed activity calculation algorithm and correlation approximation scheme have been implemented using a standard symbolic computation package, in order to validate the following two requirements:

- estimation accuracy increases monotonically with increasing number of essential variables $d$, and
- significant accuracy improvements are achieved for feasible values of $d$.

The first requirement shall ensure consistency between runs with different parameters. The second requirement is necessary because the number of terms in $\mathcal{A}\left(y_{j}\right)$ is $2^{2 d}$ in the worst case, if $d$ essential variables are kept symbolically in $\mathcal{A}\left(y_{j^{\prime \prime}}\right)$ and $\mathcal{A}\left(y_{j^{\prime}}\right)$.

## Experimental results

For the experiments the complete ISCAS' 85 benchmark set of combinational circuits and a 32 -bit carry-lookahead adder were used. To express estimation accuracy for individual nodes by means of a single quantity, the following $R M S$ switching activity error is being used:

$$
\begin{equation*}
\mathrm{RMS}_{\alpha}=\sqrt{\frac{1}{g} \sum_{j=1}^{g}\left(\hat{\alpha}_{y_{j}}-\alpha_{y_{j}}\right)^{2}} \tag{3.28}
\end{equation*}
$$

where $\hat{\alpha}_{y_{j}}$ is the estimated switching activity of node $j$, and $\alpha_{y_{j}}$ is the true value which was obtained from logic simulation of $10^{5}$ input vectors.

Two sets of experiments with different input stimuli were conducted. The first set assumes random, uncorrelated PI signals with $p_{x_{i}}^{1}=\frac{1}{2}(i=1, \ldots, n)$. In this case the approximation involves only structural correlation. For the second set of experiments correlated input vectors were used. These vectors were generated by passing the output of binary counters (temporal correlation) through a logic network with high fanout of PI signals (spatial correlation).

Table 3.2 shows the RMS error for different values of $d$. It can be seen, that with two exceptions (c1908 correlated, cla32 random) the RMS error decreases monotonically for increasing $d$. On average, the RMS error decreases by around $50 \%$ if $d=4$ essential variables are identified. Also, substantial accuracy improvements are achieved even for small values of $d$. The highest relative accuracy improvement is observed between $d=0$ and $d=1$.

For three selected circuits, figures $3.7-3.9$ show the percentage of nodes with absolute error of switching activity $\Delta \alpha_{y_{j}}=\left|\hat{\alpha}_{y_{j}}-\alpha_{y_{j}}\right|$ within certain bounds. Each of these plots combines five histograms over six error intervals. The trend of these histograms again indicates nearly monotonic behavior with respect to $d$. Even for circuit c6288 in figure 3.9 , which has been qualified as notoriously difficult by other researchers due to a immense number of reconvergent fanouts [HYH99], the proposed approximation scheme does not break down. The monotonic behavior is retained for this circuit, although at a much higher level of error.

|  | number of |  |  |  | random inputs |  |  |  |  | correlated inputs |  |  |  |  |
| :---: | :---: | :---: | :---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | :---: |
| circuit | gates | PI | PO | $\mathrm{d}=0$ | $\mathrm{~d}=1$ | $\mathrm{~d}=2$ | $\mathrm{~d}=3$ | $\mathrm{~d}=4$ | $\mathrm{~d}=0$ | $\mathrm{~d}=1$ | $\mathrm{~d}=2$ | $\mathrm{~d}=3$ | $\mathrm{~d}=4$ |  |
| c 432 | 160 | 36 | 7 | 75 | 44 | 42 | 42 | 40 | 211 | 157 | 104 | 101 | 98 |  |
| c499 | 202 | 41 | 32 | 4 | 4 | 4 | 4 | 4 | 130 | 92 | 83 | 83 | 81 |  |
| c880 | 383 | 60 | 26 | 20 | 9 | 6 | 4 | 4 | 179 | 151 | 67 | 64 | 46 |  |
| c1355 | 546 | 41 | 32 | 56 | 40 | 29 | 29 | 12 | 160 | 100 | 74 | 71 | 45 |  |
| c1908 | 880 | 33 | 25 | 19 | 6 | 5 | 3 | 3 | 136 | 76 | 64 | 44 | 49 |  |
| c2670 | 1193 | 157 | 64 | 64 | 47 | 42 | 40 | 39 | 201 | 143 | 117 | 87 | 68 |  |
| c3540 | 1669 | 50 | 22 | 56 | 44 | 43 | 40 | 34 | 176 | 163 | 134 | 97 | 91 |  |
| c5315 | 2307 | 178 | 123 | 38 | 37 | 26 | 15 | 10 | 207 | 165 | 118 | 93 | 76 |  |
| c6288 | 2416 | 32 | 32 | 130 | 117 | 109 | 103 | 97 | 213 | 188 | 170 | 154 | 147 |  |
| c7552 | 3512 | 206 | 107 | 55 | 34 | 24 | 17 | 12 | 189 | 169 | 149 | 116 | 104 |  |
| cla32 | 578 | 64 | 32 | 16 | 14 | 9 | 6 | 8 | 222 | 152 | 116 | 109 | 87 |  |
| avg. | - | - | - | 52 | 38 | 33 | 30 | 26 | 182 | 142 | 109 | 92 | 81 |  |

Table 3.2: Statistics of benchmark circuits (column 2 through 4), and RMS switching activity error [ $10^{-3}$ ] for random (column 5 through 9 ) and correlated (column 10 through 14) inputs and different numbers $d$ of essential variables.

[^6]

Figure 3.7: Percentage of nodes vs. switching activity error interval and number of essential variables (d) for circuit c7552 with random (top) and correlated (bottom) inputs.


Figure 3.8: Percentage of nodes vs. switching activity error interval and number of essential variables (d) for 32 -bit carry-lookahead adder with random (top) and correlated (bottom) inputs.


Figure 3.9: Percentage of nodes vs. switching activity error interval and number of essential variables (d) for high-reconvergent fanout circuit c6288 with random (top) and correlated (bottom) inputs.

## Discussion

Compared to existing methods the presented approach has the following advantages:

+ Potentially, all three types of signal correlation, i.e. spatial, temporal, and structural correlation are accurately modeled. Moreover, accurate switching activity can be calculated for every node from the theoretical minimum number of statistical measures (moments).
+ For large circuits, the approximation of all types of signal correlation can be controlled by means of a single parameter. Approximation accuracy with respect to this control parameter behaves nearly monotonically.
+ When signal correlations are approximated, incremental processing of the network is still possible and no pre-processing is required. Thus, the complexity with respect to the number of nodes $g$ remains $O(g)$.

These advantages come at the expense of the following critical points:

- The complexity of the algorithm with respect to the approximation parameter $d$ is $O\left(2^{2 d}\right)$. It has been shown, however, that even for small values of $d$ the estimation accuracy improves significantly compared to mere propagation of signal probabilities $p_{x_{i}}^{1}$ and neglecting all correlations ( $d=0$ ).
- The number of statistical measures (moments) to be extracted from the application data and stored, grows exponentially with the approximation parameter $d$. A possible workaround is to extract fewer moments than the maximum allowed value of $d$ would require. In this case, different degrees of approximation would apply to structural correlation and spatio-temporal correlation of input data.


### 3.4.5 Extension to Multi-Delay Model

## Discrete-time switching

In practice, signals do not propagate instantaneously through the circuit but are afflicted with time lag due to gate and interconnect delays. This time lag provokes spurious signal transitions (glitches) in the circuit that are not necessary from a purely functional point of view. Typically, activity due to glitches is about $10-25 \%$ of the total switching activity [GDKW92]. However, for particular circuits such as multipliers, glitching might contribute considerably more.

For consideration of glitches in gate-level activity estimation, a multidelay model is commonly assumed [DTP98, MDG ${ }^{+} 97$ ]. Under this model, the delay time associated with each gate is a multiple integer of some fixed time unit $\Delta$. This implies that signal switching is modeled to occur only at discrete points $t_{k}^{l}$ during the computation interval $T_{c i}$ :

$$
\begin{equation*}
t_{k}^{l}=k T_{c i}+l \cdot \Delta \quad\left(l=0,1, \ldots, l_{\max }\right) \tag{3.29}
\end{equation*}
$$

At time points $k T_{c i}(k=0,1 \ldots)$ corresponding to the active clockedge, new data samples arrive at the PIs. $l_{m a x} \cdot \Delta$ is the maximum delay. A condition to ensure correct functionality of the circuit is: $l_{\max } \Delta<T_{c i}$.


Let $\alpha_{y_{j}}^{l}$ be the switching activity of signal $y_{j}$ at time points $t_{k}^{l}$. The total switching activity of gate $j$ during one computation interval is then given by

$$
\begin{equation*}
\alpha_{y_{j}}=\sum_{l=0}^{l_{\max }} \alpha_{y_{j}}^{l} \quad(j=1, \ldots, g) . \tag{3.30}
\end{equation*}
$$

The switching activities $\alpha_{y_{j}}^{l}\left(l=0,1, \ldots, l_{\max } ; j=1, \ldots, g\right)$ can be calculated in a straightforward manner by extending the zero-delay procedure described in section 3.4.2. Moreover, provided that the PIs $x_{i}(i=1, \ldots, n)$ change their logical state no more than once during any computation interval, the switching activities $\alpha_{y_{j}}^{l}$ are explicitly defined by the same $M(n)=2^{n}\left(2^{n}-1\right)$ moments of the PI signals that were required in the zero-delay case.

The necessary adjustments for the activity calculation procedure from section 3.4.2 shall be described next. Without loss of generality it is assumed that PI signal switching occurs at times $k T_{c i}(k=0,1, \ldots)$.

## Extension of zero-delay procedure

Step (I)
Gate $j$ potentially switches at times $t_{k}^{l}$ if the circuit contains a delay path of length $l$ from any of the PIs to the output of $j$. This set of potential switching points $L_{j} \subset\left\{0,1, \ldots, l_{\max }\right\}$ is given as union of the set of switching points of the precursor nodes of gate $j$, offset by the delay time of gate $j$. For $l \notin L_{j}$ holds $\alpha_{y_{j}}^{l} \equiv 0$.

For each $l \in L_{j}$, an arithmetic transform $\mathcal{A}\left(y_{j}^{l}\right)$ is constructed from those precursor node transforms that are relevant for the transition of gate $j$ at time $l$. Unlike under the zero-delay model, the PI variables now must be time-labeled, since PI samples from the previous ( $k-$ 1 ) and current ( $k$ ) computation intervals may appear in the same arithmetic transform.

When constructing $\mathcal{A}\left(y_{j}^{l}\right)$, variable $x_{i}$ in the polynomial at some input of gate $j$ has label $k-1$, if the delay path from $x_{i}$ via this gate input to the output of gate $j$ is longer than $l$. Otherwise, $x_{i}$ has label $k$. The exponent suppression rule (3.22) applies only to variables that refer to the same time point. The arithmetic transform that refers to the last switching point of gate $j$ is identical to the one under the zero delay model, i.e. all variables in this transform are consistently labeled with either $k-1$ or $k$ depending on whether $y_{j}$ refers to the final value of the previous or current computation interval.

Step (II)
For each $l \in L_{j}$, the static probability $p_{y_{j}^{\prime}}^{1}$ of signal $y_{j}$ at time point $l$ is obtained from evaluation of $\mathcal{A}\left(y_{j}^{l}\right)$. As opposed to the zero-delay model, the static probabilities $p_{y_{j}^{l}}^{1}$ this time may depend on lag-one moments of the PI signals, since time labels $k-1$ and $k$ had to be assigned to PI variables in step (I).

Step (III)
For each $l \in L_{j}$ the arithmetic transform $\mathcal{A}\left(y_{j}^{l-1} y_{j}^{l}\right)=\mathcal{A}\left(y_{j}^{l-1}\right) \mathcal{A}\left(y_{j}^{l}\right)$ is constructed such as to reflect the correlation between adjacent switching time points of gate $j$. For the first switching time point $l, \mathcal{A}\left(y_{j}^{l-1}\right)$ corresponds to the last switching time point of the previous computation interval, i.e. in $\mathcal{A}\left(y_{j}^{l-1}\right)$ all variables are labeled with $k-1$.

Evaluation of $\mathcal{A}\left(y_{j}^{l-1} y_{j}^{l}\right)$ with the corresponding PI moments yields $p_{y_{j}^{l-1}-y_{j}^{l}}^{1-1}$. From this, and the static probabilities $p_{y_{j}^{l-1}}^{1}$ and $p_{y_{j}^{l}}^{1}$ corresponding to time points $l-1$ and $l$, the switching activity of gate $j$ at time point $l \in L_{j}$ follows as ${ }^{7}$

$$
\begin{align*}
\alpha_{y_{j}}^{l} & =p_{y_{j}^{l-1}-y_{j}^{l}}^{1-0}+p_{y_{j}^{l-1}-y_{j}^{l}}^{0-1} \\
& =p_{y_{j}^{l-1}}^{1-1}-p_{y_{j}^{l-1}-y_{j}^{l}}^{1-1}+p_{y_{j}^{l}}^{1}-p_{y_{j}^{l-1}-y_{j}^{l}}^{1-1} \\
& =p_{y_{j}^{l-1}}^{1}+p_{y_{j}^{l}}^{1}-2 \cdot p_{y_{j}^{l-1}-y_{j}^{l}}^{1-1} . \tag{3.31}
\end{align*}
$$

[^7]The diagram below elucidates (3.31) for $L_{j}=\{2,3,5\}$.


If gate $j$ can switch only at a single time point, $p_{y_{j}^{l-1}}^{1}=p_{y_{j}^{l}}^{1}=p_{y_{j}}^{1}$ and $p_{y_{j}^{l-1}-y_{j}^{l}}^{1-1}=p_{y_{j}-y_{j}}^{1-1}$. In this case, (3.31) is identical to the switching activity (3.9) under the zero-delay model.

## Example:

Consider again the example circuit in figure 3.10 and assume gate $y_{1}$ has a propagation delay of $2 \Delta$, and gates $y_{2}$ and $y_{3}$ a delay $\Delta$. Thus, the set of potential switching time points of gate $y_{3}$ is $L_{3}=\{2,3\}$. The arithmetic transform for $y_{3}$ at time point $l=2$ follows as

$$
\begin{aligned}
\mathcal{A}\left(y_{3}^{2}\right) & =\mathcal{A}\left(y_{1}^{k-1}\right)+\mathcal{A}\left(y_{2}^{k}\right)-\mathcal{A}\left(y_{1}^{k-1}\right) \mathcal{A}\left(y_{2}^{k}\right) \\
& =x_{1}^{k-1}-x_{2}^{k-1} x_{1}^{k-1}+x_{3}^{k} x_{2}^{k}-\left(x_{1}^{k-1}-x_{2}^{k-1} x_{1}^{k-1}\right) x_{3}^{k} x_{2}^{k}
\end{aligned}
$$

and the corresponding static signal probability is

$$
p_{y_{3}^{2}}^{1}=p_{x_{1}}^{1}-p_{x_{2} x_{1}}^{11}+p_{x_{3} x_{2}}^{11}-p_{x_{1}-x_{3} x_{2}}^{1-11}+p_{x_{2} x_{1}-x_{3} x_{2}}^{11-11}
$$

Note how the static probability depends on the first-order statistics of the PIs.


Figure 3.10: Example circuit with gate delays.

The arithmetic transform at the last switching point $l=3$ is equivalent to the zero-delay transform for $y_{3}$ :

$$
\mathcal{A}\left(y_{3}^{3}\right)=x_{1}-x_{2} x_{1}+x_{3} x_{2} .
$$

In step (III), depending on whether $y_{3}^{3}$ refers to the previous or current computation interval, the variables in $\mathcal{A}\left(y_{3}^{3}\right)$ are labeled with $k-1$ or $k$ as follows: $p_{y_{3}^{3}-y_{3}^{2}}^{1-1}$ is obtained from evaluation of

$$
\begin{aligned}
\mathcal{A}\left(y_{3}^{3} y_{3}^{2}\right) & =\left.\mathcal{A}\left(y_{3}^{3}\right)\right|_{k-1} \cdot \mathcal{A}\left(y_{3}^{2}\right) \\
& =x_{1}^{k-1}-x_{2}^{k-1} x_{1}^{k-1}+x_{3}^{k-1} x_{2}^{k-1} x_{3}^{k} x_{2}^{k} .
\end{aligned}
$$

$p_{y_{3}^{2}-y_{3}^{3}}^{1-1}$ is obtained from evaluation of

$$
\begin{aligned}
\mathcal{A}\left(y_{3}^{2} y_{3}^{3}\right)= & \left.\mathcal{A}\left(y_{3}^{2}\right) \cdot \mathcal{A}\left(y_{3}^{3}\right)\right|_{k} \\
= & x_{1}^{k-1} x_{1}^{k}-x_{2}^{k-1} x_{1}^{k-1} x_{1}^{k}-x_{1}^{k-1} x_{2}^{k} x_{1}^{k} \\
& +x_{2}^{k-1} x_{1}^{k-1} x_{2}^{k} x_{1}^{k}+x_{3}^{k} x_{2}^{k} .
\end{aligned}
$$

Note that in general $p_{y_{3}^{3}-y_{3}^{2}}^{1-1} \neq p_{y_{3}^{2}-y_{3}^{3}}^{1-1}$ as well as $p_{y_{3}^{2}}^{1} \neq p_{y_{3}^{3}}^{1}$.
Finally, the switching activity of $y_{3}$ under a multi-delay model follows from (3.30) and (3.31) as

$$
\begin{aligned}
\alpha_{y_{3}} & =\alpha_{y_{3}}^{2}+\alpha_{y_{3}}^{3} \\
& =2\left(p_{y_{3}^{2}}^{1}+p_{y_{3}^{3}}^{1}-p_{y_{3}^{3}-y_{3}^{2}}^{1-1}-p_{y_{3}^{2}-y_{3}^{3}}^{1-1}\right) .
\end{aligned}
$$

For the periodic input sequence shown in figure 3.10 this evaluates to $\alpha_{y_{3}}=1$, which is easily verified with logic simulation. Interestingly, in the present case zero- and multi-delay model result in the same switching activity of gate $y_{3}$ although there is one additional switching point under the multi-delay model.

The correlation approximation scheme proposed in section 3.4.3 may be applied equivalently under the multi-delay model. However, for the selection of essential variables in $\mathcal{A}\left(y_{j}^{l}\right)$ time labels shall be ignored, i.e. $x_{i}^{k-1}$ and $x_{i}^{k}$ represent the same essential variable.

## Glitch filtering

For the given timing model, the above procedure exactly calculates the switching activity for every gate in a combinational circuit. However, discrepancies compared to gate- or transistor-level simulation with detailed timing capabilities can occur due to the abstract model employed. In particular, the multi-delay model ignores the effect of inertial delay on the switching behavior of logic gates, resulting in overestimation of switching activity [MDG ${ }^{+} 97$, DTP98]. Due to nonzero charging/discharging time of internal capacitances, a signal transition at the gate input will only cause a transition at the gate output, if the new input values remain stable for a sufficiently long period of time. Short-pulse glitches will not pass through the gate.

In order to account for this effect, a glitch filtering scheme has been proposed in [DTP98], which suppresses output pulses that are shorter than the delay time of the gate. A similar approximation could be incorporated into the above procedure by weighting a switching time point $l_{1}$ of gate $j$ that is subject to suppression through the next time point $l_{2}$ with some probability of occurrence. This occurrence probability is determined by the Boolean function of gate $j$ and $p_{y_{j}^{l_{1}}}^{1}$, $p_{y_{j}^{l_{2}}}^{1}, p_{y_{j}^{1-1}-y_{j}^{l_{2}}}^{1-1}$.

### 3.5 Probabilistic Analysis of Sequential Circuits

### 3.5.1 Feedback Circuits

Sequential circuits contain memory elements and either implement recursive or nonrecursive computations depending on whether the memory elements are located in feed-back or feed-forward signal paths. For sequential circuits without feedback, switching activity information for the combinational blocks can be obtained as explained in section 3.4. However, activity calculation for circuits with feedback requires the statistical properties of the feedback signal to be known, see figure 3.11 (a). Such feedback circuits are usually viewed as finitestate machines (FSM), by regarding the current feedback value $s(k)$ as the state of the system.


Figure 3.11: General logic circuit with sequential feedback (a), and example schematic with associated state-transition graph (b).

In case of feedback, the combinational circuit has an extended set of input signals $\tilde{\boldsymbol{x}}$ formed of primary inputs $\boldsymbol{x}$ and state signals $\boldsymbol{s}$, i.e. ${ }^{8}$

$$
\tilde{\boldsymbol{x}}=\left(x_{1}, \ldots, x_{n}, s_{1}, \ldots, s_{m}\right) .
$$

If there are $m$ state signals, the system has a total of $2^{m}$ states. The sequential feedback

$$
\begin{equation*}
\boldsymbol{s}(k+1)=\boldsymbol{y}(k)=\boldsymbol{f}(\boldsymbol{x}(k), \boldsymbol{s}(k)) \tag{3.32}
\end{equation*}
$$

introduces a new form of signal dependency, which manifests itself in spatio-temporal correlation of the extended input signal $\tilde{\boldsymbol{x}}$.
In analogy to (3.14) and (3.15), the $i$-th moments

$$
\begin{equation*}
p_{\tilde{x}_{j_{1} \ldots}^{\ldots} \tilde{x}_{j_{i}}}^{1 \ldots} \quad(i=1, \ldots, n+m) \tag{3.33}
\end{equation*}
$$

and the lag-one $i$-th moments

[^8]for the extended input $\tilde{\boldsymbol{x}}$ are required for exact calculation of gate activities. Moments that exclusively comprise primary inputs $x_{i}(i=$ $1, \ldots, n$ ) are assumed to be known a priori. All other moments, which refer to any of the state signals $s_{i}(i=1, \ldots, m)$ need to be determined prior to activity computation.

### 3.5.2 Analysis via Spectral Transformation

Explicit calculation of the unknown moments in (3.33) and (3.34), and hence switching activity, is a non-trivial task. The problem can be attacked by using the same technique that was utilized for switching activity calculation for combinational circuits in section 3.4, i.e. spectral transformation of Boolean functions. Since the general mathematical notation for the derivation of the unknown moments in (3.33) and (3.34) becomes very bulky, the basic concept shall be indicated by means of the example shown in figure 3.11(b).

First, the arithmetic transform of the next-state function $s(k+1)=$ $\overline{x(k) \vee s(k)}$ is constructed. For brevity, time index $k$ will be denoted as superscript. This yields (3.35), see table 3.1, which involves a second unknown term $x^{k} s^{k}$. Equation (3.36) corresponding to this term is constructed as arithmetic transform of $x(k) \wedge s(k)$. This in turn introduces two other unknown terms $s^{k} x^{k+1}$ and $x^{k} s^{k} x^{k+1}$, for which equations are formed in (3.37) and (3.38). This process would need to be continued indefinitely.

$$
\begin{align*}
s^{k+1}= & 1-x^{k}-s^{k}+x^{k} s^{k}  \tag{3.35}\\
x^{k+1} s^{k+1}= & x^{k+1}-x^{k} x^{k+1}-s^{k} x^{k+1}+x^{k} s^{k} x^{k+1}  \tag{3.36}\\
s^{k+1} x^{k+2}= & x^{k+2}-x^{k} x^{k+2}-s^{k} x^{k+2}+x^{k} s^{k} x^{k+2}  \tag{3.37}\\
x^{k+1} s^{k+1} x^{k+2}= & x^{k+1} x^{k+2}-x^{k} x^{k+1} x^{k+2} \\
& -s^{k} x^{k+1} x^{k+2}+x^{k} s^{k} x^{k+1} x^{k+2} \tag{3.38}
\end{align*}
$$

Applying the expectation operator in the same manner as in section 3.4.2 and assuming stationary primary inputs and hence stationary state signals, the above set of arithmetic transform equations
corresponds to the following equations for moments:

$$
\begin{align*}
p_{s}^{1} & =1-p_{x}^{1}-p_{s}^{1}+p_{x s}^{11}  \tag{3.39}\\
p_{x s}^{11} & =p_{x}^{1}-p_{x-x}^{1-1}-p_{s-x}^{1-1}+p_{x s-x}^{11-1}  \tag{3.40}\\
p_{s-x}^{1-1} & =p_{x}^{1}-p_{x-x}^{1-1}-p_{s--x}^{1-.-1}+p_{x s--x}^{11-.1}  \tag{3.41}\\
p_{x s-x}^{11-1} & =p_{x-x}^{1-1}-p_{x-x-x}^{1-1-1}-p_{s-x-x}^{1-1-1}+p_{x s-x-x}^{11-1-1} \tag{3.42}
\end{align*}
$$

As can be seen, moments referring to more than two consecutive time points get involved, e.g. $p_{x-x-x}^{1-1-1}$. In general, the above set of equations may be solved, if temporal correlation of the primary input $x$ is confined to a given number of time steps. In appendix C, this solution is worked out for first-order correlation, i.e. only $p_{x}^{1}$ and $p_{x-x}^{1-1}$ are given.

To obtain information on the temporal correlation of $s$, e.g. the first-order moment $p_{s-s}^{1-1}$, a set of equations similar to the one above can be constructed, see the example in appendix C. Note that even if the primary input $x$ is temporally uncorrelated, state signal $s$ in general is not, i.e. $p_{s-s}^{1-1} \neq p_{s}^{1} p_{s}^{1}$.

In the above example, the circuit has only one primary input and one state signal. The extension to circuits with multiple inputs and state signals is possible in a straightforward manner, by replacing $x$ and $s$ in the above analysis by vectors $\boldsymbol{x}$ and $\boldsymbol{s}$. In this case, spatial correlation between primary inputs can be acknowledged with the corresponding exponential increase in complexity, see section 3.3.3.

### 3.5.3 Markovian Analysis

Traditionally, a feedback circuit, or FSM, has been modeled as Markov chain [HMPS96, $\mathrm{T}^{+} 95$, CSY96, GDKW92], such that

$$
\begin{equation*}
\operatorname{Pr}\left\{s^{k+1} \mid s^{k}, \ldots, s^{0}\right\}=\operatorname{Pr}\left\{s^{k+1} \mid s^{k}\right\} \tag{3.43}
\end{equation*}
$$

for all possible state values $s$ of the system. In this case, the circuit's behavior is determined by specification of all $\left(2^{m}\right)^{2}$ state transition probabilities $\operatorname{Pr}\left\{s^{k+1} \mid s^{k}\right\}$, see appendix A. These transition probabilities result from the static probabilities of the primary input signals $p_{x_{i}}^{1}(i=1, \ldots, n)$ (or their joint probabilities $p\left(x_{1}, \ldots, x_{n}\right)$ if spatial
correlation is acknowledged), and are associated with the edges of the state-transition graph of the system. For the example circuit in figure 3.11 (b) this model is shown below.


Since with the above model the memory of the system reaches back only one time step, (3.43) also is called lag-one Markov chain. Such a model, however, totally neglects temporal correlation in the primary input $\boldsymbol{x}$. Thus, in the context of data statistics (3.43) only represents an approximation, although this is often ignored [ $\mathrm{T}^{+} 95$, GDKW92]. To correctly model feedback circuits for arbitrary input data statistics, the concept of higher-order Markov chains must be employed [MMP99]. Formally, the following result is presented.

## Theorem 3.3

Assume a circuit with sequential feedback as in (3.32). If temporal correlation in the primary input $x$ extends over $\tau$ time steps, then state signal $s$ forms a lag- $(\tau+1)$ Markov chain, i.e.

$$
\begin{equation*}
\operatorname{Pr}\left\{s^{k+1} \mid s^{k}, s^{k-1}, \ldots, s^{0}\right\}=\operatorname{Pr}\left\{s^{k+1} \mid s^{k}, \ldots, s^{k-\tau}\right\} \tag{3.44}
\end{equation*}
$$

Proof: The proof is outlined for $\tau=1$ (first-order temporal correlation of primary input) and the example circuit in figure $3.11(\mathrm{~b})$, since spatial correlation is immaterial here. Assume that state $s$ advances as indicated by dashed lines in the trellis diagram below.


Under a lag-one Markov chain model the probability for the next state only depends on the static signal probability of $x$ if the present state
is fixed, e.g. $\operatorname{Pr}\left\{s^{k+1}=1 \mid s^{k}=0\right\}=\operatorname{Pr}\left\{x^{k}=0\right\}$. Under a lag-two Markov chain model the probability for the next state depends on the one-step conditional probability of $x$ if the present and previous states are fixed, e.g. $\operatorname{Pr}\left\{s^{k+1}=1 \mid s^{k}=0, s^{k-1}=0\right\}=\operatorname{Pr}\left\{x^{k}=0 \mid x^{k-1}=\right.$ $1\}$. Since $\tau=1$, in general $\operatorname{Pr}\left\{x^{k}=0\right\} \neq \operatorname{Pr}\left\{x^{k}=0 \mid x^{k-1}=1\right\}$ and hence $\operatorname{Pr}\left\{s^{k+1} \mid s^{k}\right\} \neq \operatorname{Pr}\left\{s^{k+1} \mid s^{k}, s^{k-1}\right\}$, which contradicts the lag-one Markov chain model (3.43).

On the other hand, modeling $s$ as lag-three Markov chain is not necessary, because for $\tau=1$ follows that $\operatorname{Pr}\left\{x^{k} \mid x^{k-1}\right\}=\operatorname{Pr}\left\{x^{k} \mid x^{k-1}, x^{k-2}\right\}$ and hence $\operatorname{Pr}\left\{s^{k+1} \mid s^{k}, s^{k-1}\right\}=$ $\operatorname{Pr}\left\{s^{k+1} \mid s^{k}, s^{k-1}, s^{k-2}\right\}$.

Thus, in the context of switching activity calculation, modeling a circuit as lag-one Markov chain will result in erroneous moments $p_{s}^{1}, p_{x s}^{11}, p_{s-s}^{1-1}, \ldots$ if the primary inputs are temporally correlated. This in turn induces an error on the switching activity of the gates in the combinational part of the circuit. It has been found that this error can exceed $100 \%$ depending on the specific circuit structure and input correlation [SK96]. In appendix C the discrepancy between a lag-one and a lag-two Markov chain model is exemplified for the circuit in figure 3.11(b).

### 3.5.4 Implications

At this point the following implications can be formulated:

- The conventional Markov chain model for feedback circuits is applicable only in case of uncorrelated primary input signals. For correct modeling of correlated input sequences, higher-order Markov models are required.
- The arithmetic transform approach provides a means for explicit calculation of switching activities in feedback circuits under a higher-order Markov model. Unlike in combinational circuits, where moments of the primary inputs up to lag-one are sufficient for exact analysis, feedback circuits require the knowledge of moments up to lag- $\tau$ if temporal correlation in the input signal stretches over $\tau$ time steps.
- An approximation technique similar to the one examined for combinational circuits in section 3.4.3 is conceivable for the estimation of state signal statistics. This would require an appropriate selection procedure for relevant terms in sets of equations such as (3.35)-(3.38). In practice, one resorts to logic simulation in order to infer the statistical properties of the state signals as well as gate switching activities. In this case, compact input stimuli are desired that closely resemble the relevant statistical properties of the target input signal [MMP97].


### 3.6 Summary

This chapter was concerned with the probabilistic analysis of switching activity for average power estimation. We have provided the theoretical basis for exact calculation of switching activity in logic circuits by bridging known concepts from two different disciplines: The notation of moments of a random variable known from probability theory, and the concept of spectral transformation known from the theory of switching functions. Based on this theoretical foundation, a novel heuristic for the approximation of signal correlation in combinational circuits has been proposed.

Our work improves the state of the art for probabilistic activity estimation in combinational circuits [MDG ${ }^{+} 97$, CMD97, MMP98] as follows :

- The proposed exact technique accurately models structural, spatial, and temporal correlation, where each step of the procedure is dedicated to one of these three types of correlation. Such exact technique is not only of theoretical interest but is also useful for calibrating approximation techniques.
- Contrary to [CMD97], where polynomial expressions for all four fist-order joint probabilities were incrementally constructed, we have shown that it suffices to incrementally construct one polynomial for each node. From this polynomial the correct joint probabilities, and hence switching activity for every node can be derived.
- Under a zero- and multi-delay model, the gate switching activities in a combinational circuit can be calculated exactly from the minimum number of statistical measures that fully specify the first-order statistics of the primary inputs.
- The proposed estimation scheme provides control over the approximation of all three types of correlation by means of a single parameter. The new estimation scheme requires no preprocessing of the network and retains incremental processing. The monotonic behavior and effectiveness of this approximation method have been verified experimentally.

Furthermore, it was demonstrated how the arithmetic transform technique also can be used for explicit analysis of circuits with sequential feedback. In this case, higher-order correlation of inputs can be taken into account.

## Seite Leer / Blank leaf

## Chapter 4

## Data Statistics and Minimum Power Dissipation

### 4.1 Introduction

### 4.1.1 Motivation

As has been seen in the previous chapter, power consumption in digital VLSI is subject to the statistical properties of the application data at hand. Given a gate-level circuit and sufficient information on the data characteristics, switching activity and power dissipation can be predicted. Another question arising in this context is the existence of an absolute lower bound on switching activity and power dissipation subject to statistical data properties. Although related to power estimation, this lower bound problem differs in that it asks for the amount of energy indispensable for the completion of some given computational task. Unlike in power estimation, the lower bound shall account for data statistics without assuming any specific network structure.

Besides its theoretical significance such a lower bound would also provide useful guidance for practical low-power design. Knowing the
minimum dissipation achievable for certain operations would allow to rate the energy efficiency of state-of-the-art implementations for these operations. Implementations with poor energy efficiency should then be prime candidates for future optimization efforts.

### 4.1.2 Previous Work

## Thermodynamics of computation

The relation between entropy and the second law of thermodynamics on one hand, and the physical limits of computation on the other hand, has challenged researchers for many decades [LR90, Hey99]. Despite its long record in the history of science, the topic has remained a controversial subject among physicists. Its pervasive consequences even fuel philosophical debates and speculations [KS99]. Section 4.2.3 shall discuss the relation between the thermodynamics of computation and minimum power consumption in VLSI.

## Information-theoretic approaches

Next to thermodynamic entropy, the concept of information-theoretic entropy [CT91] has also been employed to investigate the limits of computing. One such approach is the notion of computational work, which equals the entropy of the computation's output sequence, multiplied by the number of possible input symbols [Hel72]. However, computational work is invariant with respect to data statistics, since it assumes uniformly distributed input data.

Given the information-theoretic entropy of a sequence of symbols, lower and upper bounds on the switching activity of a binary representation of this sequence have been derived in [RSH99b]. In section 4.5, this result will be employed for the analysis of minimum power consumption in the context of data transmission.

In [Sha97], the problem of minimum power consumption in digital VLSI has been investigated in analogy to Shannon's joint sourcechannel coding theorem [Sha48]. This approach will be discussed in section 4.7, and classified according to the problem hierarchy to be developed.

## Technology limits

Despite ongoing research for alternatives, e.g. quantum information processing [SR00] or low-temperature superconducting ICs [BTR00]), to date MOS-transistor based VLSI, and here predominantly CMOS ${ }^{1}$, has remained the implementation technology of choice for virtually any application manipulating time-discrete digital data. For volume production, this is not expected to change for at least the next 10 to 20 or so years.

Opportunities for future gigascale CMOS integration were thoroughly analyzed in [Mei95] by considering theoretical and practical limits at different levels of hierarchy: Fundamental limits are inferred from the basic laws of thermodynamics, quantum mechanics, and electromagnetics. Material limits are dictated by microscopic properties of Si , such as carrier mobility and saturation velocity. Device limits for MOS-transistors are given by the minimal channel length. Apart from technological feasibility, channel length is confined by transistorinternal short channel effects. Circuit limits refer to requirements that must be met by logic gates and interconnects in CMOS designs, and are linked to the power-delay product of a single switching event. On the system level, switching energy limits are imposed by heat removal and cycle time requirements.

For a given delay, the limits on power consumption were found to be more restrictive the higher one moves up in the hierarchy [Mei95]. However, statistical data properties were not taken into account in this analysis. This shall be done in this chapter.

### 4.1.3 Outline

Since the choice of implementation medium determines the role of data statistics for power minimization, this issue is discussed first in section 4.2 for CMOS. The problem of lower bounds on power dissipation in the context of data statistics will be formulated and classified in section 4.3. Subsequent sections then deal with the four subproblems resulting from this classification. Section 4.8 compares

[^9]the lower bound problem with the information-theoretic transmission bound problem.

### 4.2 The Limits of VLSI Technology

### 4.2.1 The Ideal MOSFET

For minimum power consumption in the context of data statistics, it is of interest to explore the minimum dissipation associated with a binary switching event. This shall be done at the device level, since MOS transistors are the basic devices in CMOS technology.

The energy stored on the gate capacitance of a single MOSFET in a CMOS inverter is

$$
E_{g}=Q_{g} V_{d d} / 2 .
$$

This amount of energy is dissipated for every switching of logic state, represented by the gate charge. The minimum possible gate charge is one electron: $Q_{g}=q$. Approximating the minimum allowable supply voltage for a simple inverter circuit as [MD00]

$$
V_{d d}=2(\ln 2) k T / q \approx 36 \mathrm{mV}
$$

the ideal MOSFET dissipates a minimum energy of

$$
\begin{equation*}
E_{0}=(\ln 2) k T \tag{4.1}
\end{equation*}
$$

for every binary switching transition ${ }^{2}$. At room temperature this amounts to

$$
\begin{equation*}
E_{0} \approx 3 \cdot 10^{-21} \mathrm{~J} \quad @ 300 \mathrm{~K} \tag{4.2}
\end{equation*}
$$

Such a device that only stores a single electron in its gate capacitance when being charged has been named SETE (single electron, thermal excitation) MOSFET [MD00]. It should be noted that this is a hypothetical device that ignores many practical aspects such as layout parasitics and technology imperfections.

[^10]
### 4.2.2 Technology Trends

Table 4.1 compares several CMOS standard-cell libraries available today to the SETE MOSFET. $L_{\text {min }}$ denotes the minimum feature size ( $=$ MOS transistor channel length), and $V_{d d}$ is the supply voltage. The gate capacitance $C_{g}$ is derived from the input capacitance of a standard inverter with 1 x -drive, assuming a ratio of two for the $\mathrm{p} / \mathrm{n}$-channel widths. Gate charge $Q_{g}=V_{d d} C_{g}$ is given as number of electrons. The energy dissipated per switching event $E_{s w}=C_{g} V_{d d}^{2} / 2$ is given in units of fundamental energy limits $E_{0}$ in (4.2).

As can be seen, even the most advanced technology dissipates four to five orders of magnitude more energy than the theoretical limit would imply. This is in compliance with experimental data provided in [Key01]. On the other hand, the number of electrons stored in a transistor gate is surprisingly small already. However, extrapolation of figures from table 4.1 to speculate about the future trend and when the theoretical limit will be reached is notoriously difficult.

| Library |  | $\begin{gathered} \hline L_{\min } \\ {[\mathrm{nm}]} \\ \hline \end{gathered}$ | $\begin{aligned} & \hline V_{d d} \\ & {[\mathrm{~V}]} \\ & \hline \end{aligned}$ | $\begin{array}{r} C_{g} \\ {[\mathrm{aF}]} \\ \hline \end{array}$ | $\begin{gathered} Q_{g} \\ {\left[q_{e}\right]} \\ \hline \end{gathered}$ | $\begin{aligned} & E_{s w} \\ & {\left[E_{0}\right]} \\ & \hline \end{aligned}$ |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: |
| Name | Vendor |  |  |  |  |  |
| CXB | AMS | 800 | 5.0 | 12'000 | 375 '000 | $5.2 \cdot 10^{7}$ |
| CUB | AMS | 600 | 5.0 | 10'000 | 312 '500 | $4.3 \cdot 10^{7}$ |
| C075 | Philips | 350 | 3.3 | 3'000 | 62'000 | $5.7 \cdot 10^{6}$ |
| C050 | Philips | 250 | 2.5 | 2'000 | 31'250 | $2.2 \cdot 10^{6}$ |
| CMOS18 | Philips | 180 | 1.8 | 1'500 | $17 \times 00$ | $8.5 \cdot 10^{5}$ |
| SA-27 ${ }^{\text {a }}$ | IBM | 150 | 1.3 | 1'075 | $8 ’ 750$ | $3.2 \cdot 10^{5}$ |
| $\mathrm{Cu}-11^{\text {b }}$ | IBM | 110 | 1.0 | 330 | 2'000 | $5.8 \cdot 10^{4}$ |
| SETE | [MD00] | 14 | 0.04 | 4.47 | 1 | 1 |

${ }^{a}$ nominal $V_{d d}=1.8 \mathrm{~V},{ }^{b}$ nominal $V_{d d}=1.2 \mathrm{~V}$

Table 4.1: CMOS technology trend in terms of number of elementary charges stored on a single gate capacitance, and energy dissipation per switching event.

### 4.2.3 VLSI and Thermodynamics

## Thermodynamic results

In his 1929 paper [Szi29] Szilard introduced his famous one-molecule model of a thermodynamic system. The so-called Szilard engine for the first time linked the concepts of entropy, information and memory and led to the notion of a "bit" of information. In this way, Szilard's work provided the foundation for Shannon's theory of information [Sha48], which will be the starting point for the discussion in section 4.7.

Another important milestone in the development of today's understanding of the physics of computing is the work of Landauer and Bennett. Before their work it was believed that any computer operating at temperature $T$ must dissipate at least $(\ln 2) k T$ Joule for every "elementary act of information". This is exactly the energy limit (4.1) previously derived for CMOS.

In [Lan61] Landauer introduced the concept of logical reversibility and argued that logical irreversibility implies physical irreversibility, which is accompanied by heat dissipation. This follows from the fact that a logically irreversible operation would otherwise be able to decrease the thermodynamic entropy of the computer's memory without a compensating entropy increase elsewhere in the universe, thereby violating the second law of thermodynamics. Subsequently, any logically irreversible operation, such as the mapping of an unknown, randomly chosen state to a known successor state, must dissipate a minimum amount of energy. On the other hand, computation steps that do not discard information can be done reversible in principle. Today, this fact is known as Landauer's principle [BGL $\left.{ }^{+} 98\right]$.

Bennett [Ben73] expanded Landauer's idea, arguing that every computation step can be made in a logically reversible manner. Thus, erasure of information is not essential to computing. Adopting this conclusion, many thought experiments to construct computers with arbitrary little dissipation have since been performed, including reversible logic (Fredkin) gates, the frictionless billiard-ball computer, Brownian and enzymatic Turing machines [BL85]. These models give
evidence to the fact, that the laws of physics do not preclude the invention of a technology that allows dissipationless computing. However, they do not provide a practical design procedure for such a machine.

## Adiabatic logic

One approach to "dissipationless" computing based on MOS transistors, called adiabatic logic, emerged in the early 1990's [CB95]. This approach dispenses with constant-voltage charging of capacitances as done in ordinary CMOS. In analogy to thermodynamics, the defining property of "adiabatic" logic is the dissipation decrease with increased time allowed for the completion of the charge transfer process. This requires time-varying supply voltages as well as special circuit styles.

However, fully-adiabatic processing can only be achieved by using gates that implement invertible functions exclusively, i.e. by restriction to reversible logic in the sense of Bennet [Ben73]. The overhead induced by reversible logic can be avoided by resorting to partiallyadiabatic designs. In this case, dissipation involves a non-adiabatic component, which can not be decreased by slowing down the charge transfer process.

Although feasible in principle, the complicated circuitry and the resonant power supplies required, have largely prevented adiabatic logic from being commercially used to date.

## Hierarchy of dissipation limits

As minimum feature size of VLSI technology continues to shrink, energy dissipation per binary switching event diminishes. If todays CMOS technology could be advanced to the SETE level, every switching event, whether associated to a logically irreversible operation or not, would dissipate $E_{0}$ Joule independently of the time allowed to complete the switching. Further reduction is impossible because of the basic concept of information representation and manipulation, i.e. constant-voltage charging and draining of capacitances.

If one is to break the $E_{0}$-per-switching barrier at room temperature, one will be forced to use an implementation technology that dissipates no energy for logically reversible operations (Landauer's


Figure 4.1: Evolution of energy dissipation limits.

Principle) ${ }^{3}$. If one then is to reduce energy dissipation beyond $E_{0}$ per bit of information discarded, one will be forced to compute in a logically reversible manner using an appropriate implementation medium. However, zero energy dissipation can only be reached asymptotically if the time for completing a binary switching event is allowed to go to infinity. Figure 4.1 illustrates the evolution of minimum energy dissipation.

### 4.2.4 Implications

At this point, the following statements can be made:

- CMOS technology in conjunction with constant-voltage charging and draining of capacitances imposes a $E_{0}$-per-switching barrier on minimum energy dissipation.

[^11]- Consequently, from the energy efficiency point of view, CMOS with constant supply voltage is a suboptimal implementation technology since it does not obey Landauer's principle.
- Since with CMOS technology every switching event is accompanied by energy dissipation, the number of those events shall be reduced to the absolute minimum necessary. This motivates the question of a lower bound on switching activity for a given processing task. In section 4.3 this question will be formalized in the context of data statistics.


### 4.3 Taxonomy for Lower Bound Problem

In this section, the power dissipation bound problem will be formulated as abstract mathematical optimization problem in order to introduce a consistent classification scheme. This taxonomy shall help to comprehend the issue, and allow for categorization of existing approaches to solve the problem.

### 4.3.1 Problem Formulation

Let the implementation technology be CMOS with fixed process parameters. Then, in the light of data statistics, the lower bound problem (LBP) for low-power digital VLSI is stated as follows:
> "A digital operation is to be performed on a set of input data items with specified statistical properties in a given amount of time. What is the minimum energy any implementation of this processing task must dissipate?"

Since the processing time is fixed, minimum energy dissipation is equivalent to minimum average power. Therefore, the terms energy and power may be used interchangeably.

In order to arrive at a mathematical formulation of LBP all parameters that affect power dissipation need to be identified. These parameters are:

1. the digital operation ${ }^{4}$, denoted $\Gamma$
2. input data statistics, denoted $\operatorname{Pr}\{$.
3. the processing time ${ }^{5}$, denoted $T_{P}$
4. the RT-level architecture ${ }^{6}$, denoted $A_{|\oplus|}$
5. the binary data representation, denoted $\mathcal{B}\{$.
6. the gate-level architecture, denoted $A_{-\infty}$
7. the supply voltage level $V_{d d}$.

Hence, the power dissipation function $P$ may be written as:

$$
\begin{equation*}
P=P\left(\Gamma, \operatorname{Pr}\{.\}, T_{P}, A_{\oplus \mid}, \mathcal{B}\{.\}, A_{-\infty}, V_{d d}\right) . \tag{4.3}
\end{equation*}
$$

The seven variables in this function are mutually dependent and construct an entangled relation for power dissipation.

For the lower bound problem the variables in (4.3) may be divided into two groups:

$$
\begin{array}{ll}
\text { given input parameters } & : \Gamma, \operatorname{Pr}\{.\}, T_{P} \\
\text { free optimization variables } & : \\
A_{\notin \mid}, \mathcal{B}\{.\}, A_{-\infty}, V_{d d} .
\end{array}
$$

The three input parameters jointly define a processing task, which shall be performed on a digital circuit. Note that input data statistics are an integral part of the design specification for such a circuit.

For a specific problem instance of LBP the input parameters are given. The goal is to find the minimum power dissipation for any combination of the optimization variables:

$$
\begin{equation*}
P\left(A_{\nmid \in}, \mathcal{B}\{.\}, A_{-\infty}, V_{d d}\right) \longrightarrow \operatorname{Min} . \tag{4.4}
\end{equation*}
$$

[^12]Solving this optimization problem provides the lower bound on power dissipation for specific $\Gamma, \operatorname{Pr}\{$.$\} , and T_{P}$, and also the settings of the optimization variables that achieve this bound.

The problem is, that in order to solve (4.4), an analytical expression for $P$ is required. For voltage level $V_{d d}$ and processing time $T_{P}$ the relation to power consumption may be stated explicitly. $\Gamma, \operatorname{Pr}\{$.$\} ,$ $A_{\notin+}, \mathcal{B}\{$.$\} , and A_{\mapsto \infty}$ are, however, difficult to cast in mathematical formulas. Such formulas not only would need to describe the relation to power dissipation, but also must reflect the interdependence of variables. This makes the analytical treatment of LBP a rather unpromising undertaking and simplification of LBP seems indispensable.

Such simplifications can be attained by regarding some of the optimization variables as input parameters. Fixing different combinations of optimization variables to given values results in a classification of LBP. As will be seen, this classification naturally follows from making certain assumptions on the operation $\Gamma$ and data statistics $\operatorname{Pr}\{$.$\} .$

### 4.3.2 Representation of Digital Operations

Any digital operation $\Gamma$ may be written as

$$
\begin{equation*}
o(k)=\Gamma(i(k), i(k-1), \ldots, i(k-\infty), o(k-1), \ldots, o(k-\infty)) \tag{4.5}
\end{equation*}
$$

where $i(k)$ and $o(k)$ denote the present input and output value, respectively. In order to tackle LBP, a more precise specification of the operation $\Gamma$ is required. This specification will bear the first simplification of the problem.

## FSM representation

The processing functionality of any digital VLSI system, including those that implement DSP algorithms, may be modeled as a finitestate machine (FSM). The most general form of an FSM, i.e. a Mealy automaton, is depicted in figure 4.2(a). $I, O$, and $S$ denote the set of input, output, and state symbols, respectively. The current state of the system $s(k)$ is stored in some memory element, which in practice may be a set of register cells or a RAM macro-cell.


Figure 4.2: General finite-state machine (a), digital operation $\Gamma$ modeled as FSM (b), and memoryless mapping (c).

Let operation $\Gamma$ in (4.5) be associated with the general FSM representation as shown in figure $4.2(\mathrm{~b})$. Then, the mathematical specification of $\Gamma$ involves:

1. a special element $s(0) \in S$, referred to as initial state,
2. a next-state function $f: I \times S \rightarrow S$, and
3. an output function $g: I \times S \rightarrow O$.

There are two serious difficulties with this representation of $\Gamma$ that point to the problem of finding an appropriate generic description for a general digital operation:

- The specification of $\Gamma$ involves the initial state $s(0)$ of the system. This is highly undesirable, since it implies that two otherwise identical processing tasks will correspond to different instances of LBP and in general will have different lower bounds due to deviating initial states. To avoid this peculiarity, $s(0)$ would need to be appended to the list of optimization variables, which complicates the power dissipation function $P$, and thus the optimization problem (4.4) even further.
- Even for a given initial state $s(0),(f, g)$ defines no canonic representation of $\Gamma$ because different pairs of next-state and output function can yield the same $I / O$-behavior. Every pair $(f, g)$ describing the same operation $\Gamma$ will represent different instances
of LBP. This is because $f$ and $g$ imply a certain state encoding, and vice versa. In other words, specifying next-state and output function constrains the optimization variable RT-level architecture $A_{|\oplus|}$.
Consequently, a generic model for $\Gamma$ needs to detach from FSM related concepts. This suggests to separate the memoryless mapping from the memory.


## Memoryless mapping

Let the RT-level architecture $A_{\dagger \mid 1}$ for the implementation of $\Gamma$ be defined. By discarding the feedback of information as shown in figure $4.2(\mathrm{c}), \Gamma$ can then be modeled as memoryless mapping

$$
\begin{equation*}
\Gamma: X \rightarrow Y \tag{4.6}
\end{equation*}
$$

where $X=I \times S, Y=O \times S$. This implies that the input (output) of the data memory is considered part of the output (input) of $\Gamma$. Hence, (4.6) is a generic model for any digital operation $\Gamma$ excluding all sequential actions involved. Utilizing (4.6) for the solution of LBP, the lower bound on power dissipation will refer to a particular RTlevel architecture rather than to the operation itself. In other words, describing $\Gamma$ as memoryless mapping moves $A_{|\in|}$ from the list of optimization variables to the list of input parameters. Obviously, this also circumvents the problem of initial state as described above.
$\Gamma$ in figure 4.2(c) comprises two memoryless mappings $g$ and $f$, that share the same domain $I \times S$ but differ in their ranges. The lower bounds of $g$ and $f$ are independent only, if the two functions are to be implemented separately. In this case, the lower bound for $\Gamma$ can be obtained from a summation over the lower bounds for $g$ and $f$. Information regarding the statistical properties of the input data, i.e. $\operatorname{Pr}\{I \times S\}$, may be obtained from simulating a RT-level description of $\Gamma$ or from theoretical analysis of the associated FSM representation, see section 3.5.

## Example:

Assume the operation $\Gamma$ to be the linear time-invariant difference equation

$$
\begin{equation*}
o(k)=c_{0} \cdot i(k)+c_{1} \cdot i(k-1)+c_{2} \cdot i(k-2) \tag{4.7}
\end{equation*}
$$

with integer operands. In order to arrive at a memoryless mapping as in (4.6), the RT-level architecture $A_{|\oplus|}$ needs to be defined. Figure 4.3(a) and (b) show two alternative architectures that implement (4.7). The two architectures can be translated to their corresponding FSM representation, see figure $4.3(\mathrm{c})$ and (d). The resulting sets of state symbols $S_{1}$ and $S_{2}$, and thus the memoryless mappings $\Gamma_{1}$ and $\Gamma_{2}$, are quite different. This in turn will yield different lower bounds for the two architectures.


Figure 4.3: Two different RTL architectures implementing the same DSP operation (a,b) and the corresponding FSM representations (c,d).

### 4.3.3 Representation of Data Statistics

Next to operation $\Gamma$, data statistics $\operatorname{Pr}\{$.$\} is the other important input$ parameter to LBP. Data statistics may be given at two different levels of abstraction. Each of these levels implies a specific interpretation of the memoryless mapping (4.6), thus giving rise to further classification of the lower bound problem.

## Word-level data statistics

One way to specify the statistical properties of the application data is at the word level. In this case, $X$ and $Y$ in $\Gamma: X \rightarrow Y$ must be seen as abstract sets of words, or symbols. Without loss of generality it can be assumed that the symbols are integer numbers, i.e. $X, Y \in \mathbb{Z}$. However, nothing is said about the binary representation of those symbols. Hence, binary representation $\mathcal{B}\{X\}$ remains an optimization variable. The basic question then is, up to which extent can the statistical properties of the application data $\operatorname{Pr}\{X\}$ be taken advantage of by means of appropriate coding.

A particularly compact representation of the statistical properties of the input data is enabled via the information-theoretic notion of entropy, see appendix B. In section 4.5 this will be elaborated further.

## Bit-level data statistics

The alternative way to specify data statistics is at the bit level by means of probability measures as introduced in section 3.3. This implies, that the binary representation for each input symbol is known. In this case, $\mathcal{B}\{X\}$ is an input parameter rather than an optimization variable.

Assuming that the binary representation of input symbols implies the binary representation of output symbols, and that input and output symbols are represented with a fixed number of $L$ and $M$ bits, the memoryless mapping (4.6) becomes

$$
\begin{equation*}
\Gamma: \mathcal{B}^{L} \rightarrow \mathcal{B}^{M} \tag{4.8}
\end{equation*}
$$

The lower bound problem then is equivalent to finding the minimum power dissipation of any combinational circuit implementing the

Boolean function (4.8). At this point, two optimization variables remain, namely gate-level architecture $A_{\rightarrow \infty}$ and supply voltage $V_{d d}$. If either of these is defined as well, the respective other variable is the only degree of freedom left for minimizing power consumption and is constraint by processing time $T_{P}$.

### 4.3.4 Problem Hierarchy

Based on the discussion above a categorization of LBP into four main problem classes is suggested. Each of these problem classes is associated with a special optimization problem as in (4.4). Figure 4.4 shows this classification of the lower bound problem. The four subproblems are denoted according to the number of variables in the optimization problem associated:

LBP-4: This is the most general form of the lower bound problem, where all four optimization variables are available for minimizing power dissipation. No answer to this problem is known. Section 4.4 discusses the requirements a potential solution should comply with.

LBP-3: Here, $\Gamma$ is given as a memoryless mapping, which defines the RT-level architecture. Since data statistics is specified at the word-level, binary representation of data $\mathcal{B}\{$.$\} remains subject$ to optimization. LBP-3 corresponds to the problem of minimum power dissipation for data transmission, which can be solved by an information-theoretic approach, see section 4.5.

LBP-2: In this case, the binary data representation $\mathcal{B}\{$.$\} is defined,$ and the lower bound is subject to gate-level architecture $A_{-\infty}$ and supply voltage $V_{d d}$. No solution to LBP-2 is known. Section 4.6 reviews related results and discusses their qualification for LBP-2.

LBP-1: This is the most basic version of the lower bound problem, where power dissipation is to be minimized solely due to the supply voltage level. As shown in figure 4.4 there are two paths to LBP-1:

Figure 4.4: Classification of the lower bound problem LBP (input data statistics $\operatorname{Pr}\{$.$\} , RT-level architecture$ $A_{|\oplus|}$, binary data representation $\mathcal{B}\{$.$\} , gate-level architecture A_{-\infty-}$, supply voltage $V_{d d}$; see pp. 68).

1. The path via bit-level statistics can be thought of as the traditional voltage-scaling problem with given RT- and gate-level architecture. If the delay of the critical path in the system is known for some nominal supply voltage, then the minimum $V_{d d}$ can be calculated such as to meet the processing time requirement [CB95].
2. The path via word-level statistics is mostly of academic interest, since for given binary representation $\mathcal{B}\{$.$\} and$ gate-level architecture $A_{\triangleright \infty}$ the minimum supply voltage does not depend on data statistics and can be calculated as above. However, an information-theoretic approach which applies to LBP-1 has been suggested in [Sha97], and will be discussed in section 4.7.

Two general remarks regarding the proposed classification of LBP are due:

- The four optimization variables $A_{\text {¢f }}, \mathcal{B}\{\},. A_{-\infty}, V_{d d}$, in that order, correspond to decisions at different stages in the design flow. Violating this order when determining these variables is expected to overly limit the design space, and often is even impractical. For instance, fixing the gate-level architecture $A_{-\infty}$ first, and then optimizing with respect to binary data representation $\mathcal{B}\{$.$\} , makes no sense. Therefore, not every combination$ of optimization variables corresponds to a useful subproblem of LBP. The four subproblems LBP-1, $\ldots$, LBP-4 are regarded as the essential ones. In particular, further subclassification within LBP-4 seems not fruitful because the RT-level architecture $A_{|\oplus|}$ is not defined.
- The hardness of the lower bound problem does not vary monotonically with the number of optimization variables. The reason is that the four optimization variables are related in an intricate fashion. For instance, LBP-3 is "easier" to solve than LBP-2. The additional optimization variable $\mathcal{B}\{$.$\} in this case simpli-$ fies the problem, see section 4.5. On the other hand, LBP-2 is harder than LBP-1, obviously. In this case the additional degree of freedom provided by $A_{-\infty}$ can only be employed in a
subspace of the LBP-3 solution space, and the minimum of the power dissipation function $P$ may in general be more difficult to find in this subspace.


### 4.4 The General Lower Bound Problem

As pointed out above, LBP-4 is the most general and, presumably, most difficult lower bound problem, with all four optimization variables being subject to variation. The difficulty consists in finding an appropriate power dissipation function that comprises all optimization variables and input parameters, see (4.3). No such power function is currently known. This section provides general guidelines which may be helpful for directing future attempts to tackle the problem.

### 4.4.1 Tightness of the Lower Bound

Let the processing task be written as vector of input parameters

$$
t=\left(\Gamma, \operatorname{Pr}\{.\}, T_{P}\right),
$$

and the optimization variables be denoted

$$
z=\left(A_{|\oplus|}, \mathcal{B}\{.\}, A_{\triangleright \infty}, V_{d d}\right) .
$$

Knowing the analytical expression for the power dissipation function $P(t, z)$, the optimization problem (4.4) can be solved. This not only provides the lower bound on power dissipation for a specific processing task $t$, but also yields the optimum $z$, denoted as $z^{*}$. With this information, the most energy-efficient implementation of $t$ can be designed.

Forgoing the information on optimum values, a different approach to LBP may be taken. In this case it suffices to find some power dissipation function $\hat{P}=\hat{P}(t)$ that only depends on the three input parameters. Since $\hat{P}$ shall provide a lower bound, there must hold

$$
\begin{equation*}
\hat{P}(t) \leq P(t, z), \quad \forall t, z . \tag{4.9}
\end{equation*}
$$

Intuitively, it seems less difficult to construct $\hat{P}$ than $P$. However, in order to be a practical lower bound, there are certain requirements
that $\hat{P}$ has to meet. For instance, the naive lower bound $\hat{P} \equiv 0$ complies with (4.9) but is useless because it is not tight for most combinations of input parameters, i.e. not close to the true lower bound provided by minimizing $P$. In general, a lower bound on power dissipation for some processing task $t$ is of limited practical use, if nothing is known about its quality, i.e. its tightness.

The requirement of tightness can be made more specific in the context of energy efficiency, as discussed next.

### 4.4.2 Energy Efficiency Rating

Let the power dissipation of some implementation of processing task $t$ be $P_{\text {impl }}(t)$. Then, a natural definition for the energy efficiency of this implementation of $t$ is

$$
\begin{equation*}
\eta(t)=\frac{P\left(t, z^{*}\right)}{P_{i m p l}(t)} . \tag{4.10}
\end{equation*}
$$

$\eta \leq 1$ with equality only if the implementation at hand is optimum.
In order to rate the energy efficiency according to (4.10) the opti$\operatorname{mum} A_{\notin+1}^{*}, \mathcal{B}\{.\}^{*}, A_{\rightarrow \infty}^{*}$, and $V_{d d}^{*}$ must be known. This can be circumvented by employing the lower bound $\hat{P}(t)$ instead of the true power dissipation function $P(z, t)$, in order to obtain an estimated efficiency:

$$
\begin{equation*}
\hat{\eta}(t)=\frac{\hat{P}(t)}{P_{\text {impl }}(t)} . \tag{4.11}
\end{equation*}
$$

As pointed out above, the usefulness of the efficiency estimated in this way hinges on the quality of the lower bound $\hat{P}(t)$. In particular, the call for reliable comparison of efficiency between two implementations of different processing tasks $t_{1}$ and $t_{2}$, poses the following requirement $^{7}$ :

$$
\begin{equation*}
\eta\left(t_{1}\right)-\eta\left(t_{2}\right)=\hat{\eta}\left(t_{1}\right)-\hat{\eta}\left(t_{2}\right) . \tag{4.12}
\end{equation*}
$$

[^13]Substitution of (4.10) and (4.11) yields

$$
\begin{equation*}
P_{i m p l}\left(t_{2}\right)\left[P\left(t_{1}, z^{*}\right)-\hat{P}\left(t_{1}\right)\right]=P_{i m p l}\left(t_{1}\right)\left[P\left(t_{2}, z^{*}\right)-\hat{P}\left(t_{2}\right)\right] . \tag{4.13}
\end{equation*}
$$

For arbitrary $P_{\text {impl }}\left(t_{1}\right)$ and $P_{\text {impl }}\left(t_{2}\right)$ this can only be satisfied if

$$
\begin{equation*}
\hat{P}(t) \equiv P\left(t, z^{*}\right), \quad \forall t \tag{4.14}
\end{equation*}
$$

Hence, for a lower bound function $\hat{P}(t)$ to be beneficial for general rating of energy efficiency, it must evaluate to the true minimum power dissipation, i.e. it has to be tight for any processing task $t$. In this light it is questionable whether a lower bound $\hat{P}(t)$ that satisfies (4.14) is easier to find than the true power dissipation function $P(t, z)$. Since $\hat{P}(t)$ evaluates to the true minimum power, it is expected that knowing such a lower bound also hints at how to design the most energy-efficient implementation of $t$, even though there is no explicit dependence on the optimization variables $A_{|\oplus|}, \mathcal{B}\{\},. A_{\rightarrow \infty}$, and $V_{d d}$.

### 4.4.3 Axiomatic Requirements

Besides the tightness demand discussed above, there are other, common-sense requirements a model $\hat{P}(t)$ for minimum power dissipation shall comply with. Following, some of these axiomatic requirements are stated without claim for completeness. Each of these postulates refers to one of the three input parameters $\Gamma, \operatorname{Pr}\{$.$\} , or T_{P}$. For the sake of clarity, only the relevant input parameter is included in the mathematical formulation of the axioms.

A-1 If the operation to be performed is the constant function, the minimum power dissipation is zero:

$$
\begin{equation*}
\hat{P}(\Gamma(x)=\mathrm{c}) \equiv 0 . \tag{4.15}
\end{equation*}
$$

This postulate is in contradiction with the thermodynamic standpoint of Landauer's principle, see section 4.2.3, because an operation $\Gamma(x)=\mathrm{c}$ discards information and should therefore be dissipative. However, from a CMOS implementation point of view, the realization of a known constant value implies no switching activity. This, by virtue of (2.2), suggests zero
power consumption. The discrepancy between the two viewpoints stems from the fact, that for a practical realization of $\Gamma(x)=\mathrm{c}$ the circumstances of generating the input are disregarded, while in thermodynamics they are part of the system under consideration.

A-2 The minimum power dissipation for the identity mapping is smaller than for any other operation on the same input:

$$
\begin{equation*}
\hat{P}(\Gamma(x)=x) \leq \hat{P}(\Gamma) \tag{4.16}
\end{equation*}
$$

This is because implementing the identity mapping requires only wires for the transmission of data, see section 4.5. Any other operation will need these wires plus appropriate logic to discern between different input symbols.

A-3 The minimum power dissipation for some processing task is not larger than that of its sequential decomposition (chain rule):

$$
\begin{equation*}
\hat{P}(\Gamma) \leq \hat{P}\left(\Gamma_{1}\right)+\hat{P}\left(\Gamma_{2}\right) \quad \text { if } \quad \Gamma(x)=\Gamma_{2}\left(\Gamma_{1}(x)\right) . \tag{4.17}
\end{equation*}
$$

If the most energy-efficient way to perform operation $\Gamma$ is by means of two sequential operations, then equality will hold in (4.17). Otherwise, $\hat{P}(\Gamma)$ must be smaller than the sum, because it shall provide a lower bound. Stated differently, the lower bound model should account for sequential decomposition. Similarly, the lower bound model must be conscious of any other architectural transformation that could be employed for power reduction.

A-4 Uncorrelated input symbols with uniform distribution $\mathcal{U}$ imply the largest minimum power consumption for any data statistic:

$$
\begin{equation*}
\hat{P}(\operatorname{Pr}\{.\}) \leq \hat{P}(\mathcal{U}) \tag{4.18}
\end{equation*}
$$

This intuitively follows from the fact, that random data feature no redundancy that could be employed by clever encoding, i.e. optimization of $\mathcal{B}\{$.$\} , to reduce switching activity or to simplify$ circuitry, see section 4.5.

A-5 Increasing the time allowed to process a certain number of input symbols decreases minimum power dissipation:

$$
\begin{equation*}
\hat{P}\left(T_{P_{1}}\right)<\hat{P}\left(T_{P_{2}}\right) \quad \text { if } \quad T_{P_{1}}>T_{P_{2}} . \tag{4.19}
\end{equation*}
$$

This is justified by the basic relation for energy dissipation $E=$ $C_{L} V_{d d} / 2$, because the relaxed processing speed can be utilized to reduce the supply voltage $V_{d d}$ or to design slower circuits with smaller capacitive load $C_{L}$.
The above axioms shall hold true for the general lower bound problem LBP-4. For the most part, they also apply to the other subproblems, with the exception of A-4. This axiom does not apply if the binary data encoding is fixed, as in LBP-2 and LBP-1, because then redundancy can not be utilized for power reduction. In this case, unfavorable bit-level correlation can imply higher switching activity than a random bit stream, see figure 3.4.

### 4.4.4 Implications

At this point the following can be concluded:

- For the reliable rating of energy efficiency of known implementations, an estimated lower bound must be identical to the true lower bound on power dissipation, see (4.14). This requirement is valid for the most general lower bound problem LBP-4 and for any of the other subproblems alike.
- It is unclear how requirement (4.14) can be satisfied without knowledge of the optimum implementation. Hence, we conjecture that there is no implementation-independent lower bound on power dissipation.
- Since no explicit power dissipation function is known, the solution of the general lower bound problem LBP-4 is unknown.


### 4.5 Lower Bound for Data Transmission

This section investigates the lower bound problem LBP-3, where minimum power dissipation is subject to optimization of binary data representation $\mathcal{B}\{$.$\} , gate-level architecture A_{-\infty}$, and supply voltage $V_{d d}$,
see section 4.3.4. Because data statistics $\operatorname{Pr}\{$.$\} are specified at the$ word-level and binary data representation $\mathcal{B}\{$.$\} can be chosen arbi-$ trarily, LBP-3 may be viewed as the problem of finding the minimum power dissipation for transmitting data over an information channel. This will be elucidated in section 4.5.1. Subsequent sections then will deal with minimum switching activity in the context of data transmission.

### 4.5.1 Power Dissipation Bound

## Deterministic information channel

As mentioned in section 4.3.3, $X$ and $Y$ in $\Gamma: X \rightarrow Y$ are interpreted as abstract sets of symbols, because data statistics $\operatorname{Pr}\{$.$\} are specified$ at the word-level. This suggests an information-theoretic model for LBP-3 as indicated in figure 4.5(a).

An information source sends out symbols at a rate equal to the inverse of the processing time $T_{P}$. The sequence of symbols $\left\{X_{i}\right\}$ has a certain information content, which is specified as entropy $H(X)$, see appendix B. Since $\Gamma$ is ultimately to be implemented with a twolevel sensing technology, the abstract input symbols have to be given a binary representation $\mathcal{B}\{X\}$ by means of encoding. The memoryless mapping $\Gamma$ partitions the domain of input symbols in $M$ disjoint subsets, where $M$ is the number of possible output symbols, see figure $4.5(\mathrm{~b})$. $\Gamma$ may therefore be seen as memoryless information channel, because every input symbol is mapped to a specific output symbol with probability one [Abr63].

Intuitively, for a given number of input symbols, the implementation of $\Gamma$ seems more costly the more domains $M$ must be discerned. For instance, if $M=1$ the output is constant, and according to axiom (4.15) no power needs to be dissipated. However, if $M>1$, the output diversity will also depend on the probability of input symbols. Obviously, the problem is to find an analytical measure that describes the complexity of $\Gamma$ subject to input and output statistics. On one hand, this measure shall relate the three input parameters $\Gamma, \operatorname{Pr}\{$.$\} , and T_{P}$ to the lower bound on power dissipation $\hat{P}$. On the other hand, this measure must be implementation independent, because $\mathcal{B}\{\},. A_{\rightarrow \infty}$, and $V_{d d}$ are subject to optimization.

(a)

(b)

Figure 4.5: Operation $\Gamma$ as deterministic communication channel: (a) block diagram, (b) partition of input domain by memoryless mapping.

## Specification of processing task

One such measure that incorporates the statistics of two sequences is the mutual information $I(X ; Y)$, see appendix B. Mutual information $I(X ; Y)$ can be viewed as the reduction in uncertainty in $X$ due to the knowledge of $Y$. In the context of operation $\Gamma$, this reduction in uncertainty is due to the information transfered from the input $X$ to the output $Y$. The information transfer rate $R$ associated with $\Gamma$ may therefore be defined as in [Sha97]:

$$
\begin{equation*}
R=\frac{I(X ; Y)}{T_{P}} \tag{4.20}
\end{equation*}
$$

Since $\Gamma$ is a memoryless mapping, it follows that the conditional entropy $H(Y \mid X)=0$ and subsequently $I(X ; Y)=H(Y)$, see (B.8).

Hence, any transformation $\Gamma$ requires an information transfer rate of

$$
\begin{equation*}
R=\frac{H(Y)}{T_{P}} \tag{4.21}
\end{equation*}
$$

bits per second [bps]. Since $H(Y)$ depends on the input data statistics and on the operation $\Gamma$, information transfer rate $R$ is a function of all three input parameters and thus specifies a processing task $t=$ ( $\Gamma, \operatorname{Pr}\{\},. T_{P}$ ). However, $R$ is independent of the implementation of the processing task.

## Lower bound function $\hat{P}$

Next, in order to be useful for LBP-3, information transfer rate $R$ has to be associated with power dissipation. One way to do so is to specify the lower bound function $\hat{P}(t)$ depending on $R$. Intuitively,

$$
\begin{equation*}
\hat{P}(t) \propto R=\frac{H(Y)}{T_{P}} \tag{4.22}
\end{equation*}
$$

For a specific problem instance, $\hat{P}(t)$ shall be smaller than, or ideally equal to the minimum power dissipation achievable with optimum $\mathcal{B}\{.\}^{*}, A_{-\infty-\infty}^{*}$, and $V_{d d}^{*}$. However, the problem is how to justify this condition without knowledge of the optimum implementation and the corresponding minimum power consumption. In case of LBP-3 this perplexity may be resolved, because data statistics is specified at the word-level, and the binary data representation $\mathcal{B}\{$.$\} can be chosen$ arbitrarily for any processing task $t$.

## Optimal encoding

By virtue of (4.21), optimal binary encoding $\mathcal{B}\{.\}^{*}$ of input symbols $X$ shall imply the lowest information transfer rate $R$ such that the output symbols $Y$ can be uniquely identified by the decoder (not shown in figure $4.5(\mathrm{a}))$. Therefore, $\mathcal{B}\{.\}^{*}$ should assign equal codewords to input symbols within the same partition of the input domain, see figure 4.5 (b). In this case, no logic to assign appropriate output symbols to different input symbols is required, and the gate-level architecture will be least dissipative. Thus, $A_{\rightarrow \infty}^{*}$ is a set of wires, used to transmit
the required $R \mathrm{bps}$. The optimum supply voltage $V_{d d}^{*}$ in this case depends on the transmission scheme employed.

From these manifestations of the optimal implementation of processing task $t$, the true minimum power dissipation can be deduced.

## Minimum power dissipation

Assume that every input domain is encoded with a fixed number of $r \geq H(Y)$ bits. This limit on the number of bits is imposed by Shannon's first theorem, i.e. the source coding theorem [Abr63]. Then, the minimum power dissipation of a processing task $t$ follows from (2.2) as:

$$
\begin{equation*}
P\left(t, z^{*}\right)=\frac{1}{2} C_{L} \frac{\alpha_{S}^{*}}{T_{P}}\left(V_{d d}^{*}\right)^{2} . \tag{4.23}
\end{equation*}
$$

$C_{L}$ denotes the load capacitance of a transmission wire, and is assumed to be determined by the CMOS technology. $\alpha_{S}^{*}$ is the minimum number of signal transitions per symbol transmitted. The following sections will explore $\alpha_{S}^{*}$ as function of the entropy of the sequence of symbols and the number of bits per symbol $r$. In general, the number of transitions per symbol is related to the switching activity of a binary signal $\alpha_{x}$ as defined in (3.8) by means of

$$
\begin{equation*}
\alpha_{x}=\frac{\alpha_{S}}{r} \tag{4.24}
\end{equation*}
$$

If more than one signal (wire) is used for transmission, (4.24) gives the average switching activity for these signals.

The optimum supply voltage $V_{d d}^{*}$ in (4.23) depends on the time schedule used for the transmission of the $r$ bits. For serial transmission by means of a single binary signal (wire), the optimum operating voltage $V_{d d}^{*}$ corresponds to a propagation delay of the transmission wire of $T_{P} / r$ seconds. For parallel transmission over $r$ wires, $V_{d d}^{*}$ corresponds to a propagation delay of $T_{P}$ seconds. Assuming that minimum admissible supply voltage varies proportionally with the inverse of the propagation delay, parallel transmission ideally permits a diminution of minimum power consumption by a factor of $1 / r^{2}$ compared to serial transmission, simply because power varies quadratically with $V_{d d}$. This, however, is only true within certain limits, because the propagation delay of a transmission line drastically increases as $V_{d d}$ approaches the threshold voltage of the MOS devices involved [CB95].

## Practical applicability

Minimum power dissipation $P\left(t, z^{*}\right)$ as given in (4.23) may be seen as the general solution for LBP-3. However, this solution is mainly of academic interest. Its practical implications are limited, because the power consumption inferred does not include the cost of binary encoding. This is due to the abstract specification of the operation $\Gamma$, which, together with word-level data statistics, give rise to the information-theoretic model (4.21) for the processing task. Such a model in turn permits to move the burden of computation from the processing unit under consideration to the preceding unit. In order to pinpoint the computational burden, means must be provided to bridge the information-theoretic viewpoint and the implementation that ultimately must use two-level sensing. Abstract specification of data statistics at the word-level, without any relation to the binary implementation, is inappropriate for this matter.

Basically there are two ways to go round the problem above: One is to specify data statistics at the bit-level as in LBP-2, see section 4.6. The other way is to abide word-level data statistics and close the gap to binary implementation by assuming a given gate-level architecture. This is the basic concept of the information-theoretic approach to LBP-1, see section 4.7.

### 4.5.2 Switching Activity Bounds

This section provides a general result for the minimum number of transitions per symbol $\alpha_{S}^{*}$ as it appears in the minimum power dissipation function (4.23). Subsequent sections then will deal with two special encoding schemes.

A general encoding scheme not only will consider symbol probabilities, but may also employ temporal correlation between symbols. Therefore, the information content of some sequence $\left\{X_{i}\right\}$ to be coded must be given as entropy rate $\mathcal{H}\left(\left\{X_{i}\right\}\right)$, rather than as entropy $H(X)$, see (B.9). Given the entropy rate, there exist an upper and lower bound on the switching activity of the binary signal that represents the encoded symbols, see theorem 4.1. For brevity, $\mathcal{H}\left(\left\{X_{i}\right\}\right)$ will be written as $\mathcal{H}$. If $\left\{X_{i}\right\}$ is an i.i.d. process, entropy rate $\mathcal{H}$ is equal to the entropy $H(X)$, see (B.10).

## Theorem 4.1 ([RSH99b])

If a discrete random process of entropy rate $\mathcal{H}$ is encoded with $r \geq \mathcal{H}$ bits per symbol on average, the expected number of signal transitions per symbol $\alpha_{S}$ is bounded by

$$
\begin{equation*}
h^{-1}\left(\frac{\mathcal{H}}{r}\right) r \leq \alpha_{S} \leq\left(1-h^{-1}\left(\frac{\mathcal{H}}{r}\right)\right) r \tag{4.25}
\end{equation*}
$$

where $h^{-1}(\cdot) \in[0,1]$ is the inverse of the binary entropy function defined in (B.4) with $h^{-1}(y)=x$ and $x \in[0,0.5]$.

The above result is universal in a sense that it applies to any encoding scheme, parallel and serial transmission of bits, and to any stationary signal source with arbitrary statistics. Figure 4.6 elucidates (4.25) by showing the lower and upper bound on $\alpha_{S}$ for different values of $r$. Both $\alpha_{S}$ and $r$ are depicted in units of $\mathcal{H}$. If the source is coded with $r=\mathcal{H}$ bits, which corresponds to optimal entropy coding, the binary signal representing the codeword will be completely random and uncorrelated, i.e. $\alpha_{x}=\alpha_{S} / r=\alpha_{S} / \mathcal{H}=0.5$. By increasing $r$ one obtains more freedom to choose the code, which can be employed to code the source using fewer and fewer, or, if desired, using more and more signal transitions. However, any valid coding scheme must reside within the unshaded region in figure 4.6.


Figure 4.6: Lower and upper bound on the number of transitions per symbol $\alpha_{S}$ (adapted from [RSH99b]).

## Transition signaling

The proof of theorem 4.1 is given in [RSH99b] and runs along the following line: Given the entropy rate of a binary sequence, the number of ' 1 's in this sequence is bounded by means of the binary entropy function (B.4). Then, interpreting logic ' 1 ' as binary transition and logic ' 0 ' as no transition, the number of transitions is bounded. Such an interpretation is justified because entropy only depends on the probability distribution, but is independent of the actual symbol value.

The explanation of logic values ' 1 ' and ' 0 ' in the codeword as 'transition' and 'no transition' on the corresponding data bus line is called transition signaling. It is a convenient approach to the problem of minimizing the number of transitions in a binary sequence, which then is identical to minimizing the number of ' 1 's in that sequence.

In case of transition signaling, the encoder must generate codewords that indicate a transition by logic ' 1 ', and no transition by logic ' 0 '. This requires extra circuitry at the encoder and decoder side in order to convert transition domain codewords to the level domain of the physical data bus, and vice versa. Figure 4.7 compares the principle of ordinary transmission (level signaling) with transition signaling. In both cases, the physical data lines are identical.


Figure 4.7: Principle of ordinary level signaling (top) and transition signaling (bottom).

### 4.5.3 Variable-Length Coding

One way to approach the bounds in (4.25) is to use variable-length codes by assigning shorter codewords to more probable symbols and using longer ones for less frequent symbols. An example of such a coding approach is the well-known Huffman code [CT91].

## Example:

Assume an i.i.d. process $X$ with five symbols and probability distribution as shown in table 4.2. The entropy of the source is computed as $H(X)=2.122$ bits. When coding the five symbols with the conventional binary code, the number of transitions per symbol follows from (3.10) for uncorrelated signals as $\alpha_{S}($ bin $)=\sum_{i=0}^{2} 2 \mu_{x_{i}}\left(1-\mu_{x_{i}}\right)=$ $0.48+0.48+0.18=1.14$. This is within the theoretical bounds of $0.578 \leq \alpha_{S} \leq 2.422$ computed from (4.25).

Using transition signaling with $r=3$ bits per symbol reduces the activity to $\alpha_{S}($ trans $)=0.7$. This is within $18 \%$ of the theoretical lower bound. The corresponding code is shown in table 4.2 (column four), where '|' stands for 'transition' and '-' for 'no transition'.

Constructing a variable length code and using transition signaling as shown in column five (vlc), yields $\alpha_{S}(v l c)=0.9>\alpha_{S}($ trans $)$. However, since the variable length code on average only uses $r=2.2<3$ bits per symbol, it only misses the theoretical lower bound by $5 \%$, and hence must be considered more efficient.


Table 4.2: Probability distribution, conventional binary code, optimal ( $K=1$ ) transition domain code, and variable-length code for example i.i.d. source with 5 symbols.

Variable codeword lengths as used in the preceding example are opposed to today's predominant synchronous, word-parallel design style. It is always possible to convert a variable-length code to a fixedlength code without increasing switching activity by padding with '-' (no transition). The codewords can then be transmitted over a parallel data bus that has as many wires as the longest codeword requires. However, the effectiveness of this new code in terms of reaching the lower activity bound is much worse, since the many redundant bits are not used efficiently. Therefore, it is of interest to investigate the theoretical bounds on switching activity when fixed-length codes are to be used exclusively.

### 4.5.4 Fixed-Length Coding

In a fixed-length code every codeword consists of $r$ bits when coding each symbol individually. Grouping $K$ symbols into one block and then coding these blocks instead of individual symbols, yields a codeword length of $r K$ bits. As $K$ increases, the bounds in (4.25) may be asymptotically achieved. However, the number of codewords to be stored in the codebook is $|X|^{K}$. Therefore, the required coding hardware grows exponentially with block length $K$. Furthermore, real-time encoding requires extra storage capacity for joint evaluation of $K$ symbols.

For fixed-length coding the activity bounds in (4.25) are related to three parameters: the entropy rate $\mathcal{H}$ of the symbol source, the number of bits per symbol $r$, equivalently to the data bus width, and the coding block length $K$. These relations are discussed next.

## Activity bounds depending on entropy rate

Assume a fixed-length code of $r$ bits to be transmitted over a given physical data link, e.g. an 8 -bit bus. Also, let the block length be fixed, e.g. $K=1$. Then, as $\mathcal{H}$ increases, the difference between lower and upper bound decreases. If $\mathcal{H}=r$, then $\alpha_{S}=r / 2$, and from (4.24) follows the switching activity for each line of the data bus as $\alpha_{x_{i}}=$ $1 / 2(i=1, \ldots, r)$. Conversely, as $\mathcal{H}$ decreases, the gap between lower and upper bound in (4.25) grows, meaning that there exist binary
representations that are highly inefficient from a power consumption point of view.

Consequently, for sources with high redundancy, i.e. highly unequal symbol probabilities and/or memory over many time steps, appropriate encoding promises larger power savings than for lowredundancy sources. This is in accordance with the results of chapter 5 , where the standard 2 's-complement encoding is shown to waste power in case of speech data, which indeed features a high degree of redundancy.

## Activity bounds depending on the number of bits

In practice, the coding block length $K$ will be limited due to systemlevel constraints such as area or power consumption. Since the coding hardware grows exponentially with block length, $K$ in general must be rather small. In fact, for most practical systems only $K \leq 2$ is feasible.

It is possible to compute the number of bits per symbol necessary to minimize $\alpha_{S}$ for a given block length. However, for arbitrary sources and distributions the notation for explicit expressions on this number becomes very cumbersome. The following result for uniformly distributed symbols demonstrates the general reasoning.

## Corollary 4.1

Let $X$ be an i.i.d. source with uniform distribution. The minimum number of bits per symbol $r^{*}$ that minimizes the switching activity for block length $K$ is given by

$$
\begin{equation*}
r^{*}(K)=\left\lceil\frac{|X|^{K}-1}{K}\right\rceil . \tag{4.26}
\end{equation*}
$$

Coding with $r>r^{*}$ bits per symbol can not further reduce $\alpha_{S}$.
Proof: The minimum number of transitions for fixed-length coding is achieved if there is exactly one codeword with zero transition, and all other codewords have one transition. If the codewords are $n$ bits long, there are at most $n$ codewords with one transition. Hence, at most $n+1$ symbols can be coded with the minimum number of transitions.

On the other hand, for block length $K$, there are $|X|^{K}$ symbols to be coded with $n=r K$ bits. Since each of the corresponding $|X|^{K}$ codewords shall cause at most one transition, we have

$$
|X|^{K} \leq n+1=r K+1
$$

Since for fixed-length encoding $r$ must be an integer number, (4.26) follows. Employing $r>r *$ bits does not reduce $\alpha_{S}$ because the codewords must differ in at least one bit position.

## Example:

Assume an i.i.d. source $X$ with five symbols, all being equally likely. From (4.26) one has $r^{*}(K=1)=4, r^{*}(K=2)=12$, and $r^{*}(K=$ $3)=42$. Figure 4.8 shows the minimum switching activity as function of $r$ for block lengths $K=1,2,3$, together with the absolute lower bound any coding scheme could achieve. As can be seen, for a given $K$ it makes no sense to increase $r$ beyond $r^{*}$, since the number of transitions per symbol does not further decrease. For $r=r^{*}$, the deviation from the absolute lower bound is minimum, indicating that for a given block length $K$ coding with $r^{*}(K)$ bits per symbol is most efficient in terms of switching activity.

## Activity bounds depending on block length

In figure 4.8 it can be seen that for a given number $r$ of bits the switching activity $\alpha_{S}$ does not monotonically decrease with increasing block length $K$. However, for $K \rightarrow \infty$ the absolute lower bound is approached. Therefore, the following question is naturally to ask: For some given $r$ and $K$, what is the optimal block length $K^{*}$ such that $K^{*} \leq K$ and $\alpha_{S}$ is minimum.

The derivation of a general explicit expression for $K^{*}$ is impracticable. However, determination of $K^{*}$ via a computer program is rather straightforward. Such a program, that calculates the minimum switching activity achievable for an i.i.d. source with arbitrary statistics for some given $r$ and $K$, has been implemented. The following result provides the general reasoning on which this program is based.


Figure 4.8: Minimum number of transitions per symbol vs. number of bits per symbol for different block length $K(|X|=5$, uniform distribution).

## Corollary 4.2

Let $X$ be an i.i.d. source with uniform distribution. For a given number of $r$ bits per symbol, the minimum number of transitions per symbol $\alpha_{S}^{*}$ as function of block length $K$ satisfies

$$
\begin{equation*}
\frac{\sum_{t=1}^{t_{K}^{*}-1} t\binom{r K}{t}}{K \cdot|X|^{K}}<\alpha_{S}^{*}(K) \leq \frac{\sum_{t=1}^{t_{K}^{*}} t\binom{r K}{t}}{K \cdot|X|^{K}} \tag{4.27}
\end{equation*}
$$

$t_{K}^{*}$ is the maximum number of transitions that must be used to code any block of $K$ symbols, and is uniquely given by

$$
\begin{equation*}
\sum_{t=0}^{t_{K}^{*}-1}\binom{r K}{t}<|X|^{K} \leq \sum_{t=0}^{t_{K}^{*}}\binom{r K}{t} . \tag{4.28}
\end{equation*}
$$

Proof: For an $n$-bit fixed-length code there are exactly $\binom{n}{t}$ codewords with $t$ ' 1 's $(t=1, \ldots, n)$, corresponding to $t$ transitions when using transition signaling. For block length $K$ the codewords
associated to blocks of symbols are $n=r K$ bits long. For minimum switching activity codewords are assigned in increasing order with respect to the number of transitions $t$. Since $|X|^{K}$ codewords are required, (4.28) follows. Furthermore, the $\binom{n}{t}$ codewords with $t$ transitions each, cause together $t\binom{n}{t}$ transitions. Since all codewords are equally likely by proposition, and $\alpha_{S}^{*}$ is the number of transitions per individual symbol, (4.27) follows.

The bounds provided in (4.27) are rather loose. However, the expression for tighter bounds, and the relation of minimum switching activity to block length $K$ for sources with arbitrary statistics become even more entangled. The legal ranges of $\alpha_{S}^{*}(K)$ and $\alpha_{S}^{*}(K+1)$ imposed by (4.27) in general overlap significantly. If $\alpha_{S}^{*}(K)$ comes very close to its lower bound and $\alpha_{S}^{*}(K+1)$ does not, then non-monotonic behavior of minimum switching activity with respect to block length $K$ is observed.

## Example:

Assume again an i.i.d. source $X$ with five symbols, all being equally likely. For this source, figure 4.9 shows the minimum switching activity as function of block length $K$ for $r=3, \ldots, 7$ bits per symbol. As can be seen, for a given $r$ the minimum activity achievable does not always decrease with increasing block length $K$. For instance, let the data bus width be restricted to $r=5$ bits (two redundant bit lines are used). If the current optimal coding employs a block length of $K=3$, then it is useless to increase the block length by one only, since with $K=4$ switching activity can not be reduced further.

### 4.5.5 Implications

As pointed out previously, the solution of LBP-3 provided in section 4.5.1 is of limited practical utility, because it applies to data transmission only. Moreover, the cost of encoding, which is applied in order to achieve minimum power consumption on the transmitting wires, is not included in the analysis. The application of complex coding schemes is therefore confined to high-capacitance data links, e.g. off-chip buses or highly loaded memory interfaces. In this case switching activity savings attained by appropriate encoding can


Figure 4.9: Minimum number of transitions per symbol vs. block length $K$ for different number of bits per symbol $r(|X|=5$, uniform distribution).
justify the coding overhead [ $\left.\mathrm{BMM}^{+} 00\right]$.
From the examination of the theoretical activity bounds (4.25) the following general conclusions regarding data encoding for low power dissipation may be drawn:

- For sources with high redundancy, i.e. highly unequal symbol probabilities and/or memory over many time steps, appropriate encoding promises larger power savings than for low-redundancy sources.
- For a fixed number of bits per symbol $r$, minimum switching activity does not monotonically decrease with ascending block length $K$.
- In practice, one will often be restricted to fixed-length codes and a coding block length of $K \lesssim 2$. In this case there exists a limit on the number of bits per symbol $r$, beyond which no activity reduction can be achieved.


### 4.6 Lower Bound in Boolean Optimization

### 4.6.1 Problem Description

According to figure 4.4, in LBP-2 gate-level architecture $A_{-\infty-}$ and supply voltage $V_{d d}$ are optimization variables, while RT-level architecture $A_{|\in|}$ and bit-level data statistics $\mathcal{B}\{$.$\} are known input parameters. No$ general solution for this problem is known. However, some comments on the problem's disposition can be formulated.

Assuming fixed-length encoded inputs and outputs, LBP-2 corresponds to finding the most efficient combinational circuit that implements a specific Boolean function $\Gamma: \mathcal{B}^{L} \rightarrow \mathcal{B}^{M}$. The condition for maximum delay of the critical path in this circuit is given by the processing time $T_{P}$. In this sense, LBP-2 is reminiscent of the ordinary synthesis problem, where a gate-level netlist is to be constructed from an abstract functional description. The peculiarities of the lower bound problem may be elucidated by opposing LBP-2 and the synthesis problem.

- In ordinary synthesis, the supply voltage $V_{d d}$ is fixed and every gate has a known delay according to this nominal voltage. For LBP-2 $V_{d d}$ is an optimization variable and can not be chosen independently of the gate-level architecture $A_{-\infty}$, because the two are linked by the delay of the critical path. However, even if $V_{d d}$ is treated as given input parameter, the lower bound problem differs from the synthesis problem in the following points.
- Unconstrained synthesis may be performed in two consecutive steps. First, the Boolean function is minimized with respect to some generic representation. Subsequently, this representation is mapped to a set of technology-dependent logic gates. Such successive execution simplifies the synthesis problem, but is inadequate for matters of optimum implementation. In LBP-2 Boolean minimization and technology mapping must be considered jointly with respect to power dissipation, as is done in timing-driven synthesis where power cost is replaced by the delay of the critical path. However, for LBP-2 there is an additional parameter, namely data statistics.

The problem may be simplified if mapping is confined to a small subset of generic gates that form a complete basis for the realization of Boolean functions, e.g. inverter and two-input and and OR. Even so, LBP-2 differs from the synthesis problem in the following respect.

- For synthesis, the gate-equivalent count typically is a fair measure for circuit area. Minimizing this number in general is assumed to also reduce power consumption. For minimum power dissipation in regard of data statistics this conclusion is not justified. If very few input patterns have a very high probability of occurrence, a larger circuit with a high degree of redundancy could be more efficient by means of precomputation $\left[\mathrm{AMD}^{+} 94\right]$. In this case the redundant logic is used to identify and evaluate the most frequent input patterns by a small, energy-efficient subcircuit. The larger, more dissipative main circuit then would only be enabled upon a miss of the subcircuit.

The lower bound problem LBP-2 becomes even more intricate if variable-length codes are permitted for the binary representation $\mathcal{B}\{$.$\} .$ Thus, a universal power dissipation model should include this potential as well, even if it is rarely used in practice because it impedes synchronous word-parallel processing. From a theoretical point of view such coding may indeed seem reasonable as it naturally commences circuit activity only as needed. This has also been noted in [Ger96], where a computational coding theory in generalization of Boolean minimization algorithms was postulated.

### 4.6.2 Related Efforts

Although LBP-2 has, to all appearances, formally not been attacked by the VLSI research community, there is previous work on RTlevel power estimation that relates to the lower bound problem. Information-theoretic concepts have been employed for RT-level power estimation in two ways.

Complexity analysis of Boolean functions. The entropy of the output of some Boolean function, assuming that all input combinations are equally likely, is considered a measure for the
area-complexity of this function. Under this hypothesis an area model which is exponential in the number of inputs was derived in [CA90]. For most practical circuits this model is not very accurate and even for reasonable number of inputs dramatically overestimates the gate count. This model was improved for single-output Boolean functions [NN96a] and multi-output functions [NN99] based on the notion of cube complexity .

Estimation of switching activity. The other use of entropy for power estimation is to predict the average switching activity in a circuit that implements the Boolean function at hand. In [MMP96, NN96b] formulas for the average entropy per net $\bar{H}$ as a function of average input and output entropy were developed. This requires simplifying assumptions on the logic structure of the circuit, as well as on the correlation of signals. Average switching activity in the circuit may then be approximated as

$$
\begin{equation*}
\alpha_{x} \approx \bar{H} / 2 \tag{4.29}
\end{equation*}
$$

This assumes temporally uncorrelated signals, see figure B.1.
Combining the above results, relative power dissipation can be estimated as the product of circuit area and average switching activity.

Direct application of the above approach to LBP-2 is not possible for two reasons. First, for both complexity analysis and activity estimation strong assumptions on input data statistics have to be made. However, the specific input statistics $\operatorname{Pr}\{$.$\} are a crucial input pa-$ rameter, subject to which minimum dissipation should vary. Second, isolated analysis of Boolean function complexity and switching activity might be acceptable for relative power estimation, but it is not when searching for the absolute lower limit of dissipation. In this case the gate-level structure must account for the given signal statistics.

### 4.6.3 Implications

In the absence of an analytical power dissipation model $P\left(\Gamma, \operatorname{Pr}\{\},. T_{P}, A_{\nmid \boldsymbol{|}}, \mathcal{B}\{\},. A_{-\infty}, V_{d d}\right)$ it is not clear how to tackle LBP-2 nor the simplified version with fixed $V_{d d}$. Furthermore, the lower bound function $\hat{P}\left(\Gamma, \operatorname{Pr}\{\},. T_{P}, A_{\nmid \oplus}, \mathcal{B}\{\}.\right)$ again should obey
the tightness requirement (4.14). It is not apparent how such $\hat{P}$ could be formulated without reference to the optimal solution $A_{-\infty \infty}^{*}, V_{d d}^{*}$.

### 4.7 Lower Bound on Supply Voltage

In section 4.5 operation $\Gamma$ has been associated with a deterministic information channel, such as to infer the minimum power dissipation for data transmission. The actual cost of data processing was not included, however. There has also been an attempt to associate operation $\Gamma$ with a noisy information channel [Sha97], in order to include data processing power. The lower bound on dissipation so established refers to one particular implementation of the processing task at hand. That is, $A_{\nmid \mathcal{\prime}}, A_{-\infty}$, and $\mathcal{B}\{$.$\} are given input parameters and V_{d d}$ remains the sole optimization variable, as in LBP-1.

This section first will present the information-theoretic approach from [Sha97] to LBP-1, and then discuss its limitations.

### 4.7.1 Information-theoretic Results

## Noisy communication channel

Let $\Gamma: X \rightarrow Y$ denote a deterministic channel as in section 4.5.1, but this time with noise $N$ imposed on the channel output such that

$$
\begin{equation*}
Y^{\prime}=Y+N=\Gamma(X)+N \tag{4.30}
\end{equation*}
$$

In this case $H\left(Y^{\prime} \mid X\right)>0$ and is a function of the probability distributions of $X$ and $N$. The noise imposed on the channel manifests itself in the fact, that for a specific channel input symbol different output symbols can appear, each of which has its own probability.

The channel capacity $C$ is defined as the maximum mutual information $I\left(X ; Y^{\prime}\right)$ or, equivalently, the minimum conditional entropy $H\left(Y^{\prime} \mid Y\right)$ over all possible distributions of $X$ [Abr63]. For a given noise power spectrum, capacity $C$ characterizes the channel's ability of transferring information and is independent of the input statistics. In [Sha48] it was shown that the capacity of a channel bandlimited to frequency $W$ is

$$
\begin{equation*}
C=\int_{0}^{W} \log [1+\operatorname{SNR}(f)] d f \tag{4.31}
\end{equation*}
$$

where $C$ is in bps. $\operatorname{SNR}(f)$ denotes the signal-to-noise ratio which, for flat signal and noise power spectra, is given by the ratio of signal and noise variance:

$$
\begin{equation*}
\operatorname{SNR}(f)=\frac{\sigma_{X}^{2}}{\sigma_{N}^{2}} \tag{4.32}
\end{equation*}
$$

## Power dissipation bound

The joint source-channel coding theorem [Sha48] states, that an information transfer rate $R$ over a noisy channel is achievable, with the probability of error approaching zero, as long as

$$
\begin{equation*}
R<C . \tag{4.33}
\end{equation*}
$$

Substitution of (4.31) and (4.32) into (4.33) yields the following constraint on signal power for reliable transmission:

$$
\begin{equation*}
\sigma_{X}^{2}>\left(2^{R / W}-1\right) \sigma_{N}^{2} \tag{4.34}
\end{equation*}
$$

From (4.34) immediately follows a lower bound on the power dissipation of the channel.

## Theorem 4.2 ([Sha97])

For a channel bandlimited to $W \mathrm{~Hz}$, with flat signal and noise power spectrum, and a desired information transfer rate of $R \mathrm{bps}$, the lower bound on power dissipation is given by

$$
\begin{equation*}
P_{m i n}>P\left(\left(2^{R / W}-1\right) \sigma_{N}^{2}\right) \tag{4.35}
\end{equation*}
$$

where $P\left(\sigma_{X}^{2}\right)$ is a linear monotonically increasing function relating the signal power $\sigma_{X}^{2}$ to power dissipation, and $\sigma_{N}^{2}$ is the noise power.

At first, the lower bound (4.35) is independent of the implementation technology and does not refer to the problem of minimum power consumption for digital VLSI processing. In [Sha97] an attempt was made to apply theorem 4.2 to CMOS VLSI by making a number of simplifying assumptions, which will be discussed next.

### 4.7.2 Application to VLSI Processing

Figure 4.10(a) shows the information-theoretic model of a processing task realized in a noisy implementation medium. The architecture implementing operation $\Gamma$ is viewed as noisy bandlimited channel for which theorem 4.2 must apply. By identifying all parameters in theorem 4.2 with characteristics from LBP, a relation between (4.35) and minimum dissipation of some processing task can be established based on the assumptions made in [Sha97].

Information transfer rate. Let $R$ be defined as in section 4.5.1, $R=H(Y) / T_{P}$. Since operation $\Gamma$, input data statistics $\operatorname{Pr}\{$.$\} ,$ and processing time $T_{P}$ determine $R$, the processing task under consideration is implicitly included in the power dissipation bound (4.35).

Noise power. A specific implementation of $\Gamma$ is associated with a certain noise power $\sigma_{N}^{2}$. It is assumed that a noise model is known, from which $\sigma_{N}^{2}$ can be deduced for a given gate-level architecture $A_{-\infty}$ [Sha97]. This model shall lump all noise components of the circuit into a single noise source at its output, see figure 4.10(a).

Bandwidth. According to the sampling theorem, a channel bandlimited to $W \mathrm{~Hz}$ can be used a maximum of $2 W$ times per second without interference of the transmit pulses. In [Sha97] it was shown, that this maximum possible transmission rate indeed achieves the lowest power dissipation for a given information transfer rate $R$. Thus, for the delay $T_{d}$ of the critical path of the given implementation $A_{\triangleright \infty}$ must hold

$$
\begin{equation*}
T_{d}=\frac{1}{2 W} . \tag{4.36}
\end{equation*}
$$

This is reminiscent of a dual-edge-triggered clocking strategy for the implementation with $W=f_{c l k}=f_{c i} / 2$, see section 2.1.

Power function. Let the transmit pulse have ideal square-wave shape of amplitude equal to the supply voltage $V_{d d}$ such that

$$
\begin{equation*}
\sigma_{X}^{2}=V_{d d}^{2} / 4 \tag{4.37}
\end{equation*}
$$

Assuming that every channel use induces a signal transition, the power dissipation function $P\left(\sigma_{X}^{2}\right)$ for CMOS follows from (2.2) as

$$
\begin{equation*}
P\left(\sigma_{X}^{2}\right)=4 C_{L} W \sigma_{X}^{2} \tag{4.38}
\end{equation*}
$$

Here $C_{L}$ denotes the average load capacitance associated with the gate-level architecture $A_{\triangleright \infty}$ at hand. (2.2) obeys the requirement of linearity and monotony of $P\left(\sigma_{X}^{2}\right)$ posed in theorem 4.2.

## Relation to LBP-1

The assumed knowledge of $A_{\triangleright \infty}$ for derivation of $\sigma_{N}^{2}$ implies that the binary encoding of input data $\mathcal{B}\{$.$\} is known. Furthermore, the$

(a)

(b)

Figure 4.10: Processing task in a noisy implementation media: (a) block diagram, (b) memoryless mapping $\Gamma$ blurred by noise (adapted from [Sha97]).

RT-level architecture $A_{\mid \in 1}$ is given, since $\Gamma$ is a memoryless mapping. Hence, the only optimization variable left for achieving minimum power consumption is the supply voltage $V_{d d}$, which complies with LBP-1. The lower bound function $\hat{P}(t)$, see section 4.4.1, follows from application of (4.38) to theorem 4.2 as

$$
\begin{equation*}
\hat{P}(t)>4 C_{L} W\left(2^{R / W}-1\right) \sigma_{N}^{2} \tag{4.39}
\end{equation*}
$$

with $t=\left(\Gamma, \operatorname{Pr}\{\},. T_{P}, A_{\notin \mid}, \mathcal{B}\{\},. A_{-\infty}\right)$. The derivation of $\hat{P}(t)$ also provides the true minimum power dissipation $P\left(t, z^{*}\right)$, because the optimal supply voltage $V_{d d}^{*}$ follows directly from substituting (4.37) into (4.34). This fact reinforces the conjecture made in the context of energy efficiency rating, that there is no implementation-independent lower bound $\hat{P}(t)$ which is tight to the true lower bound $P\left(t, z^{*}\right)$.

Figure 4.11 summarizes the information-theoretic approach in the context of LBP-1. In the theory by [Sha97], $V_{d d}^{*}$ corresponds to the minimum signal power that permits unique decoding of symbols from the noisy output $Y^{\prime}$. This is symbolized in figure $4.10(\mathrm{~b})$, where the dashed fields around the true output symbols may not overlap to allow unambiguous decoding. Thus, operation with optimal $V_{d d}^{*}$ changes the channel characteristic to noiseless transmission.

### 4.7.3 Limitations

In order to arrive at the lower bound (4.39) a number of simplifications were introduced by [Sha97]. Critical examination of these simplifications yields the following limitations of the theory.

- The channel capacity $C$, associated with an implementation, depends on the noise power $\sigma_{N}^{2}$. However, there exists no consistent approach to derive $\sigma_{N}^{2}$ for a given gate-level netlist. Such an approach would need to refer any noise to a single source at the output of the circuit, see figure 4.10(a).
- In CMOS circuits, noise is mainly due to signal switching that induces voltage fluctuations (ground and power bounce) by means of series impedance in common power supply lines. Waveform and amplitude of this noise voltage depend on the supply voltage $V_{d d}$ [KAK99]. One deficiency of the informationtheoretic approach is the assumed independence of noise power
$\sigma_{N}^{2}$ and signal power $\sigma_{X}^{2}$, which is equivalent to $V_{d d}$. Furthermore, switching noise is highly periodical, whereas a flat signal power spectrum (white noise) was assumed.
Similarly, the theory neglects the dependence of bandwidth $W$ on supply voltage $V_{d d}$, which in practice are linked via the delay of the critical path, see (4.36).
Taking the dependencies of $\sigma_{N}^{2}$ and $W$ on $V_{d d}$ together yields an entangled relation between channel capacity $C$ and supply voltage $V_{d d}$. Thus, (4.34) is an implicit relation for signal power $\sigma_{X}^{2}$ and the inference of theorem 4.2 is not justified for CMOS implementation.
- With the assumption that every channel use induces a signal transition, i.e. $\alpha=1$, (4.38) actually sets an upper limit on power dissipation as function of signal power, and thus, does not qualify for the application to theorem 4.2. Since no useful lower bound on the number of transitions per channel use exists, average switching activity in the circuit under consideration could act as a makeshift.
- Even if all of the above issues could be resolved, the informationtheoretic approach proposed has limited practical significance because it predicts minimum power dissipation and supply voltage for a given gate-level implementation of some processing task. However, the same information can readily be obtained by means of gate-level power estimation and an ordinary voltagescaling approach.
- The cost of coding is not included in the analysis. Therefore, by means of optimal encoding as discussed in section 4.5.1, the implementation architecture also could be a set of mere wires. In this case LBP-1 is equivalent to LBP-3.

This rather long list of limitations indicates that the solution for LBP-1 can not be derived from straightforward application of Shannon's theory of information, as was attempted in [Sha97]. This fact alone suggests that the general lower bound problem LBP-4 can not be solved this way either. The following section provides a formal reasoning for this conclusion.

Figure 4.11: Information-theoretic approach [Sha97] to minimum power dissipation (dashed boxes) in the framework of LBP-1 (solid boxes). For a list of debatable assumptions made in [Sha97] see section 4.7.3.

### 4.8 Information Theory vs. Low Power VLSI

Similarly to [Sha97], also the work presented in this chapter initially was inspired of an alleged analogy between the information-theoretic transmission bound problem and LBP as formulated in section 4.3.1. However, our point of view has changed in the course of the research performed. This section provides the rationale behind this change in our viewpoint.

### 4.8.1 Digest of Information-theoretic Results

## Source coding

Let the information rate $R$ of some code be defined as the number of bits of information per code symbol, where $n$ code symbols form one codeword. Shannon's first theorem, i.e. the source coding theorem states that a source emitting $H$ bits of information per source symbol can be represented by a code of information rate $R>H$.

## Channel coding

Shannon's second theorem, i.e. the channel coding theorem states that reliable transmission over a channel with capacity $C$ is possible, as long as the information rate $R$ of the code is below the channel capacity: $R<C$. Reliable transmission means that the codewords can be decoded at the receiver side of the channel with an arbitrary small probability of error.

## Joint source-channel coding

The joint source-channel coding theorem then combines the above two theorems and shows that reliable transmission is possible as long as the entropy of the source is below the channel capacity: $H<C$. This means, that a two-stage coding process for separately designing most efficient source and channel codes, is as efficient as considering the two problems together [CT91].

### 4.8.2 Problem Comparison

The joint source-channel coding theorem may also be viewed as answer to the question of the minimum channel capacity $C$ required for reliable transmission of source symbols with entropy $H$. This in fact is exactly the approach of [Sha97], where it was noted that "the design of a digital system is akin to the selection or development of an appropriate communication network topology with sufficient capacity $C$ ".

In the following, we want to make this comparison more explicit. This requires an approach that does not comply with the usual interpretation of information theory [Sha48, Abr63, CT91]. Table 4.3 summarizes the comparison of the transmission bound problem of information theory (IT) and the dissipation bound problem of low-power VLSI (LP).

Given problem instance. In either case, a specific task is to be performed for a given set of parameters $t$.

IT: Symbols emitted by a source with information content $H$ are to be transmitted over a noisy communication channel.
LP: A digital operation $\Gamma$ is to be performed on a set of data items with statistics $\operatorname{Pr}\{$.$\} in a given amount of time T_{P}$.

Minimum cost. Both the transmission and processing task are to be performed with regard to a specific cost measure, which shall be minimized.

IT: The mathematical model for data transmission includes no explicit cost measure. Here, cost is assumed to be the channel capacity which must be provided for reliable transmission ${ }^{8}$.
LP: The processing task shall be performed using the least amount of energy. Since processing time $T_{P}$ is fixed, this is synonymous to minimum power dissipation $P$.

[^14]Optimization variables. The minimum cost requirement shall be met by means of a set of optimization variables.

IT: The required channel capacity can be minimized through appropriate encoding of blocks of source symbols. This includes the choice of codeword length $n$.
LP: Power dissipation is minimized by choosing an appropriate RT-level architecture $A_{\nmid \mathcal{|}}$, binary data representation $\mathcal{B}\{$.$\} ,$ gate-level architecture $A_{\triangleright \infty}$, and supply voltage $V_{d d}$.

Cost function. A certain relation between the cost measure and the other terms involved is assumed in either theory.

IT: Channel capacity $C$ by definition is independent on how the channel is used, i.e. is independent of the data encoding. With optimum encoding, the required capacity $C$ solely depends on the entropy $H$ of the source.

LP: The indispensable power dissipation depends both on the processing task to be performed as well as on the actual implementation of this task.

Lower Bound. The aim is to derive a lower bound on the respective cost measure.

IT: The joint source-channel theorem gives a lower bound on channel capacity $C$ necessary for reliable transmission. The proof employs a random-coding argument and does not provide the minimizing encoding.
LP: The general lower bound on power dissipation, i.e. the solution to LBP-4 is unknown.

### 4.8.3 Implications

In information theory the cost measure (channel capacity) is independent of the optimization variable (encoding) by definition. This is justified because in practice the channel and the encoder/decoder units can be implemented separately. Hence, the lower bound for

| information theory | low-power VLSI |
| :---: | :---: |
| problem instance |  |
| $t=(H)$ | $t=\left(\Gamma, \operatorname{Pr}\{\},. T_{P}\right)$ |
| minimum cost |  |
| $C \rightarrow$ Min | $P \rightarrow$ Min |
| optimization variables |  |
| $z=(\mathcal{B}\{\})$. | $z=\left(A_{\nmid \oplus \mid}, \mathcal{B}\{\},. A_{\bullet \infty}, V_{d d}\right)$ |
| cost function |  |
| $C=C(t)$ | $P=P(t, z)$ |
| lower bound |  |
| $C>H$ | $?$ |

Table 4.3: Comparison of information-theoretic transmission and VLSI power dissipation bound problem.
the cost measure could be derived without knowledge of the optimum encoding.

In low-power VLSI, on the other hand, the cost measure (power) depends on the optimization variable (hardware implementation). This dependency of power dissipation on $A_{|\oplus|}, \mathcal{B}\{\},. A_{-\infty}$, and $V_{d d}$ certainly agrees with the physical reality that we wish to model. Using a simplified cost function $\hat{P}(t)$ for stating the lower bound does of course not decouple power dissipation and implementation in practice. Nevertheless, we want this simplified model to agree with the physical world, and hence are faced with requirement (4.14) for useful rating of energy efficiency:

$$
\hat{P}(t) \equiv P\left(t, z^{*}\right), \quad \forall t
$$

Consequently, we conclude that the power dissipation bound problem is different in nature from the information-theoretic transmission bound problem. Only the special case LBP-3 can be solved in analogy because power dissipation associated with optimum encoding $\mathcal{B}\{.\}^{*}$ is not included in the cost function, see section 4.5.1.

### 4.9 Summary

First, the energy dissipation limits of todays MOS transistor based VLSI technology have been discussed in the light of existing thermodynamic bounds. It was concluded that, from the energy efficiency point of view, CMOS technology in conjunction with constant-voltage charging and draining of capacitances is a suboptimal implementation medium.

Then, a new view of minimum power dissipation in the context of data statistics has been established by systematically combining the various parameters involved in the low-power VLSI design process. This approach for the first time enables a classification of the lower bound problem. Two of the resulting subproblems, i.e. minimum power dissipation for data transmission (LBP-3) and lower bound on supply voltage (LBP-1), can be solved in principle. No solution for the most general lower bound problem LBP-4 and the Boolean optimization problem LBP-2 exist due to the lack of appropriate analytical models for power dissipation.

However, the finding of an analytical power dissipation function $P=P\left(\Gamma, \operatorname{Pr}\{\},. T_{P}, A_{\text {ศf }}, \mathcal{B}\{\},. A_{\triangleright \infty}, V_{d d}\right)$ that universally applies to any processing task, indeed would revolutionize low-power digital design. Depending on which subproblem of LBP this model applies to, low-power design at the corresponding level of abstraction would become an automated process. On the other hand, we have shown that an implementation-independent lower bound must be tight to the true minimum dissipation, because a reliable rating of energy efficiency for known implementations is otherwise impossible. Therefore, we conclude that there is no useful implementation-independent lower bound on power dissipation.

An existing information-theoretic approach to minimum power dissipation [Sha97] has been investigated in detail, and several limitations of this theory have been uncovered. Furthermore, the approach in [Sha97] has been identified to target only the most simple subproblem within the proposed classification of the lower bound problem, i.e. the lower bound on supply voltage (LBP-1).

By means of a formal comparison it has been shown in this thesis that the general lower bound problem on power dissipation is different in nature from the information-theoretic transmission bound problem. While information theory neglects the cost of optimizing the transmission, this is not appropriate for low-power VLSI processing. Only in case of plain data links (LBP-3), the power dissipation bound arises in direct analogy to information theory.

## Seite Leer / Blank leaf

## Chapter 5

## Energy-Efficient Processing of Speech

### 5.1 Introduction

### 5.1.1 Motivation

Traditionally, hardware designers have tried to find economic arithmetic units by minimizing the total gate count. Low power dissipation was assumed to follow. However, from chapter 3 it is clear that minimum power dissipation hinges on switching activity which, in turn, can be minimized by appropriate encoding as was shown in chapter 4. The present chapter now shall combine the hardware designer's perspective with the idea of minimizing power dissipation in consideration of data statistics and switching activity.

Unlike $\left[\mathrm{BMM}^{+} 00\right]$, where data statistics has been employed to reduce the switching activity on data links, this chapter exemplifies data statistics as vehicle for low-power implementation of digital signal processing (DSP) algorithms by means of application-specific design [WKFF00, WKFF01]. Finite impulse response (FIR) filtering of speech data serves as an example. FIR filtering is, for instance, a standard task for the realization of digital hearing instruments [KB98]. Currently, results of this chapter are being employed for the energy-
efficient implementation of a noise-reduction algorithm in a commercial hearing instrument [Sch99]. In general, the results to be presented shall support the designer of application-specific digital signal processors in making decisions on data representation and arithmetic units.

### 5.1.2 Previous Work

A large number of approaches have been suggested to minimize the power consumption of digital filters. Only those related to data statistics will be cited here.

A technique, called approximate processing, has been suggested in [LNC96] and was generalized in $\left[\mathrm{NOC}^{+} 97\right]$. This technique dynamically adapts the filter order to changing input signal characteristics such as to keep the stopband energy of the filtered signal below a predefined threshold. Recently, an audio chip using this technique in decimation and interpolation filters has been reported to achieve power saving between $20 \%$ and $70 \%$ depending on signal statistics [Pan00].

In [SRB97] the differential coefficients method (DCM) has been proposed. The fundamental idea is to trade a long-coefficient multiplication for a short one at the cost of some overhead due to additional storage requirements. The authors examined this tradeoff by means of an analytical energy cost model. The bottom line is that DCM is effective only if the coefficient bit-width is appreciably reduced by differentiation. The above idea has been generalized to the decorrelating transformation (DECOR) in [RSH99a], such as to extend the principle of differential encoding from the filter coefficients to the filter input signal. However, the power savings that resulted from applying DECOR to the filter input were actually due to data word width reduction. In practice, such a reduction often will be impossible since clipping of data samples cannot be tolerated. Moreover, in [RSH99a] the authors employed the analytical energy cost model from [SRB97] to evaluate DECOR, thereby disregarding the effect of differential encoding on data statistics and thus, switching activity. This chapter shall provide a more realistic appraisal of differential encoding for speech filtering.

In [NS99] the authors employed data statistics to optimize the design of a low-power FIR filter bank for digital hearing instruments by means of asynchronous control and datapath logic. The present work rests on the same idea of exploiting the statistical properties of speech for low-power data processing at the example of FIR filtering. However, the present work differs as it provides a comparative study of alternative data encodings under various application parameters.

### 5.1.3 Outline

In section 5.2 the statistical features of speech are reviewed and coding schemes suited for data processing are identified. Section 5.3 presents an experimental study of the relation between word-level statistical properties of speech and bit-level switching activity. In section 5.4 the target application and the associated reference architecture are introduced. Subsequently, implementation details for alternative data encodings are discussed. Finally, the encodings are experimentally evaluated with respect to their energy efficiency in section 5.5.

### 5.2 Speech Features and Coding

### 5.2.1 Statistical Properties of Speech Data

Speech signals are inherently non-stationary with distinct short-term and long-term statistical properties. Short-term properties refer to speech segments of about $20 \ldots 200 \mathrm{~ms}$ duration, while long-term characteristics refer to speech segments that are several orders of magnitude longer [JN84].

## Short-term characteristics of speech

Speech in general comprises two types of signals: quasi-periodic signal components resulting from voiced speech (mostly vowels), and noise-like components resulting from unvoiced speech (fricatives and plosives). Quasi-periodic signals contain energy from about 100 Hz up to 4.5 kHz . Noise-like signals contains frequency components up to about 8 kHz but at a much lower energy level.


Figure 5.1: Long-term averaged probability density function (left) and power spectral density (right) for benchmark speech sequences.

## Long-term characteristics of speech

In the present context, the long-term behavior of speech is of interest as this determines average switching activity and hence average power. This means that sufficiently long speech samples must be used such as to have a typical share of quasi-periodic and noise-like signal components. By averaging all statistical measures over a sufficiently long period of time, speech is approximated as stationary signal.

Typical for the long-term behavior of speech is the so-called Gamma probability density function [JN84], because small signal values around zero are orders of magnitude more likely than values close to the maximum amplitude. Figure 5.1 shows the probability density function for two benchmark speech samples of twenty seconds duration each [SQA]. The sequences represent a female and a male speaker, articulating several sentences at natural speed in English. Despite the similarity of their probability distributions, the sequences have different spectral properties, as can be seen in their power spectrum in figure 5.1. The male voice has higher signal power for frequencies below $1 / 10$ the sampling frequency $f_{S}=16 \mathrm{kHz}$, while the female voice contains considerably more energy at higher frequencies. Nevertheless, correlation of adjacent samples, which is of primary importance for switching activity, differs only marginally for female and male voice ( $\varrho_{x}=0.99$ for the male vs. $\varrho_{x}=0.95$ for the female speaker) .


Figure 5.2: Classification of low bit-rate speech coding techniques. Only time-domain waveform coding is applicable to data processing.

### 5.2.2 Speech Coding Techniques

The term speech coding traditionally refers to employing the statistical properties of speech signals to reduce the required transmission bit rate in communication systems by means of data compression. Such coding techniques may be broadly divided into waveform coding and source coding (vocoders), see figure 5.2. Hybrid codecs combine methods from waveform and source coding to produce the best speech quality at low bit rates [Kon94].

As will be seen, proper encoding of speech can also be useful for low-power implementation of signal processing algorithms. In this case, however, the choice of coding is limited to time-domain waveform coding, because for the implementation of all codecs some timedomain processing is required. Therefore, all codecs potentially benefit from efficient implementation of PCM systems. The following coding schemes will be considered for low-power FIR filtering: linearly quantized PCM, PCM with adaptive (APCM) and logarithmic (log-PCM) quantization, and differential PCM (DPCM).

### 5.3 Activity Analysis for Speech Data

In section 4.5.2 theoretical lower bounds for the average number of transitions per symbol for given word-level data characteristics were discussed. However, minimum switching activity in general implies complex and highly dissipative coding hardware. Furthermore, minimum activity codes are not appropriate for efficient implementation of arithmetic operations. For the word-level activity analysis in this section, only fixed-length lossless encodings suitable for implementation of processing algorithms will be considered. For analysis and evaluation of lossy encodings, a meaningful signal quality measure is indispensable. Such a quality measure will be inferred from the target application in section 5.4.2.

### 5.3.1 Fixed-length Lossless Encodings

Assume a digital discrete-time speech signal

$$
\begin{equation*}
X(k)=\left[X\left(k T_{c i}\right)\right]_{\langle x\rangle}, \quad k=0,1, \ldots \tag{5.1}
\end{equation*}
$$

with $-1<X(k)<1$, where $[.]_{\langle x\rangle}$ denotes quantization to $\langle x\rangle$ bits by rounding to the nearest binary value. The codeword for $X(k)$ is the binary vector $\boldsymbol{x}^{k}=\left(x_{0}^{k}, x_{1}^{k}, \ldots, x_{\langle x\rangle-1}^{k}\right)$. The following four encodings are considered:

PCM with 2's-complement representation (2sC). For this code, the codeword $\boldsymbol{x}^{k}$ is formed such that

$$
\begin{equation*}
X(k)=(-1) \cdot x_{0}^{k}+\sum_{i=1}^{\langle x\rangle-1} x_{i}^{k} \cdot 2^{-i} \tag{5.2}
\end{equation*}
$$

2 sC is the most widely used representation for signed numbers because it facilitates efficient implementation of addition/subtraction. However, for speech signals where small values around zero are orders of magnitude more likely than very large values, the sign-extension of 2 sC greatly inflates switching activity.

PCM with sign-magnitude representation (S\&M). In this case, the codeword $\boldsymbol{x}^{k}$ is formed such that

$$
\begin{equation*}
X(k)=(-1)^{x_{0}^{k}} \cdot \sum_{i=1}^{\langle x\rangle-1} x_{i}^{k} \cdot 2^{-i} \tag{5.3}
\end{equation*}
$$

S\&M representation conforms better to the characteristics of speech, but implies more costly adders/subtracters.

Differential PCM with $\mathbf{2 s C}$ representation (D-2sC). The codeword $\boldsymbol{x}^{k}$ is recursively constructed such that

$$
\begin{equation*}
X(k)-X(k-1)=(-1) \cdot x_{0}^{k}+\sum_{i=1}^{\langle x\rangle-1} x_{i}^{k} \cdot 2^{-i} \tag{5.4}
\end{equation*}
$$

Since the amplitude of speech signals changes slowly over time, the differentiation in (5.4) potentially reduces the dynamic range. For the target application, differential encoding is associated with the aforementioned DECOR transform, to which we will return in section 5.4.4.

Differential PCM with S\&M representation (D-S\&M). In this case, for the codeword $\boldsymbol{x}^{k}$ holds

$$
\begin{equation*}
X(k)-X(k-1)=(-1)^{x_{0}^{k}} \cdot \sum_{i=1}^{\langle x\rangle-1} x_{i}^{k} \cdot 2^{-i} \tag{5.5}
\end{equation*}
$$

### 5.3.2 Signal Parameters

Switching activities $\alpha_{x_{i}}(i=0, \ldots,\langle x\rangle-1)$ and the total number of transitions per symbol

$$
\begin{equation*}
\alpha_{x}=\sum_{i=0}^{\langle x\rangle-1} \alpha_{x_{i}} \tag{5.6}
\end{equation*}
$$

shall be investigated experimentally subject to four signal parameters:

Signal variance $\boldsymbol{\sigma}_{\boldsymbol{X}}^{\mathbf{2}}$ [dB]. This parameter corresponds to perceived loudness and is defined as

$$
\sigma_{X}^{2}[\mathrm{~dB}]=10 \cdot \log _{10}\left(\frac{1}{n} \sum_{k=1}^{n}\left(X(k)-\mu_{X}\right)^{2}\right)
$$

where $\mu_{X}$ is the expected value.
Signal-to-noise ratio SNR [dB]. If $\tilde{X}(k)=X(k)+N(k)$ with $N(k)$ being noise samples, this parameter is defined as

$$
\mathrm{SNR}[\mathrm{~dB}]=10 \cdot \log _{10} \frac{\sigma_{\tilde{X}}^{2}}{\sigma_{N}^{2}} .
$$

Sampling frequency $\boldsymbol{f}_{\boldsymbol{S}}[\mathbf{k H z}]$. According to (5.1) this parameter is defined as

$$
f_{S}=1 / T_{c i}
$$

Quantization resolution $\langle\boldsymbol{x}\rangle$ [bit]. For fixed-length codes, number of bits per symbol, data word width, and quantization resolution $\langle x\rangle$ are synonyms.

The two speech sequences investigated in section 5.2 .1 served as benchmark signals for the experiments. For sake of brevity, only the main results for the male speaker are reproduced here. Complete results are given in [WKF00]. It was found that female voice in general induces slightly higher switching activity due to lower temporal correlation resulting from increased spectral power at higher frequencies. However, the qualitative behavior with respect to the above parameters is the same for male and female speakers.

### 5.3.3 Switching Activity vs. Signal Parameters

## Signal variance

The influence of signal power on switching activity is investigated by fixing the sampling frequency to 16 kHz and the quantization resolution to 16 bits. A sinusoidal signal sampled at its minima and maxima defines the reference of $\sigma_{X}^{2}=0 \mathrm{~dB}$.

Figure 5.3 shows the switching activity profiles for two grossly different loudness levels. In these plots, the left-most bit position denotes the sign bit $x_{0}$. The right-most position is the LSB. The table at the bottom of figure 5.3 reports the absolute and relative number of transitions per symbol, see (5.6). From the results in figure 5.3 the following may be concluded:

- Word-level switching activity $\alpha_{x}$ grows with increasing signal power for all encodings considered.
- Absolute and relative activity savings ${ }^{1}$ of S\&M over 2 sC increase with decreasing signal power, meaning 2 sC being particularly inefficient for low voice.
- Differential PCM with 2 sC considerably amplifies switching activity not only for redundant but also for information bearing bits.
- Differential PCM with S\&M considerably reduces switching activity compared to PCM with S\&M for all information bearing bits. Activity savings of D-S\&M over S\&M therefore grow with signal variance.


## Signal-to-noise ratio

In order to explore different SNR, noise has been added to the speech sequence. $\sigma_{\tilde{X}}^{2}$ of the target signal was kept constant such as to exclude the influence of signal variance on switching activity for this set of experiments.

Figure 5.4 shows the activity profiles for colored noise, which was obtained by passing white noise through a filter with magnitude response similar to the power spectrum of the original speech signal. Results for other noise types including multiple-speaker modeling are given in [WKF00]. From figure 5.4 the following implications follow:

- As signal quality deteriorates, switching activity grows for all considered encodings.

[^15]- Absolute activity savings of S\&M over 2 sC are largely unaffected by the SNR. Relative savings marginally decline.
- Differential encoding becomes less effective with decreasing SNR.


## Sampling frequency

For some completely uncorrelated data source, the number of transitions per symbol $\alpha_{x}$ is constant for any sampling frequency $f_{S}$. Hence, by virtue of (2.2), power dissipation linearly grows with $f_{S}$. However, this does not hold true for speech signals. Figure 5.5 typifies the relation between switching activity and sampling frequency for speech:

- As a consequence of increasing correlation, switching activity $\alpha_{x}$ reduces for higher sampling frequencies and hence, power dissipation, as a function of $f_{S}$, grows with smaller rate than for uncorrelated signals.
- The advantage of S\&M over 2sC diminishes for increasing sampling frequency. This can be explained by a smaller proportion of sign changes taking place, which lessens the adverse effect of sign-extension in 2 sC representation.
- Because of increasing correlation between samples, differential PCM becomes more efficient with raised sampling frequency.


## Quantization resolution

Finally, by definition (5.6), switching activity $\alpha_{x}$ will also be affected by the number of bits per symbol used for encoding. Figure 5.6 depicts the activity profiles and total activities for quantization resolutions of 12 and 24 bits. The experimental data reveal the following:

- Resolutions in excess of about sixteen bits will add random-like high-activity bits to the data word and hence disproportionately increase overall activity.
- Relative activity savings of S\&M and D-S\&M over 2 sC decline as the quantization resolution enhances.





Figure 5.3: Switching activity profiles and total switching activity $\alpha_{x}$ for two different signal power levels $\left(\langle x\rangle=16 \mathrm{bit}, f_{S}=16 \mathrm{kHz}\right)$.


$-2 \mathrm{sC}$
.... S\&M

- D-2sC
$\ldots \infty$ D-S\&M

| $\alpha_{x}$ |  | $[\%]$ |  <br> 5.5 <br> 4.9 |  | 100 | 6.2 | 100 |
| :---: | ---: | ---: | ---: | :---: | :---: | :---: | :---: |
| 6.5 | 89 | 5.6 | 90 |  |  |  |  |
| 4.5 | 118 | 7.5 | 121 |  |  |  |  |

Figure 5.4: Switching activity profiles and total switching activity $\alpha_{x}$ for two different signal-to-noise ratios $\left(\langle x\rangle=16 \mathrm{bit}, \sigma_{X}^{2}=-20 \mathrm{~dB}\right.$, $f_{S}=16 \mathrm{kHz}$ ).


$-2 \mathrm{sC}$
---- S\&M

- D-2sC
$\ldots . . . \mathrm{D}-\mathrm{S} \& \mathrm{M}$

|  |  |  |  |
| :---: | ---: | ---: | ---: |
| $\alpha_{x}$ | $[\%]$ | $\alpha_{x}$ | $[\%]$ |
| 5.6 | 100 | 4.6 | 100 |
| 5.0 | 89 | 4.3 | 93 |
| 6.9 | 123 | 5.4 | 117 |
| 4.7 | 84 | 3.6 | 78 |

Figure 5.5: Switching activity profiles and total switching activity $\alpha_{x}$ for two different sampling frequencies $\left(\langle x\rangle=16 \mathrm{bit}, \sigma_{X}^{2}=-20 \mathrm{~dB}\right)$.




| $\alpha_{x}$ | $[\%]$ | [\%] position 2 |  |  |
| :---: | ---: | ---: | ---: | :---: |
| 3.3 | 100 | $\alpha_{x}$ | 100 |  |
| 2.8 | 85 | 8.1 | 94 |  |
| 4.1 | 124 | 10.1 | 111 |  |
| 2.4 | 73 | 8.1 | 89 |  |

Figure 5.6: Switching activity profiles and total switching activity $\alpha_{x}$ for two different quantization resolutions ( $\left.\sigma_{x}^{2}=-20 \mathrm{~dB}, f_{S}=16 \mathrm{kHz}\right)$.

### 5.4 FIR Filtering of Speech Data

### 5.4.1 Target Application

The FIR filter algorithm is a prime candidate for data encoding, because every data item is used multiple times in a generic operation, i.e. multiply-accumulate (MAC). It is assumed, that an $N$-tap FIR filter operation

$$
\begin{equation*}
Y(k)=\sum_{n=0}^{N-1} c_{n} X(k-n) \tag{5.7}
\end{equation*}
$$

is to be performed on an application-specific signal processor with fixed-point arithmetic. Again, $\langle x\rangle$ is used to denote the number of bits in the binary representation of signal $X$. The filter coefficients $c_{n}$ are freely programmable but invariant during the period of operation.

Figure 5.7 shows the model employed for evaluating different data codes. Power analysis is carried out for encoder/decoder, functional data memory, and MAC unit. Since the focus is on exploiting data statistics, the coefficient memory is not included in the power analysis, although different data codes may infer specific coefficient codings. For the same reason any potential symmetry in the coefficient vector is not exploited.

### 5.4.2 Reference Architecture with 2sC Data

The reference architecture operates with data in linear PCM and 2 sC representation. Neither encoding nor decoding is necessary in this case. Thus, in the evaluation model in figure 5.7, power is dissipated only by the data memory and the MAC unit.

In [BG99] it was found that for a fully time-shared architecture the direct form (DF) filter is more energy-efficient than the transposed form due to increased state-storage requirements in the latter. Furthermore, for a time-shared architecture the transposed form poses problems with the coefficient update in adaptive filtering. Hence, the DF filter structure has been selected for the present investigations.


Figure 5.7: Block diagram of evaluation model for data coding.

## Data memory

Assume the data memory is to be built with standard-cell flip-flops. There are basically three different schemes to realize the necessary data storage and access for DF FIR filtering in a fully time-shared architecture:
(1) circular shift register with fixed read/write positions,
(2) linear shift register with fixed write position and random read access,
(3) register file with random read/write access.

Obviously, architecture (1) is the most area-efficient configuration since it works without address generation overhead. However, with regard to power consumption, it is the least desirable version since the shift register must be clocked at $N$ times the sampling frequency. We experimentally found, see appendix D , that configurations (2) and (3) are comparable in terms of area and power consumption for a wide range of filter orders. However, architecture (3) with direct read/write addressing permits to gate the clock signal. With this feature, architecture (3) outperformed all other structures in view of power dissipation. Hence, a structure as shown in figure 5.8 has been selected for the data memory of the reference architecture. All alternative architectures shall use the same memory structure but possibly differ in data word width $\langle x\rangle$.


Figure 5.8: Reference architecture of the data memory.


Figure 5.9: Reference architecture of the MAC unit.

## MAC unit

The MAC unit performs standard 2 sC multiplication by means of a parallel multiplier. As shown in figure 5.9 the coefficient word width is presumed equal to the data word width, resulting in a product width of $2\langle x\rangle-1$ bits $^{2}$. The product is truncated such as to retain $\langle a\rangle$ most-significant bits for the accumulator $A$ with $2\langle x\rangle-1 \geq\langle a\rangle \geq\langle x\rangle$. Finally, the filter output signal $Y(k)$ is truncated to the input data word width of $\langle x\rangle$ bits.

Theoretically, for an arbitrary $N$-tap filter $\left\lceil\log _{2}(N)\right\rceil$ guard bits have to be provided for the accumulator in order to avoid arithmetic overflow. However, given filter coefficients which correspond to practical impulse responses and the particular input data characteristic at hand, it is extremely unlikely that the full theoretical range will be exercised. In fact, for all benchmark signals from section 5.3 and over a wide range of filter specifications the accumulator can do without any guard bit. Therefore, zero guard bits were used for the reference as well as for any alternative architecture.

Formally, the filter operation performed by the reference implementation (PCM-2sC) can be summarized as

$$
\begin{align*}
X^{\prime}(k) & =X(k) \\
A(n+1) & =\left\lfloor c_{n} \cdot X^{\prime}(k-n)\right\rfloor_{\langle a\rangle}+A(n) \\
Y^{\prime}(k) & =\lfloor A(N)\rfloor_{\langle x\rangle}  \tag{5.8}\\
Y(k) & =Y^{\prime}(k)
\end{align*}
$$

( $n=0, \ldots, N-1 ; a(0)=0$ ), where $\lfloor x\rfloor_{w}$ denotes the truncation of signal $x$ in 2 sC representation to $w$ bits (rounding towards $-\infty$ ).

## Quality measure

The accumulator width $\langle a\rangle$ of the reference architecture determines the target quality of the filtered signal, which is assessed as the signal-to-quantization-noise ratio

[^16]\[

$$
\begin{equation*}
\operatorname{SQNR}[\mathrm{dB}]=10 \cdot \log _{10} \frac{\frac{1}{n} \sum_{k=1}^{n}\left(\bar{Y}_{r}[k]\right)^{2}}{\frac{1}{n} \sum_{k=1}^{n}\left(\bar{Y}_{r}[k]-\bar{Y}[k]\right)^{2}} \tag{5.9}
\end{equation*}
$$

\]

where $\bar{Y}$ and $\bar{Y}_{r}$ denote the mean-compensated output signal associated with full and reduced precision accumulation, respectively. SQNR is the previously mentioned quality measure that permits meaningful evaluation of lossy data encodings. For $\langle a\rangle<2\langle x\rangle-1$ such codings in principle allow to shift part of the approximation from within the MAC unit to the preceding encoder unit, which potentially simplifies multiplication.

### 5.4.3 Sign-Magnitude Representation

As was seen in section 5.3 , S\&M representation substantially reduces the average switching activity of binary encoded speech data streams compared to 2 sC . $\mathrm{S} \& \mathrm{M}$ offers the advantage of easy implementation of many arithmetic operations. Particularly multiplication of S\&M numbers is straightforward, since it simply requires an unsigned multiplication of magnitudes and xOR-ing the input sign bits. Unfortunately, the most basic arithmetic operation, i.e. addition of two numbers, is considerably more complicated with S\&M than with 2 sC . It is this fact that has prevented the common use of S\&M representation for DSP implementations.

However, since multiplication is more costly than addition, S\&M is an attractive alternative to 2 sC for DSP applications, even when the statistical properties of the application data are left aside. Two alternative S\&M MAC units based on S\&M and 2sC addition have been investigated. In either case the same unsigned multiplier is used.

## S\&M accumulation

Addition of two S\&M numbers can be implemented by a 1'scomplement adder with end-around carry signal [Hwa79]. This requires conditional inversion of one input operand as well as of the sum output, see figure 5.11. For the application at hand this has the
advantage of operating the accumulator with the preferred S\&M representation. Furthermore, when the S\&M product is truncated prior to accumulation, the error introduced will in general be lower than in the reference implementation. This is because truncation of 2 sC numbers corresponds to rounding towards minus infinity, see figure 5.10, which will cause the error to build up during accumulation of individual filter tap results. On the other hand, pruning S\&M numbers complies with rounding towards zero and the error tends to average out over several filter taps.


Figure 5.10: Truncation for 2 sC and $\mathrm{S} \& \mathrm{M}$ numbers.

The above benefits of S\&M accumulation come at the expense of increased activity in the adder circuitry due to the end-around carry signal and pre- and post-conversion overhead. Besides, although a 1's-complement adder with end-around carry is algorithmically stable, oscillation may occur in the hardware implementation due to skewed input signals. To eliminate oscillation, the end-around carry signal has been gated with a clock-derived enable signal as shown in figure 5.11.

The filter operation with S\&M accumulation can be formally described as follows:

$$
\begin{align*}
X^{\prime}(k) & =\{X(k)\}_{S \& M} \\
A(n+1) & =\left[c_{n} \cdot X^{\prime}(k-n)\right]_{\langle a\rangle}+A(n) \\
Y^{\prime}(k) & =[A(N)]_{\langle x\rangle}  \tag{5.10}\\
Y(k) & =\left\{Y^{\prime}(k)\right\}_{2 s C}
\end{align*}
$$

where $[X]_{w}$ denotes the truncation to $w$ bits of signal $X$ in S\&M representation.


Figure 5.11: S\&M addition with gated end-around carry signal.

## 2sC accumulation

To facilitate a more simple addition, the S\&M product may be converted to 2 sC . Precise conversion would imply the circuitry shown in figure $5.12(\mathrm{a})$, where the extra row of $2\langle x\rangle-2$ half-adder cells is undesirable for energy efficiency reasons. Since truncation is performed post conversion, the biased 2 sC truncation error is associated with this architecture. Approximate conversion from S\&M to 2 sC as shown in figure 5.12 (b) is not only less costly, but at the same time accomplishes the unbiased rounding of $\mathrm{S} \& \mathrm{M}$ truncation. In this case, the sign-bit of the product is fed as carry input to the accumulating adder, which implies almost no overhead.

With the conversion as shown in figure 5.12(b), the formal filter description for 2 sC accumulation is

$$
\begin{align*}
X^{\prime}(k) & =\{X(k)\}_{S \& M} \\
A(n+1) & =\left\{\left[c_{n} \cdot X^{\prime}(k-n)\right]_{\langle a\rangle}\right\}_{2 s C}+A(n)  \tag{5.11}\\
Y^{\prime}(k) & =\lfloor A(N)\rfloor_{\langle x\rangle} \\
Y(k) & =Y^{\prime}(k) .
\end{align*}
$$

Note that in the occurrence of 2 sC accumulation, no explicit decoder unit is required.

(a)

(b)

Figure 5.12: Conversion of the $S \& M$ product for $2 s C$ accumulation with (a) truncation post-conversion and (b) truncation pre-conversion.

### 5.4.4 Differential Encoding

Differential encoding employs the high correlation between consecutive samples to remove short-term redundancy in the speech signal. DPCM corresponds to the decorrelating transform DECOR [RSH99a], when applied to the input data stream. In DECOR the filter transfer function is multiplied and divided by the same polynomial.

$$
\begin{equation*}
Y(z)=X(z) \cdot\left(1+\alpha z^{-\beta}\right)^{m} \cdot H(z) \cdot \frac{1}{\left(1+\alpha z^{-\beta}\right)^{m}} \tag{5.12}
\end{equation*}
$$

Since correlation is highest between adjacent speech samples the obvious choice for $\alpha$ and $\beta$ in (5.12) is $\alpha=-1$ and $\beta=1$. Parameter $m$ determines the number of recursive difference operations that are applied to the input data stream. Figure 5.13 depicts the system architecture for $\alpha=-1, \beta=1, m=1$.

With this choice of parameters, en- and decoder units are particularly simple in that they require no multiplication. However, a


Figure 5.13: DECOR transform for $\alpha=-1, \beta=m=1$.
key obstacle in a practical realization of this transform is the infiniteimpulse response (IIR) operation $1 /\left(1-z^{-1}\right)$ to be performed in the decoder. The pole on the unit circle is only canceled by the preceding feed-forward section if finite-precision effects are avoided. This has two important consequences which will be discussed next.

## Error propagation

Most of the time the difference between adjacent speech samples will have a smaller dynamic range than the original signal. Thus, in [RSH99a], power savings were based on word width reduction by allowing for deterioration of the SQNR. However, the decoding operation $1 /\left(1-z^{-1}\right)$ will propagate the smallest error in any of the filter output samples to all subsequent samples. Thus, there is a risk of not achieving the target SQNR of the reference implementation. To exclude error propagation, the differential input is represented with $\langle x\rangle+1$ bits (for $m=1$ ). Although this additional bit was found to be redundant for all speech signals investigated in section 5.3 , it seems to be inevitable from a practical point of view, whenever the input sequence is not a priori known.

Hence, due to error propagation, the IIR operation of the decoder restricts the choice of word width for the differential input.

## Error accumulation

Assume a certain number of bits being truncated from the multiplier output prior to accumulation and let

$$
\begin{equation*}
\epsilon_{\Delta}(k)=\Delta Y_{r}(k)-\Delta Y(k) \tag{5.13}
\end{equation*}
$$

be the corresponding error associated with the differential filter output. Then, by recursively applying $Y(k)=Y(k-1)+\Delta Y(k)$ and
using (5.13), the error in the decoded output signal becomes

$$
\begin{align*}
\epsilon(k) & =Y_{r}(k)-Y(k) \\
& =Y_{r}(0)+\sum_{i=1}^{k} \Delta Y_{r}(i)-Y(0)-\sum_{i=1}^{k} \Delta Y(i) \\
& =\epsilon(0)+\sum_{i=1}^{k} \epsilon_{\Delta}(i) . \tag{5.14}
\end{align*}
$$

From (5.14) it is clear, that any non-zero mean error induced during the filter operation will accumulate in the IIR decoder and consequently spoil the SQNR. To avoid this, the authors in [RSH99a] suggested to perform true rounding instead of truncation for reducing the product word width. However, rounding only accomplishes a zero mean error if the original signal itself has zero mean, see figure 5.10. This, in general, can not be justified. And even if assumed for the filter input signal, a DC bias may result from multiplication with finite-precision filter coefficients.

Thus, the only way to exclude error accumulation in practice, is to refrain from word size reduction in the MAC unit.

## Practical implementation of DECOR

The preceding discussion suggests that the statistical properties of speech cannot be utilized for low-power computation by differential encoding when targeting synchronous bit-parallel processing. However, as shown in section 5.3, differential encoding reduces average switching activity if used in conjunction with S\&M representation even for the nominal word width $\left\langle x^{\prime}\right\rangle=\langle x\rangle+1$. Therefore, the following differential filter implementation has been chosen:

$$
\begin{align*}
X^{\prime}(k) & =\{X(k)-X(k-1)\}_{S \& M} \\
A(n+1) & =c_{n} \cdot X^{\prime}(k-n)+A(n) \\
Y^{\prime}(k) & =\{A(N)\}_{2 s C}+Y^{\prime}(k-1)  \tag{5.15}\\
Y(k) & =\left\{Y^{\prime}(k)\right\rfloor_{\langle x\rangle} .
\end{align*}
$$

To avoid error accumulation, truncation is only performed at the decoder output following the IIR operation.

With this implementation, the experimental evaluation in section 5.5 exclusively marks the low-power potential of DPCM in view of switching activity reduction, permitting a more realistic assessment than in [RSH99a], where power savings were bound to word size diminution.

### 5.4.5 Adaptive Encoding

Adaptive PCM (APCM) for low-rate communication models speech as non-stationary signal to accommodate the dynamic range with a fewer number of bits. For the target application, this translates into a reduction in the data path width.

## Adaptation scheme

Of the various adaptation schemes proposed in the communication framework, feed-forward instantaneous gain adaption best suites the application at hand. Instantaneous adaption means that each individual data sample is approximated as

$$
\begin{equation*}
X(k) \simeq 2^{-G(k)} \cdot M(k) \tag{5.16}
\end{equation*}
$$

Feed-forward adaption is akin to sending the gain exponent $G(k)$ together with the mantissa $M(k)$ through the data path in order to control the computation. The data path thus must be capable of floating point arithmetic.

The mantissa word width $\langle m\rangle$ is determined by SQNR requirements and the word width of the gain exponent is given as

$$
\begin{equation*}
\langle g\rangle=\left\lceil\log _{2}(\langle x\rangle-\langle m\rangle+1)\right\rceil \tag{5.17}
\end{equation*}
$$

if original data samples were represented using $\langle x\rangle$ bits. Thus, approximation (5.16) provides a reduction in the overall data word width, i.e.

$$
\begin{equation*}
\langle m\rangle+\langle g\rangle<\langle x\rangle \tag{5.18}
\end{equation*}
$$

as long as $0<\langle m\rangle<\langle x\rangle-2$.
Figure 5.14 illustrates the formation of the mantissa codeword $\boldsymbol{m}^{k}$ and gain exponent codeword $\boldsymbol{g}^{k}$ from the original 2 sC codeword for


Figure 5.14: Formation of mantissa codeword $\boldsymbol{m}^{k}$ and gain exponent codeword $\boldsymbol{g}^{k}$ from the 2 sC codeword $\boldsymbol{x}^{k}$ in APCM.
$\langle x\rangle=16$ and $\langle m\rangle=10$. The neglected bits at the left-hand side of the mantissa window correspond to redundant information, while truncating bits at the right-hand side of the window will introduce noise. Adaptation of S\&M data can be performed accordingly.

## Implementation of target application

Based on (5.16) the following implementation has been chosen for evaluating the energy efficiency of APCM:

$$
\begin{align*}
X^{\prime}(k) & =\left(\{M(k)\}_{S \& M}, G(k)\right) \\
A(n+1) & =\left\{\left[c_{n} \cdot M(k-n) \ll G(k-n)\right]_{\langle a\rangle}\right\}_{2 s C}+A(n) \\
Y^{\prime}(k) & =\lfloor A(N)\rfloor_{\langle x\rangle} \\
Y(k) & =Y^{\prime}(k) \tag{5.19}
\end{align*}
$$

The MAC unit utilizes floating point arithmetic rather than fixed point arithmetic as in the reference implementation. Using gain factor $G(k)$, the product is rescaled to the accumulator word width, which is governed by the target SQNR. With this scheme, potential power savings of APCM are based on appropriate rounding of individual data samples prior to multiplication.

Alternative implementations of APCM are possible. First, a unique scale factor may be used for $N$ consecutive input samples. This scale
factor is updated at the beginning of each filter cycle. This scheme corresponds to block floating point arithmetic [Opp70, RB97] and requires storage of the original data samples in the functional memory, but has the advantage that rescaling has to be performed only for filter output values. The effectiveness of this method strongly depends on the filter order and is especially appropriate when non-overlapping blocks of input data are to be processed.

Second, in addition to data samples, filter coefficients may be adaptively scaled if so permitted by the impulse response of the filter at hand. Since the focus here is on employing the data statistics for lowpower computation independently of specific filter characteristics the above two alternatives are not further considered.

### 5.4.6 Logarithmic Encoding

As a standard technique for speech compression, logarithmic quantization employs the long-term statistical properties of speech to reduce the transmission bit rate. For data processing, the main attraction of logarithmic encoding is the substitution of costly multiplication in the linear domain by simple addition in the logarithmic domain. On the other hand, addition in the logarithmic domain calls for lookup tables (LUT) which grow exponentially with increasing precision [SA75, Lew90].

With the approximation technique explained below, LUTs for the conversion between logarithmic and linear domains are smaller than those required for addition. Since in the target application the final filter output must be in linear PCM format, anti-logarithm is performed within the MAC unit prior to accumulation.

## Approximation of logarithm

To convert between logarithmic and linear domains one could either use direct table-lookup or some approximation technique. Depending on the target SQNR, the data word width may be significantly reduced if direct table-lookup is employed, i.e. $\left\langle x^{\prime}\right\rangle\langle\langle x\rangle$. However, the conversion overhead quickly becomes unacceptable even for mediumscale $\langle x\rangle$.

On the other hand, the linear approximation $\log _{2}(1+x) \approx$ $x(0 \leq x \leq 1)$ proposed in [Mit62] permits efficient implementation, but is too imprecise to attain any practical SQNR for the target application. The piecewise linear approximation strategy from [CZV65] improves conversion accuracy but involves multiplications. The method proposed in [SBG99] enhances precision by means of a corrective addition. This approach will be modified, such that the conversion accuracy can be adjusted to meet the SQNR constraint.

Let $X \geq 1$, then the logarithm can be written as ${ }^{3}$

$$
\begin{align*}
\log (X) & =\log \left(2^{I}+F 2^{I}\right) \\
& =I+\log (1+F) \\
& =I+F+C(F) \tag{5.20}
\end{align*}
$$

Integer part $I$ of the logarithm is formed by means of a simple leadingone detector and a small LUT. The fractional part $\log (1+F)$ of the logarithm is formed by adding a correction term $C(F)$ to $F$, the linear approximation of $\log (1+F)$. There holds

$$
\begin{equation*}
C(F)=\log (1+F)-F>0, \quad(0<F<1) \tag{5.21}
\end{equation*}
$$

and

$$
\frac{\mathrm{d} C(F)}{\mathrm{d} F}=\left\{\begin{array}{lll}
>0 & & 0 \leq F<F^{*}  \tag{5.22}\\
=0 & \text { for } & F=F^{*} \\
<0 & & F^{*}<F \leq 1
\end{array}\right.
$$

$F^{*} \approx 0.4427$, and $C\left(F^{*}\right) \approx 0.0861$ being the maximum of the correction term, see figure 5.15.

In practice, the correction term $C(F)$ must be realized with finite precision, and (5.20) becomes an approximation. The monotonic behavior of $C(F)$ can be employed for implementing $C(F)$ as LUT whose number of entries may be minimized in regard of the target SQNR by appropriate nonuniform quantization of $F$. Next to the number of entries, the number of bits to represent $C(F)$ is the second parameter for the optimization of this LUT. Figure 5.16 shows the principle and a numerical example for conversion from the linear to

[^17]

Figure 5.15: Linear approximation $\log (1+F) \approx F$ for $0 \leq F \leq 1$ with maximum error $C\left(F^{*}\right)$.
the logarithmic domain.
A similar approach can be used for the implementation of antilogarithm by employing the approximation

$$
\begin{equation*}
F \approx \log (1+F-C(1-F)) \quad(0 \leq F \leq 1) \tag{5.23}
\end{equation*}
$$

With this, the anti-logarithm for some given integer part $I$ and fractional part $F$ becomes

$$
\begin{align*}
X & =2^{I+F} \\
& \approx 2^{I} 2^{\log (1+F-C(1-F))} \\
& \approx 2^{I}(1+F-C(1-F)) . \tag{5.24}
\end{align*}
$$

This is easily implemented by putting a ' 1 ' in the appropriate position according to $I$ and appending the corrected fractional part. The symmetry in the correction terms for logarithm and anti-logarithm is an interesting feature, which could be employed such as to use only one common LUT. In the present case this was not done because of the required modification of the overall filter architecture compared to the reference implementation.


Figure 5.16: Architecture for logarithmic encoding with linear approximation and additive correction (left). Numerical example $\log _{2}(235)=0.7877$ for LUT with five entries and three bits resolution for the correction terms (right).

## Implementation of target application

With (5.20) and (5.24) the FIR filter based on logarithmic encoding (log-PCM) can hence be given as

$$
\begin{align*}
X^{\prime}(k) & =I_{X}(k)+F_{X}(k)+C\left(F_{X}(k)\right) \\
P(n) & =\log \left(\left|c_{n}\right|\right)+X^{\prime}(k-n) \\
A(n+1) & =\left\{S _ { X } ( k - n ) S _ { c _ { n } } \left[2 ^ { I _ { P } ( n ) } \left(1+F_{P}(n)-\right.\right.\right. \\
& \left.\left.\left.C\left(1-F_{P}(n)\right)\right)\right]_{\langle a\rangle-1}\right\}_{2 s C}+A(n)  \tag{5.25}\\
Y^{\prime}(k) & =\lfloor A(N)\rfloor_{\langle x\rangle} \\
Y(k) & =Y^{\prime}(k)
\end{align*}
$$

with $I_{X}$ and $F_{X}$ being integer and fractional part of $X$ in the logarithmic domain as above, and $S_{X}$ denoting the sign of $X$.

### 5.5 Evaluation of Coding Schemes

### 5.5.1 Application Parameters

In order to cover a wide range of typical situations in speech filtering, the different codings have been evaluated at extreme points of the following parameters:

A/D-conversion bit rate $I[\mathrm{bit} / \mathrm{s}]$. This parameter comprises sampling frequency $f_{S}$ and quantization resolution $\langle x\rangle$, that were dealt with separately in section 5.3 in order to reveal their influence on switching activity. The two considered extreme points are $I_{L}=96 \mathrm{kbit} / \mathrm{s}$ and $I_{H}=705.6 \mathrm{kbit} / \mathrm{s}$, corresponding to telephony (toll) and CD-quality (see table 5.1).

Signal quality. This embraces the remaining free parameters from section 5.3, i.e. signal power $\sigma_{x}^{2}$ and signal-to-noise ratio SNR. Based on the results from section 5.3, the following signals were chosen: male speaker with high SNR to represent high signal quality [SQA], and faint colored noise to emulate low signal quality, see table 5.1. The colored noise was formed by passing Gaussian white noise through a filter with a magnitude response that matches the long-term power spectral density of speech.

Processing accuracy. For the target application processing accuracy is associated with the internal accumulator word width and rounding scheme used for summation over all product terms, and is assessed by the SQNR as defined in (5.9). The two extreme points to be investigated coincide with full accumulator width and accumulator width equal to input data width in the reference architecture. The corresponding SQNR of the high quality signal sets the target on processing accuracy for any alternative coding scheme.

### 5.5.2 Power Estimation Model

Since encoding for data processing has a joint impact on switching activity, circuit structure, and type of arithmetic units, analytical power cost models seem inappropriate in this case. Particularly

| Parameter | Logo | Specification |
| :---: | :---: | :---: |
| Bit rate | $I_{L}$ | $\langle x\rangle=12 \mathrm{bit}, f_{S}=8 \mathrm{kHz}$ |
|  | $I_{H}$ | $\langle x\rangle=16 \mathrm{bit}, f_{S}=44.1 \mathrm{kHz}$ |
| Signal <br> quality | $Q_{L}$ | $\sigma_{x}^{2}=-50 \mathrm{~dB}$, SNR $=0 \mathrm{~dB}$ |
|  | $Q_{H}$ | $\sigma_{x}^{2}=-20 \mathrm{~dB}, \mathrm{SNR}=40 \mathrm{~dB}$ |
| Processing <br> accuracy | $A_{L}$ | $\langle a\rangle=\langle x\rangle$ |
|  | $A_{H}$ | $\langle a\rangle=2\langle x\rangle-1$ |

Table 5.1: Specification of application parameters.
simple counting of basic operations as being performed in [SRB97] and [RSH99a] is hard to justify. Instead, a simulation approach to power estimation has been taken here. Although it does not guarantee exact results in absolute terms, it is considered sufficient for relative comparisons between alternatives.

For each encoding scheme corresponding gate-level circuits were designed for encoder/decoder, MAC unit and data memory. All units were described in VHDL and then synthesized and mapped to a $0.25 \mu \mathrm{~m}$ CMOS standard cell library. Synthesis and technology mapping were performed using Synopsys DesignCompiler with optimization directed towards minimum area implementation. Consistent lowlevel architectures were selected for adders (ripple-carry) and multipliers (signed or unsigned array multiplication with carry-save addition of partial products) to reduce the influence of gate-level implementation details.

Subsequently, the gate-level netlist of each unit has been exercised with corresponding stimuli. Activities resulting from spurious transitions have been captured by utilizing gate-delay models provided by the technology library. The functionality of each circuit has been verified by comparing gate-level responses with expected responses as obtained from bit-true behavioral models written in Matlab.

The node toggle information so acquired was annotated on the netlist employing Synopsys DesignPower. Finally, average power consumption has been computed using estimated capacitive loads.

Note that the terms power and energy dissipation are synonyms in the present case of real-time operation and fixed data rate.

### 5.5.3 Experimental Results and Discussion

## Experimental results

A $N=17$-tap low-pass filter with cutoff frequency of $0.65 \cdot f_{S} / 2$ was applied to the benchmark signals with statistics as given in table 5.1. The high-amplitude signals $Q_{H}$ had a duration of eight seconds, i.e. $64^{\prime} 000\left(352^{\prime} 800\right)$ samples at 8 (44.1) kHz , and were used for SQNR computations. For the noise signals $Q_{L} 10$ ' 000 samples were simulated. The above choice of $N$ and number of data samples is a compromise between sufficient insensitivity of the results on the particular filter/data characteristic and simulation run time.

Table 5.2 shows the estimated energy consumption per processed data sample across implementation alternatives and application parameters. For each filter the total energy use is split into encoder/decoder (Cod), data memory (Mem) and MAC unit.

## Discussion

Three general conclusions can be drawn from the results in table 5.2:

- The total dissipation in the reference implementation (5.8) is clearly dominated by the MAC unit, which implies that provisions to reduce total power must attack here.
- Energy use of the reference implementation is relatively insensitive to signal amplitude, i.e. processing of $Q_{L}$ takes the same (or even more) energy as processing $Q_{H}$.
- Any of the alternatives investigated outperforms PCM-2sC. Although there is no single optimal implementation for all parameter sets, differences between the alternative encodings are relatively minor. Therefore, other criteria such as area cost may be considered for rating the coding schemes.

|  |  | $I_{L}$ |  |  |  | $I_{H}$ |  |  |  |
| :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :---: |
|  |  | $A_{L}$ |  | $A_{H}$ |  | $A_{L}$ |  | $A_{H}$ |  |
| Coding | Unit | $Q_{L}$ | $Q_{H}$ | $Q_{L}$ | $Q_{H}$ | $Q_{L}$ | $Q_{H}$ | $Q_{L}$ | $Q_{H}$ |
| PCM-2sC as in (5.8) | Cod | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
|  | Mem | 15.9 | 15.1 | 15.9 | 15.1 | 20.0 | 18.3 | 20.0 | 18.3 |
|  | Mac | 118.7 | 119.5 | 147.5 | 149.8 | 256.8 | 242.6 | 281.3 | 266.3 |
|  | $\Sigma$ | 134.6 | 134.6 | 163.4 | 164.9 | 276.8 | 260.9 | 301.3 | 284.6 |
| PCM-S\&M as in (5.10) | Cod | 0.2 | 0.1 | 0.3 | 0.1 | 0.3 | 0.2 | 0.4 | 0.2 |
|  | Mem | 12.0 | 13.8 | 12.0 | 13.8 | 15.5 | 17.1 | 15.5 | 17.1 |
|  | Mac | 17.2 | 46.9 | 41.4 | 74.5 | 71.9 | 135.2 | 116.6 | 178.7 |
|  | $\Sigma$ | 29.4 | 60.8 | 53.7 | 88.4 | 87.7 | 152.5 | 132.5 | 196.0 |
| PCM-S\&M as in (5.11) | Cod | 0.1 | 0.1 | 0.1 | 0.1 | 0.2 | 0.1 | 0.2 | 0.1 |
|  | Mem | 12.0 | 13.8 | 12.0 | 13.8 | 15.5 | 17.1 | 15.5 | 17.1 |
|  | Mac | 17.5 | 44.4 | 36.3 | 67.0 | 68.7 | 124.8 | 104.9 | 161.4 |
|  | $\Sigma$ | 29.6 | 58.3 | 48.4 | 80.9 | 84.4 | 142.0 | 120.6 | 178.6 |
| DPCM-S\&M as in (5.15) | Cod | n.a. | n.a. | 2.2 | 2.3 | n.a. | n.a. | 3.5 | 3.3 |
|  | Mem | n.a. | n.a. | 12.6 | 13.8 | n.a. | n.a. | 16.2 | 16.8 |
|  | Mac | n.a. | n.a. | 39.2 | 58.6 | n.a. | n.a. | 111.4 | 131.8 |
|  | $\Sigma$ | n.a. | n.a. | 54.0 | 74.7 | n.a. | n.a. | 131.1 | 151.9 |
| APCM-S\&M as in (5.19) | Cod | 0.5 | 0.3 | 0.3 | 0.2 | 0.8 | 0.3 | 0.6 | 0.3 |
|  | Mem | 11.6 | 13.3 | 12.0 | 13.8 | 15.1 | 16.6 | 15.5 | 17.1 |
|  | Mac | 18.7 | 47.5 | 37.5 | 69.2 | 78.2 | 132.7 | 107.9 | 164.3 |
|  | $\sum$ | 30.8 | 61.1 | 49.8 | 83.2 | 94.1 | 149.6 | 124.0 | 181.7 |
| Log-PCM as in (5.25) | Cod | 1.0 | 0.7 | n.a. | n.a. | $4.5 *$ | $2.4 *$ | n.a. | n.a. |
|  | Mem | 12.5 | 14.7 | n.a. | n.a. | 16.6* | 17.9* | n.a. | n.a. |
|  | Mac | 25.0 | 40.1 | n.a. | n.a. | 131.4* | 168.5* | n.a. | n.a. |
|  | $\Sigma$ | 38.5 | 55.5 | n.a. | n.a. | 152.5* | 188.8* | n.a. | n.a. |

*SQNR 10 dB below target
Table 5.2: Average energy use $\left[10^{-7} \mathrm{~J}\right]$ per input data sample for 17-tap low-pass filter.

|  |  | $I_{L}$ |  | $I_{H}$ |  |
| :---: | :---: | :---: | :---: | :---: | :---: |
| Coding | Unit | $A_{L}$ | $A_{H}$ | $A_{L}$ | $A_{H}$ |
| PCM-2sC as in (5.8) | Cod | 0.0 | 0.0 | 0.0 | 0.0 |
|  | Mem | 4.9 | 4.9 | 6.3 | 6.3 |
|  | Mac | 3.4 | 4.0 | 6.0 | 6.6 |
|  | $\Sigma$ | 8.3 | 8.9 | 12.3 | 12.9 |
| PCM-S\&M as in (5.10) | Cod | 0.2 | 0.2 | 0.3 | 0.3 |
|  | Mem | 4.9 | 4.9 | 6.3 | 6.3 |
|  | Mac | 2.9 | 3.5 | 5.4 | 6.1 |
|  | $\Sigma$ | 8.0 | 8.6 | 12.0 | 12.7 |
| PCM-S\&M as in (5.11) | Cod | 0.1 | 0.1 | 0.2 | 0.2 |
|  | Mem | 4.9 | 4.9 | 6.3 | 6.3 |
|  | Mac | 2.8 | 3.3 | 5.3 | 6.0 |
|  | $\Sigma$ | 7.8 | 8.3 | 11.8 | 12.5 |
| DPCM-S\&M as in (5.15) | Cod | n.a. | 1.3 | n.a. | 1.8 |
|  | Mem | n.a. | 5.2 | n.a. | 6.6 |
|  | Mac | n.a. | 3.6 | n.a. | 6.3 |
|  | $\Sigma$ | n.a. | 10.1 | n.a. | 14.7 |
| APCM-S\&M as in (5.19) | Cod | 0.3 | 0.2 | 0.5 | 0.4 |
|  | Mem | 4.5 | 4.8 | 5.9 | 6.2 |
|  | Mac | 2.3 | 3.2 | 4.3 | 5.4 |
|  | $\Sigma$ | 7.1 | 8.2 | 10.7 | 12.0 |
| Log-PCM as in (5.25) | Cod | 0.6 | n.a. | 2.2 * | n.a. |
|  | Mem | 4.9 | n.a. | 6.3* | n.a. |
|  | Mac | 1.4 | n.a. | 2.9* | n.a. |
|  | $\Sigma$ | 6.9 | n.a. | 11.4* | n.a. |

*SQNR 10 dB below target
Table 5.3: Cell area $\left[10^{-2} \mathrm{~mm}^{2}\right]$ of FIR filter processor with different encodings.

Table 5.3 shows cell area figures corresponding to the energy evaluations from table 5.2. Since the same hardware units are used for processing of high- and low-quality signals, parameters $Q_{L}$ and $Q_{H}$ do not appear in table 5.3.

The results in table 5.2 and 5.3 permit the following statements for the alternative data encodings:

PCM-S\&M. The S\&M implementations (5.10) and (5.11) differ only marginally in their energy and area usage, with (5.11), which is based on 2 sC accumulation, being slightly more efficient. Com-
pared to PCM-2sC, substantial energy savings are achieved. Reduction is especially high for the low-amplitude signal $Q_{L}$ which is in accordance with observations from section 5.3. Furthermore, energy savings of PCM-S\&M are accompanied by a slightly enhanced area economy of about $5 \%$.

DPCM-S\&M. Because of error accumulation and propagation, filtering of differentially encoded data is always to be performed with accurate multiply-accumulate (see section 5.4.4). Therefore, energy and area results are identical for $A_{L}$ and $A_{H}$. Compared with PCM-S\&M, DPCM further decreases energy use in case of high signal amplitude $Q_{H}$ and high processing accuracy $A_{H}$. This decrease comes at the expense of an area penalty of about $15 \%$, however.

APCM-S\&M. Adaptive quantization (5.19) does not reduce overall dissipation relative to PCM-S\&M. Even in the data memory, energy consumption is only slightly better because no significant reduction of the overall data word width was achieved. For example, for low processing accuracy $A_{L}$, the data word width could be reduced by four bits in case of both bit rates $I_{L}$ and $I_{H}$. However, since three bits are required to encode the corresponding gain exponent, the net input word width reduction is only one bit. On the other hand, area cost in case of low processing accuracy $A_{L}$ is about $10 \%$ below that of PCM-S\&M.

Log-PCM. Logarithmic quantization is only applicable to low- accuracy processing, because unrealistic conversion precision between linear and logarithmic domain is required otherwise. For low bit rate $I_{L}$ and high signal amplitude $Q_{H}$, Log-PCM does with less energy than PCM-S\&M. In this case $C(F)$ in (5.20) was manually optimized such that the required SQNR could be attained with a small ( $4 \times 3$ )-bit LUT containing five distinct correction terms, see figure 5.16. Besides, the area cost was reduced by about $12 \%$ compared to PCM-S\&M.
To meet the target processing accuracy for the high-amplitude signal $Q_{H}$, a considerably larger LUT is required. Optimizing a $7 \times 7$ bit LUT for eighty distinct correction terms, processing accuracy of 10 dB below that of the reference implementation
> was achieved. Since in this case energy consumption and area cost already exceed that of other encodings, further expansion of the LUT's complexity is not beneficial.

### 5.5.4 Derivation of General Coding Guidelines

General guidelines for the choice of coding in FIR filtering of speech data shall be derived from the results in table 5.2. Since the respective coding overhead scales with the number of filter $\operatorname{taps} N$, the energy efficiency of the encoding schemes depends on $N$.

In order to extrapolate from filter order, the following assumption is made with respect to data memory and MAC unit: Energy use per data sample grows linearly with the number of filter taps $N$. For the data memory this fact has been experimentally verified in the range $8 \leq N \leq 128$, see figure D.2. For the MAC unit this assumption is justified because no particular filter characteristic is assumed here.

The overall energy consumption per data sample $E_{\text {sample }}^{\sum}$ can thus be written as function of the number of filter taps:

$$
\begin{align*}
E_{\text {sample }}^{\sum(N)} & =E_{\text {sample }}^{\text {Cod }}+E_{\text {sample }}^{\text {Mem }}+E_{\text {sample }}^{\text {Mac }} \\
& =E_{\text {sample }}^{\text {Cod }}+N \cdot\left(E_{\text {tap }}^{\text {Mem }}+E_{\text {tap }}^{\text {Mac }}\right) \tag{5.26}
\end{align*}
$$

Coding energy per sample, $E_{\text {sample }}^{\text {Cod }}$, is directly given by table 5.2. Energy use per filter tap for data memory and Mac unit $E_{\text {tap }}^{M e m}$ and $E_{\text {tap }}^{M a c}$, respectively, are obtained by dividing the corresponding values in table 5.2 by $N=17$, i.e. the number of taps the results were obtained for.

Figure 5.17 visualizes (5.26) by showing relative energy dissipation when using alternative encodings at the chosen points of the parameter field. From these curves it is clear that the relative merits of the various coding schemes do not change beyond eight filter taps or so. Furthermore, it can be concluded that suitable data recoding is beneficial in terms of energy dissipation even for a single MAC operation.


Figure 5.17: Overall energy dissipation relative to PCM-2sC [\%] as function of number of filter taps $N$ for low (high) bit rate $I_{L}\left(I_{H}\right)$ and processing accuracy $A_{L}\left(A_{H}\right)$. Solid (dashed) lines correspond to high (low) signal amplitude $Q_{H}\left(Q_{L}\right)$. Encoding schemes that are not available for certain parameter settings have been dropped.


Figure 5.18: Guideline for the choice of energy-efficient encoding in the three-dimensional application space of speech filtering.

Since in most applications signal amplitude will vary, the decision as to which encoding is preferable in a particular situation shall be based on appropriate weighting of power savings associated to $Q_{L}$ and $Q_{H}$. Because high-amplitude signals imply higher absolute energy dissipation than noise-like signals, $Q_{H}$ should in general be weighted stronger.

According to this policy, figure 5.18 gives a coarse guideline for the choice of energy-efficient encoding in the three-dimensional application space of speech filtering, spanned by parameters bit rate, processing accuracy, and number of MAC operations. In this figure, DPCM and log-PCM relate to (5.15) and (5.25) as before. (A)PCM indicates that, depending on actual word size reduction that can be realized, it might be beneficial to combine S\&M with adaptation as in (5.19).

### 5.6 Summary

Waveform coding techniques such as PCM with sign-magnitude representation, differential PCM, adaptive and logarithmic quantization have been investigated for low-power digital filtering of speech data. Depending on bit rate, processing accuracy, and number of filter taps, guidelines for the choice of coding in application-specific digital signal processors have been established.

Experimental results confirmed linear PCM with sign-magnitude representation as universal coding for energy-efficient implementation of algorithms dominated by multiply-accumulate operations. Depending on signal characteristics and processing accuracy, S\&M can save $25-75 \%$ of the energy in a single-tap filter with 2 's-complement arithmetic. With increasing filter order, other encodings become even more efficient. Differential encoding in conjunction with S\&M is particularly useful in case of high processing accuracy and high bit rate. With the proposed circuit structure for linear approximation of logarithms, log-PCM proved very energy-efficient for low processing accuracy and low bit rates.

Furthermore, experimental results imply that the coding overhead for the encoding techniques considered is small enough to justify recoding of speech data even for a single multiply-accumulate operation. Although the results were obtained for FIR filters, the presented encoding techniques and corresponding arithmetic units may as well be adapted to IIR filters and other processing algorithms dominated by multiply-accumulate operations.

## Chapter 6

## Concluding Remarks

This thesis has explored three major aspects of low-power digital VLSI design in the context of data statistics: estimation of switching activity in logic circuits using probabilistic techniques, lower bounds on power dissipation for a given processing task, and filtering of speech data as example for practical low-power design. The essence of the achieved results and conclusions has been given in the introductory section 1.2.3. More detailed accounts have been provided in the summary sections of chapter 3 (pp.56), chapter 4 (pp. 110), and chapter 5 (p. 150).

This concluding chapter shall reflect on the prospects for practical application of the work presented in this thesis, and provide suggestions for future work.

### 6.1 Prospects for Practical Application

Switching activity estimation. Probabilistic activity estimation is a fascinating topic that has stimulated an enormous amount of work within the VLSI research community. On the other hand, it is barely used in practice. This paradox is easily explained by the existence of a more simple alternative, i.e. logic simulation. Even if one could show any probabilistic approach to handle the runtime-accuracy tradeoff more efficiently, which on a fair basis
is notoriously difficult to do, we would not expect this situation to change drastically.
There is, however, one application which might represent an exception: Activity-driven gate-level power optimization. The need to cope with a circuit structure that changes during the optimization process could render the shift from a well-tried to an unfamiliar technique necessary. A shift which otherwise seems to go against the force of habit.

Lower bound on power dissipation. Compared to the overall amount of references on low-power digital VLSI available, relatively little work has been reported on the conceptual level, which almost inevitably would lead one to the question of a lower bound on power dissipation. Since practical implications are profound, this fact must be ascribed to the difficulty of the problem and the restricted prospect of success to solve it.
One existing attempt to tackle the lower bound problem is [Sha97]. However, in section 4.7 this approach was shown to suffer from a number of limitations although, according to our taxonomy, it only attacks the most simple subproblem of the lower bound issue. Rightfully, we have to admit not to have found the general solution either, even though at the very outset we fancied this possibility. Now, with the conjecture that there can be no useful implementation-independent lower bound, we are much less euphoric concerning a general solution. As there is little hope that an analytical power dissipation model which comprises all design parameters can be found, a general lowpower design theory is out of sight. Therefore, ingenuity and experience will remain desirable features for low-power digital VLSI designers.

Low-power design in consideration of data statistics. In the absence of a general low-power design theory, the practical utility of data statistics for power reduction has been demonstrated by means of an example, i.e. FIR filtering of speech data. To be able to compare the energy efficiency of various data coding schemes, several coding and arithmetic units were designed and optimized manually at the gate level. Since much of this labori-
ous work is specific to the application at hand, the corresponding optimization process is hard to automate, even though a hardware description language and synthesis environment which supports more than the ordinary 2's-complement number representation would be helpful.
However, any time-consuming search in the design space is in deep contrast to today's shorter product development cycles imposed by competitive pressure. In the light of system-level design languages such as SystemC, it appears that this pressure sometimes leads to ranking power dissipation below time-to-market, even for typical low-power applications.

### 6.2 Future Work

The following suggestions for future work can be given:
Switching activity estimation The extension of the proposed correlation approximation technique to circuits with sequential feedback would be valuable. This approximation then could be incorporated into a tool for the analysis of feedback circuits. Of course, the integration of the proposed activity estimation scheme in a gate-level power estimation/optimization tool would be a desirable continuation of this thesis.

Lower bound on power dissipation Potential future work on the lower bound problem seems most realistic for the subproblem "Boolean Optimization" (LBP-2) with the simplifying assumptions of a fixed supply voltage and a restricted set of logic gates.

Practical low-power design The waveform coding techniques investigated for FIR filtering of speech could be evaluated for other DSP operations that are frequently encountered in audio applications, e.g. discrete Fourier or cosine transform. Furthermore, a similar investigation for typical DSP data other than speech, such as music, video or multimedia data, would be interesting.

# Seite Leer / <br> Blank leaf 

## Appendix A

## Basics of Probability Theory

This appendix reviews basic concepts from probability theory. In the context of digital systems the discussion is limited to probability measures for discrete events. The treatment is neither complete nor rigorous, but adapted to the needs in the context of probabilistic activity calculation in chapter 3. More information on the subject can be found for instance in [SW86, BS91].

## Random Variables

Let sample space $\Omega$ be a finite set of events $\omega$. Then, a discrete random variable $X$ formally is a mapping $X: \Omega \rightarrow \mathbb{R}$. For simplicity we assume $\Omega \subset \mathbb{Z}$ and $X(\omega)=\omega, \omega \in \Omega$. The probability mass function (pmf) $p(x)$ of $X$ is defined as

$$
\begin{equation*}
p(x)=\operatorname{Pr}\{X=x\} \tag{A.1}
\end{equation*}
$$

for all $x \in \Omega$. Thus, the pmf fully specifies the statistical properties of $X$.

The above notation naturally extends to random vectors $\boldsymbol{X}=$ $\left(X_{1}, \ldots, X_{n}\right)$ with $X_{i}(i=1, \ldots, n)$ being random variables. The
joint pmf is defined as

$$
\begin{equation*}
p\left(x_{1}, \ldots, x_{n}\right)=\operatorname{Pr}\left\{X_{1}=x_{1}, \ldots, X_{n}=x_{n}\right\} . \tag{A.2}
\end{equation*}
$$

The marginal pmf $p\left(x_{i}\right)$ is obtained by summing over all evaluations of $p\left(x_{1}, \ldots, x_{n}\right)$ at $x_{j}=\omega$ for all $j \neq i$ and $\omega \in \Omega$.

The $i$-th moment of a discrete random variable $X$ is defined as

$$
\begin{equation*}
\mu_{i}=\sum_{x \in \Omega} x^{i} p(x) \tag{A.3}
\end{equation*}
$$

$\mu_{1}$ is the expected value of $X$, also denoted as $E[X]$. The $i$-th central moment of $X$ is

$$
\begin{equation*}
\lambda_{i}=\sum_{x \in \Omega}(x-E[X])^{i} \cdot p(x) \tag{A.4}
\end{equation*}
$$

The second central moment $\lambda_{2}$ is called the variance of $X$ and is denoted $\sigma_{X}^{2}$ :

$$
\begin{align*}
\sigma_{X}^{2} & =\sum_{x \in \Omega}(x-E[X])^{2} \cdot p(x) \\
& =E\left[(X-E[X])^{2}\right] \\
& =E\left[X^{2}\right]-(E[X])^{2} \tag{A.5}
\end{align*}
$$

$\sqrt{\sigma_{X}^{2}}=\sigma_{X}$ is called the standard deviation of $X$.
The above definitions extend to random vectors. In chapter 3 the following $i$-th moment of a random vector $\boldsymbol{X}=\left(X_{1}, \ldots, X_{n}\right)$ is of interest:

$$
\begin{equation*}
\mu_{j_{1} \ldots j_{i}}=\sum_{\left(x_{1}, \ldots, x_{n}\right) \in \Omega^{n}} x_{j_{1}} \ldots x_{j_{i}} \cdot p\left(x_{1}, \ldots, x_{n}\right) \tag{A.6}
\end{equation*}
$$

The first moments $\mu_{j}(j=1, \ldots, n)$ are exactly the expected values $E\left[X_{j}\right]$ of the components of $\boldsymbol{X}$.
The second central moments of $\boldsymbol{X}$ follow in analogy to (A.5) as

$$
\begin{align*}
\lambda_{j_{1} j_{2}} & =E\left[\left(X_{j_{1}}-E\left[X_{j_{1}}\right]\right)\left(X_{j_{2}}-E\left[X_{j_{2}}\right]\right)\right] \\
& =E\left[X_{j_{1}} X_{j_{2}}\right]-E\left[X_{j_{1}}\right] \cdot E\left[X_{j_{2}}\right] \tag{A.7}
\end{align*}
$$

$\lambda_{j_{1} j_{2}}\left(j_{1}, j_{2} \in\{1, \ldots, n\}\right)$ is called the covariance of the random variables $X_{j_{1}}$ and $X_{j_{2}}$ and is a measure for their correlation. If the correlation coefficient $\varrho_{X_{j_{1}} X_{j_{2}}}$, being defined as

$$
\begin{equation*}
\varrho_{X_{j_{1}} X_{j_{2}}}=\frac{\lambda_{j_{1} j_{2}}}{\sqrt{\sigma_{X_{j_{1}}}^{2} \sigma_{X_{j_{2}}}^{2}}}, \tag{A.8}
\end{equation*}
$$

is equal to zero, $X_{j_{1}}$ and $X_{j_{2}}$ are said to be uncorrelated. For $j_{1}=$ $j_{2}=j$ the covariance is identical to the variance and $\varrho_{X_{j} X_{j}}=1$.

## Random Sequences

A random process $\left\{X_{k}\right\}$ is a collection of random variables indexed by $k$. If the randam variables have a discrete sample space $\Omega$ as above and $k$ assumes values from the set of positive integers $0,1,2, \ldots$, then $\left\{X_{k}\right\}$ is called a random sequence. Index $k$ can be thought of as the discrete time. As in the context of probabilistic power estimation in chapter 3 , the random variables $X_{k}$ may itself be multi-dimensional, i.e. the $X_{k}$ may be random vectors.

A random sequence is said to be strict-sense stationary if the pmf's associated to any two sequences $X_{k}, X_{k+1}, \ldots, X_{k+n}$ and $X_{k+\tau}, X_{k+\tau+1}, \ldots, X_{k+\tau+n}$ are identical. This requires that all moments of $X_{k}, X_{k+1}, \ldots, X_{k+n}$ are independent of time $k$. As opposed to this, wide-sense stationarity only requires $E\left[X_{k}\right]$ and $E\left[X_{k} X_{k+\tau}\right]$, i.e. the first two moments to be independent of $k$. In general, the term stationary is used to indicate time-independence for all statistical properties.

A random sequence $\left\{X_{k}\right\}$ is said to be an i.i.d. sequence if all its random variables are independent and identically distributed, i.e. have the same probability mass function $p(x)$.

A random sequence $\left\{X_{k}\right\}$ forms a Markov chain if
$\operatorname{Pr}\left\{X_{k+1}=x_{k+1} \mid X_{k}=x_{k}, \ldots, X_{0}=x_{0}\right\}=\operatorname{Pr}\left\{X_{k+1}=x_{k+1} \mid X_{k}=x_{k}\right\}$
for all states $x_{i} \in \Omega$, where $\operatorname{Pr}\{X=x \mid Y=y\}$ is the conditional probability of $X$ assuming the value $x$ given that $Y$ has assumed $y$. Thus, in a Markov chain the next state $k+1$ depends only on the current state $k$ but is conditionally independent of the past. Therefore, a Markov chain is completely specified by the transition probability matrix $Q$ with elements $q_{i j}(i, j \in 1, \ldots,|\Omega|)$, denoting the probability of changing from state $i$ to state $j$ in the next time step, i.e. $q_{i j}=$ $\operatorname{Pr}\left\{X_{k+1}=j \mid X_{k}=i\right\}$.

If the probabilities $\pi_{i}=\operatorname{Pr}\left\{X_{k}=i\right\},(i=1, \ldots,|\Omega|)$ converge for $k \rightarrow \infty$, the row vector $\pi=\left[\pi_{1}, \ldots, \pi_{|\Omega|}\right]$ is called stationary distribution of the Markov chain and can be obtained from solving the set of linear equations (Chapman-Kolmogorov equation)

$$
\begin{align*}
\pi Q & =\boldsymbol{\pi} \\
\pi_{1}+\ldots+\pi_{|\Omega|} & =1 . \tag{A.10}
\end{align*}
$$

## Appendix B

## Basics of Information Theory

This section provides an overview of basic quantities and their properties as being used in digital information theory. It builds upon elementary notations from probability theory as given in appendix A. The material widely follows [CT91] in style and naming convention.

## Entropy

Let $X$ be a discrete random variable $X$ with alphabet $\mathcal{X}$ and probability mass function (pmf) $p(x)$ as defined in appendix A . The entropy or information content, of $X$, or uncertainty in $X$, is defined as

$$
\begin{equation*}
H(X)=-\sum_{x \in \mathcal{X}} p(x) \log p(x) \tag{B.1}
\end{equation*}
$$

with the convention $0 \log 0=0$. If the $\log$ in (B.1) is to the base 2 , then entropy is measured in bits, which is assumed throughout this text. Alternatively for $H(X)$ one may write $H\left(p_{1}, \ldots, p_{\mid \mathcal{X |}}\right)$ for $p_{i}=p\left(x_{i}\right), i=1, \ldots,|\mathcal{X}|$. With the above definition, entropy solely depends on the probability distribution of $X$, but not on the actual values it can take. Furthermore (B.1) implies

$$
\begin{equation*}
0 \leq H(X) \leq \log |\mathcal{X}| \tag{B.2}
\end{equation*}
$$



Figure B.1: Binary entropy function $h(p)$ and switching activity function $4 p(1-p)$.

The right-side equality holds if and only if $p(x)$ is the uniform distribution, i.e. $p(x) \sim \mathcal{U}$. The redundancy in $X$ is defined as the difference between the maximum and the actual entropy:

$$
\begin{equation*}
\mathcal{R}(X)=\log |\mathcal{X}|-H(X) . \tag{B.3}
\end{equation*}
$$

Of particular interest in digital VLSI is the binary entropy function $h(p)$ for a binary random variable $X$ with $\mathcal{X}=\{0,1\}, p(1)=p$, and $p(0)=1-p$ :

$$
\begin{equation*}
h(p)=H(X)=-p \log (p)-(1-p) \log (1-p) \tag{B.4}
\end{equation*}
$$

The graph of $h(p)$ depicted in figure B. 1 illustrates that the entropy of a binary random variable is maximized for $p=0.5$. Furthermore, $h(p)$ is monotonically increasing in the interval ( $0,1 / 2$ ] and decreasing in $[1 / 2,1)$. The redundancy of a binary random variable $X$ follows from (B.3) as $\mathcal{R}(X)=1-h(p)$. Also note the similarity of the graph for $h(p)$ with the switching activity graph for $\rho=-1$ in figure 3.4. As can be seen

$$
\begin{equation*}
h(p) \geq 4 p(1-p), \quad(0 \leq p \leq 1) \tag{B.5}
\end{equation*}
$$

The concept of entropy of a random variable $X$ naturally extends to the joint entropy of a collection of random
variables $\left(X_{1}, X_{2}, \ldots, X_{n}\right)$ with joint probability mass function $p\left(x_{1}, x_{2}, \ldots, x_{n}\right)$ :

$$
H\left(X_{1}, \ldots, X_{n}\right)=-\sum_{x_{1} \in \mathcal{X}_{1}} \ldots \sum_{x_{n} \in \mathcal{X}_{n}} p\left(x_{1}, \ldots, x_{n}\right) \log p\left(x_{1}, \ldots, x_{n}\right) .
$$

## Mutual Information

The conditional entropy of a pair of random variables $(X, Y)$ with probability mass function $p(x, y)$ is defined as

$$
\begin{equation*}
H(X \mid Y)=-\sum_{x \in \mathcal{X}} \sum_{y \in \mathcal{Y}} p(x, y) \log p(x \mid y) . \tag{B.6}
\end{equation*}
$$

Conditional entropy $H(X \mid Y)$ can be interpreted as the remaining uncertainty in $X$ given the knowledge of $Y$. Definition (B.6) ensures correspondence between conditional entropy and conditional probability by virtue of the following chain rule:

$$
\begin{equation*}
H(X, Y)=H(Y)+H(X \mid Y) \tag{B.7}
\end{equation*}
$$

Based on conditional entropy, the mutual information between two random variables is defined as

$$
\begin{equation*}
I(X ; Y)=H(X)-H(X \mid Y)=H(Y)-H(Y \mid X) \tag{B.8}
\end{equation*}
$$

Thus, $I(X ; Y)$ can be viewed as the reduction in uncertainty in $X$ due to the knowledge of $Y$. Note that mutual information is symmetric, i.e. $X$ contains as much information about $Y$ as $Y$ contains about $X$. Mutual information can also be thought of as the "distance" between the joint distribution and the product of the marginal distributions of $X$ and $Y$ : The higher the "distance" between $p(x, y)$ and $p(x) p(y)$, the higher is the correlation between $X$ and $Y$, and the higher is the mutual information $I(X ; Y)$. Figure B. 2 explains the relation between entropy and mutual information by means of a Venn diagram.

## Entropy Rate

Let $\left\{X_{k}\right\}$ be a random sequence as defined in appendix A. The entropy rate $\mathcal{H}$ of $\left\{X_{k}\right\}$ is defined as the average entropy per symbol of a


Figure B.2: Relation between entropy and mutual information.
sequence of $n$ random variables as $n$ goes to infinity:

$$
\begin{equation*}
\mathcal{H}\left(\left\{X_{k}\right\}\right)=\lim _{n \rightarrow \infty} \frac{1}{n} H\left(X_{1}, X_{2}, \ldots, X_{n}\right) . \tag{B.9}
\end{equation*}
$$

For stationary processes the limit in (B.9) always exists and equals the conditional entropy of the last random variable given the past, i.e. $\mathcal{H}\left(\left\{X_{k}\right\}\right)=\lim _{n \rightarrow \infty} H\left(X_{n} \mid X_{n-1}, \ldots, X_{1}\right)$.

For two types of random processes the entropy rate becomes particularly simple. First, for an independent and identically distributed (i.i.d.) sequence the entropy rate equals the entropy of its random variables:

$$
\begin{equation*}
\mathcal{H}\left(\left\{X_{k}\right\}\right)=H(X) \text { when }\left\{X_{k}\right\} \text { i.i.d. } \tag{B.10}
\end{equation*}
$$

Second, if $\left\{X_{k}\right\}$ forms a stationary Markov chain the entropy rate is given by

$$
\begin{align*}
\mathcal{H}\left(\left\{X_{k}\right\}\right) & =H\left(X_{n} \mid X_{n-1}\right)  \tag{B.11}\\
& =-\sum_{i, j} \mu_{i} Q_{i j} \log Q_{i j} \tag{B.12}
\end{align*}
$$

where $\mu$ is the stationary distribution and $Q$ is the transition matrix as defined in appendix A .

## Appendix C

## FSM Analysis Example

By means of the example shown below, this appendix demonstrates how arithmetic transformation ${ }^{1}$ can be utilized for the exact analysis of feedback circuits. Assume input $x$ is highly (first-order) anticorrelated ( $\varrho_{x}=-0.6$ ). Then, state signal $s$ must be modeled as lag-two Markov chain, see theorem 3.3. This is equivalent to using the given moments $p_{x}^{1}$ and $p_{x-x}^{1-1}$ for the analysis. On the other hand, a lag-one Markov chain disregards the temporal correlation of $x$ by specifying the state transition probabilities in terms of $p_{x}^{1}$ only. As will be seen, this can result in large errors for the switching activity $\alpha_{y}$.


[^18]
## Lag-two Markov chain model

Construction of the arithmetic transform for the next-state function ${ }^{2}$ $s^{k+1}=\overline{x^{k} \vee s^{k}}$ and the extended next-state function obtained from AND-ing both sides with $x^{k+1}$, yields

$$
\begin{aligned}
s^{k+1} & =1-x^{k}-s^{k}+x^{k} s^{k} \\
x^{k+1} s^{k+1} & =x^{k+1}-x^{k} x^{k+1}-s^{k} x^{k+1}+x^{k} s^{k} x^{k+1}
\end{aligned}
$$

which translates into equations for two of the unknown moments:

$$
\begin{align*}
p_{s}^{1} & =1-p_{x}^{1}-p_{s}^{1}+p_{x s}^{11}  \tag{C.1}\\
p_{x s}^{11} & =p_{x}^{1}-p_{x-x}^{1-1}-p_{s-x}^{1-1}+p_{x s-x}^{11-1} \tag{C.2}
\end{align*}
$$

This brings two other unknowns $p_{s-x}^{1-1}$ and $p_{x s-x}^{11-1}$ into operation. Under a lag-two Markov chain model $p_{x s-x}^{11-1}$ can be expressed in terms of $p_{x}^{1}$, $p_{x-x}^{1-1}$ and $p_{x s}^{11}$ as follows:

$$
\begin{align*}
p_{x s-x}^{11-1} & =\operatorname{Pr}\left\{x^{k}=1, s^{k}=1, x^{k+1}=1\right\} \\
& =\operatorname{Pr}\left\{s^{k}=1, x^{k+1}=1 \mid x^{k}=1\right\} \operatorname{Pr}\left\{x^{k}=1\right\} \\
& =\operatorname{Pr}\left\{s^{k}=1 \mid x^{k}=1\right\} \operatorname{Pr}\left\{x^{k+1}=1 \mid x^{k}=1\right\} \operatorname{Pr}\left\{x^{k}=1\right\} \\
& =p_{x s}^{11} p_{x-x}^{1-1} / p_{x}^{1} . \tag{C.3}
\end{align*}
$$

In (C.3) the third equality holds, because $x^{k+1}$ only depends on $x^{k}$ but not on previous input values (first-order temporal correlation). On the other hand, $s^{k}$ depends on $x^{k-1}$ and $s^{k-1}$. Thus, if $x^{k}$ is fixed, $x^{k+1}$ and $s^{k}$ are independent. The derivation of $p_{s-x}^{1-1}$ follows a similar reasoning:

$$
\begin{align*}
p_{s-x}^{1-1} & =p_{x s-x}^{11-1}+p_{x s-x}^{01-1} \\
& =p_{x s-x}^{11-1}+p_{x s}^{01} p_{x-x}^{0-1} / p_{x}^{0} \\
& =p_{x s-x}^{11-1}+p_{x s}^{01}\left(p_{x}^{1}-p_{x-x}^{1-1}\right) /\left(1-p_{x}^{1}\right) \tag{C.4}
\end{align*}
$$

In (C.4) the second term of the right-hand side is derived in analogy to $p_{x s-x}^{11-1}$ in (C.3), with $p_{x-x}^{0-1}=p_{x}^{1}-p_{x-x}^{1-1}$, see (3.9). Finally, $p_{x s}^{01}$ can

[^19]be expressed in terms of moments by constructing the corresponding arithmetic transform $\overline{x^{k+1}} s^{k+1}$ :
$$
\left(1-x^{k+1}\right) s^{k+1}=1-x^{k+1}-x^{k}\left(1-x^{k+1}\right)-s^{k}\left(1-x^{k+1}\right)+x^{k} s^{k}\left(1-x^{k+1}\right)
$$
which translates into
\[

$$
\begin{equation*}
p_{x s}^{01}=1-2 p_{x}^{1}+p_{x-x}^{1-1}-p_{s}^{1}+p_{s-x}^{1-1}+p_{x s}^{11}-p_{x s-x}^{11-1} \tag{C.5}
\end{equation*}
$$

\]

With this expression, (C.1)-(C.4) yield a set of linear equations for the unknown moments that refer to the static behavior of state signal $s$ :

$$
\left(\begin{array}{cccc}
2 & -1 & 0 & 0 \\
0 & 1 & 1 & -1 \\
\beta & -\beta & 1-\beta & \beta-1 \\
0 & -p_{x-x}^{1-1} / p_{x}^{1} & 0 & 1
\end{array}\right) \cdot\left(\begin{array}{c}
p_{s}^{1} \\
p_{x s}^{11} \\
p_{s-x}^{11} \\
p_{x s-x}^{11-1}
\end{array}\right)=\left(\begin{array}{c}
1-p_{x}^{1} \\
p_{x}^{1}-p_{x-x}^{1-1} \\
\beta \cdot \delta \\
0
\end{array}\right)
$$

with $\beta=\left(p_{x}^{1}-p_{x-x}^{1-1}\right) /\left(1-p_{x}^{1}\right)$ and $\delta=1-2 p_{x}^{1}+p_{x-x}^{1-1}$. Solving this set of equations ${ }^{3}$ with the given $p_{x}^{1}$ and $p_{x-x}^{1-1}$ gives

$$
\begin{equation*}
p_{s}^{1}=0.417, \quad p_{x s}^{11}=0.333, \quad p_{s-x}^{1-1}=0.133, \quad p_{x s-x}^{11-1}=0.067 . \tag{C.6}
\end{equation*}
$$

Information on the temporal correlation of the state signal $s$ can be inferred from AND-ing the next-state function with $s^{k}$ in consideration of the exponent suppression rule, see (3.22). In general, this would eventually result in another set of equations. However, for the present example we have

$$
s^{k} s^{k+1}=s^{k}-x^{k} s^{k}-s^{k}+x^{k} s^{k} \equiv 0 .
$$

From this, the following moments are determined:

$$
\begin{equation*}
\underline{p}_{s-s}^{1-1}=p_{x s-s}^{11-1}=p_{s-x s}^{1-11}=p_{x s-x s}^{11-11}=0 . \tag{C.7}
\end{equation*}
$$

This result is in accordance with the state transition graph of the example circuit, where there is no self-edge on state $s=1$. Similarly, from AND-ing the next-state function with $x^{k}$, one obtains

$$
\begin{equation*}
\underline{p}_{x-s}^{1-1}=p_{x-x s}^{1-11}=0 . \tag{C.8}
\end{equation*}
$$

[^20]With (C.6)-(C.8) all necessary moments of the circuit input are determined and the switching activity of all gates in the combinational part of the circuit may be calculated by means of the procedure introduced in section 3.4.2. In the present case, the true ${ }^{4}$ switching activity $\alpha_{y}$ is simply given as

$$
\begin{equation*}
\underline{\underline{\alpha_{y}}=2\left(p_{s}^{1}-p_{s-s}^{1-1}\right)=0.833} . \tag{C.9}
\end{equation*}
$$

## Lag-one Markov chain model

Next, the above result shall be compared with the ones available from applying the conventional lag-one Markov chain model to the state signal $s$, see appendix A. In case of the present example circuit, the one-step state transition probability matrix $Q$ is given as

$$
Q=\left[\begin{array}{cc}
p_{x}^{1} & 1-p_{x}^{1}  \tag{C.10}\\
1 & 0
\end{array}\right]
$$

Solving the corresponding Chapman-Kolmogorov equation, see (A.10), yields the static probability of the state signal:

$$
\begin{equation*}
\hat{p}_{s}^{1}=\left(1-p_{x}^{1}\right) /\left(2-p_{x}^{1}\right)=\frac{1}{3} . \tag{C.11}
\end{equation*}
$$

This obviously deviates from the true result in (C.6).
Using this incorrect result and furthermore erroneously assuming that $s$ is temporally uncorrelated, the switching activity of $y$ amounts to

$$
\begin{equation*}
\hat{\alpha}_{y}=2 \hat{p}_{s}^{1}\left(1-\hat{p}_{s}^{1}\right)=0.444 \tag{C.12}
\end{equation*}
$$

which is only half the true value in (C.9). Even exploiting the knowledge of the state-transition probabilities, which is helpful in the present example only because $y=s$, yields

$$
\begin{equation*}
\hat{\alpha}_{y}=2 \hat{p}_{s}^{1}=0.667 \tag{C.13}
\end{equation*}
$$

which is still $20 \%$ short of the true value.
Thus, the conventional (lag-one) Markov chain model for finite-state machines only represents a raw approximation for feedback circuits with temporally correlated primary inputs.

[^21]
## Appendix D

## Memory Structures for FIR Filters

This appendix provides an overview on possible realizations of the functional data memory for finite impulse response (FIR) filters, and presents experimental results on power dissipation and area usage for such memory structures. In general, a similar data memory organization is required for the realization of any inner product operation, e.g. in correlator units.

## Memory Structures

In an $N$-tap FIR filter

$$
y(k)=\sum_{n=0}^{N-1} c_{n} x(k-n)
$$

each input data sample contributes to $N$ consecutive output samples. Therefore, storage capacity for $N$ input samples is required.

It is assumed, that the FIR filter algorithm is to be performed on a single multiply-accumulate unit in a fully time-shared manner as in section 5.4. The filter coefficients $c_{n}(n=0, \ldots, N-1)$ are not known a priori and could be time-varying as in adaptive filters.

Furthermore, the functional data memory shall be built from bistable devices (flip-flops). Then, there are basically three different schemes to realize data storage and access for FIR filtering:

FSR : feedback shift register with fixed read/write position,
SRR : shift register with fixed write and random read access,
RRW : register file with random read/write access.
The register file with random read/write access (RRW) corresponds to a solution with a macro-cell RAM which, however, is not further considered here. Instead, two alternative implementations for RRW are examined:

RRW $_{M X}$ : register file with multiplexed inputs,
$\mathrm{RRW}_{\mathrm{CG}}$ : register file with gated clock signals.
Figure D. 1 shows the the alternative memory structures for any of which the following shall hold:

- New input data samples arrive at a frequency $f_{S}$ (data sampling frequency).
- Data samples at the output of the memory (input to data path) are registered at frequency $N \cdot f_{S}$.
- The data path processes the data samples in the following order: $x(k), x(k-1), \ldots, x(k-N+1)$.
- Two in-phase clock signals of frequencies $f_{S}$ and $N \cdot f_{S}$ are available at no cost.

Note that the four memory structures in figure D. 1 not only differ in their hardware complexity, but also map the same input signal statistics to node activities in distinct ways.


Figure D.1: Alternative memory structures for FIR filtering.

## Experimental Evaluation

For each memory structure a VHDL model has been generated. Using Synopsys DesignCompiler, netlists were synthesized $(0.25 \mu \mathrm{~m}$ CMOS standard cell library) for $N=8,16,32,64,128$ and a data word width of 12 bits. Each netlist was optimized for minimum area. Then, the gate-level netlists have been exercised with speech stimuli ( $f_{S}=8 \mathrm{kHz}$ ) in 2's-complement representation. Activities including glitches have been captured by utilizing gate-delay models provided by the technology library. Finally, average power consumption

| \# of <br> taps $N$ | Power |  |  |  |  | Area |  |  |  |
| :---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | ---: | :---: |
|  | FSR | SRR | RRW $_{\text {MX }}$ | RRW $_{\mathrm{CG}}$ | FSR | SRR | RRW $_{\mathrm{MX}}$ | RRW $_{\mathrm{CG}}$ |  |
| 8 | 3.9 | 1.3 | 1.4 | 1.0 | 19 | 22 | 31 | 24 |  |
| 16 | 12.7 | 2.8 | 2.9 | 2.4 | 33 | 39 | 56 | 43 |  |
| 32 | 46.5 | 5.8 | 6.2 | 5.2 | 61 | 74 | 106 | 80 |  |
| 64 | 178.8 | 13.0 | 14.0 | 11.7 | 117 | 142 | 204 | 151 |  |
| 128 | 704.8 | 31.4 | 29.5 | 26.3 | 227 | 279 | 399 | 294 |  |

Table D.1: Power dissipation $[\mu \mathrm{W}]$ and area usage $\left[10^{3} \mu \mathrm{~m}^{2}\right]$ for different memory structures of FIR filters (data word width $=12 \mathrm{bits}$ ).


Figure D.2: Interpolated data from table D. 1 on double-logarithmic scale.
has been computed for estimated capacitive loads using Synopsys DesignPower. Table D. 1 summarizes the numerical results. In figure D. 2 these results have been interpolated and plotted on a double-logarithmic scale.

The experimental results permit the following conclusions:

- While memory structures $\mathrm{SRR}^{( } \mathrm{RRW}_{\mathrm{MX}}$, and $\mathrm{RRW}_{\mathrm{CG}}$ have comparable power dissipation, the minimum hardware solution FSR dissipates exorbitantly more power. This is due to the clocking of the registers at a frequency of $N \cdot f_{S}$ for FSR.
- Memory structure RRW $_{\mathrm{CG}}$, which uses clock-gating, achieves the minimum power dissipation of all alternatives. With in-
creasing data word width, the advantage of RRW $_{\text {CG }}$ is expected to grow further because the relative overhead due to the clockgating circuitry will diminish.
- $\mathrm{RRW}_{\mathrm{MX}}$ with multiplexed flip-flop inputs is the least desirable memory structure, since it implies the largest area cost while being less energy-efficient than RRW $_{\text {CG }}$.
- Power dissipation and area cost grow nearly linear with respect to the number of filter taps $N$ for all considered memory structures.


# Seite Leer / <br> Blank leaf 

## Bibliography

[Abr63] N. Abramson. Information Theory and Coding. Electronic Science Series. McGraw-Hill, 1963.
[ALES98] A. Alvandpour, P. Larsson-Edefors, and C. Svensson. Impact of Miller capacitances on power consumption. In Proc. 8th Int. Workshop Power and Timing Modeling, Optimization and Simulation (PATMOS'98), Lyngby DK, Oct. 1998.
[AMD $\left.{ }^{+} 94\right]$ M. Alidina, J. Monteiro, S. Devadas, A. Ghosh, and M. Papaefthymiou. Precomputation-based sequential logic optimization for low power. IEEE Trans. VLSI Syst., 2(4):426-436, Dec. 1994.
[Ben73] C.H. Bennett. Logical reversibility of computation. IBM J. Res. Develop., 17:525-532, 1973.
[BG99] V.A. Bartlett and E. Grass. A low-power FIR filter exploiting data-dependent operation. In Proc. 9th Int. Workshop Power and Timing Modeling, Optimization and Simulation (PATMOS'99), 1999.
$\left[\mathrm{BGL}^{+} 98\right]$ C.H. Bennett, P. Gács, M. Li, P.M.B. Vitányi, and W.H. Zurek. Information distance. IEEE Trans. Inform. Theory, 44(4):1407-1423, 1998.
[BL85] C.H. Bennett and R. Landauer. The fundamental physical limits of computation. Scientific American, pages 38-46, 1985.
$\left[\mathrm{BMM}^{+} 00\right]$ L Benini, A. Macii, E. Macii, M. Poncino, and R. Scarsi. Architectures and synthesis algorithms for power-efficient bus interfaces. IEEE Trans. Computer-Aided Design, 19(9):969-980, Nov. 2000.
[BNYT93] R. Burch, F.N. Najm, P. Yang, and T.N. Trick. A Monte Carlo approach for power estimation. IEEE Trans. VLSI Syst., 1(1):63-71, March 1993.
[BS91] I.N. Bronstein and K.A. Semendjajew. Taschenbuch der Mathematik. B.G. Teubner Verlagsgesellschaft, 25 edition, 1991. (in German).
[BTR00] D.K. Brock, E.K. Track, and J.M. Rowell. Superconductor ICs: The $100-\mathrm{GHz}$ second generation. IEEE Spectrum, Dec. 2000.
[ $\left.\mathrm{C}^{+} 98\right] \quad$ D.I. Cheng et al. A hybrid methodology for switching activities estimation. IEEE Trans. Computer-Aided Design, 17(4):357-366, Apr. 1998.
[CA90] K.-T. Cheng and V. Agrawal. An entropy measure for the complexity of multi-output Boolean functions. In Proc. Design Automation Conf., pages 302-305, 1990.
[CB95] A. P. Chandrakasan and R. W. Brodersen. Low Power Digital CMOS Design. Kluwer, Norwell, MA, 1995.
[CMD97] J.C. Costa, J.C. Monteiro, and S. Devadas. Switching activity estimation using limited depth reconvergent path analysis. In Proc. Int. Symp. Low Power Electronics and Design, pages 184-189, 1997.
[CSY96] D.-S. Chen, M. Sarrafzadeh, and G.K.H. Yeap. State encoding of finite state machines for low power design. J. Circuits, Systems and Computers, 6(6):649-661, Dec. 1996.
[CT91] T.M. Cover and J. A. Thomas. Elements of Information Theory. John Wiley and Sons, 1991.
[CZV65] M Combet, H. Zonneveld, and L. Verbeek. Computation of the base two logarithm of binary numbers. IEEE Trans. Electron. Comput., EC-14:863-867, Dec. 1965.
[DTP98] C.-S. Ding, C.-Y. Tsui, and M. Pedram. Gate-level power estimation using tagged probabilistic simulation. IEEE Trans. Computer-Aided Design, 17(11):1099-1107, Nov. 1998.
[GDKW92] A. Ghosh, S. Devadas, K. Keutzer, and J. White. Estimation of average switching activity in combinational and sequential circuits. In Proc. Design Automation Conf., pages 253-259, 1992.
[Ger96] N. Gershenfeld. Signal entropy and the thermodynamics of computation. IBM Systems Journal, 35(3\&4):577-586, 1996.
[Hel72] L. Hellermann. A measure of computational work. IEEE Trans. Comput., C-21:439-446, May 1972.
[Hey99] A.J.G. Hey, editor. Feynman and Computation: Exploring the Limits of Computers. Perseus Books, 1999.
[HMPS96] G.D. Hachtel, E. Macii, A. Pardo, and F. Somenzi. Markovian analysis of large finite state machines. IEEE Trans. Computer-Aided Design, 15(12):1479-1493, Dec. 1996.
[Hwa79] K. Hwang. Computer Arithmetic: Principles, Architecture and Design. Wiley \& Sons, 1979.
[HYH99] M.C. Hansen, H. Yalcin, and J.P. Hayes. Unveiling the ISCAS-85 benchmarks: A case study in reverse engineering. IEEE Design $\mathcal{B}$ Test of Comput., pages 72-80, JulySept. 1999.
[IP97] S. Iman and M. Pedram. Low Power Design in Deep Submicron Electronics, chapter Combinational Circuit Optimization. Kluwer, 1997.
[JN84] N.S. Jayant and P. Noll. Digital Coding of Waveforms. Prentice-Hall, 1984.
[Kae00] H. Kaeslin. VLSI II: Design of Very Large Scale Integration Circuits. Lecture notes, Integrated Systems Laboratoty, ETH Zürich, 2000.
[KAK99] A. Kabbani and A.J. Al-Khalili. Estimation of ground bounce effects on CMOS circuits. IEEE Trans. Components and Packaging Technology, 22(2):316-325, June 1999.
[KB98] M. Kahrs and K. Brandenburg, editors. Applications of Digital Signal Processing to Audio and Acoustics. Kluwer Academics, 1998.
[Key01] R.W. Keyes. Fundamental limits of silicon technology. Proc. of the IEEE, pages 227-239, March 2001.
[Kon94] A.M. Kondoz. Digital Speech: Coding for Low Bit Rate Communications Systems. John Wiley \& Sons, 1994.
[KS99] L.M. Krauss and G.D. Starkman. The fate of life in the universe. Scientific American, pages 36-43, Nov. 1999.
[Lan61] R. Landauer. Irreversibility and heat generation in the computing process. IBM J. Res. Develop., 5:183-191, 1961.
[Lew90] D.M. Lewis. An architecture for addition and subtraction of long word length numbers in the logarithmic number system. IEEE Trans. Comput., 39(11):1325-1336, Nov. 1990.
[LNC96] J.T. Ludwig, S.H. Nawab, and A.P. Chandrakasan. Low-power digital filtering using approximate processing. IEEE J. Solid-State Circuits, 31(3):395-400, 1996.
[LR90] H. Leff and A. Rex, editors. Maxwell's Demon: Entropy, Information, Computing. Princeton University Press, 1990.
[MD00] J.D. Meindl and J.A. Davis. The fundamental limit on binary switching energy for terascale integration (TSI). IEEE J. Solid-State Circuits, 35(10):1515-1516, Oct. 2000.
[MDG $\left.{ }^{+} 97\right]$ J. Monteiro, S. Devadas, A. Ghosh, K. Keutzer, and J. White. Estimation of average switching activity in combinational logic circuits using symbolic simulation. IEEE Trans. Computer-Aided Design, 16(1):121127, Jan. 1997.
[Mei95] J.D. Meindl. Low power microelectronics: Retrospect and prospect. Proc. of the IEEE, 83(4):619-635, Apr. 1995.
[Mit62] J.N. Mitchell. Computer multiplication and division using binary logarithms. IRE Trans. Electron. Comput., EC-11:512-517, 1962.
[MMP95] R. Marculescu, Diana Marculescu, and M. Pedram. Efficient power estimation for highly correlated input streams. In Proc. 32nd Design Automation Conf., pages 628-634, San Francisco, CA, 1995.
[MMP96] D. Marculescu, R. Marculescu, and M. Pedram. Information theoretic measures for power analysis. IEEE Trans. Computer-Aided Design, 15(6):599-610, June 1996.
[MMP97] R. Marculescu, D. Marculescu, and M. Pedram. Composite sequence compaction for fine-state machines using block entropy and high-order Markov models. In Proc. Int. Symp. Low Power Electronics and Design, pages 190195, 1997.
[MMP98] R. Marculescu, D. Marculescu, and M. Pedram. Probabilistic modeling of dependencies during switching activity analysis. IEEE Trans. Computer-Aided Design, 17(2):73-83, Feb. 1998.
[MMP99] R. Marculescu, D. Marculescu, and M. Pedram. Sequence compaction for power estimation: Theory and practice. IEEE Trans. Computer-Aided Design, 18(7):973993, July 1999.
[MPS98] E. Macii, M. Pedram, and F. Somenzi. High-level power modeling, estimation, and optimization. In IEEE Trans. Computer-Aided Design, volume 17, pages 1061-1079, Nov. 1998.
[NN96a] M. Nemani and F.N. Najm. High-level power estimation and the area complexity of Boolean functions. In Proc. Int. Symp. Low Power Electronics and Design, pages 329334, 1996.
[NN96b] M. Nemani and F.N. Najm. Towards a high-level power estimation capability. IEEE Trans. Computer-Aided Design, 15(6):588-598, June 1996.
[NN99] M. Nemani and F.N. Najm. High-level area and power estimation for VLSI circuits. IEEE Trans. ComputerAided Design, 18(6):697-713, 1999.
[ $\mathrm{NOC}^{+} 97$ ] S.H. Nawab, A.V. Oppenheim, A.P. Chandrakasan, J.M. Winograd, and J.T. Ludwig. Approximate signal processing. J. VLSI Signal Processing Systems, 15:177-200, Jan. 1997.
[NS99] L.S. Nielsen and J. Sparso. Designing asynchronous circuits for low power: An IFIR filter bank for a digital hearing aid. Proc. of the IEEE, 87(2):268-281, 1999.
[Opp70] A.V. Oppenheim. Realization of digital filters using block floating point arithmetic. IEEE Trans. Audio and Electroacoustics, 18:130-136, 1970.
[Pan00] C.J. Pan. A stereo audio chip using approximate processing for decimation and interpolation filters. IEEE Trans. Solid-State Circuits, 35(1):45-55, Jan. 2000.
[PM75] K. Parker and E. McClusky. Probabilistic treatment of general combinational networks. IEEE Trans. Electron. Comput., C-24(6):668-670, 1975.
[RB97] K. Ralev and P. Bauer. Implementation options for block floating point digital filters. In Proc. Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP-97), volume 3, pages 2197-2200, 1997.
[RSH99a] S. Ramprasad, N.R. Shanbhag, and I.N. Hajj. Decorrelating (DECOR) transformations for low-power digital filters. IEEE Trans. Circuits and Syst., 46(6):776-787, June 1999.
[RSH99b] S. Ramprasad, N.R. Shanbhag, and I.N. Hajj. Signal coding for low power: Fundamental limits and practical realizations. IEEE Trans. Circuits and Syst.-II, 46(7):923929, July 1999.
[ $\left.\mathrm{S}^{+} 95\right]$ D. Singh et al. Power conscious CAD tools and methodologies: A perspective. Proc. of the IEEE, 83(4):570-594, Apr. 1995.
[SA75] E.E. Swartzlander and A.G. Alexopoulos. The sign/logarithm number system. IEEE Trans. Comput., C-24:1238-1242, Dec. 1975.
[SBG99] S.L. SanGregory, C. Brothers, and D. Gallagher. A fast, low-power logarithm approximation with CMOS VLSI implementation. In Proc. 42nd Midwest Symposium on Circuits and Systems, 1999.
[Sch99] A. Schaub. Hearing aid. Patent P1573, Bernafon AG, Switzerland, 1999.
[SF96] T. Sasao and M. Fujita, editors. Representation of Discrete Functions. IFIP WG 10.5 Workshop on Application of the Reed-Muller Expansion in Circuit Design. Kluwer, 1996.
[Sha48] C.E. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27:379-423,623-656, July, Oct. 1948.
[Sha97] N.R. Shanbhag. A mathematical basis for powerreduction in digital VLSI systems. IEEE Trans. Circuits and Syst., 44(11):935-951, Nov. 1997.
[SK96] P.H. Schneider and S. Krishnamoorthy. Effects of correlations on accuracy of power analysis - an experimental study. In Proc. Int. Symp. Low Power Electronics and Design, pages 113-116, Monterey, CA, 1996.
[SP00] J. Satyanarayana and K. Parhi. Theoretical analysis of word-level switching activity in the presence of glitching and correlation. IEEE Trans. VLSI Syst., 8(2):148-159, Apr. 2000.
[SQA] Sound quality assessment material. European Broadcasting Union. CDROM (Cat.No 422 204-2).
[SR00] A.M. Steane and E.G. Rieffel. Beyond bits: The future of quantum information processing. IEEE Computer, pages 38-45, Jan. 2000.
[SRB97] N. Sankarayya, K. Roy, and D. Bhattacharya. Algorithms for low power FIR filter realization using differential coefficients. In Proc. 10th Int. Conf. on VLSI Design, pages 174-178, Jan. 1997.
[SSW96] P.H. Schneider, U. Schlichtmann, and B. Wurth. Fast power estimation of large circuits. IEEE Design $\xi$ Test of Comput., pages 70-78, Spring 1996.
[SW86] H. S. Stark and J.W. Woods. Probability, Random Processes, and Estimation Theory for Engineers. PrenticeHall, 1986.
[Szi29] L. Szilard. On the decrease of entropy in a thermodynamic system by the intervention of intelligent beings. $Z$. f. Physik, 53:840-856, 1929. (in German).
[ $\left.\mathrm{T}^{+} 95\right] \quad \mathrm{C}-\mathrm{Y}$. Tsui et al. Power estimation methods for sequential logic circuits. IEEE Trans. VLSI Syst., 3(3):404-416, 1995.
[TTSG97] S. Theoharis, G. Theodoridis, D. Soudris, and C. Goutis. A new method for switching activity estimation of logic level networks. In Proc. 7th Int. Workshop Power and Timing Modeling, Optimization and Simulation (PATMOS '97), pages 131-140, September 8-11 1997.
[WKF00] J. Wassner, H. Kaeslin, and N. Felber. Experimental speech data analysis with application to low-power DSP. Technical report, ETH Zurich, Integrated Systems Laboratory, 2000.
[WKFF99] J. Wassner, H. Kaeslin, N. Felber, and W. Fichtner. Spectral transformation of Boolean functions for probabilistic power estimation. In Proc. 9th Int. Workshop Power and Timing Modeling, Optimization and Simulation (PATMOS‘g9), pages 169-178, Greece, Oct. 1999.
[WKFF00] J. Wassner, H. Kaeslin, N. Felber, and W. Fichtner. Speech coding for energy-efficient digital signal processing. In CDROM-Proc. 43rd IEEE Midwest Symposium on Circuits and Systems (paper 05/001), Lansing, MI, USA, Aug. 2000.
[WKFF01] J. Wassner, H. Kaeslin, N. Felber, and W. Fichtner. Waveform coding for low-power digital filtering of speech data. submitted to IEEE Trans. Signal Processing, 2001.
[XN94] M.G. Xakellis and F.N. Najm. Statistical estimation of the switching activity in digital circuits. In Proc. 31rd Design Automation Conf., pages 728-733, San Diego, CA, 1994.
[Yea98] G.K. Yeap. Practical Low Power Digital VLSI Design. Kluwer Academic Publishers, 1998.
[YTK98] L.-P. Yuan, C.-C. Teng, and S.-M. Kang. Statistical estimation of average power dissipation using nonparametric techniques. IEEE Trans. VLSI Syst., 6(1):65-73, March 1998.

## Seite Leer / <br> Blank leaf

## Curriculum Vitae

Jürgen Wassner was born in Lauchhammer, Germany, on April 9, 1967. He received a technical degree in electronics before enrolling in Electrical Engineering at the Technical University of Dresden, Germany, in 1990. From the same institution he received the Dipl.-Ing. (M.Sc.) degree (with distinction) in 1995 for his diploma thesis on graph-theoretical methods in control system analysis.

During 1994-95 he spent six months as research scholar at the Department of Electrical and Computer Engineering at the University of Wisconsin, USA. There he contributed to the design of new algorithms for the modeling of electric power systems.

In 1995 he joined the European subsidiary of Silicon Valley Group (SVG) Inc. as System Engineer, working on lithographic equipment for the semiconductor industry. In this position he was promoted to supervise the technical support and advancement of SVG supplies at the semiconductor fab MOS4YOU of Philips Inc., Netherlands.

Since 1997 he has been working as research and teaching assistant at the Integrated Systems Laboratory of the Swiss Federal Institute of Technology (ETH) Zürich, Switzerland, towards his Doctoral degree. His research interest embraces all aspects of digital low-power VLSI design, specifically those related to data statistics, digital signal processing, and computer arithmetic.


[^0]:    ${ }^{1}$ From a functional point of view, there exists an upper bound of $\alpha \leq 1$ for all data signals. However, spurious signal transitions may inflate $\alpha$ beyond that limit, see section 3.4.5.

[^1]:    ${ }^{1}$ Multiple-driven nets are not considered here.

[^2]:    ${ }^{2}$ For combinational circuits it is sufficient to assume wide-sense stationarity because only first-order statistics are relevant in this case.
    ${ }^{3}$ From (A.3) follows that for a binary random variable the $i$-th moment ( $\mathrm{i}=2,3, \ldots$ ) is identical to the first moment.

[^3]:    ${ }^{4}$ The token "-" is used to indicate the reference to consecutive time points, in contrast to joint probabilities of several random sequences sampled at the same point in time.

[^4]:    ${ }^{5}$ Periodic signals are chosen solely for convenience, such that to be able to represent average statistical properties with a finite number of samples.

[^5]:    ${ }^{6}$ Note the direction of time index $k$ in figure 3.6 when evaluating the moments.

[^6]:    For comparison, if the absolute error in switching activity is uniformly distributed between $+/-0.25(+/-0.1,+/-0.05)$ then $\mathrm{RMS}_{\alpha} \approx 145 \cdot 10^{-3}\left(60 \cdot 10^{-3}, 30 \cdot 10^{-3}\right)$.

[^7]:    ${ }^{7}$ Under the same timing model (3.29), in [SP00] a different formula for the probability of switching from ' 0 ' at time point $l-1$ to ' 1 ' at time point $l$ has been derived in analogy to (3.10). In our notation this formula goes

    $$
    p_{y_{j}^{l-1}-y_{j}^{l}}^{0-1}=p_{y_{j}^{l-1}}^{1}\left(1-p_{y_{j}^{l}}^{1}\right)\left(1-\varrho_{y_{j}^{l-1}-y_{j}^{l}}\right)
    $$

    where $\varrho_{y_{j}^{l-1}-y_{j}^{l}}$ is the correlation coefficient of the random variables associated with time points $l-1$ and $l$, see (A.8). The above formula bases on the incorrect assumption that $p_{y_{j}^{l-1}-y_{j}^{l}}^{0-1}=p_{y_{j}^{l-1}-y_{j}^{l}}^{1-0}$. Although this holds if $y_{j}^{l-1}$ and $y_{j}^{l}$ refer to consecutive samples of a stationary binary signal, see proof of theorem 3.1, this assumption is fallacious under a multi-delay model. In this case, $y_{j}^{l-1}$ and $y_{j}^{l}$ do not refer to a stationary signal, because in general $p_{y_{j}^{l-1}}^{1} \neq p_{y_{j}^{l}}^{1}$.

[^8]:    ${ }^{8}$ Without loss of generality it may be assumed that all $m$ primary outputs of the combinational circuit $y_{g-m+1}, \ldots, y_{g}$ are state signals, see figure 3.1.

[^9]:    ${ }^{1}$ Different logic styles are being used in conjunction with CMOS technology. In what follows, the term CMOS will imply static CMOS logic.

[^10]:    ${ }^{2}$ Here, $k \approx 1.38 \cdot 10^{-23} \mathrm{~J} / \mathrm{K}$ is Boltzmann's constant and $T$ is the absolute temperature.

[^11]:    ${ }^{3}$ Decreasing $T$ below room temperature is another potential which, however, is not feasible in most practical situations.

[^12]:    ${ }^{4}$ In the present context, the term operation will always refer to a time-discrete digital-valued algorithm.
    ${ }^{5}$ Without loss of generality it is assumed that $T_{P}$ is the processing time for one input data item
    ${ }^{6}$ It is assumed that with the RT-level architecture all other high-level implementation options are defined. These options include the data acquisition scheme (synchronous/asynchronous), the data processing scheme (serial/parallel), and the operation frequency which is chosen such as to comply with the required processing time $T_{P}$.

[^13]:    ${ }^{7}$ Requirement (4.12) results from the question: How much closer does $P_{\text {impl }}\left(t_{1}\right)$ come to its lower bound than $P_{\text {impl }}\left(t_{2}\right)$ ? A relaxed requirement would be: $\hat{\eta}\left(t_{1}\right)>\hat{\eta}\left(t_{2}\right)$ if $\eta\left(t_{1}\right)>\eta\left(t_{2}\right)$, and $\hat{\eta}\left(t_{1}\right)<\hat{\eta}\left(t_{2}\right)$ if $\eta\left(t_{1}\right)<\eta\left(t_{2}\right)$.
    This can be satisfied by $\hat{P}(t)=a \cdot P\left(t, z^{*}\right)$ with $a \leq 1$. In fact, however, this is not easier to ensure than (4.14), since it shall hold true for any processing task $t$.

[^14]:    ${ }^{8}$ In practice, the cost of transmission usually is regarded as the number of channel uses per symbol necessary for reliable communication, since this determines the bandwidth needed. The number of channel uses per symbol corresponds to the codeword length $n$, which in the present interpretation is part of the optimization variable.

[^15]:    ${ }^{1}$ Increased absolute savings correspond to a larger area enclosed by the activity profiles under consideration. Relative savings may be inferred from the position of this area: The higher up in the diagram the larger the relative savings.

[^16]:    ${ }^{2}$ Since both operands are confined to $-1<X(k), c_{n}<1$ the product will fall in the same range. Therefore, $2(\langle x\rangle-1)$ bits are required for the product magnitude, plus one bit for the product sign.

[^17]:    ${ }^{3}$ In the present context, speech samples $X(k)$ are interpreted as integer numbers, and logarithm always is to the base two.

[^18]:    ${ }^{1}$ The arithmetic transform technique for probabilistic activity calculation in combinational circuits is described in section 3.4. For the meaning of mathematical notations used in the present example, see section 3.3.

[^19]:    ${ }^{2}$ Time index $k$ is denoted a superscript.

[^20]:    ${ }^{3}$ Note that there always exists a unique solution because the matrix of coefficients has full rank for any valid combination of values for $p_{x}^{1}$ and $p_{x-x}^{1-1}$.

[^21]:    ${ }^{4}$ The result has been verified by logic simulation.

