#### 18-344: Computer Systems and the Hardware-Software Interface Fall 2025



#### **Course Description**

**Lecture 11: Design Space Exploration** 

This course covers the design and implementation of computer systems from the perspective of the hardware software interface. The purpose of this course is for students to understand the relationship between the operating system, software, and computer architecture. Students that complete the course will have learned operating system fundamentals, computer architecture fundamentals, compilation to hardware abstractions, and how software actually executes from the perspective of the hardware software/boundary. The course will focus especially on understanding the relationships between software and hardware, and how those relationships influence the design of a computer system's software and hardware. The course will convey these topics through a series of practical, implementation-oriented lab assignments.

Credit: Brandon Lucia

#### Today: Design Space Exploration

- Defining the design space of a hardware or software system
- Pareto Frontiers and optimizing within a design space
- Applied Performance Evaluation
  - Finding the best performing design under constraints

### Defining a design space

- A design space is a set of possible incarnations of a system
- A design space is defined over a set of parameters
- A point in the design space is a concrete system with a concrete value for each of the design space's parameters
- Design spaces exist to allow systematic exploration of a collection of possible designs, like architectures.

#### Example: Branch Predictor Design Space



What are the parameters / dimensions of this branch predictor's design?

#### Example: Branch Predictor Design Space



**GHT** size BHT # entries GHT/PC hash func BHT entry size BranchID hash BTB # entries BTB assoc

These parameters are the dimensions of a design space vector

**GHT** size

BHT # entries

GHT/PC hash func

BHT entry size

BranchID hash

BTB # entries

BTB assoc

Sg

Nb

Hp

Sb

Hb

Nt

At

### Example: Branch Predictor Design Space



Sg Nb Hp Sb Hb Nt At

### Example: Branch Predictor Design Space



## Can find a good design by measuring points in the design space



### Can find a good design by measuring points in the design space



#### Is one of these better?





#### Plotting the design space: Geometric view of design dimensions 16k xor() hash() Nb 16 1k xor() Нр hash() Sb Nb Hb Nt At Nt

#### Plotting the design space: Geometric view of design dimensions 16k xor() hash() Sg Nb 16 1k xor() Нр hash() Sb Nb Hb Nt At Nt Limited medium: too many dimensions to render visually Limited interpretability: what does position mean?

Can be helpful for clustering designs if non-obvious

## Plotting the design space: Geometric view of figures of merit



Simple medium: can easily render multiple FoMs & designs Limited view of designs: points do not show design info Benefit: allows comparing designs in multiple dimensions



FoM = "Feature of Merit", i.e. an attribute we care about.

#### Plotting many designs to study a tradeoff



#### Plotting many designs to study a tradeoff





### Which points in this plot are optimal?





#### Pareto Optimality of Design Alternatives



#### **Pareto Optimality:**

A design is optimal if no change leads to improvement in one dimension without a loss in at least one other dimension

Vilfredo Pareto



#### Pareto Optimality of Design Alternatives



#### Design Consequence of Pareto Optimality

Never select designs other than at the frontier, at least without motivation outside of plot. Any design anywhere other than at the frontier can achieve the same or better performance at a lower cost w.r.t. the plotted dimensions.



#### Design Consequence of Pareto Optimality



### Worthwhile Options Are Along the Pareto Frontier



### Design Space Exploration

- Applied Performance Evaluation to find the best feasible system
  - Define a system's important design parameters
  - Define a system's figure(s) of merit
  - Define a set of constraints on the feasibility of a binding of design parameters
  - Choose a feasible parameter setting and measure its merit
  - Iterate until satisfied:
    - If this system is better than the last one, keep it. If worse, discard it.
    - Choose a parameter and change it

### Design Space Exploration

- Applied Performance Evaluation to find the best feasible system
  - Define a system's important design parameters
  - Define a system's figure(s) of merit
    - Define a set of constraints on the feasibility of a binding of design parameters
    - Choose a feasible parameter setting and measure its merit
    - Iterate until satisfied:
      - If this system is better than the last one, keep it. If worse, discard it.
      - Choose a parameter and change it

#### Constraining your design space



#### Physical design constraints

Max BP power = 4mW

Max BTB associativity = 2

Max memory (BTB+BHT) = 20kB

Designs candidates are often described as needing to "Make PPA":

- Power
- Performance
- Area

### Design Space Exploration

- Applied Performance Evaluation to find the best feasible system
  - Define a system's important design parameters
  - Define a system's figure(s) of merit
  - Define a set of constraints on the feasibility of a binding of design parameters
  - Choose a feasible parameter setting and measure its merit
  - Iterate until satisfied:
    - If this system is better than the last one, keep it. If worse, discard it.
    - Choose a parameter and change it
    - Random restarts to search different sub-spaces

#### Systematically fill out your design space



#### Systematically fill out your design space



# Example of Design Space Optimization The Q100 Database Acceleration Architecture

#### Q100: The Architecture and Design of a Database Processing Unit

Lisa Wu Andrea Lottarini Timothy K. Paine Martha A. Kim Kenneth A. Ross

Columbia University, New York, NY



#### **Cutting edge database query hardware accelerator**

- "GPU for SQL & Database operations"
- Architecture built up of a collection of special computing tiles in hardware
- Each tile runs a particular kind of database operation
- Tiles connected by configurable wires that can be set up to make circuits to do a database query
- (Includes one of the best design space explorations I've encountered in a research paper)

|            |                    |        | Area     |      | ower    | <b>Critical Path</b> | <b>Design Width (bits)</b> |        |            |                        |
|------------|--------------------|--------|----------|------|---------|----------------------|----------------------------|--------|------------|------------------------|
| -          | Tile               | $mm^2$ | % Xeon a | mW   | % Xeon  | ns                   | Record                     | Column | Comparator | Other Constraint       |
| Functional | Aggregator         | 0.029  | 0.07%    | 7.1  | 0.14%   | 1.95                 |                            | 256    | 256        |                        |
|            | ALU                | 0.091  | 0.21%    | 12.0 | 0.24%   | 0.29                 |                            | 64     | 64         |                        |
|            | <b>BoolGen</b>     | 0.003  | 0.01%    | 0.2  | < 0.01% | 0.41                 |                            | 256    | 256        |                        |
|            | ColFilter          | 0.001  | < 0.01%  | 0.1  | < 0.01% | 0.23                 |                            | 256    |            |                        |
|            | Joiner             | 0.016  | 0.04%    | 2.6  | 0.05%   | 0.51                 | 1024                       | 256    | 64         |                        |
|            | <b>Partitioner</b> | 0.942  | 2.20%    | 28.8 | 0.58%   | ***3.17              | 1024                       | 256    | 64         |                        |
|            | Sorter             | 0.188  | 0.44%    | 39.4 | 0.79%   | 2.48                 | 1024                       | 256    | 64         | 1024 entries at a time |
|            | Append             | 0.011  | 0.03%    | 5.4  | 0.11%   | 0.37                 | 1024                       | 256    |            |                        |
| A!1; a     | ColSelect          | 0.049  | 0.11%    | 8.0  | 0.16%   | 0.35                 | 1024                       | 256    |            |                        |
| Auxiliary  | Concat             | 0.003  | 0.01%    | 1.2  | 0.02%   | 0.28                 |                            | 256    |            |                        |
|            | Stitch             | 0.011  | 0.03%    | 5.4  | 0.11%   | 0.37                 |                            | 256    |            |                        |

Design space optimization problem statement:

|            |                |        | Area                           | Po         | wer         | Critical Path |    | D      | esign Widtl | h (bits)   |                        |
|------------|----------------|--------|--------------------------------|------------|-------------|---------------|----|--------|-------------|------------|------------------------|
|            | Tile           | $mm^2$ | % Xeon a                       | n Hi       | rile        | ns            | R  | Record | Column      | Comparator | Other Constraint       |
| -          | Aggregator     | choose | % Xeon a a number original wol | of each    | io a high-  | 1.95          |    |        | 256         | 256        | ,                      |
|            | ALU<br>BoolGen | Ln the | original wo                    | of click   | how         | 0.29<br>0.41  |    |        | 64<br>256   | 64<br>256  |                        |
| Functional |                |        |                                |            |             |               |    |        | 256         | 230        |                        |
|            | Joiner         | - 103/ | tiles yic.                     | C. F ODC   | 1 1156 01.5 | 0.51          |    | 1024   | 256         | 64         |                        |
|            | Partitioner    | marta  | rmance bei                     | UBLIC ON L | nany of     | ***3.17       |    | 1024   | 256         | 64         |                        |
|            | Sorter         | perio  | ber to bour                    | id now '   |             | 2.48          |    | 1024   | 256         | 64         | 1024 entries at a time |
|            | Append         | dnum   | ber to bour<br>tile they co    | onsider    | 0.11%       | 0.37          |    | 1024   | 256         |            |                        |
| Auxiliary  | ColSelect      | 0 each | 5.1170                         | 8.0        | 0.16%       | 0.35          |    | 1024   | 256         |            |                        |
| Auxiliai y | Concat         | 0.003  | 0.01%                          | 1.2        | 0.02%       | 0.28          |    |        | 256         |            |                        |
|            | Stitch         | 0.011  | 0.03%                          | 5.4        | 0.11%       | 0.37          | 91 |        | 256         |            | -                      |

Design space optimization problem statement:

| ,          | Tile                                    | 50-40 00-000 10 00 00 00 00 00 00 00 00 00 00 00                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               | Xeon a                               | mir.                                    | tile                             | Critic | Tile                                                                         | Maximum<br>Useful Count         | "Tiny"<br>Tile   | Tile Counts<br>Explored               | her Constraint    |
|------------|-----------------------------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|--------------------------------------|-----------------------------------------|----------------------------------|--------|------------------------------------------------------------------------------|---------------------------------|------------------|---------------------------------------|-------------------|
| Functional | BoolGen<br>ColFilter<br>Joiner          | Choose a not have a simulation of the original interests of the control of the co | lation to<br>s yield no<br>ince ben  | decide<br>o more<br>efit and<br>d how n | use that<br>nany of              |        | Aggregator<br>ALU<br>BoolGen<br>ColFilter<br>Joiner<br>Partitioner<br>Sorter | 4<br>5<br>6<br>6<br>4<br>5<br>6 | X<br>X<br>X<br>X | 4<br>1 5<br>6<br>6<br>4<br>1 5<br>1 6 | entries at a time |
| Auxiliary  | Append<br>ColSelect<br>Concat<br>Stitch | 0 each tile<br>0.003                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | they <sup>CO</sup><br>0.01%<br>0.03% | 8.0<br>1.2<br>5.4                       | 0.11%<br>0.16%<br>0.02%<br>0.11% |        | Append<br>ColSelect<br>Concat<br>Stitch                                      | 8<br>7<br>2<br>3                | X<br>X<br>X<br>X | 8<br>7<br>2<br>3                      |                   |

Design space optimization problem statement:

|                   |                    |        | Area                |          | ower       | <b>Critical Path</b>                                   | D          | esign Widtl |            |                        |
|-------------------|--------------------|--------|---------------------|----------|------------|--------------------------------------------------------|------------|-------------|------------|------------------------|
| -                 | Tile               | $mm^2$ | % Xeon <sup>a</sup> | mW       | % Xeon     | ns                                                     | Record     | Column      | Comparator | Other Constraint       |
|                   | Aggregator         | 0.029  | 0.07%               | 7.1      | 0.14%      | 1.95                                                   |            | 256         | 256        |                        |
|                   | ALU                | 0.091  | 0.21%               | 12.0     | 0.24%      | 0.20                                                   | ling       | 64          | 64         |                        |
|                   | BoolGen            | 0.003  | 0.01%               | 0.2      | < 0.01 or  | .t. a reasonable                                       | basellile  | 256         | 256        |                        |
| <b>Functional</b> | ColFilter          | 0.001  | < 0.01%             |          | a area W.r | t. a reasonable rk, they compain rver processor,       | red to a   | 256         |            |                        |
|                   | Joiner             | 0.016  | 0.04%               | Estimat  | e areal wo | rk, they compare                                       | the Intel  | 256         | 64         |                        |
|                   | <b>Partitioner</b> | 0.942  | 2.20%               | In the C | origiliai  | rver processor,                                        | studed all | of 256      | 64         |                        |
|                   | Sorter             | 0.188  |                     |          |            |                                                        |            | .56         | 64         | 1024 entries at a time |
|                   | Append             | 0.011  | 0.03%               | E5620    | Xeon. The  | rver processor,<br>comparison inc<br>wires, buffers, e | 1024       | 256         |            |                        |
| Auviliany         | ColSelect          | 0.049  | 0.11%               | the co   | onnecting  | 0.35                                                   | 1024       | 256         |            |                        |
| Auxiliary         | Concat             | 0.003  | 0.01%               | 2.2      | 0.02%      | 0.28                                                   |            | 256         |            |                        |
| <u> </u>          | Stitch             | 0.011  | 0.03%               | 5.4      | 0.11%      | 0.37                                                   |            | 256         |            |                        |

Design space optimization problem statement:

|            | Tile               | $mm^2$ | Area<br>% Xeon <sup>a</sup> | 55-365 | ower<br>% Xeon | Critical Path                                   |              | esign Width      |                      | on Other Constraint    |
|------------|--------------------|--------|-----------------------------|--------|----------------|-------------------------------------------------|--------------|------------------|----------------------|------------------------|
|            | THE                | $mm^-$ | % Aeon                      | mW     | % Aeon         | ns                                              | Record       | Column           |                      | or Other Constraint    |
|            | Aggregator         | 0.029  | 0.07%                       | 7.1    | 0.14%          | 1.95                                            |              | ıar              | ۱ - ۳                |                        |
|            | ALU                | 0.091  | 0.21%                       | 12.0   | 0.24%          | Want to mini In the origina                     | mize pow     | eited the        | number               |                        |
|            | BoolGen            | 0.003  | 0.01%                       | 0.2    | < 0.01%        | Want to ma                                      | ıl work, lir | miles in         | $5$ to $0, 1, \dots$ |                        |
| Functional | ColFilter          | 0.001  | < 0.01%                     | 0.1    | < 0.01%        | In the origina                                  | or 110s of   | mW) uine         | $s_{t}$ of "tiny"    |                        |
|            | Joiner             | 0.016  | 0.04%                       | 2.6    | 0.05%          | Want to minion of high-power or 2, and all      | er (199      | itrary coui      | 100.                 |                        |
|            | <b>Partitioner</b> | 0.942  | 2.20%                       | 28.8   | 0.58%          | and all                                         | owed and     | have <10n        | UVV.                 |                        |
|            | Sorter             | 0.188  | 0.44%                       | 39.4   | 0.79%          | of high-powers<br>or 2, and all<br>functional u | inits that   | - <del>0</del> 0 | 64                   | 1024 entries at a time |
|            | Append             | 0.011  | 0.03%                       | 5.4    | 0.11%          | 0.37                                            | 1024         | 256              |                      |                        |
| Auviliany  | ColSelect          | 0.049  | 0.11%                       | 8.0    | 0.16%          | 0.35                                            | 1024         | 256              |                      |                        |
| Auxiliary  | Concat             | 0.003  | 0.01%                       | 1.2    | 0.02%          | 0.28                                            |              | 256              |                      |                        |
| -          | Stitch             | 0.011  | 0.03%                       | 5.4    | 0.11%          | 0.37                                            |              | 256              |                      |                        |

Design space optimization problem statement:

|                   |                                        |           | Area                            | Power         |           | Critical Path |   | D     | esign Widtl |            |                        |
|-------------------|----------------------------------------|-----------|---------------------------------|---------------|-----------|---------------|---|-------|-------------|------------|------------------------|
|                   | Tile                                   | $mm^2$    | % Xeon <sup>a</sup>             | $\mathbf{mW}$ | % Xeon    | ns            | R | ecord | Column      | Comparator | Other Constraint       |
| -                 | Aggregator                             | 0.029     | 0.07%                           | 7.1           | 0.14%     | 1.95          |   |       | 256         | 256        |                        |
|                   | ALU                                    | 0.091     | 0.21%                           | 10            |           | 0.29          |   |       | 64          | 64         |                        |
|                   | BoolGen  College  Je Frequence         | 0.002     | I by tile lat                   | ency          | +hat      | 0.41          |   |       | 256         | 256        |                        |
| <b>Functional</b> | BoolGen Color Je Frequence Pa Aggressi | cy limite | d by the                        | n mear        | is that   | 0.23          |   |       | 256         |            |                        |
|                   | Jd Frequein                            | qia vlov  | elined design                   | os the r      | naximuiii | 0.51          |   | 1024  | 256         | 64         |                        |
|                   | Pa Aggressi                            | very pri  | delay defin                     | es cir        | as the    | ***3.17       |   | 1024  | 256         | 64         |                        |
|                   | 501 the Criu                           | Carr      | which 15 "                      | 10 -          |           | 2.48          |   | 1024  | 256         | 64         | 1024 entries at a time |
|                   | Pa Aggressi So the criti Ap switchin   | ng delay  | ne design).  Ways define  0.03% | frod.         | for Q100) | 0.37          |   | 1024  | 256         |            |                        |
| Auxiliary         | Coll freque                            | ncy or a  | ways define                     | s treq.       | 0.16%     | 0.35          |   | 1024  | 256         |            |                        |
| Auxiliary         | Con (partit                            | ioner al  | Ways                            | 1.2           | 0.02%     | 0.28          |   |       | 256         |            |                        |
|                   | Stitc                                  | 0.011     | 0.03%                           | 5.4           | 0.11%     | 0.37          |   |       | 256         |            |                        |

Design space optimization problem statement:

|            |                                              |          | Area                                       |           | ower      | <b>Critical Path</b> | <b>Design Width (bits)</b> |        |            |                        |
|------------|----------------------------------------------|----------|--------------------------------------------|-----------|-----------|----------------------|----------------------------|--------|------------|------------------------|
|            | Tile                                         | $mm^2$   | % Xeon a                                   | mW        | % Xeon    | ns                   | Record                     | Column | Comparator | Other Constraint       |
|            | Aggregator                                   | 0.029    | 0.07%                                      | 7.1       | 0.14%     | 1.95                 |                            | 256    | 256        |                        |
|            | ALU                                          | 0.091    | 0.21%                                      | 12.0      | 0.24      | 0.29                 |                            | 64     | 64         |                        |
|            | <b>BoolGen</b>                               | 0.003    | 0.010                                      | LDB he    | nchmark   | 0.41                 |                            | 256    | 256        |                        |
| Functional | ColFilter                                    |          | on standard                                | y Do bo   |           | 0.23                 |                            | 256    |            |                        |
|            | BoolGen ColFilter J Simulate Pa Collect N    | e design | - asureme                                  | ents for  | (TDC-H)   | 0.51                 | 1024                       | 256    | 64         |                        |
|            | Pa Silitary                                  | run time | measa.                                     | chmark    | ((170.11) | ***3.17              | 1024                       | 256    | 64         |                        |
|            | So Collect                                   | tion-Pro | cessing being                              | system    | Without   | 2.48                 | 1024                       | 256    | 64         | 1024 entries at a time |
|            | So Collect r So Transac Ap which s Con being | stresses | measurement<br>cessing ben<br>a database s | ching fro | om memo   | 0.37                 | 1024                       | 256    |            |                        |
| Aussilians | Col. Willer                                  | oottlene | cked by re-                                | 8.0       | 0.16%     | 0.35                 | 1024                       | 256    |            |                        |
| Auxiliary  | Con being                                    | 0000     | 0.01%                                      | 1.2       | 0.02%     | 0.28                 |                            | 256    |            |                        |
|            | Stitch                                       | 0.011    | 0.03%                                      | 5.4       | 0.11%     | 0.37                 |                            | 256    |            |                        |

Design space optimization problem statement:

Choose the right mixture of tiles to have the best performance and power without using too much area or limiting frequency

#### Q100 Pareto Frontier



Pareto plot from a research paper on the Q100 Database accelerator by Wu et al, ASPLOS 2014

- How did they select magenta points?
- What other points might they have selected?
- What is the value in seeing all these points?

- "Pareto Design" as used in the paper means the design that maximizes (runtime) performance per watt.
- Although there were designs with nominally better runtime, the goal of the paper was to select three options for further study. The two options with a nominally better runtime were only negligibly better but at a much higher cost in terms of energy, rendering them less interesting to the authors.

#### Results of Design Space Exploration

|          |        |        | Area       |              | Power        |       |       |       |       |              |  |
|----------|--------|--------|------------|--------------|--------------|-------|-------|-------|-------|--------------|--|
|          | Tiles  | NoC    | <b>SBs</b> | <b>Total</b> | <b>Total</b> | Tiles | NoC   | SBs   | Total | <b>Total</b> |  |
|          | $mm^2$ | $mm^2$ | $mm^2$     | $mm^2$       | % Xeon       | W     | W     | W     | W     | % Xeon       |  |
| LowPower | 1.890  | 0.567  | 0.520      | 2.978        | 7.0%         | 0.238 | 0.071 | 0.400 | 0.710 | 14.2%        |  |
| Pareto   |        |        |            |              |              |       |       |       |       |              |  |
| HighPerf | 5.080  | 1.524  | 0.780      | 7.384        | 17.3%        | 0.541 | 0.162 | 0.600 | 1.303 | 26.1%        |  |



Final results show idealized design and results that include adding in costs related to the on-chip network and memory access bandwidth

#### Heat Plots Can Be Used to Explore 2D Space



sign, the communication bandwidth for heat map, Pareto design maximum intra- max bandwidth per connection. most connections exceed the provisioned connection bandwidth exhibit almost  $6.3 \, GB/s$  NoC bandwidth, marked as X's identical behavior as HighPerf design. in the figures.

Figure 10. Even with a LowPower de- Figure 11. Similar to connection count Figure 12. Heat map of HighPerf design

Here heat plots are used to show the communication bandwidth needed between tiles and which design elements exceed a reference threshold.

#### Q100 Takeaways / What did we just learn

- Practical application of design space exploration
- Defined design space based on tiles and connections between tiles
- Defined constraints and optimization goals based on power, area, frequency
- Runs experiments to produce Pareto Frontier with performance and power as main design dimension
- Final designs come from Pareto Frontier fast, balanced, low-power
- Compare design to characteristics of known baseline (Xeon)

#### What to think about next?

- Miscellaneous (micro)architectural tricks & optimizations (future)
  - Super-scalar Out-of-Order
  - VLIW
  - Vector processors / SIMD
  - SIMT/GPU