# 15-740/18-740 Computer Architecture Lecture 3: Performance

Prof. Onur Mutlu
Carnegie Mellon University

### Last Time ...

- Some microarchitecture ideas
  - Part of microarchitecture vs. ISA
- Some ISA level tradeoffs
  - Semantic gap
  - □ Simple vs. complex instructions -- RISC vs. CISC
  - Instruction length
  - Uniform decode
  - Number of registers

## Review: ISA-level Tradeoffs: Number of Registers

#### Affects:

- Number of bits used for encoding register address
- Number of values kept in fast storage (register file)
- (uarch) Size, access time, power consumption of register file

#### Large number of registers:

- + Enables better register allocation (and optimizations) by compiler → fewer saves/restores
- -- Larger instruction size
- -- Larger register file size
- -- (Superscalar processors) More complex dependency check logic

# ISA-level Tradeoffs: Addressing Modes

- Addressing mode specifies how to obtain an operand of an instruction
  - Register
  - Immediate
  - Memory (displacement, register indirect, indexed, absolute, memory indirect, autoincrement, autodecrement, ...)

#### More modes:

- + help better support programming constructs (arrays, pointerbased accesses)
- -- make it harder for the architect to design
- -- too many choices for the compiler?
  - Many ways to do the same thing complicates compiler design
  - Read Wulf, "Compilers and Computer Architecture"

# x86 vs. Alpha Instruction Formats

#### **x86**:



#### Alpha:



Table 2-2. 32-Bit Addressing Forms with the ModR/M Byte

|                       | r0(/r)                                                                                                                                                                      | . 52 | טונ אטט                                              | _                                            | _                                            |                                              |                                              |                                              | _                                            | DH                                           | DU                                           |  |  |
|-----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------|------------------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------|--|--|
| x86                   | r8(/r) r16(/r) r32(/r) r32(/r) mm(/r) xmm(/r) (In decimal) /digit (Opcode) (In binary) REG =                                                                                |      |                                                      | AL<br>AX<br>EAX<br>MMO<br>XMMO<br>0<br>0     | CL<br>CX<br>ECX<br>MM1<br>XMM1<br>1<br>001   | DL<br>DX<br>EDX<br>MM2<br>XMM2<br>2<br>010   | BL<br>BX<br>EBX<br>MM3<br>XMM3<br>3<br>011   | AH<br>SP<br>ESP<br>MM4<br>XMM4<br>4<br>100   | CH<br>BP<br>EBP<br>MM5<br>XMM5<br>5<br>101   | DH<br>SI<br>ESI<br>MM6<br>XMM6<br>110        | BH<br>DI<br>EDI<br>MM7<br>XMM7<br>7<br>111   |  |  |
|                       | Effective Address   Mod   R/M                                                                                                                                               |      |                                                      |                                              | Value of ModR/M Byte (in Hexadecimal)        |                                              |                                              |                                              |                                              |                                              |                                              |  |  |
| register indirect     | [EAX]<br>[ECX]<br>[EDX]<br>[EBX]<br>[][] <sup>1</sup><br>disp32 <sup>2</sup><br>[ESI]<br>[EDI]                                                                              | 00   | 000<br>001<br>010<br>011<br>100<br>101<br>110<br>111 | 00<br>01<br>02<br>03<br>04<br>05<br>06<br>07 | 08<br>09<br>0A<br>0B<br>0C<br>0D<br>0D<br>0F | 10<br>11<br>12<br>13<br>14<br>15<br>16       | 18<br>19<br>1A<br>1B<br>1C<br>1D<br>1E<br>1F | 20<br>21<br>22<br>23<br>24<br>25<br>26<br>27 | 28<br>29<br>2A<br>2B<br>2C<br>2D<br>2E<br>2F | 30<br>31<br>32<br>33<br>34<br>35<br>36<br>37 | 38<br>39<br>3A<br>3B<br>3C<br>3D<br>3E<br>3F |  |  |
|                       | [EAX]+disp8 <sup>3</sup> [ECX]+disp8 [EDX]+disp8 [EBX]+disp8 [][]+disp8 [EBP]+disp8 [ESI]+disp8 [EDI]+disp8                                                                 | 01   | 000<br>001<br>010<br>011<br>100<br>101<br>110<br>111 | 40<br>41<br>42<br>43<br>44<br>45<br>46<br>47 | 48<br>49<br>4A<br>4B<br>4C<br>4D<br>4E<br>4F | 50<br>51<br>52<br>53<br>54<br>55<br>56<br>57 | 58<br>59<br>5A<br>5B<br>5C<br>5D<br>5E<br>5F | 60<br>61<br>62<br>63<br>64<br>65<br>66       | 68<br>69<br>6A<br>6B<br>6C<br>6D<br>6E<br>6F | 70<br>71<br>72<br>73<br>74<br>75<br>76<br>77 | 78<br>79<br>7A<br>7B<br>7C<br>7D<br>7E<br>7F |  |  |
| register +            | [EAX]+disp32<br>[ECX]+disp32<br>[EDX]+disp32<br>[EBX]+disp32<br>[][]+disp32<br>[EBP]+disp32<br>[ESI]+disp32<br>[EDI]+disp32                                                 | 10   | 000<br>001<br>010<br>011<br>100<br>101<br>110<br>111 | 80<br>81<br>82<br>83<br>84<br>85<br>86<br>87 | 88<br>89<br>8A<br>8B<br>8C<br>8D<br>8E<br>8F | 90<br>91<br>92<br>93<br>94<br>95<br>96<br>97 | 98<br>99<br>9A<br>9B<br>9C<br>9D<br>9E<br>9F | A0<br>A1<br>A2<br>A3<br>A4<br>A5<br>A6<br>A7 | A8<br>A9<br>AA<br>AB<br>AC<br>AD<br>AE<br>AF | B0<br>B1<br>B2<br>B3<br>B4<br>B5<br>B6<br>B7 | B8<br>B9<br>BA<br>BB<br>BC<br>BD<br>BE<br>BF |  |  |
| displacement register | EAX/AX/AL/MM0/XMM0<br>ECX/CX/CL/MM/XMM1<br>EDX/DX/DL/MM2/XMM2<br>EBX/BX/BL/MM3/XMM3<br>ESP/SP/AH/MM4/XMM4<br>EBP/BP/CH/MM5/XMM5<br>ESI/SI/DH/MM6/XMM6<br>EDI/DI/BH/MM7/XMM7 | 11   | 000<br>001<br>010<br>011<br>100<br>101<br>110<br>111 | CO<br>C1<br>C2<br>C3<br>C4<br>C5<br>C6<br>C7 | C8<br>C9<br>CA<br>CB<br>CC<br>CD<br>CE<br>CF | D0<br>D1<br>D2<br>D3<br>D4<br>D5<br>D6<br>D7 | D8<br>D9<br>DA<br>DB<br>DC<br>DD<br>DE<br>DF | E0<br>E1<br>E2<br>E3<br>E4<br>E5<br>E6<br>E7 | 68<br>69<br>6A<br>6B<br>6C<br>6D<br>6E<br>6F | F0<br>F1<br>F2<br>F3<br>F4<br>F5<br>F6<br>F7 | F8<br>F9<br>FA<br>FB<br>FC<br>FD<br>FE<br>FF |  |  |

#### NOTES:

- 1. The [--][--] nomenclature means a SIB follows the ModR/M byte.
- The disp32 nomenclature denotes a 32-bit displacement that follows the ModR/M byte (or the SIB byte if one is present) and that is added to the index.
- The disp8 nomenclature denotes an 8-bit displacement that follows the ModR/M byte (or the SIB byte if one is present) and that is sign-extended and added to the index.

Table 2-3 is organized to give 256 possible values of the SIB byte (in hexadecimal). General purpose registers used as a base are indicated across the top of the table.

| X | 8             | 6             |
|---|---------------|---------------|
|   | $\overline{}$ | $\overline{}$ |

indexed

(base +

index)

scaled

(base +

index\*4)

| Table 2-3. 32-Bit Addressin | g Forms with the SIB Byt | te |
|-----------------------------|--------------------------|----|
|-----------------------------|--------------------------|----|

|   | r32                                                                               | TODIC L | J. J.                                                | EAX                                          | ECX                                          | EDX                                          | EBX                                                                                         | ESP                                          | _                                            | ESI                                          | EDI                                          |
|---|-----------------------------------------------------------------------------------|---------|------------------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------|---------------------------------------------------------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------|----------------------------------------------|
|   | (In decimal) Base =<br>(In binary) Base =                                         |         |                                                      | 000                                          | 001                                          | 010                                          | 3<br>011                                                                                    | 4<br>100                                     | [*]<br>5<br>101                              | 110                                          | EDI<br>7<br>111                              |
| - | Scaled Index                                                                      | SS      | Index                                                | Value of SIB Byte (in Hexadecimal)           |                                              |                                              |                                                                                             |                                              |                                              |                                              |                                              |
|   | [EAX]<br>[ECX]<br>[EDX]<br>[EBX]<br>none<br>[EBP]<br>[ES]]<br>[ED]]               | 00      | 000<br>001<br>010<br>011<br>100<br>101<br>110<br>111 | 00<br>08<br>10<br>18<br>20<br>28<br>30<br>38 | 01<br>09<br>11<br>19<br>21<br>29<br>31<br>39 | 02<br>0A<br>12<br>1A<br>22<br>2A<br>32<br>3A | 03<br>0B<br>13<br>1B<br>23<br>2B<br>33<br>3B                                                | 04<br>0C<br>14<br>1C<br>24<br>2C<br>34<br>3C | 05<br>0D<br>15<br>1D<br>25<br>2D<br>35<br>3D | 06<br>0E<br>16<br>1E<br>26<br>2E<br>36<br>3E | 07<br>0F<br>17<br>1F<br>27<br>2F<br>37<br>3F |
|   | [EAX+2]<br>[ECX+2]<br>[EDX+2]<br>[EBX+2]<br>none<br>[EBP+2]<br>[ESI+2]<br>[EDI+2] | 01      | 000<br>001<br>010<br>011<br>100<br>101<br>110<br>111 | 40<br>48<br>50<br>58<br>60<br>68<br>70<br>78 | 41<br>49<br>51<br>59<br>61<br>69<br>71<br>79 | 42<br>4A<br>52<br>5A<br>62<br>6A<br>72<br>7A | 43<br>4B<br>53<br>5B<br>63<br>6B<br>73<br>7B                                                | 44<br>40<br>54<br>50<br>64<br>60<br>74<br>70 | 45<br>4D<br>55<br>5D<br>65<br>6D<br>75<br>7D | 46<br>4E<br>56<br>5E<br>66<br>6E<br>76<br>7E | 47<br>4F<br>57<br>5F<br>67<br>6F<br>77<br>7F |
|   | [EAX*4]<br>[ECX*4]<br>[EDX*4]<br>[EBX*4]<br>none<br>[EBP*4]<br>[ESI*4]<br>[EDI*4] | 10      | 000<br>001<br>010<br>011<br>100<br>101<br>110<br>111 | 80<br>88<br>90<br>98<br>A0<br>A8<br>B0<br>B8 | 81<br>89<br>91<br>89<br>A1<br>A9<br>B1<br>B9 | 82<br>8A<br>92<br>9A<br>A2<br>AA<br>B2<br>BA | 83<br>8B<br>93<br>9B<br>A3<br>AB<br>B3<br>BB                                                | 84<br>8C<br>94<br>9C<br>A4<br>AC<br>B4<br>BC | 85<br>8D<br>95<br>9D<br>A5<br>AD<br>B5<br>BD | 86<br>8E<br>96<br>9E<br>A6<br>AE<br>B6<br>BE | 87<br>8F<br>97<br>9F<br>A7<br>AF<br>B7<br>BF |
|   | [EAX*8]<br>ECX*8]<br>[EDX*8]<br>[EBX*8]<br>none<br>[EBP*8]<br>[ESI*8]<br>[EDI*8]  | 11      | 000<br>001<br>010<br>011<br>100<br>101<br>110<br>111 | CO<br>C8<br>DO<br>D8<br>EO<br>E8<br>FO<br>F8 | C1<br>C9<br>D1<br>D9<br>E1<br>E9<br>F1<br>F9 | C2<br>CA<br>D2<br>DA<br>E2<br>EA<br>F2<br>FA | 3<br>3<br>3<br>3<br>3<br>3<br>3<br>3<br>3<br>3<br>3<br>3<br>3<br>3<br>3<br>3<br>3<br>3<br>3 | C4<br>CC<br>D4<br>DC<br>E4<br>EC<br>F4<br>FC | 56565656                                     | C6<br>CE<br>D6<br>DE<br>E6<br>E6<br>F6<br>FE | 77<br>67<br>67<br>67<br>67<br>67<br>67<br>67 |

#### NOTES:

The [\*] nomenclature means a disp32 with no base if the MOD is 00B. Otherwise, [\*] means disp8
or disp32 + [EBP]. This provides the following address modes:

#### MOD bits Effective Address 00 [scaled index] + disp32

01 [scaled index] + disp8 + [EBP]

10 [scaled index] + disp32 + [EBP]

### Other ISA-level Tradeoffs

- Load/store vs. Memory/Memory
- Condition codes vs. condition registers vs. compare&test
- Hardware interlocks vs. software-guaranteed interlocking
- VLIW vs. single instruction
- 0, 1, 2, 3 address machines
- Precise vs. imprecise exceptions
- Virtual memory vs. not
- Aligned vs. unaligned access
- Supported data types
- Software vs. hardware managed page fault handling
- Granularity of atomicity
- Cache coherence (hardware vs. software)

• • • •

# Programmer vs. (Micro)architect

- Many ISA features designed to aid programmers
- But, complicate the hardware designer's job
- Virtual memory
  - vs. overlay programming
  - Should the programmer be concerned about the size of code blocks?
- Unaligned memory access
  - Compile/programmer needs to align data
- Transactional memory?

## Transactional Memory

#### THREAD 1

```
enqueue (Q, v) {
 Node t \text{ node} = \text{malloc}(...);
 node->val=v;
 node->next = NULL;
 acquire(lock);
 if (Q->tail)
  Q->tail->next = node;
 else
  Q->head = node;
 Celeasid (lonkde;
 Celetaie (londide;
begin-transaction
enqueue (Q, v); //no locks
end-transaction
```

#### THREAD 2

```
enqueue (Q, v) {
 Node t \text{ node} = \text{malloc}(...);
 node->val=v;
 node->next = NULL;
 acquire(lock);
 if (Q->tail)
  Q->tail->next = node;
 else
  Q->head = node;
 Celetaie (londide;
 Celetaid (tontide;
begin-transaction
enqueue (Q, v); //no locks
end-transaction
```

## Transactional Memory

- A transaction is executed atomically: ALL or NONE
- If there is a data conflict between two transactions, only one of them completes; the other is rolled back
  - Both write to the same location
  - One reads from the location another writes

# ISA-level Tradeoff: Supporting TM

- Still under research
- Pros:
  - Could make programming with threads easier
  - Could improve parallel program performance vs. locks. Why?

#### Cons:

- What if it does not pan out?
- All future microarchitectures might have to support the new instructions (for backward compatibility reasons)
- Complexity?
- How does the architect decide whether or not to support TM in the ISA? (How to evaluate the whole stack)

### ISA-level Tradeoffs: Instruction Pointer

- Do we need an instruction pointer in the ISA?
  - Yes: Control-driven, sequential execution
    - An instruction is executed when the IP points to it
    - IP automatically changes sequentially (except control flow instructions)
  - No: Data-driven, parallel execution
    - An instruction is executed when all its operand values are available (data flow)
- Tradeoffs: MANY high-level ones
  - Ease of programming (for average programmers)?
  - Ease of compilation?
  - Performance: Extraction of parallelism?
  - Hardware complexity?

## The Von-Neumann Model



## The Von-Neumann Model

- Stored program computer (instructions in memory)
- One instruction at a time
- Sequential execution
- Unified memory
  - The interpretation of a stored value depends on the control signals
- All major ISAs today use this model
- Underneath (at uarch level), the execution model is very different
  - Multiple instructions at a time
  - Out-of-order execution
  - Separate instruction and data caches

### Fundamentals of Uarch Performance Tradeoffs



- Zero-cycle latency (no cache miss)

- Perfect data flow (reg/memory dependencies)
- Zero-cycle latency

- No branch mispredicts
- Zero-cycle interconnect (operand communication)
- Infinite capacity

- Zero cost

- No fetch breaks

- Enough functional units
- Zero latency compute?

We will examine all these throughout the course (especially data supply)

## How to Evaluate Performance Tradeoffs

# Improving Performance

Reducing instructions/program

Reducing cycles/instruction (CPI)

Reducing time/cycle (clock period)

# Improving Performance (Reducing Exec Time)

- Reducing instructions/program
  - More efficient algorithms and programs
  - Better ISA?
- Reducing cycles/instruction (CPI)
  - Better microarchitecture design
    - Execute multiple instructions at the same time
    - Reduce latency of instructions (1-cycle vs. 100-cycle memory access)
- Reducing time/cycle (clock period)
  - Technology scaling
  - Pipelining

# Improving Performance: Semantic Gap

- Reducing instructions/program
  - Complex instructions: small code size (+)
  - Simple instructions: large code size (--)
- Reducing cycles/instruction (CPI)
  - Complex instructions: (can) take more cycles to execute (--)
    - REP MOVS
    - How about ADD with condition code setting?
  - Simple instructions: (can) take fewer cycles to execute (+)
- Reducing time/cycle (clock period)
  - Does instruction complexity affect this?
    - It depends