#### 18-344: Computer Systems and the Hardware-Software Interface Fall 2025



COUISE DESCRIPTION Lecture 15: Advanced Architecture: Superscalar and Out of Order

This course covers the design and implementation of computer systems from the perspective of the hardware software interface. The purpose of this course is for students to understand the relationship between the operating system, software, and computer architecture. Students that complete the course will have learned operating system fundamentals, computer architecture fundamentals, compilation to hardware abstractions, and how software actually executes from the perspective of the hardware software/boundary. The course will focus especially on understanding the relationships between software and hardware, and how those relationships influence the design of a computer system's software and hardware. The course will convey these topics through a series of practical, implementation-oriented lab assignments.

Credit: Brandon Lucia

### Today: Advanced Microarchitecture Techniques

 Advanced Instruction-Level Parallelism: Multiple Issue, Out of Order Execution, Register Renaming, SMT













What is the best performance that we can ever get out of a pipeline like the one we have been studying? (how do we answer this question?)



Iron Law of Processor Performance:

Instr / Prog x Cycles / Instr x Seconds / Cycle

Fundamental limits to each of these terms in our current pipeline?

# Thinking about latency (again) to optimize for cycle time



What is the implication of mul having a 3ns latency, compared to the latency of each of the other stages?

# Thinking about latency (again) to optimize for cycle time



What is the implication of mul having a 3ns latency, compared to the latency of each of the other stages?

333MHz max clock frequency

(despite 1GHz being OK for non-mul operations)



Break the multiply unit into 3 parts, each of which takes 1ns, equalizing all stages' latencies



Back-to-back multiplies keep the mul pipe full, at 1GHz latency



Back-to-back non-mul ops keep the pipe full, at 1GHz latency



Question: What about add mul add mul?



Question: What about add mul add mul?



Question: What about add mul add mul?



**Problem?** 

### Instructions might complete out of order if we are not careful!



In addition to the unfortunate **stall in the memory stage**, the add and the mul **execute in the wrong order**!

#### Avoiding out-of-order completion



Hard to avoid the *stall...*Can avoid the *ordering problem* with extra stall logic in **Ex** 

### Let's Rewind: Anything interesting about this snapshot in time?



# Independent FUs allow us to optimize IPC directly by increasing ILP



This pipeline is **Ex**ecuting multiple instructions at the same time on different functional units. **ILP begets IPC!** 

Superscalar Out of Order Execution

### A Superscalar Processor Executes Multiple Instructions at the Same Time



Scalar executes one instruction at a time
Superscalar executes multiple instructions at a time



(Here, we give up on the detailed pipeline diagram due to the increased complexity of the design.)





Third idea: Add issue queue

required for n-wide issue?



Fifth idea: Add *multiple* 

execute units to which to

dispatch operations after they

**Fourth idea:** Decouple **register read** from decode. Register read happens for **issued** instructions now

Seventh idea: Add *multiple* write ports to register file to allow simultaneous multiple register writebacks



Sixth idea: Handle multiple outstanding memory operations in memory system (complex! we will mostly ignore this part)



Fetch:



Fetch: Branch prediction more complex. Risk of overfetch because we're fetching a whole block? Must consider multiple, sequential fetches based on predictions

**Decode:** 



Fetch: Branch prediction more complex. Risk of overfetch because we're fetching a whole block? Must consider multiple, sequential fetches based on predictions

Decode: Not too bad, just replication of resources



Fetch: Branch prediction more complex. Risk of overfetch because we're fetching a whole block? Must consider multiple, sequential fetches based on predictions

Decode: Not too bad, just replication of resources



Fetch: Branch prediction more complex. Risk of overfetch because we're fetching a whole block? Must consider multiple, sequential fetches based on predictions

Issue:

Decode: Not too bad, just replication of resources



Fetch: Branch prediction more complex. Risk of overfetch because we're fetching a whole block? Must consider multiple, sequential fetches based on predictions

Issue: Dependence / hazard detection logic complexity.
Need to detect dependences between all instructions in issue queue and some combinations of instructions cannot issue simultaneously

### Superscalar processors: Challenges & sources

of complexity

Decode: Not too bad, just replication of resources

0010
1100
1010
0101
0100
0100

mul

Decode

Fetch: Branch prediction more complex. Risk of overfetch because we're fetching a whole block? Must consider multiple, sequential fetches based on predictions

Reg Read: Multi-porting register file has high cost (4-wide = 8 read ports) & area cost is proportional to square of port count









Issue: Dependence / hazard detection logic complexity.
Need to detect dependences between all instructions in issue queue and some combinations of instructions cannot issue simultaneously

#### Superscalar processors: Challenges & sources

of complexity

Decode: Not too bad, just replication of resources

add mul

Decode

Fetch: Branch prediction more complex. Risk of overfetch because we're fetching a whole block? Must consider multiple, sequential fetches based on predictions

0010

1100

1010

0101

0100

0100

Reg Read: Multi-porting register file has high cost (4-wide = 8 read ports) & area cost is proportional to square of port count









Execute / Memory:

Issue: Dependence / hazard detection logic complexity.
Need to detect dependences between all instructions in issue queue and some combinations of instructions cannot issue simultaneously

#### Superscalar processors: Challenges & sources

of complexity

Decode: Not too bad, just replication of resources

add mul

add mul mul add **Reg Read: Multi-porting** register file has high cost (4wide = 8 read ports) & area cost is proportional to square of port count









**Fetch: Branch prediction** more complex. Risk of overfetch because we're fetching a whole block? Must consider multiple, sequential fetches based on predictions

0010

1100

1010

0101

0100

0100

Issue: Dependence / hazard detection logic complexity. **Need to detect dependences** between all instructions in issue queue and some combinations of instructions cannot issue simultaneously

**Execute / Memory: More** execute units, more cache ports. Forwarding paths & input operand selection logic become very complicated.

#### Superscalar processors: Challenges & sources

of complexity

Decode: Not too bad, just replication of resources

**Reg Read: Multi-porting** register file has high cost (4wide = 8 read ports) & area cost is proportional to square of port count

Reg. WB: Write port per instruction that may complete that writes a register (4-wide = 4 write ports)















**Fetch: Branch prediction** more complex. Risk of overfetch because we're fetching a whole block? Must consider multiple, sequential fetches based on predictions

Issue: Dependence / hazard detection logic complexity. **Need to detect dependences** between all instructions in issue queue and some combinations of instructions cannot issue simultaneously

**Execute / Memory: More** execute units, more cache ports. Forwarding paths & input operand selection logic become very complicated.

### Remaining limits on performance of this processor?



Application itself may not have ample ILP

#### In-order issue rule:

"Unlucky" sequence of instructions may prevent multiple issue. (e.g., the first add and the mul can issue together, but the second add prevents it.)

#### Out of Order Execution

**Execute** instructions from the issue In-order Front-end In-order Commit window fully out of order even if instructions have a WAW or WAR dependence that would prevent 0010 them from superscalar issuing 1100 add together (how!?) 1010 add mul add add mu1 mu] mul 0101 0100 ALU (non-SW mul mul 0100 mul) (Rename) lw mul /Dispatch ALU (non-Decode **Commit** SW mul) **Commit** in order lw **Dispatch** instructions into an *issue* add lw to respect window that issues instructions to original program Register execute *as soon as input operands* Memory semantics Reg. Read mul0 mul1mul2 Write-Back Issue are available

**Out of Order Execution** 

Register Renaming Resolves Dependences that Prevent Instructions from Executing Together



Eliminate WAW, WAR, and preserve RAW (why?)

# In-order commit tracks instruction completion and ensures architectural state updates in order



#### All Types of Data Hazards Matter in OoO Execution

```
      sub
      x6
      x8
      x16
      x4
      lw
      x6
      0xabc

      lw
      x16
      0xabc
      add
      x16
      x6
      x14
      sub
      x6
      x5
      x4

      add
      x12
      x6
      x14
      lw
      x16
      0xabc
      add
      x12
      x6
      x14
```

Read-After-Write (RAW)

Write-After-Read (WAR)

Write-After-Write (WAW)

Only Read-After-Write (RAW) hazards are possible in our simple pipeline

lw x6 0xabc
sub x6 x5 x4
add x12 x6 x14

Write-After-Write (WAW)

lw x6 0xabc

Fetch

Decode

**Execute** 

Memory

Memory

Memory

Register Write-Back

lw x6 0xabc
sub x6 x5 x4
add x12 x6 x14

Write-After-Write (WAW)



sub x6 x5 x4

lw x6 0xabc
sub x6 x5 x4
add x12 x6 x14

Write-After-Write (WAW)



lw x6 0xabc
sub x6 x5 x4
add x12 x6 x14

Write-After-Write (WAW)



lw x6 0xabc
sub x6 x5 x4
add x12 x6 x14

Write-After-Write (WAW)

Fetch

Decode

Execute

Memory



**Write-After-Write (WAW)** 

#### Multi-cycle latency memory op



#### Non-mem-op, single memory cycle

Earlier 1w instruction finishes after later sub instruction. Both write x6. Wrong final value in x6. Explicitly handled with logic to maintain ordering in processors that allow this behavior (not our datapath)

sub x8 x16 x4 add x16 x6 x14 lw x11 0xabc

Stalled at decode/reg. read

Write-After-Read (WAR)



Completes quickly and writes reg.

Later add instruction writes x16 before earlier sub instruction reads x16. sub sees wrong value!

A1: add x6 x8 x11 M1: mul x9 x6 x13 A2: add x6 x17 x30 A3: add x7 x9 x14 M2: add x8 x18 x6 A4: add x6 x7 x9



Question: How can instructions issue to our out-of-order pipeline in which instructions may execute and complete out of order?

If WAW or WAR, can't just dispatch or OoO execution may read regs not yet updated

A1: add x6 x8 x11 M1: mul x9 x6 x13 A2: add x6 x17 x30 A3: add x7 x9 x14 M2: add x8 x18 x6 A4: add x6 x7 x9



Rename Table A1.x6 -> r0

A1: add x6 x8 x11 M1: mul x9 x6 x13 A2: add x6 x17 x30 A3: add x7 x9 x14 M2: add x8 x18 x6 A4: add x6 x7 x9



Rename Table

A1.x6 -> r0

M1.x9 -> r1

M1.x6 <- r0

RAW dependence on x6 M1 waiting on result from A1 (r0)

A1: add x6 x8 x11 M1: mul x9 x6 x13 A2: add x6 x17 x30 A3: add x7 x9 x14 M2: add x8 x18 x6 A4: add x6 x7 x9



Rename Table

 $A1.x6 \rightarrow r0$ 

M1.x9 -> r1

M1.x6 <- r0

A2.x6 -> r2

WAW dep b/w A1 & A2 & WAR dep w/ M1 Resolved by renaming output regs

A1: add x6 x8 x11 M1: mul x9 x6 x13 A2: add x6 x17 x30 A3: add x7 x9 x14 M2: add x8 x18 x6 A4: add x6 x7 x9



# Rename Table A1.x6 -> r0 M1.x9 -> r1 M1.x6 <- r0 A2.x6 -> r2 A3.x7 -> r3 A3.x9 <- r1 M2.x8 -> r4

RAW dependence between M1 & A3 Cannot be resolved by renaming

```
A1: add x6 x8 x11
M1: mul x9 x6 x13
A2: add x6 x17 x30
A3: add x7 x9 x14
M2: add x8 x18 x6
A4: add x6 x7 x9
```



Rename Table
A1.x6 -> r0
M1.x9 -> r1
M1.x6 <- r0
A2.x6 -> r2
A3.x7 -> r3
A3.x9 <- r1
M2.x8 -> r4
M2.x6 <- r2

WAW dep w/ A1 resolved by renaming True dep w/ A2 resolved by looking up renamed result of A2

```
A1: add x6 x8 x11
M1: mul x9 x6 x13
A2: add x6 x17 x30
A3: add x7 x9 x14
M2: add x8 x18 x6
A4: add x6 x7 x9
```



# Rename Table A1.x6 -> r0 M1.x9 -> r1 M1.x6 <- r0 A2.x6 -> r2 A3.x7 -> r3 A3.x9 <- r1 M2.x8 -> r4 M2.x6 <- r2 A4.x6 -> r5 A4.x7 <- r3 A4.x9 <- r1

WAR dep with M2 & WAW w/ A2 resolved by renaming
True deps w/ A3 and M1 resolved by looking up renamed regs in table

```
A1: add x6 x8 x11
M1: mul x9 x6 x13
A2: add x6 x17 x30
A3: add x7 x9 x14
M2: add x8 x18 x6
A4: add x6 x7 x9
```



A1.x6 -> r0
M1.x9 -> r1
M1.x6 <- r0
A2.x6 -> r2
A3.x7 -> r3
A3.x9 <- r1
M2.x8 -> r4
M2.x6 <- r2
A4.x6 -> r5
A4.x7 <- r3
A4.x9 <- r1

Rename Table

After register renaming, only RAW dependences (i.e., "True Dependences") remain in the execution

```
A1: add r0 x8 x11
M1: mul r1 r0 x13
A2: add r2 x17 x30
A3: add r3 r1 x14
M2: add r4 x18 r2
A4: add r5 r3 r1
```



A1.x6 -> r0
M1.x9 -> r1
M1.x6 <- r0
A2.x6 -> r2
A3.x7 -> r3
A3.x9 <- r1
M2.x8 -> r4
M2.x6 <- r2
A4.x6 -> r5
A4.x7 <- r3
A4.x9 <- r1

Rename Table

After register renaming, only RAW dependences (i.e., "True Dependences") remain in the execution

#### Renaming Avoids False Deps

sub x8 x16 x4
add r1 x6 x14
lw x11 0xabc

Stalled at decode/reg. read

Write-After-Read (WAR)



Completes quickly and writes reg.

Later add instruction writes r1 before earlier sub instruction reads x16, which is perfectly ok!

# Superscalar Out of Order Execution is extremely complex to implement

In-order Front-end In-order Commit We will leave out of order execution details here, but there is a lot more to learn about this topic. 0010 Register renaming algorithms, how to do forwarding in 1100 add SS/OoO, what to do on exceptions in SS/OoO... 447 & 740 1010 add mul mu1 add mul mul 0101 0100 ALU (non-SW mul mul 0100 mul) (Rename) lw mul /Dispatch ALU (non-**Decode** Commit SW lmul) lw add lw Register Memory Reg. Read mul0 mul1mul2 Issue Write-Back Out of Order Execution

Scheduling Techniques to Maximize ILP

#### Superscalar execution exploits ILP to increase IPC

Empty issue slot represent wasted opportunity to do some work on a cycle

Performance in a superscalar processor depends on the existence of ILP in the program.

We need there to be parallelizable instructions in the instruction stream that we fetch, dispatch, and issue.

Question: how to avoid issue slot waste?



#### Superscalar execution exploits ILP to increase IPC

Empty issue slot represent wasted opportunity to do some work on a cycle

#### Question: how to avoid issue slot waste?

- Schedule code in program to avoid dependences
- Schedule code in loops to align with fetch granularity
- Schedule code to avoid oversubscribing functional units (i.e., a sequence of consecutive multiplies can't issue together)



#### Simultaneous Multi-Threading (SMT)

Also known as "Hyper-threading" on Intel processors, used for decades now.



#### Simultaneous Multi-Threading (SMT)

**Question: Sources of hardware complexity for SMT?** 

ALU (non-

mul0 mul1mul2

Out of Order Execution

mul)

SW

Register

Write-Back

lw

lw

Memory

Need fetch to support multiple streams (including branch prediction logic...) Fill empty issue slots with Need to tag functional units, rename table entries, ROB entries (and other instructions from another structures) to route values to correct downstream instructions thread ALU (non-SW mul) Issue Width lw

Issue

**Issue Time** 

Reg. Read

#### Very Large Instruction Word (VLIW) Architectures

Change the ISA! In VLIW, the ISA exposes the issue width architecturally Each fetch / issue is on a packet of instructions, hopefully independent



#### The compiler plays a crucial role

- We will pick up next time with more discussion of hardware/software interfaces that expose opportunities for parallelism
- We will study how the compiler exposes parallelism and exploits the opportunities for parallelism in the architecture
- More VLIW, Vector architectures
- Then we will look at some compiler fundamentals and see how all of these ideas converge in software