

# 18-344: Computer Systems and the Hardware-Software Interface

Home
Syllabus
Course Schedule
Lab Details
Homework Details
Recitation Slides
Exam Details
Piazza

Staff

#### 18-344: Computer Systems and the Hardware-Software Interface (Fall 2023 Lecute 2)



#### **Course Description**

#### **LECTURE 2**

This course covers the design and implementation of computer systems from the perspective of the hardware software interface. The purpose of this course is for students to understand the relationship between the operating system, software, and computer architecture. Students that complete the course will have learned operating system fundamentals, computer architecture fundamentals, compilation to hardware abstractions, and how software actually executes from the perspective of the hardware software/boundary. The course will focus especially on understanding the relationships between software and hardware, and how those relationships influence the design of a computer system's software and hardware. The course will convey these topics through a series of practical, implementation-oriented lab assignments.

## What is the hardware/software boundary?

- An ISA or Computer Architecture?
- A division of labor between Computer Engineers and Programmers?
- A split between what you can change and what you cannot change?
- Python vs. Verilog?

#### The 213 view of the world

- ISA is the *immutable* foundation of the system
- High-level language compiles to ISA
- Linux (or other) OS provides important low-level services
- Low-level optimization: know HW structure to make smart code changes









#### The 240 view of the world

- What's an ISA? (RISCV-240 not withstanding)
- SystemVerilog describes your hardware
- What's an OS? What is \*software\* even?
- Implement through simulation, ASIC fabrication or FPGA configuration



# Relative Mutability/Non-Recurring Eng. (NRE) Cost?



## Relative Observability?



## Relative Optimizability?





# Our first hw/sw interface: The Von Neumann Computing Model



John von Neumann's Big Idea:

Programs are data.





 Let's optimize! Where is there a bottleneck in the Von Neumann abstract machine?



 Data & Program share a bus into the CPU. Need to time multiplex access to the bus.



```
I1: x = y + z
I2: a = b * c
I3: r = s + t
```





# Alternative to von Neumann: the Harvard Architecture



# Alternative to von Neumann: the Harvard Architecture



### Optimizing our Harvard Architecture



### Optimizing our Harvard Architecture



## Thinking about the costs of HW optimization





#### ./destiny config/SRAM 128 1 32.cfg \$ ./destiny config/SRAM 128 1 128.cfg 4-byte Memory Interface: 16-byte Memory Interface: Read Energy = 1.51pJ Read Energy = 0.836pJ Write **Energy** = **0.738pJ** Write **Energy** = **1.30pJ CPU** Control & Instruction **Control & Instruction** Sequencing ("Control") Sequencing ("Control") There is No Free Arithmetic, Logic, and Arithmetic, Logic, and Data Lunch! Data Manipulation Manipulation ("ALU") ("ALU") **Dual-ported Data Bus Instruction Bus Unified Bus** (2x64-bit read, 64-bit write port) (32-bit read) **Memory Memory** Program Data Data Program

## How about optimizing instruction supply?



### How about optimizing instruction supply?



# Is this optimization a good tradeoff?



# Is this optimization a good tradeoff?



# Is this optimization a good tradeoff?



# Is this optimization always a good tradeoff?



### How about changing the code?



A key Law of the HW/SW Universe

#### Let's revisit the proposition that we should optimize instruction supply

How about optimizing instruction supply?



How do we decide if this part of the system is really worth optimizing?

100% of execution time

Imagine we have a perfectly precise measurement tool to break down execution time...



What can we say about optimizing instruction supply (fetch) if this is our situation?

5% - Floating Point

#### What if we make fetch 4x faster?



We have 45% + 20% + 17.5% + 5% = 87.5% of the execution running at its original speed (ie 87.5% of 10 seconds = 8.75s)

5% - Floating Point

We have 12.5% of the execution running 4x faster. Originally, we had 12.5% of 10 units of time = 1.25s. If 4x faster, fetch takes 0.31s. Savings of 0.94s

#### What if we make fetch 4x faster?



What does 0.906 mean?

5% - Floating Point

#### What if we make fetch 4x faster?

<10% improvement overall, despite 4x improvement on one part of the system!



4x improvement in fetch means execution takes 9.0625s instead of 10s Saved 0.94s overall. 0.94s / 10s \* 100 = 9.4% time savings

5% - Floating Point

#### What if we make memory accesses 4x faster?



Point

#### What if we make memory accesses 4x faster?



55% of 10s + (45% of 10s) / 4 = 6.625s

Evidently, we should optimize the memory part before fetch!

5% - Floating Point



Amdahl's Law: if fraction p can be optimized: Optimized Time = [ (1-p) \* t / 1 ] + [ (p \* t) / speedup ]

5% - Floating Point



Point

100% of execution time

45% - Memory Accesses 20% - Control Flow 17.5% - Integer 12.5% - Fetch

#### Amdahl's Law:

Overall Speedup = T\_orig / T\_opt

Overall Speedup = 1/(1-p+p/s)

5% - Floating Point

## By how much do we have to improve the memory part of the system to get a 2x total speedup?



5% - Floating Point

## By how much do we have to improve the memory part of the system to get a 2x total speedup?



```
0.55 x 10s + (0.45 x 10s)/speedup = 10s / 2 = 5s

4.5s / speedup = 5s - 5.5s

speedup = -9x

Well that's a strange amount by which to speed up

a program... conclusion?
```

5% - Floating Point



Let's try again: What if we *completely* optimize away the optimizable part?

5% - Floating Point



What if we *completely* optimize away the optimizable part? (How much is left over here?)



## What if we completely optimize away the optimizable part? 8.75s optimized execution time

5.5s optimized execution time





#### Amdahl's Law:

optimized time = [1-p x time / 1.0] + [p x time / s]Overall speedup = 1 / (1-p+p/s)



Amdahl's Law with *infinite* speedup: optimized time with infinite speedup of  $p = [1-p \times time / 1.0]$ Overall speedup with infinite speedup of p = 1 / (1 - p)



### Amdahl's Law is Extremely Versatile



Works for any optimization problem and goal. Always focus on the biggest slice & the rest doesn't matter.

5% - Floating Point



5% - Floating

**Point** 

**Bottleneck: memory accesses** 



New bottleneck: control flow



New bottleneck: memory accesses (again!)



Remember: Amdahl tells us to optimize the biggest slice



Idea: find an *optimizable* part of your system and make it *bigger* If we know that memory is optimizable, why not optimize more and do more memory accesses?

Gustafson's Law: Sequential part does not grow as optimizable part grows. Can always add more optimizable part and make sequential part matter less

Assume that we can scale up # of parallel memory accesses, N
Assume we can scale input up to use all N parallel accesses

```
data_size = 10
data[data_size] = {...}
if(...) {
    ...//18 more of these conditionals
if(...) {
}
for d in 0..data_size{ d++ }
```

```
data_size = 100000
data[data_size] = {...}
if(...) {
    .../18 more of these conditionals
if(...) {
    #parallel[N=1000]
for d in 0..data_size{ d++ }
```

85% - Memory Accesses

### Gustafson's Law for overall speedup with speedup factor of N: (assume) Optimized time = T = 1Unoptimized time = T' = (1-p)T + pT\*N = (1-p) + pNScaled Speedup = T' / T = (1-p) + pN

#### Scale parallel memory accesses, N, up to 1000?

```
Scaled Speedup = 1-p + 1000p = 999p + 1
Scaled Speedup = 999 * 0.85 + 1 = 850x
```

85% - Memory Accesses

### Gustafson's Law for overall speedup with speedup factor of N: (assume) Optimized time = T = 1Unoptimized time = T' = (1-p)T + pT\*N = (1-p) + pNScaled Speedup = T' / T = (1-p) + pN

### What did we just learn?

- Two high-level architectural models
- Identify performance bottlenecks
- Develop optimizations to mitigate bottlenecks
- Analyze resulting improvement from mitigating bottlenecks
- Identifying persistent performance limiters (e.g., branches)
- Optimize in software or hardware
- (Almost) never bet against Gene Amdahl in an optimization contest!

#### What to think about next?

- What is a computer architecture?
- What matters when defining a HW/SW interface?
- What is above the ISA and what is below the ISA?
- What is hidden from the programmer and what is exposed?