buzzwords - 15-740/18-740 Computer Architecture

This is an old revision of the document!

Buzz Words

These are a list of keywords related to the topics covered in the class. These are to jog your memory. For a comprehensive list, please refer to lecture notes and the notes you take during class.

Lecture 2

ISA, Trade-offs, Performance

ISA vs Microarchitecture
Latency vs Throughput trade-offs (bit-serial adders vs carry ripple/look-ahead adders)
Speculation (branch prediction overview)
Superscalar processing (multiple instruction issue, VLIW)
Prefetching (briefly covered)
Power/clock gating (briefly covered)

Lecture 3

Performance

Addressing modes (Registers, Memory, Base-index-displacement)
0/1/2/3 Address machines (accumulator machines, stack machines)
Aligned accesses (Hardware handled vs Software handled: trade-offs)
Transactional memory (overview)
Von-Neumann model (Sequential instruction execution)
Data flow machines (Data driven execution)
Evaluating performance

Lecture 4

Pipelining

Performance metrics (Instructions per second, FLOPS, Spec/Mhz)
Amdahl's law
Pipelining vs Multi-cycle machines
Stalling (dependencies)
Data dependencies (true, output, anti)
Value prediction
Control dependencies
Branch prediction / Predication
Fine-grained multi-threading

Lecture 5

Precise Exceptions

Need for separate instruction and data caches
Exceptions vs Interrupts
Reorder buffer/History buffer/Future files
Impact of size of architectural register file
Checkpointing
Reservation stations
Precise exceptions

Lecture 6

Virtual Memory

Problems in early machines (size, protection, relocation, contiguity, sharing)
Segmentation (solves some of these problems)
Virtual memory (high level)
Implementing translations (page tables, inverted page tables)
TLBs (caching translations)
Virtually-tagged caches and Physically-tagged caches
Address space identifiers
Sharing (overview)
Super pages (overview)

Lecture 7

Out-of-Order Execution

Caches (direct-mapped vs fully associative, Virtually-indexed physically tagged caches, page coloring)
Super scalar vs Out-of-Order execution
Precise exceptions (summary)
Branch prediction vs exceptions
Out-of-Order execution (preventing dispatch stalls)
Tomasulo's algorithm
Dynamic data-flow graph generation

Lecture 8

Exploiting ILP

Overview of Tomasulo's algorithm
Problems with increasing instruction window to entire program
1. number of comparators
2. ROB size - instruction window
Implementation issues in out-of-order processing (decoupling structures)
1. physical register file (removing the need to store/broadcast values in reservation station and storing architectural reg file)
2. Register alias tables and architectural register alias table
Handling branch miss prediction (don't have to wait for branch to become oldest)
1. checkpoint the register alias table on each branch
2. what if we have a lot of branches? (confidence estimation)
Handling stores
1. store buffers
2. load searches store buffer/cache
3. what if addresses are not yet computed for some stores? (unknown address problem)
4. how to detect that a load is dependent on some store that hasn't computed an address yet? (load buffer)
Problems with physical register files
1. register file read always on the critical path
Multiple instruction issue
Distributed/Centralized reservation stations
Scheduling
1. Design Issues
Dependents on a load miss
1. FIFO for load dependants
2. Register deallocation
Latency tolerance of out of order processors

Lecture 9

Caching Basics

Caches
Direct mapped caches
Set associative caches
Miss rate
Average Memory Access Time
Cache placement
Cache replacement
Handling writes in caches
Inclusion/Exclusion

Lecture 10

Runahead and MLP

Lack of temporal and spatial locality (long strides, large working sets)
Stride prefetching (software vs hardware: complexity, dynamic information)
Irregular accesses (hash tables, linked structures?)
Where OoO cannot really benifit? (L2 cache misses) (need large instruction windows)
Run-ahead execution (Generating cache misses that can be serviced in parallel)
Problems with run-ahead
1. length of run-ahead
2. what if new cache-miss is dependent on original miss?
3. Branch mispredictions and miss dependent branches
DRAM bank organization
Tolerating memory latencies
1. Caching
2. Prefetching
3. Multi-threading
4. Out of order execution
Fine-grained multi-theading (design and costs)
Causes of inefficiency in Run-ahead (energy consumption)
Breaking dependence
1. address prediction (AVD prediction)

Lecture 11

OOO wrap-up and Advanced Caching

Dual Core Execution (DCE)
Comparison between run ahead and DCE
1. Lag between the front and the back cores - controlled by result queue sizing
Slipstreaming
SMT Architectures for slipstreaming instead of 2 separate cores
Result queue length in DCE
Store-Load dependencies
Store buffer design
1. Content associative, age ordered list of stores
Memory Disambiguation
1. Load dependence/independence on previous stores
2. Store/Load dependence prediction
Speculative execution and data coherence
1. Load buffer
Research issues in OoO
1. Scalable and energy-efficient instruction windows
2. Packing more MLP into a small window
OOO in Multi core systems
1. Memory system contention - bigger issue
2. Multiple cores to perform OOO
3. Asymmetric Multi-cores
Symmetric vs Asymmetric multi cores
1. Accelerating critical sections
2. Core fusion

Inclusion/Exclusion
Multi-level caching

Lecture 12

Advanced Caching

Handling writes
1. Write-back
2. Write-through
3. Write allocate/no allocate
Instruction/Data Caching
Cache Replacement Policies
1. Random
2. FIFO
3. Least Recently Used
4. Not Most Recently Used
5. Least Frequently used
LRU Vs Random - Random as good as LRU for most practical workloads
Optimal Replacement Policy
MLP aware cache replacement
Cache Performance
1. Reducing miss rate
2. Reducing miss latency/cost
Cache Parameters
1. Cache size vs hit rate
2. Block size
3. Large Blocks - Critical words
4. Large blocks - bandwidth wastage
5. Sub blocking
6. Associativity
7. Power of 2 associativity?
8. Hybrid Replacement policies
9. Sampling based hybrid (random/LRU) replacement
Cache Misses
1. Compulsory
2. Conflict
3. Capacity
4. Coherence
Cache aware schedulers - cache affinity based application mapping
Victim caches
Hashing - Randomizing index functions
Pseudo associativity - serial cache lookup

Lecture 13

More Caching

Speculative partial tag comparison
Skewed associative caches
1. Randomizing the index for different ways
Improving hit rate in software
1. Loop interchange - Row major, Column major
2. Blocking
3. Loop fusion, Array merging
4. Data structure layout - Packing frequently used fields in arrays
Handling multiple outstanding misses
1. Non blocking caches
2. Miss Status Handling Registers (MSHR)
3. Accessing MSHRs
Reducing miss latency through software
1. Compiler level reordering of loops
2. Software prefetching
Handling multiple accesses in a cycle
1. True/Virtual multiporting
2. Banking/Interleaving

Lecture 14

Prefetching

Compulsory/conflict/capacity misses and prefetching
Coherence misses and prefetching
False sharing
1. Word/byte based coherence
2. Value prediction/Speculative execution
Prefetching and correctness
What/When/Where/How
1. Accuracy
2. Timeliness
3. Coverage
4. Prefetch buffers
5. Skewing prefetches towards demand fetches
6. Software/Hardware/Execution-based prefetchers
Software prefetching
1. Binding/Non-binding prefetches
2. Prefetching during pointer chasing
3. x86 prefetch instructions - prefetching into different levels of cache
4. Handling of prefetches that cause TLB misses/page faults
5. Compiler driven prefetching
6. Accuracy vs Timeliness tradeoff - Branches between prefetch and actual load
Hardware prefetching
1. Next line prefetchers
2. Stride prefetchers
3. Instruction based stride prefetchers
4. Stream buffers
5. Locality based prefetching
Prefetcher performance
1. Accuracy
2. Coverage
3. Timeliness
4. Aggressiveness
  1. Prefetcher distance
  2. Prefetcher degree
Irregular prefetching
1. Markov prefetchers

Lecture 15

Prefetching (wrap up)

Power 4 System Microarchitecture (prefetchers) (IBM Journal of R & D)
Irregular patterns (indirect array accesses, linked structures)
1. markov prefetching
  1. linked lists or trees
  2. markov prefetchers vs stride prefetches
Content directed prefetching
1. pointer based structures
2. identifying pointers (software mechanism/hardware prediction)
3. compiler analysis to provide hints for useful prefetches
Hybrid prefetchers
Execution based prefetchers
1. pre-execution thread for creating prefetches for the main program
2. determining when to start the pre-execution
3. similar idea for branch prediction

Lecture 16

Prefetching in multicores
Importance of prefetch efficiency
Issues with local prefefetcher throttling
Hierarchical prefetcher throttling
Cache coherence
1. Snoopy cache coherence
Shared caches in multicores
Utility based cache partitioning

Lecture 17

Software based cache management
1. Thread scheduling
2. Page coloring
3. Dynamic partitioning through page recoloring
Cache placement
1. Insertion
2. Re-insertion
3. Circular reference model
4. Dynamic insertion policy - LRU and Bimodal insertion

Lecture 19

Main memory system

Memory hierarchy
SRAM/DRAM cell structures
Memory bank organization
Page mode DRAM
1. Bank operation
Basic DRAM operation
1. Controller latency
2. Bank latency
DRAM chips/DIMMs
DRAM channels
Address mapping/interleaving
Bank mapping randomization
DRAM refresh
DRAM controller issues
1. Memory controller placement

Lecture 20

DRAM controller functions
1. Refresh
2. Scheduling
DRAM Scheduling policies
1. FCFS
2. FR-FCFS
Row buffer management policies
1. Open row
2. Closed row
DRAM controller design
1. Machine learning - a possibility
DRAM power states
Inter-thread interference interference in DRAM
Multi-core DRAM controllers
1. Stall-time fairness
  1. Unfairness
  2. Estimating alone runtime
  3. Providing system software support
2. Parallelism aware batch scheduling
  1. Request batching
  2. Within batch scheduling

Lecture 21

Super scalar processing

Types of parallelism
1. Task
2. Thread
3. Instruction

Fetch stage
1. Instruction alignment Issue
2. Solution - Split cache line fetch
3. Fetch break - branches in the fetch block
4. Solution - Short distance predicted-taken branch
5. Solution - Basic block reordering
6. Solution - Super block code optimization
7. Solution - Trace cache