Table of Contents
Buzz Words
These are a list of keywords related to the topics covered in the class. These are to jog your memory. For a comprehensive list, please refer to lecture notes and the notes you take during class.
Lecture 2
ISA, Trade-offs, Performance
- ISA vs Microarchitecture
- Latency vs Throughput trade-offs (bit-serial adders vs carry ripple/look-ahead adders)
- Speculation (branch prediction overview)
- Superscalar processing (multiple instruction issue, VLIW)
- Prefetching (briefly covered)
- Power/clock gating (briefly covered)
Lecture 3
Performance
- Addressing modes (Registers, Memory, Base-index-displacement)
- 0/1/2/3 Address machines (accumulator machines, stack machines)
- Aligned accesses (Hardware handled vs Software handled: trade-offs)
- Transactional memory (overview)
- Von-Neumann model (Sequential instruction execution)
- Data flow machines (Data driven execution)
- Evaluating performance
Lecture 4
Pipelining
- Performance metrics (Instructions per second, FLOPS, Spec/Mhz)
- Amdahl's law
- Pipelining vs Multi-cycle machines
- Stalling (dependencies)
- Data dependencies (true, output, anti)
- Value prediction
- Control dependencies
- Branch prediction / Predication
- Fine-grained multi-threading
Lecture 5
Precise Exceptions
- Need for separate instruction and data caches
- Exceptions vs Interrupts
- Reorder buffer/History buffer/Future files
- Impact of size of architectural register file
- Checkpointing
- Reservation stations
- Precise exceptions
Lecture 6
Virtual Memory
- Problems in early machines (size, protection, relocation, contiguity, sharing)
- Segmentation (solves some of these problems)
- Virtual memory (high level)
- Implementing translations (page tables, inverted page tables)
- TLBs (caching translations)
- Virtually-tagged caches and Physically-tagged caches
- Address space identifiers
- Sharing (overview)
- Super pages (overview)
Lecture 7
Out-of-Order Execution
- Caches (direct-mapped vs fully associative, Virtually-indexed physically tagged caches, page coloring)
- Super scalar vs Out-of-Order execution
- Precise exceptions (summary)
- Branch prediction vs exceptions
- Out-of-Order execution (preventing dispatch stalls)
- Tomasulo's algorithm
- Dynamic data-flow graph generation
Lecture 8
Exploiting ILP
- Overview of Tomasulo's algorithm
- Problems with increasing instruction window to entire program
- number of comparators
- ROB size - instruction window
- Implementation issues in out-of-order processing (decoupling structures)
- physical register file (removing the need to store/broadcast values in reservation station and storing architectural reg file)
- Register alias tables and architectural register alias table
- Handling branch miss prediction (don't have to wait for branch to become oldest)
- checkpoint the register alias table on each branch
- what if we have a lot of branches? (confidence estimation)
- Handling stores
- store buffers
- load searches store buffer/cache
- what if addresses are not yet computed for some stores? (unknown address problem)
- how to detect that a load is dependent on some store that hasn't computed an address yet? (load buffer)
- Problems with physical register files
- register file read always on the critical path
- Multiple instruction issue
- Distributed/Centralized reservation stations
- Scheduling
- Design Issues
- Dependents on a load miss
- FIFO for load dependants
- Register deallocation
- Latency tolerance of out of order processors
Lecture 9
Caching Basics
- Caches
- Direct mapped caches
- Set associative caches
- Miss rate
- Average Memory Access Time
- Cache placement
- Cache replacement
- Handling writes in caches
- Inclusion/Exclusion
Lecture 10
Runahead and MLP
- Lack of temporal and spatial locality (long strides, large working sets)
- Stride prefetching (software vs hardware: complexity, dynamic information)
- Irregular accesses (hash tables, linked structures?)
- Where OoO cannot really benifit? (L2 cache misses) (need large instruction windows)
- Run-ahead execution (Generating cache misses that can be serviced in parallel)
- Problems with run-ahead
- length of run-ahead
- what if new cache-miss is dependent on original miss?
- Branch mispredictions and miss dependent branches
- DRAM bank organization
- Tolerating memory latencies
- Caching
- Prefetching
- Multi-threading
- Out of order execution
- Fine-grained multi-theading (design and costs)
- Causes of inefficiency in Run-ahead (energy consumption)
- Breaking dependence
- address prediction (AVD prediction)
Lecture 11
OOO wrap-up and Advanced Caching
- Dual Core Execution (DCE)
- Comparison between run ahead and DCE
- Lag between the front and the back cores - controlled by result queue sizing
- Slipstreaming
- SMT Architectures for slipstreaming instead of 2 separate cores
- Result queue length in DCE
- Store-Load dependencies
- Store buffer design
- Content associative, age ordered list of stores
- Memory Disambiguation
- Load dependence/independence on previous stores
- Store/Load dependence prediction
- Speculative execution and data coherence
- Load buffer
- Research issues in OoO
- Scalable and energy-efficient instruction windows
- Packing more MLP into a small window
- OOO in Multi core systems
- Memory system contention - bigger issue
- Multiple cores to perform OOO
- Asymmetric Multi-cores
- Symmetric vs Asymmetric multi cores
- Accelerating critical sections
- Core fusion
- Inclusion/Exclusion
- Multi-level caching
Lecture 12
Advanced Caching
- Handling writes
- Write-back
- Write-through
- Write allocate/no allocate
- Instruction/Data Caching
- Cache Replacement Policies
- Random
- FIFO
- Least Recently Used
- Not Most Recently Used
- Least Frequently used
- LRU Vs Random - Random as good as LRU for most practical workloads
- Optimal Replacement Policy
- MLP aware cache replacement
- Cache Performance
- Reducing miss rate
- Reducing miss latency/cost
- Cache Parameters
- Cache size vs hit rate
- Block size
- Large Blocks - Critical words
- Large blocks - bandwidth wastage
- Sub blocking
- Associativity
- Power of 2 associativity?
- Hybrid Replacement policies
- Sampling based hybrid (random/LRU) replacement
- Cache Misses
- Compulsory
- Conflict
- Capacity
- Coherence
- Cache aware schedulers - cache affinity based application mapping
- Victim caches
- Hashing - Randomizing index functions
- Pseudo associativity - serial cache lookup
Lecture 13
More Caching
- Speculative partial tag comparison
- Skewed associative caches
- Randomizing the index for different ways
- Improving hit rate in software
- Loop interchange - Row major, Column major
- Blocking
- Loop fusion, Array merging
- Data structure layout - Packing frequently used fields in arrays
- Handling multiple outstanding misses
- Non blocking caches
- Miss Status Handling Registers (MSHR)
- Accessing MSHRs
- Reducing miss latency through software
- Compiler level reordering of loops
- Software prefetching
- Handling multiple accesses in a cycle
- True/Virtual multiporting
- Banking/Interleaving
Lecture 14
Prefetching
- Compulsory/conflict/capacity misses and prefetching
- Coherence misses and prefetching
- False sharing
- Word/byte based coherence
- Value prediction/Speculative execution
- Prefetching and correctness
- What/When/Where/How
- Accuracy
- Timeliness
- Coverage
- Prefetch buffers
- Skewing prefetches towards demand fetches
- Software/Hardware/Execution-based prefetchers
- Software prefetching
- Binding/Non-binding prefetches
- Prefetching during pointer chasing
- x86 prefetch instructions - prefetching into different levels of cache
- Handling of prefetches that cause TLB misses/page faults
- Compiler driven prefetching
- Accuracy vs Timeliness tradeoff - Branches between prefetch and actual load
- Hardware prefetching
- Next line prefetchers
- Stride prefetchers
- Instruction based stride prefetchers
- Stream buffers
- Locality based prefetching
- Prefetcher performance
- Accuracy
- Coverage
- Timeliness
- Aggressiveness
- Prefetcher distance
- Prefetcher degree
- Irregular prefetching
- Markov prefetchers
Lecture 15
Prefetching (wrap up)
- Power 4 System Microarchitecture (prefetchers) (IBM Journal of R & D)
- Irregular patterns (indirect array accesses, linked structures)
- markov prefetching
- linked lists or trees
- markov prefetchers vs stride prefetches
- Content directed prefetching
- pointer based structures
- identifying pointers (software mechanism/hardware prediction)
- compiler analysis to provide hints for useful prefetches
- Hybrid prefetchers
- Execution based prefetchers
- pre-execution thread for creating prefetches for the main program
- determining when to start the pre-execution
- similar idea for branch prediction
Lecture 16
- Prefetching in multicores
- Importance of prefetch efficiency
- Issues with local prefefetcher throttling
- Hierarchical prefetcher throttling
- Cache coherence
- Snoopy cache coherence
- Shared caches in multicores
- Utility based cache partitioning
Lecture 17
- Software based cache management
- Thread scheduling
- Page coloring
- Dynamic partitioning through page recoloring
- Cache placement
- Insertion
- Re-insertion
- Circular reference model
- Dynamic insertion policy - LRU and Bimodal insertion
Lecture 19
Main memory system
- Memory hierarchy
- SRAM/DRAM cell structures
- Memory bank organization
- Page mode DRAM
- Bank operation
- Basic DRAM operation
- Controller latency
- Bank latency
- DRAM chips/DIMMs
- DRAM channels
- Address mapping/interleaving
- Bank mapping randomization
- DRAM refresh
- DRAM controller issues
- Memory controller placement
Lecture 20
- DRAM controller functions
- Refresh
- Scheduling
- DRAM Scheduling policies
- FCFS
- FR-FCFS
- Row buffer management policies
- Open row
- Closed row
- DRAM controller design
- Machine learning - a possibility
- DRAM power states
- Inter-thread interference interference in DRAM
- Multi-core DRAM controllers
- Stall-time fairness
- Unfairness
- Estimating alone runtime
- Providing system software support
- Parallelism aware batch scheduling
- Request batching
- Within batch scheduling
Lecture 21
Super scalar processing I
- Types of parallelism
- Task
- Thread
- Instruction
- Fetch stage
- Instruction alignment Issue
- Solution - Split cache line fetch
- Fetch break - branches in the fetch block
- Fetch break solutions
- Short distance predicted-taken branch
- Basic block reordering
- Super block code optimization
- Trace cache
Lecture 22
Super scalar processing II
- Trace Caches
- Multiple branch prediction aliasing issue
- Inactive issue
- Promoting highly biased branches to static branches
- Saturating counters
- Fill unit optimizations
- Making highly biased branch paths atomic
- Redundancy - Solution : Block based trace cache
- An Enhanced Instruction cache vs a trace cache
- Pentium 4 trace cache
- Block structured ISA
- Enlarged block branches - faults
- Super block vs Block structured ISAs
- Decode in superscalar processing
- Predecoding
- Decode cache
- CISC to RISC translation in hardware
- Micro code sequencing
- Pentium 4 decoders
- Prentium pro decoders
- simple decoders
- Complex decoder
- Micro op sequencer
- Instruction buffering fetch and decode
Lecture 23
Superscalar Processing III
- Renaming multiple instructions
- dependency check logic (n^2 comparators)
- help from compiler
- ensure instructions are independent (difficult for wide fetches)
- hardware-software co-design to simplify dependency logic
- Dispatching multiple instructions
- wakeup logic (compare all tags in reservation station with all the tags that are broadcast)
- select logic (hierarchical tree based selection)
- Execute
- enough execution units
- enough forwarding paths (broadcast tag/value to all functional units)
- Reducing dispatch+bypass delays
- clustering (divide window into multiple clusters)
- intra-cluster bypass is fast
- inter-cluster bypass can be slow
- Register file
- need multiple reads/writes per cycle
- Replicate or partition the register files
- using block-structured ISA
- Retirement
- updating architectural register map
Lecture 24
Control Flow
- Problem of branches
- Types
- conditional, unconditional, call, return, indirect branches
- Handling conditional branches
- Predicate combining
- condition codes vs condition registers
- Delayed branching
- Fine-grained multi-threading
- Branch prediction
- predicting if an instruction is a branch (predecoding)
- predicting the direction of the branch
- predicting the target address of a branch
- Static branch predition
- always taken/not taken
- backward taken, forward not taken
- by compiler based on profiling
- Dynamic branch prediction
- last time predictor
- history based predictors
- two-level predictors
Lecture 25
Control Flow - II
- 2-bit counter based prediction
- Global branch prediction
- Global branch correlation
- Global two-level prediction
- Global history register
- Local two-level prediction
- Pattern history table
- Interference in the pattern history table
- Randomizing the index into the pattern history table
- Agree prediction
- Alpha 21264 Tournament Predictor
- Perceptron branch predictor
- Perceptron - learns a target boolean function of N inputs
- Call and Return Prediction
- Indirect branch prediction
- Virtual Conditional Branch prediction
- Branch prediction issues
- Need to know a branch as soon as it is fetched
- Latency
- State recovery upon misprediction
- Predicated execution
Lecture 26
Control Flow - III & Concurrency
- Predicated Execution
- Predication decisions at the compiler
- Rename stage modifications
- Limitations of predication
- Adaptivity
- Complex Control Flow Graphs
- ISA support
- Wish branches
- Wish jump/join
- Wish loop
- Wish branches vs Predicated Execution
- Wish branches vs Branch prediction
- Diverge-Merge Processor
- Dynamic-Hammock
- Multi-path Execution
- Research issues in control flow handling
- Hardware/software cooperation
- Fetch gating
- Recycling useful work done on wrong path
Concurrency
- Classification of machines
- SISD
- SIMD
- MIMD
- Decoupled Access/Execute
- Astronautics ZS-1
- Loop unrolling
Lecture 27
VLIW
- Each VLIW instruction - a bundle of independent instructions (identified by compiler)
- Each instruction bundle executed by hardware in lockstep
- Commercial VLIW machines
- TIC6000, Trimedia, STMicro
- Intel IA-64 - Partially VLIW
- Encoding VLIW NOPs
- Static Instruction Scheduling for VLIW
- Code motion - Safety & Legality
- Trace scheduling
- List scheduling
- Super block scheduling
- Hyperblock scheduling
- The Intel IA-64 architecture
- No lock step execution of a bundle
- Specify dependencies between instructions within a bundle
- Template bits
- What hinder static mode motion?
- Exceptions
- Loads/Stores