Buzz Words

These are a list of keywords related to the topics covered in the class. These are to jog your memory. For a comprehensive list, please refer to lecture notes and the notes you take during class.

Lecture 2

ISA, Trade-offs, Performance
  • ISA vs Microarchitecture
  • Latency vs Throughput trade-offs (bit-serial adders vs carry ripple/look-ahead adders)
  • Speculation (branch prediction overview)
  • Superscalar processing (multiple instruction issue, VLIW)
  • Prefetching (briefly covered)
  • Power/clock gating (briefly covered)

Lecture 3

Performance
  • Addressing modes (Registers, Memory, Base-index-displacement)
  • 0/1/2/3 Address machines (accumulator machines, stack machines)
  • Aligned accesses (Hardware handled vs Software handled: trade-offs)
  • Transactional memory (overview)
  • Von-Neumann model (Sequential instruction execution)
  • Data flow machines (Data driven execution)
  • Evaluating performance

Lecture 4

Pipelining
  • Performance metrics (Instructions per second, FLOPS, Spec/Mhz)
  • Amdahl's law
  • Pipelining vs Multi-cycle machines
  • Stalling (dependencies)
  • Data dependencies (true, output, anti)
  • Value prediction
  • Control dependencies
  • Branch prediction / Predication
  • Fine-grained multi-threading

Lecture 5

Precise Exceptions
  • Need for separate instruction and data caches
  • Exceptions vs Interrupts
  • Reorder buffer/History buffer/Future files
  • Impact of size of architectural register file
  • Checkpointing
  • Reservation stations
  • Precise exceptions

Lecture 6

Virtual Memory
  • Problems in early machines (size, protection, relocation, contiguity, sharing)
  • Segmentation (solves some of these problems)
  • Virtual memory (high level)
  • Implementing translations (page tables, inverted page tables)
  • TLBs (caching translations)
  • Virtually-tagged caches and Physically-tagged caches
  • Address space identifiers
  • Sharing (overview)
  • Super pages (overview)

Lecture 7

Out-of-Order Execution
  • Caches (direct-mapped vs fully associative, Virtually-indexed physically tagged caches, page coloring)
  • Super scalar vs Out-of-Order execution
  • Precise exceptions (summary)
  • Branch prediction vs exceptions
  • Out-of-Order execution (preventing dispatch stalls)
  • Tomasulo's algorithm
  • Dynamic data-flow graph generation

Lecture 8

Exploiting ILP
  • Overview of Tomasulo's algorithm
  • Problems with increasing instruction window to entire program
    1. number of comparators
    2. ROB size - instruction window
  • Implementation issues in out-of-order processing (decoupling structures)
    1. physical register file (removing the need to store/broadcast values in reservation station and storing architectural reg file)
    2. Register alias tables and architectural register alias table
  • Handling branch miss prediction (don't have to wait for branch to become oldest)
    1. checkpoint the register alias table on each branch
    2. what if we have a lot of branches? (confidence estimation)
  • Handling stores
    1. store buffers
    2. load searches store buffer/cache
    3. what if addresses are not yet computed for some stores? (unknown address problem)
    4. how to detect that a load is dependent on some store that hasn't computed an address yet? (load buffer)
  • Problems with physical register files
    1. register file read always on the critical path
  • Multiple instruction issue
  • Distributed/Centralized reservation stations
  • Scheduling
    1. Design Issues
  • Dependents on a load miss
    1. FIFO for load dependants
    2. Register deallocation
  • Latency tolerance of out of order processors

Lecture 9

Caching Basics
  • Caches
  • Direct mapped caches
  • Set associative caches
  • Miss rate
  • Average Memory Access Time
  • Cache placement
  • Cache replacement
  • Handling writes in caches
  • Inclusion/Exclusion

Lecture 10

Runahead and MLP
  • Lack of temporal and spatial locality (long strides, large working sets)
  • Stride prefetching (software vs hardware: complexity, dynamic information)
  • Irregular accesses (hash tables, linked structures?)
  • Where OoO cannot really benifit? (L2 cache misses) (need large instruction windows)
  • Run-ahead execution (Generating cache misses that can be serviced in parallel)
  • Problems with run-ahead
    1. length of run-ahead
    2. what if new cache-miss is dependent on original miss?
    3. Branch mispredictions and miss dependent branches
  • DRAM bank organization
  • Tolerating memory latencies
    1. Caching
    2. Prefetching
    3. Multi-threading
    4. Out of order execution
  • Fine-grained multi-theading (design and costs)
  • Causes of inefficiency in Run-ahead (energy consumption)
  • Breaking dependence
    1. address prediction (AVD prediction)

Lecture 11

OOO wrap-up and Advanced Caching
  • Dual Core Execution (DCE)
  • Comparison between run ahead and DCE
    1. Lag between the front and the back cores - controlled by result queue sizing
  • Slipstreaming
  • SMT Architectures for slipstreaming instead of 2 separate cores
  • Result queue length in DCE
  • Store-Load dependencies
  • Store buffer design
    1. Content associative, age ordered list of stores
  • Memory Disambiguation
    1. Load dependence/independence on previous stores
    2. Store/Load dependence prediction
  • Speculative execution and data coherence
    1. Load buffer
  • Research issues in OoO
    1. Scalable and energy-efficient instruction windows
    2. Packing more MLP into a small window
  • OOO in Multi core systems
    1. Memory system contention - bigger issue
    2. Multiple cores to perform OOO
    3. Asymmetric Multi-cores
  • Symmetric vs Asymmetric multi cores
    1. Accelerating critical sections
    2. Core fusion
  • Inclusion/Exclusion
  • Multi-level caching

Lecture 12

Advanced Caching
  • Handling writes
    1. Write-back
    2. Write-through
    3. Write allocate/no allocate
  • Instruction/Data Caching
  • Cache Replacement Policies
    1. Random
    2. FIFO
    3. Least Recently Used
    4. Not Most Recently Used
    5. Least Frequently used
  • LRU Vs Random - Random as good as LRU for most practical workloads
  • Optimal Replacement Policy
  • MLP aware cache replacement
  • Cache Performance
    1. Reducing miss rate
    2. Reducing miss latency/cost
  • Cache Parameters
    1. Cache size vs hit rate
    2. Block size
    3. Large Blocks - Critical words
    4. Large blocks - bandwidth wastage
    5. Sub blocking
    6. Associativity
    7. Power of 2 associativity?
    8. Hybrid Replacement policies
    9. Sampling based hybrid (random/LRU) replacement
  • Cache Misses
    1. Compulsory
    2. Conflict
    3. Capacity
    4. Coherence
  • Cache aware schedulers - cache affinity based application mapping
  • Victim caches
  • Hashing - Randomizing index functions
  • Pseudo associativity - serial cache lookup

Lecture 13

More Caching
  • Speculative partial tag comparison
  • Skewed associative caches
    1. Randomizing the index for different ways
  • Improving hit rate in software
    1. Loop interchange - Row major, Column major
    2. Blocking
    3. Loop fusion, Array merging
    4. Data structure layout - Packing frequently used fields in arrays
  • Handling multiple outstanding misses
    1. Non blocking caches
    2. Miss Status Handling Registers (MSHR)
    3. Accessing MSHRs
  • Reducing miss latency through software
    1. Compiler level reordering of loops
    2. Software prefetching
  • Handling multiple accesses in a cycle
    1. True/Virtual multiporting
    2. Banking/Interleaving

Lecture 14

Prefetching
  • Compulsory/conflict/capacity misses and prefetching
  • Coherence misses and prefetching
  • False sharing
    1. Word/byte based coherence
    2. Value prediction/Speculative execution
  • Prefetching and correctness
  • What/When/Where/How
    1. Accuracy
    2. Timeliness
    3. Coverage
    4. Prefetch buffers
    5. Skewing prefetches towards demand fetches
    6. Software/Hardware/Execution-based prefetchers
  • Software prefetching
    1. Binding/Non-binding prefetches
    2. Prefetching during pointer chasing
    3. x86 prefetch instructions - prefetching into different levels of cache
    4. Handling of prefetches that cause TLB misses/page faults
    5. Compiler driven prefetching
    6. Accuracy vs Timeliness tradeoff - Branches between prefetch and actual load
  • Hardware prefetching
    1. Next line prefetchers
    2. Stride prefetchers
    3. Instruction based stride prefetchers
    4. Stream buffers
    5. Locality based prefetching
  • Prefetcher performance
    1. Accuracy
    2. Coverage
    3. Timeliness
    4. Aggressiveness
      1. Prefetcher distance
      2. Prefetcher degree
  • Irregular prefetching
    1. Markov prefetchers

Lecture 15

Prefetching (wrap up)
  • Power 4 System Microarchitecture (prefetchers) (IBM Journal of R & D)
  • Irregular patterns (indirect array accesses, linked structures)
    1. markov prefetching
      1. linked lists or trees
      2. markov prefetchers vs stride prefetches
  • Content directed prefetching
    1. pointer based structures
    2. identifying pointers (software mechanism/hardware prediction)
    3. compiler analysis to provide hints for useful prefetches
  • Hybrid prefetchers
  • Execution based prefetchers
    1. pre-execution thread for creating prefetches for the main program
    2. determining when to start the pre-execution
    3. similar idea for branch prediction

Lecture 16

  • Prefetching in multicores
  • Importance of prefetch efficiency
  • Issues with local prefefetcher throttling
  • Hierarchical prefetcher throttling
  • Cache coherence
    1. Snoopy cache coherence
  • Shared caches in multicores
  • Utility based cache partitioning

Lecture 17

  • Software based cache management
    1. Thread scheduling
    2. Page coloring
    3. Dynamic partitioning through page recoloring
  • Cache placement
    1. Insertion
    2. Re-insertion
    3. Circular reference model
    4. Dynamic insertion policy - LRU and Bimodal insertion

Lecture 19

Main memory system
  • Memory hierarchy
  • SRAM/DRAM cell structures
  • Memory bank organization
  • Page mode DRAM
    1. Bank operation
  • Basic DRAM operation
    1. Controller latency
    2. Bank latency
  • DRAM chips/DIMMs
  • DRAM channels
  • Address mapping/interleaving
  • Bank mapping randomization
  • DRAM refresh
  • DRAM controller issues
    1. Memory controller placement

Lecture 20

  • DRAM controller functions
    1. Refresh
    2. Scheduling
  • DRAM Scheduling policies
    1. FCFS
    2. FR-FCFS
  • Row buffer management policies
    1. Open row
    2. Closed row
  • DRAM controller design
    1. Machine learning - a possibility
  • DRAM power states
  • Inter-thread interference interference in DRAM
  • Multi-core DRAM controllers
    1. Stall-time fairness
      1. Unfairness
      2. Estimating alone runtime
      3. Providing system software support
    2. Parallelism aware batch scheduling
      1. Request batching
      2. Within batch scheduling

Lecture 21

Super scalar processing I
  • Types of parallelism
    1. Task
    2. Thread
    3. Instruction
  • Fetch stage
    1. Instruction alignment Issue
    2. Solution - Split cache line fetch
    3. Fetch break - branches in the fetch block
  • Fetch break solutions
    1. Short distance predicted-taken branch
    2. Basic block reordering
    3. Super block code optimization
    4. Trace cache

Lecture 22

Super scalar processing II
  • Trace Caches
    1. Multiple branch prediction aliasing issue
    2. Inactive issue
    3. Promoting highly biased branches to static branches
    4. Saturating counters
    5. Fill unit optimizations
      1. Making highly biased branch paths atomic
    6. Redundancy - Solution : Block based trace cache
    7. An Enhanced Instruction cache vs a trace cache
    8. Pentium 4 trace cache
  • Block structured ISA
    1. Enlarged block branches - faults
    2. Super block vs Block structured ISAs
  • Decode in superscalar processing
    1. Predecoding
    2. Decode cache
    3. CISC to RISC translation in hardware
    4. Micro code sequencing
    5. Pentium 4 decoders
    6. Prentium pro decoders
      1. simple decoders
      2. Complex decoder
      3. Micro op sequencer
    7. Instruction buffering fetch and decode

Lecture 23

Superscalar Processing III
  • Renaming multiple instructions
    1. dependency check logic (n^2 comparators)
    2. help from compiler
      • ensure instructions are independent (difficult for wide fetches)
      • hardware-software co-design to simplify dependency logic
  • Dispatching multiple instructions
    1. wakeup logic (compare all tags in reservation station with all the tags that are broadcast)
    2. select logic (hierarchical tree based selection)
  • Execute
    1. enough execution units
    2. enough forwarding paths (broadcast tag/value to all functional units)
  • Reducing dispatch+bypass delays
    1. clustering (divide window into multiple clusters)
    2. intra-cluster bypass is fast
    3. inter-cluster bypass can be slow
  • Register file
    1. need multiple reads/writes per cycle
    2. Replicate or partition the register files
    3. using block-structured ISA
  • Retirement
    1. updating architectural register map

Lecture 24

Control Flow
  • Problem of branches
  • Types
    • conditional, unconditional, call, return, indirect branches
  • Handling conditional branches
  • Predicate combining
    • condition codes vs condition registers
  • Delayed branching
  • Fine-grained multi-threading
  • Branch prediction
    • predicting if an instruction is a branch (predecoding)
    • predicting the direction of the branch
    • predicting the target address of a branch
  • Static branch predition
    • always taken/not taken
    • backward taken, forward not taken
    • by compiler based on profiling
  • Dynamic branch prediction
    • last time predictor
    • history based predictors
    • two-level predictors

Lecture 25

Control Flow - II
  • 2-bit counter based prediction
  • Global branch prediction
  • Global branch correlation
  • Global two-level prediction
    1. Global history register
  • Local two-level prediction
    1. Pattern history table
    2. Interference in the pattern history table
      1. Randomizing the index into the pattern history table
      2. Agree prediction
  • Alpha 21264 Tournament Predictor
  • Perceptron branch predictor
    1. Perceptron - learns a target boolean function of N inputs
  • Call and Return Prediction
  • Indirect branch prediction
    1. Virtual Conditional Branch prediction
  • Branch prediction issues
    1. Need to know a branch as soon as it is fetched
    2. Latency
    3. State recovery upon misprediction
  • Predicated execution

Lecture 26

Control Flow - III & Concurrency
  • Predicated Execution
    1. Predication decisions at the compiler
    2. Rename stage modifications
  • Limitations of predication
    1. Adaptivity
    2. Complex Control Flow Graphs
    3. ISA support
  • Wish branches
    1. Wish jump/join
    2. Wish loop
  • Wish branches vs Predicated Execution
  • Wish branches vs Branch prediction
  • Diverge-Merge Processor
  • Dynamic-Hammock
  • Multi-path Execution
  • Research issues in control flow handling
    1. Hardware/software cooperation
    2. Fetch gating
    3. Recycling useful work done on wrong path

Concurrency

  • Classification of machines
    1. SISD
    2. SIMD
    3. MIMD
  • Decoupled Access/Execute
  • Astronautics ZS-1
  • Loop unrolling

Lecture 27

VLIW
  • Each VLIW instruction - a bundle of independent instructions (identified by compiler)
  • Each instruction bundle executed by hardware in lockstep
  • Commercial VLIW machines
    1. TIC6000, Trimedia, STMicro
  • Intel IA-64 - Partially VLIW
  • Encoding VLIW NOPs
  • Static Instruction Scheduling for VLIW
  • Code motion - Safety & Legality
  • Trace scheduling
  • List scheduling
  • Super block scheduling
  • Hyperblock scheduling
  • The Intel IA-64 architecture
    1. No lock step execution of a bundle
    2. Specify dependencies between instructions within a bundle
    3. Template bits
  • What hinder static mode motion?
    1. Exceptions
    2. Loads/Stores

Personal Tools