This is an old revision of the document!


Buzz Words

These are a list of keywords related to the topics covered in the class. These are to jog your memory. For a comprehensive list, please refer to lecture notes and the notes you take during class.

Lecture 2

ISA, Trade-offs, Performance
  • ISA vs Microarchitecture
  • Latency vs Throughput trade-offs (bit-serial adders vs carry ripple/look-ahead adders)
  • Speculation (branch prediction overview)
  • Superscalar processing (multiple instruction issue, VLIW)
  • Prefetching (briefly covered)
  • Power/clock gating (briefly covered)

Lecture 3

Performance
  • Addressing modes (Registers, Memory, Base-index-displacement)
  • 0/1/2/3 Address machines (accumulator machines, stack machines)
  • Aligned accesses (Hardware handled vs Software handled: trade-offs)
  • Transactional memory (overview)
  • Von-Neumann model (Sequential instruction execution)
  • Data flow machines (Data driven execution)
  • Evaluating performance

Lecture 4

Pipelining
  • Performance metrics (Instructions per second, FLOPS, Spec/Mhz)
  • Amdahl's law
  • Pipelining vs Multi-cycle machines
  • Stalling (dependencies)
  • Data dependencies (true, output, anti)
  • Value prediction
  • Control dependencies
  • Branch prediction / Predication
  • Fine-grained multi-threading

Lecture 5

Precise Exceptions
  • Need for separate instruction and data caches
  • Exceptions vs Interrupts
  • Reorder buffer/History buffer/Future files
  • Impact of size of architectural register file
  • Checkpointing
  • Reservation stations
  • Precise exceptions

Lecture 6

Virtual Memory
  • Problems in early machines (size, protection, relocation, contiguity, sharing)
  • Segmentation (solves some of these problems)
  • Virtual memory (high level)
  • Implementing translations (page tables, inverted page tables)
  • TLBs (caching translations)
  • Virtually-tagged caches and Physically-tagged caches
  • Address space identifiers
  • Sharing (overview)
  • Super pages (overview)

Lecture 7

Out-of-Order Execution
  • Caches (direct-mapped vs fully associative, Virtually-indexed physically tagged caches, page coloring)
  • Super scalar vs Out-of-Order execution
  • Precise exceptions (summary)
  • Branch prediction vs exceptions
  • Out-of-Order execution (preventing dispatch stalls)
  • Tomasulo's algorithm
  • Dynamic data-flow graph generation

Lecture 8

Exploiting ILP
  • Overview of Tomasulo's algorithm
  • Problems with increasing instruction window to entire program
    1. number of comparators
    2. ROB size - instruction window
  • Implementation issues in out-of-order processing (decoupling structures)
    1. physical register file (removing the need to store/broadcast values in reservation station and storing architectural reg file)
    2. Register alias tables and architectural register alias table
  • Handling branch miss prediction (don't have to wait for branch to become oldest)
    1. checkpoint the register alias table on each branch
    2. what if we have a lot of branches? (confidence estimation)
  • Handling stores
    1. store buffers
    2. load searches store buffer/cache
    3. what if addresses are not yet computed for some stores? (unknown address problem)
    4. how to detect that a load is dependent on some store that hasn't computed an address yet? (load buffer)
  • Problems with physical register files
    1. register file read always on the critical path
  • Multiple instruction issue
  • Distributed/Centralized reservation stations
  • Scheduling
    1. Design Issues
  • Dependents on a load miss
    1. FIFO for load dependants
    2. Register deallocation
  • Latency tolerance of out of order processors

Lecture 9

Caching Basics
  • Caches
  • Direct mapped caches
  • Set associative caches
  • Miss rate
  • Average Memory Access Time
  • Cache placement
  • Cache replacement
  • Handling writes in caches
  • Inclusion/Exclusion

Lecture 10

Runahead and MLP
  • Lack of temporal and spatial locality (long strides, large working sets)
  • Stride prefetching (software vs hardware: complexity, dynamic information)
  • Irregular accesses (hash tables, linked structures?)
  • Where OoO cannot really benifit? (L2 cache misses) (need large instruction windows)
  • Run-ahead execution (Generating cache misses that can be serviced in parallel)
  • Problems with run-ahead
    1. length of run-ahead
    2. what if new cache-miss is dependent on original miss?
    3. Branch mispredictions and miss dependent branches
  • DRAM bank organization
  • Tolerating memory latencies
    1. Caching
    2. Prefetching
    3. Multi-threading
    4. Out of order execution
  • Fine-grained multi-theading (design and costs)
  • Causes of inefficiency in Run-ahead (energy consumption)
  • Breaking dependence
    1. address prediction (AVD prediction)

Lecture 11

OOO wrap-up and Advanced Caching
  • Dual Core Execution (DCE)
  • Comparison between run ahead and DCE
    1. Lag between the front and the back cores - controlled by result queue sizing
  • Slipstreaming
  • SMT Architectures for slipstreaming instead of 2 separate cores
  • Result queue length in DCE
  • Store-Load dependencies
  • Store buffer design
    1. Content associative, age ordered list of stores
  • Memory Disambiguation
    1. Load dependence/independence on previous stores
    2. Store/Load dependence prediction
  • Speculative execution and data coherence
    1. Load buffer
  • Research issues in OoO
    1. Scalable and energy-efficient instruction windows
    2. Packing more MLP into a small window
  • OOO in Multi core systems
    1. Memory system contention - bigger issue
    2. Multiple cores to perform OOO
    3. Asymmetric Multi-cores
  • Symmetric vs Asymmetric multi cores
    1. Accelerating critical sections
    2. Core fusion
  • Inclusion/Exclusion
  • Multi-level caching

Lecture 12

Advanced Caching
  • Handling writes
    1. Write-back
    2. Write-through
    3. Write allocate/no allocate
  • Instruction/Data Caching
  • Cache Replacement Policies
    1. Random
    2. FIFO
    3. Least Recently Used
    4. Not Most Recently Used
    5. Least Frequently used
  • LRU Vs Random - Random as good as LRU for most practical workloads
  • Optimal Replacement Policy
  • MLP aware cache replacement
  • Cache Performance
    1. Reducing miss rate
    2. Reducing miss latency/cost
  • Cache Parameters
    1. Cache size vs hit rate
    2. Block size
    3. Large Blocks - Critical words
    4. Large blocks - bandwidth wastage
    5. Sub blocking
    6. Associativity
    7. Power of 2 associativity?
    8. Hybrid Replacement policies
    9. Sampling based hybrid (random/LRU) replacement
  • Cache Misses
    1. Compulsory
    2. Conflict
    3. Capacity
    4. Coherence
  • Cache aware schedulers - cache affinity based application mapping
  • Victim caches
  • Hashing - Randomizing index functions
  • Pseudo associativity - serial cache lookup

Lecture 13

More Caching
  • Speculative partial tag comparison
  • Skewed associative caches
    1. Randomizing the index for different ways
  • Improving hit rate in software
    1. Loop interchange - Row major, Column major
    2. Blocking
    3. Loop fusion, Array merging
    4. Data structure layout - Packing frequently used fields in arrays
  • Handling multiple outstanding misses
    1. Non blocking caches
    2. Miss Status Handling Registers (MSHR)
    3. Accessing MSHRs
  • Reducing miss latency through software
    1. Compiler level reordering of loops
    2. Software prefetching
  • Handling multiple accesses in a cycle
    1. True/Virtual multiporting
    2. Banking/Interleaving

Lecture 14

Prefetching
  • Compulsory/conflict/capacity misses and prefetching
  • Coherence misses and prefetching
  • False sharing
    1. Word/byte based coherence
    2. Value prediction/Speculative execution
  • Prefetching and correctness
  • What/When/Where/How
    1. Accuracy
    2. Timeliness
    3. Coverage
    4. Prefetch buffers
    5. Skewing prefetches towards demand fetches
    6. Software/Hardware/Execution-based prefetchers
  • Software prefetching
    1. Binding/Non-binding prefetches
    2. Prefetching during pointer chasing
    3. x86 prefetch instructions - prefetching into different levels of cache
    4. Handling of prefetches that cause TLB misses/page faults
    5. Compiler driven prefetching
    6. Accuracy vs Timeliness tradeoff - Branches between prefetch and actual load
  • Hardware prefetching
    1. Next line prefetchers
    2. Stride prefetchers
    3. Instruction based stride prefetchers
    4. Stream buffers
    5. Locality based prefetching
  • Prefetcher performance
    1. Accuracy
    2. Coverage
    3. Timeliness
    4. Aggressiveness
      1. Prefetcher distance
      2. Prefetcher degree
  • Irregular prefetching
    1. Markov prefetchers

Lecture 15

Prefetching (wrap up)
  • Power 4 System Microarchitecture (prefetchers) (IBM Journal of R & D)
  • Irregular patterns (indirect array accesses, linked structures)
    1. markov prefetching
      1. linked lists or trees
      2. markov prefetchers vs stride prefetches
  • Content directed prefetching
    1. pointer based structures
    2. identifying pointers (software mechanism/hardware prediction)
    3. compiler analysis to provide hints for useful prefetches
  • Hybrid prefetchers
  • Execution based prefetchers
    1. pre-execution thread for creating prefetches for the main program
    2. determining when to start the pre-execution
    3. similar idea for branch prediction

Lecture 16

  • Prefetching in multicores
  • Importance of prefetch efficiency
  • Issues with local prefefetcher throttling
  • Hierarchical prefetcher throttling
  • Cache coherence
    1. Snoopy cache coherence
  • Shared caches in multicores
  • Utility based cache partitioning

Lecture 17

  • Software based cache management
    1. Thread scheduling
    2. Page coloring
    3. Dynamic partitioning through page recoloring
  • Cache placement
    1. Insertion
    2. Re-insertion
    3. Circular reference model
    4. Dynamic insertion policy - LRU and Bimodal insertion

Lecture 19

Main memory system

  • Memory hierarchy
  • SRAM/DRAM cell structures
  • Memory bank organization
  • Page mode DRAM
    1. Bank operation
  • Basic DRAM operation
    1. Controller latency
    2. Bank latency
  • DRAM chips/DIMMs
  • DRAM channels
  • Address mapping/interleaving
  • Bank mapping randomization
  • DRAM refresh
  • DRAM controller issues
    1. Memory controller placement

Lecture 20

  • DRAM controller functions
    1. Refresh
    2. Scheduling
  • DRAM Scheduling policies
    1. FCFS
    2. FR-FCFS
  • Row buffer management policies
    1. Open row
    2. Closed row
  • DRAM controller design
    1. Machine learning - a possibility
  • DRAM power states
  • Inter-thread interference interference in DRAM
  • Multi-core DRAM controllers
    1. Stall-time fairness
      1. Unfairness
      2. Estimating alone runtime
      3. Providing system software support
    2. Parallelism aware batch scheduling
      1. Request batching
      2. Within batch scheduling

Lecture 21

Super scalar processing

  • Types of parallelism
    1. Task
    2. Thread
    3. Instruction
  • Fetch stage
    1. Instruction alignment Issue
    2. Solution - Split cache line fetch
    3. Fetch break - branches in the fetch block
    4. Solution - Short distance predicted-taken branch
    5. Solution - Basic block reordering
    6. Solution - Super block code optimization
    7. Solution - Trace cache

Personal Tools