This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
buzzword [2015/02/12 19:18] kevincha [Lecture 11 (2/11 Mon.)] |
buzzword [2015/02/25 21:05] kevincha [Lecture 16 (2/23 Mon.)] |
||
---|---|---|---|
Line 460: | Line 460: | ||
* Advantages? | * Advantages? | ||
+ | ===== Lecture 12 (2/13 Fri.) ===== | ||
+ | * Renaming | ||
+ | * Register renaming table | ||
+ | * Predictor (branch predictor, cache line predictor ...) | ||
+ | * Power budget (and its importance) | ||
+ | * Architectural state, precise state | ||
+ | * Memory dependence is known dynamically | ||
+ | * Register state is not shared across threads/processors | ||
+ | * Memory state is shared across threads/processors | ||
+ | * How to maintain speculative memory states | ||
+ | * Write buffers (helps simplify the process of checking the reorder buffer) | ||
+ | * Overall OoO mechanism | ||
+ | * What are other ways of eliminating dispatch stalls | ||
+ | * Dispatch when the sources are ready | ||
+ | * Retired instructions make the source available | ||
+ | * Register renaming | ||
+ | * Reservation station | ||
+ | * What goes into the reservation station | ||
+ | * Tags required in the reservation station | ||
+ | * Tomasulo's algorithm | ||
+ | * Without precise exception, OoO is hard to debug | ||
+ | * Arch. register ID | ||
+ | * Examples in the slides | ||
+ | * Slides 28 --> register renaming | ||
+ | * Slides 30-35 --> Exercise (also on the board) | ||
+ | * This will be usefull for the midterm | ||
+ | * Register aliasing table | ||
+ | * Broadcasting tags | ||
+ | * Using dataflow | ||
+ | |||
+ | |||
+ | ===== Lecture 13 (2/16 Mon.) ===== | ||
+ | |||
+ | * OoO --> Restricted Dataflow | ||
+ | * Extracting parallelism | ||
+ | * What are the bottlenecks? | ||
+ | * Issue width | ||
+ | * Dispatch width | ||
+ | * Parallelism in the program | ||
+ | * What does it mean to be restricted data flow | ||
+ | * Still visible as a Von Neumann model | ||
+ | * Where does the efficiency come from? | ||
+ | * Size of the scheduling windors/reorder buffer. Tradeoffs? What make sense? | ||
+ | * Load/store handling | ||
+ | * Would like to schedule them out of order, but make them visible in-order | ||
+ | * When do you schedule the load/store instructions? | ||
+ | * Can we predict if load/store are dependent? | ||
+ | * This is one of the most complex structure of the load/store handling | ||
+ | * What information can be used to predict these load/store optimization? | ||
+ | * Centralized vs. distributed? What are the tradeoffs? | ||
+ | * How to handle when there is a misprediction/recovery | ||
+ | * OoO + branch prediction? | ||
+ | * Speculatively update the history register | ||
+ | * When do you update the GHR? | ||
+ | * Token dataflow arch. | ||
+ | * What are tokens? | ||
+ | * How to match tokens | ||
+ | * Tagged token dataflow arch. | ||
+ | * What are the tradeoffs? | ||
+ | * Difficulties? | ||
+ | |||
+ | ===== Lecture 14 (2/18 Wed.) ===== | ||
+ | |||
+ | * SISD/SIMD/MISD/MIMD | ||
+ | * Array processor | ||
+ | * Vector processor | ||
+ | * Data parallelism | ||
+ | * Where does the concurrency arise? | ||
+ | * Differences between array processor vs. vector processor | ||
+ | * VLIW | ||
+ | * Compactness of an array processor | ||
+ | * Vector operates on a vector of data (rather than a single datum (scalar)) | ||
+ | * Vector length (also applies to array processor) | ||
+ | * No dependency within a vector --> can have a deep pipeline | ||
+ | * Highly parallel (both instruction level (ILP) and memory level (MLP)) | ||
+ | * But the program needs to be very parallel | ||
+ | * Memory can be the bottleneck (due to very high MLP) | ||
+ | * What does the functional units look like? Deep pipelin and simpler control. | ||
+ | * CRAY-I is one of the examples of vector processor | ||
+ | * Memory access pattern in a vector processor | ||
+ | * How do the memory accesses benefit the memory bandwidth? | ||
+ | * Memory level parallelism | ||
+ | * Stride length vs. the number of banks | ||
+ | * stride length should be relatively prime to the number of banks | ||
+ | * Tradeoffs between row major and column major --> How can the vector processor deals with the two | ||
+ | * How to calculate the efficiency and performance of vector processors | ||
+ | * What if there are multiple memory ports? | ||
+ | * Gather/Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism) | ||
+ | * Helps handling sparse metrices | ||
+ | * Conditional operation | ||
+ | * Structure of vector units | ||
+ | * How to automatically parallelize code through the compiler? | ||
+ | * This is a hard problem. Compiler does not know the memory address. | ||
+ | * What do we need to ensure for both vector and array processor? | ||
+ | * Sequential bottleneck | ||
+ | * Amdahl's law | ||
+ | * Intel MMX --> An example of Intel's approach to SIMD | ||
+ | * No VLEN, use OpCode to define the length | ||
+ | * Stride is one in MMX | ||
+ | * Intel SSE --> Modern version of MMX | ||
+ | |||
+ | ===== Lecture 15 (2/20 Fri.) ===== | ||
+ | * GPU | ||
+ | * Warp/Wavefront | ||
+ | * A bunch of threads sharing the same PC | ||
+ | * SIMT | ||
+ | * Lanes | ||
+ | * FGMT + massively parallel | ||
+ | * Tolerate long latency | ||
+ | * Warp based SIMD vs. traditional SIMD | ||
+ | * SPMD (Programming model) | ||
+ | * Single program operates on multiple data | ||
+ | * can have synchronization point | ||
+ | * Many scientific applications are programmed in this manner | ||
+ | * Control flow problem (branch divergence) | ||
+ | * Masking (in a branch, mask threads that should not execute that path) | ||
+ | * Lower SIMD efficiency | ||
+ | * What if you have layers of branches? | ||
+ | * Dynamic wrap formation | ||
+ | * Combining threads from different warps to increase SIMD utilization | ||
+ | * This can cause memory divergence | ||
+ | * VLIW | ||
+ | * Wide fetch | ||
+ | * IA-64 | ||
+ | * Tradeoffs | ||
+ | * Simple hardware (no dynamic scheduling, no dependency checking within VLIW) | ||
+ | * A lot of loads at the compiler level | ||
+ | * Decoupled access/execute | ||
+ | * Limited form of OoO | ||
+ | * Tradeoffs | ||
+ | * How to street the instruction (determine dependency/stalling)? | ||
+ | * Instruction scheduling techniques (static vs. dynamic) | ||
+ | * Systoric arrays | ||
+ | * Processing elements transform data in chains | ||
+ | * Develop for image processing (for example, convolution) | ||
+ | * Stage processing | ||
+ | | ||
+ | |||
+ | |||
+ | ===== Lecture 16 (2/23 Mon.) ===== | ||
+ | * Systoric arrays | ||
+ | * Processing elements transform data in chains | ||
+ | * Can be arrays of multi-dimensional processing elements | ||
+ | * Develop for image processing (for example, convolution) | ||
+ | * Can be use to break stages in pipeline programs, using a set of queues and processing elements | ||
+ | * Can enable high concurrency and good for regular programs | ||
+ | * Very special purpose | ||
+ | * The warp computer | ||
+ | * Static instruction scheduling | ||
+ | * How do we find the next instruction to execute? | ||
+ | * Live-in and live-out | ||
+ | * Basic blocks | ||
+ | * Rearranging instructions in the basic block | ||
+ | * Code movement from one basic block to another | ||
+ | * Straight line code | ||
+ | * Independent instructions | ||
+ | * How to identify independent instructions | ||
+ | * Atomicity | ||
+ | * Trace scheduling | ||
+ | * Side entrance | ||
+ | * Fixed up code | ||
+ | * How scheudling is done | ||
+ | * Instruction scheduling | ||
+ | * Prioritization heuristics | ||
+ | * Superblock | ||
+ | * Traces with no side-entrance | ||
+ | * Hyperblock | ||
+ | * BS-ISA | ||
+ | * Tradeoffs betwwen trace cache/Hyperblock/Superblock/BS-ISA | ||
+ | | ||
+ | ===== Lecture 17 (2/25 Wed.) ===== | ||
+ | * IA-64 | ||
+ | * EPIC | ||
+ | * IA-64 instruction bundle | ||
+ | * Multiple instructions in the bundle along with the template bit | ||
+ | * Template bits | ||
+ | * Stop bits | ||
+ | * Non-faulting loads and exception propagation | ||
+ | * Aggressive ST-LD reordering | ||
+ | * Phyiscal memory system | ||
+ | * Ideal pipelines | ||
+ | * Ideal cache | ||
+ | * More capacity | ||
+ | * Fast | ||
+ | * Cheap | ||
+ | * High bandwidth | ||
+ | * DRAM cell | ||
+ | * Cheap | ||
+ | * Sense the purturbation through sense amplifier | ||
+ | * Slow and leaky | ||
+ | * SRAM cell (Cross coupled inverter) | ||
+ | * Expensice | ||
+ | * Fast (easier to sense the value in the cell) | ||
+ | * Memory bank | ||
+ | * Read access sequence | ||
+ | * DRAM: Activate -> Read -> Precharge (if needed) | ||
+ | * What dominate the access laatency for DRAM and SRAM | ||
+ | * Scaling issue | ||
+ | * Hard to scale the scale to be small | ||
+ | * Memory hierarchy | ||
+ | * Prefetching | ||
+ | * Caching | ||
+ | * Spatial and temporal locality | ||
+ | * Cache can exploit these | ||
+ | * Recently used data is likely to be accessed | ||
+ | * Nearby data is likely to be accessed | ||
+ | * Caching in a pipeline design | ||
+ | * Cache management | ||
+ | * Manual | ||
+ | * Data movement is managed manually | ||
+ | * Embedded processor | ||
+ | * GPU scratchpad | ||
+ | * Automatic | ||
+ | * HW manage data movements | ||
+ | * Latency analysis | ||
+ | * Based on the hit and miss status, next level access time (if miss), and the current level access time | ||
+ | * Cache basics | ||
+ | * Set/block (line)/Placement/replacement/direct mapped vs. associative cache/etc. | ||
+ | * Cache access | ||
+ | * How to access tag and data (in parallel vs serially) | ||
+ | * How do tag and index get used? | ||
+ | * Modern processors perform serial access for higher level cache (L3 for example) to save power | ||
+ | * Cost and benefit of having more associativity | ||
+ | * Given the associativity, which block should be replace if it is full | ||
+ | * Replacement poligy | ||
+ | * Random | ||
+ | * Least recently used (LRU) | ||
+ | * Least frequently used | ||
+ | * Least costly to refetch | ||
+ | * etc. | ||
+ | * How to implement LRU | ||
+ | * How to keep track of access ordering | ||
+ | * Complexity increases rapidly | ||
+ | * Approximate LRU | ||
+ | * Victim and next Victim policy |