This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
buzzword [2014/01/17 19:18] rachata |
buzzword [2014/04/30 18:12] rachata |
||
---|---|---|---|
Line 94: | Line 94: | ||
* Tradeoffs between control-driven and data-driven | * Tradeoffs between control-driven and data-driven | ||
* What are easier to program? | * What are easier to program? | ||
- | * Which are easy to compile? | + | * Which are easy to compile? |
- | * What are more parallel (does that mean it is faster?) | + | * What are more parallel (does that mean it is faster?) |
- | * Which machines are more complex to design? | + | * Which machines are more complex to design? |
* In control flow, when a program is stop, there is a pointer to the current state (precise state). | * In control flow, when a program is stop, there is a pointer to the current state (precise state). | ||
* ISA vs. Microarchitecture | * ISA vs. Microarchitecture | ||
* Semantics in the ISA | * Semantics in the ISA | ||
- | * uArch should obey the ISA | + | * uArch should obey the ISA |
- | * Changing ISA is costly, can affect compatibility. | + | * Changing ISA is costly, can affect compatibility. |
* Instruction pointers | * Instruction pointers | ||
* uArch techniques: common and powerful techniques break Vonn Neumann model if done at the ISA level | * uArch techniques: common and powerful techniques break Vonn Neumann model if done at the ISA level | ||
Line 118: | Line 118: | ||
* Things that are not suppose to be visible to the programmer/software but typically make the processor faster and/or consumes less power and/or less complex | * Things that are not suppose to be visible to the programmer/software but typically make the processor faster and/or consumes less power and/or less complex | ||
- | ===== Lecture 3 (1/17 Wed.) ===== | + | ===== Lecture 3 (1/17 Fri.) ===== |
* Design tradeoffs | * Design tradeoffs | ||
Line 164: | Line 164: | ||
* Optimization opportunity | * Optimization opportunity | ||
+ | ===== Lecture 4 (1/22 Wed.) ===== | ||
+ | |||
+ | * Semantic gap | ||
+ | * Small vs. Large semantic gap (CISC vs. RISC) | ||
+ | * Benefit of RISC vs. CISC | ||
+ | * Micro operations/microcode | ||
+ | * Translate complex instructions into smaller instructions | ||
+ | * Parallelism (motivation for RISC) | ||
+ | * Compiler optimization | ||
+ | * Code optimization through translation | ||
+ | * VLIW | ||
+ | * Fixed vs. variable length instructions | ||
+ | * Tradeoffs | ||
+ | * Alignment issues? (fetch/decode) | ||
+ | * Decoding issues? | ||
+ | * Code size? | ||
+ | * Adding additional instructions? | ||
+ | * Memory bandwidth and cache utilization? | ||
+ | * Energy? | ||
+ | * Encoding in variable length instructions | ||
+ | * Structure of Alpha instructions and other uniform decode instructions | ||
+ | * Different type of instructions | ||
+ | * Benefit of knowing what type of instructions | ||
+ | * Speculatively operate future instructions | ||
+ | * x86 and other non-uniform decode instructions | ||
+ | * Tradeoff vs. uniform decode | ||
+ | * Tradeoffs for different number of registers | ||
+ | * Spilling into memory if the number of registers is small | ||
+ | * Compiler optimization on how to manage which value to keep/spill | ||
+ | * Addressing modes | ||
+ | * Benefits? | ||
+ | * Types? | ||
+ | * Different uses of addressing modes? | ||
+ | * Various ISA-level tradeoffs | ||
+ | * Virtual memory | ||
+ | * Unalign memory access/aligned memory access | ||
+ | * Cost vs. benefit of unaligned access | ||
+ | * ISA specification | ||
+ | * Things you have to obey/specifie in the ISA specification | ||
+ | * Architectural states | ||
+ | * Microarchitecture implements how arch. state A transformed to the next arch. state A' | ||
+ | * Single cycle machines | ||
+ | * Critical path in the single cycle machine | ||
+ | * Multi cycle machines | ||
+ | * Functional units | ||
+ | * Performance metrics | ||
+ | * CPI/IPC | ||
+ | * CPI of a single cycle microarchitecture | ||
+ | |||
+ | ===== Lecture 5 (1/24 Fri.) ===== | ||
+ | |||
+ | * Instruction processing | ||
+ | * Fetch | ||
+ | * Decode | ||
+ | * Execute | ||
+ | * Memory fetch | ||
+ | * Writeback | ||
+ | * Datapath & Control logic in microprocessors | ||
+ | * Different types of instructions (I-type, R-type, etc.) | ||
+ | * Control flow instructions | ||
+ | * Non-control flow instructions | ||
+ | * Delayed slot/Delayed branch | ||
+ | * Single cycle control logic | ||
+ | * Lockstep | ||
+ | * Critical path analysis | ||
+ | * Critical path of a single cycle processor | ||
+ | * Combinational logic & Sequential logic | ||
+ | * Control store | ||
+ | * Tradeoffs of a single cycle uarch | ||
+ | * Dynamic power/Static power | ||
+ | * Speedup calculation | ||
+ | * Parallelism | ||
+ | * Serial bottleneck | ||
+ | * Amdahl's bottleneck | ||
+ | * Design principles | ||
+ | * Common case design | ||
+ | * Critical path design | ||
+ | * Balanced designs | ||
+ | * Multi cycle design | ||
+ | |||
+ | ===== Lecture 6 (1/27 Mon.) ===== | ||
+ | |||
+ | * Microcoded/Microprogrammed machines | ||
+ | * States | ||
+ | * Microinstructions | ||
+ | * Microsequencing | ||
+ | * Control store - Product control signals | ||
+ | * Microsequencer | ||
+ | * Control signal | ||
+ | * What do they have to control? | ||
+ | * Instruction processing cycle | ||
+ | * Latch signals | ||
+ | * State machine | ||
+ | * State variables | ||
+ | * Condition code | ||
+ | * Steering bits | ||
+ | * Branch enable logic | ||
+ | * Difference between gating and loading? (write enable vs. driving the bus) | ||
+ | * Memory mapped I/O | ||
+ | * Hardwired logic | ||
+ | * What control signals come from hardwired logic? | ||
+ | * Variable latency memory | ||
+ | * Handling interrupts | ||
+ | * Difference betwen interrupts and exceptions | ||
+ | * Emulator (i.e. uCode allots minimal datapath to emulate the ISA) | ||
+ | * Updating machine behavior | ||
+ | * Horizontal microcode | ||
+ | * Vertical microcode | ||
+ | * Primitives | ||
+ | |||
+ | ===== Lecture 7 (1/29 Wed.) ===== | ||
+ | |||
+ | * Pipelining | ||
+ | * Limitations of the multi-programmed design | ||
+ | * Idle resources | ||
+ | * Throughput of a pipelined design | ||
+ | * What dictacts the throughput of a pipelined design? | ||
+ | * Latency of the pipelined design | ||
+ | * Dependency | ||
+ | * Overhead of pipelining | ||
+ | * Latch cost? | ||
+ | * Data forwarding/bypassing | ||
+ | * What are the ideal pipeline? | ||
+ | * External fragmentation | ||
+ | * Issues in pipeline designs | ||
+ | * Stalling | ||
+ | * Dependency (Hazard) | ||
+ | * Flow dependence | ||
+ | * Output dependence | ||
+ | * Anti dependence | ||
+ | * How to handle them? | ||
+ | * Resource contention | ||
+ | * Keeping the pipeline full | ||
+ | * Handling exception/interrupts | ||
+ | * Pipeline flush | ||
+ | * Speculation | ||
+ | * Interlocking | ||
+ | * Multipath execution | ||
+ | * Fine grain multithreading | ||
+ | * No-op (Bubbles in the pipeline) | ||
+ | * Valid bits in the instructions | ||
+ | |||
+ | ===== Lecture 8 (1/31 Fri.) ===== | ||
+ | * Branch prediction | ||
+ | * Different types of data dependence | ||
+ | * Pipeline stalls | ||
+ | * bubbles | ||
+ | * How to handle stalls | ||
+ | * Stall conditions | ||
+ | * Stall signals | ||
+ | * Dependences | ||
+ | * Distant between dependences | ||
+ | * Data forwarding/bypassing | ||
+ | * Maintaining the correct dataflow | ||
+ | * Different ways to design data forwarding path/logic | ||
+ | * Different techniques to handle interlockings | ||
+ | * SW based | ||
+ | * HW based | ||
+ | * Profiling | ||
+ | * Static profiling | ||
+ | * Helps from the software (compiler) | ||
+ | * Superblock optimization | ||
+ | * Analyzing basic blocks | ||
+ | * How to deal with branches? | ||
+ | * Branch prediction | ||
+ | * Delayed branching (branch delay slot) | ||
+ | * Forward control flow/backward control flow | ||
+ | * Branch prediction accuracy | ||
+ | * Profile guided code positioning | ||
+ | * Based on the profile info. position the code based on it | ||
+ | * Try to make the next sequential instruction be the next inst. to be executed | ||
+ | * Trace cache | ||
+ | * Predicate combining (combine predicate for a branch instruction) | ||
+ | * Predicated execution (control dependence becomes data dependence) | ||
+ | * Definition of basic blocks | ||
+ | * Control flow graph | ||
+ | |||
+ | ===== Lecture 9 (2/3 Mon.) ===== | ||
+ | * Delayed branching | ||
+ | * benefit? | ||
+ | * What does it eliminates? | ||
+ | * downside? | ||
+ | * Delayed branching in SPARC (with squashing) | ||
+ | * Backward compatibility with the delayed slot | ||
+ | * What should be filled in the delayed slot | ||
+ | * How to ensure correctness | ||
+ | * Fine-grained multithreading | ||
+ | * fetch from different threads | ||
+ | * What are the issues (what if the program doesn't have many threads) | ||
+ | * CDC 6000 | ||
+ | * Denelcor HEP | ||
+ | * No dependency checking | ||
+ | * Inst. from different thread can fill-in the bubbles | ||
+ | * Cost? | ||
+ | * Simulteneuos multithreading | ||
+ | * Branch prediction | ||
+ | * Guess what to fetch next. | ||
+ | * Misprediction penalty | ||
+ | * Need to guess the direction and target | ||
+ | * How to perform the performance analysis? | ||
+ | * Given the branch prediction accuracy and penalty cost, how to compute a cost of a branch misprediction. | ||
+ | * Given the program/number of instructions, percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions. | ||
+ | * How many extra instructions are being fetched? | ||
+ | * What is the performance degredation? | ||
+ | * How to reduce the miss penalty? | ||
+ | * Predicting the next address (non PC+4 address) | ||
+ | * Branch target buffer (BTB) | ||
+ | * Predicting the address of the branch | ||
+ | * Global branch history - for directions | ||
+ | * Can use compiler to profile and get more info | ||
+ | * Input set dictacts the accuracy | ||
+ | * Add time to compilation | ||
+ | * Heuristics that are common and doesn't require profiling. | ||
+ | * Might be inaccurate | ||
+ | * Does not require profiling | ||
+ | * Programmer can tell the hardware (via pragmas (hints)) | ||
+ | * For example, x86 has the hint bit | ||
+ | * Dynamic branch prediction | ||
+ | * Last time predictor | ||
+ | * Two bits counter based prediction | ||
+ | * One more bit for hysteresis | ||
+ | |||
+ | |||
+ | ===== Lecture 10 (2/5 Wed.) ===== | ||
+ | |||
+ | * Branch prediction accuracy | ||
+ | * Why are they very important? | ||
+ | * Differences between 99% accuracy and 98% accuracy | ||
+ | * Cost of a misprediction when the pipeline is veryd eep | ||
+ | * Value prediction | ||
+ | * Global branch correlation | ||
+ | * Some branches are correlated | ||
+ | * Local branch correlation | ||
+ | * Some branches can depend on the result of past branches | ||
+ | * Pattern history table | ||
+ | * Record global taken/not taken results. | ||
+ | * Cost vs. accuracy (What to record, do you record PC? Just taken/not taken info.?) | ||
+ | * One-level branch predictor | ||
+ | * What information are used | ||
+ | * Two-level branch prediction | ||
+ | * What entries do you keep in the glocal history? | ||
+ | * What entries do you keep in the local history? | ||
+ | * How many table? | ||
+ | * Cost when training a table | ||
+ | * What are the purposes of each table? | ||
+ | * Potential problems of a two-level history | ||
+ | * GShare predictor | ||
+ | * Global history predictor is hashed with the PC | ||
+ | * Store both GHP and PC in one combined information | ||
+ | * How do you use the information? Why does the XOR result still usable? | ||
+ | * Slides (page 16-18) for a good overview of one- and two-level predictors | ||
+ | * Warmup cost of the branch predictor | ||
+ | * Hybrid solution? Fast warmup is used first, then switch to the slower one. | ||
+ | * Tournament predictor (Alpha 21264) | ||
+ | * Other types of branch predictor | ||
+ | * Using machine learning? | ||
+ | * Geometric history length | ||
+ | * Look at branches far behind (but using geometric step) | ||
+ | * Predicated execution - eliminate branches | ||
+ | * What are the tradeoffs | ||
+ | * What is the block is big (can lead to execution a lot of useless work) | ||
+ | * Allows easier code optimization | ||
+ | * From the compiler PoV, predicated execution combine multiple basic blocks into one bigger basic block | ||
+ | * Reduce control dependences | ||
+ | * Need ISA support | ||
+ | * Wish branches | ||
+ | * Compiler generate both predicated and non-predicated codes | ||
+ | * HW design which one to use | ||
+ | * Use branch prediction on an easy to predict code | ||
+ | * Use predicated execution on a hard to predict code | ||
+ | * Compiler can be more aggressive in optimimzing the code | ||
+ | * What are the tradeoffs (slide# 47) | ||
+ | * Multi-path execution | ||
+ | * Execute both paths | ||
+ | * Can lead to wasted work | ||
+ | |||
+ | |||
+ | ===== Lecture 11 (2/12 Wed.) ===== | ||
+ | |||
+ | * Call and return prediction | ||
+ | * Direct call is easy to predict | ||
+ | * Retun is harder (indirect branches) | ||
+ | * Nested calls make return easier to predict | ||
+ | * Can use stack to predict the return | ||
+ | * Indirect branch prediction | ||
+ | * These branches have multiple targets | ||
+ | * For switch-case, virtual function calls, jump tables, interface calls | ||
+ | * BTB to predict the target address - low accuracy | ||
+ | * History based: BTB + GHR | ||
+ | * Virtual program counter prediction | ||
+ | * Complications in superscalar processors | ||
+ | * Fetch? What if multiple branches are fetched at the same time? | ||
+ | * Logic requires to ensure correctness? | ||
+ | * Multi-cycle executions (Different functional units take different number of cycles) | ||
+ | * Instructions can retire out-of-order | ||
+ | * How to deal with this case? Stall? Throw exceptions if there are problems? | ||
+ | * Exceptions and Interrupts | ||
+ | * When they are handled? | ||
+ | * Why are some interrupts should be handled right away? | ||
+ | * Precise exception | ||
+ | * arch. state should be consistent before handling the exception/interrupts | ||
+ | * Easier to debug (you see the sequential flow when the interrupt occurs) | ||
+ | * Deterministic | ||
+ | * Easier to recover from the exception | ||
+ | * Easier to restart the processes | ||
+ | * How to ensure precise exception? | ||
+ | * Tradeoffs between each method | ||
+ | * Reorder buffer | ||
+ | * Reorder results before they are visible to the arch. state | ||
+ | * Need to presearve the sequential sematic and data | ||
+ | * What are the informatinos in the ROB entry | ||
+ | * Where to get the value from (forwarding path? reorder buffer?) | ||
+ | * Extra logic to check where the youngest instructions/value is | ||
+ | * Content addressible search | ||
+ | * A lot of comparators | ||
+ | * Different ways to simplify the reorder buffer | ||
+ | * Register renaming | ||
+ | * Same register refers to independent values (lacks of registers) | ||
+ | * Where does the exception happen (after retire) | ||
+ | * History buffer | ||
+ | * Update the register file when the instruction complete. Unroll if there is an exception. | ||
+ | * Future file (commonly used, along with reorder buffer) | ||
+ | * Keep two set of register files | ||
+ | * An updated value (Speculative), called fiture file | ||
+ | * A backup value (to restore the state quickly | ||
+ | * Double the cost of the regfile, but reduce the area as you don't have to use a content addressible memory (compared to ROB alone) | ||
+ | * Branch misprediction resembles Exception | ||
+ | * The difference is that branch misprediction is not visible to the software | ||
+ | * Also much more common (say, divide by zero vs. a mispredicted branch) | ||
+ | * Recovery is similar to exception handling | ||
+ | * Latency of the state recovery | ||
+ | * What to do during the state recovery | ||
+ | * Checkpointing | ||
+ | * Advantages? | ||
+ | | ||
+ | ===== Lecture 14 (2/19 Wed.) ===== | ||
+ | |||
+ | * Predictor (branch predictor, cache line predictor ...) | ||
+ | * Power budget (and its importance) | ||
+ | * Architectural state, precise state | ||
+ | * Memory dependence is known dynamically | ||
+ | * Register state is not shared across threads/processors | ||
+ | * Memory state is shared across threads/processors | ||
+ | * How to maintain speculative memory states | ||
+ | * Write buffers (helps simplify the process of checking the reorder buffer) | ||
+ | * Overall OoO mechanism | ||
+ | * What are other ways of eliminating dispatch stalls | ||
+ | * Dispatch when the sources are ready | ||
+ | * Retired instructions make the source available | ||
+ | * Register renaming | ||
+ | * Reservation station | ||
+ | * What goes into the reservation station | ||
+ | * Tags required in the reservation station | ||
+ | * Tomasulo's algorithm | ||
+ | * Without precise exception, OoO is hard to debug | ||
+ | * Arch. register ID | ||
+ | * Examples in the slides | ||
+ | * Slides 28 --> register renaming | ||
+ | * Slides 30-35 --> Exercise (also on the board) | ||
+ | * This will be usefull for the midterm | ||
+ | * Register aliasing table | ||
+ | |||
+ | ===== Lecture 15 (2/21 Fri.) ===== | ||
+ | |||
+ | * OoO --> Restricted Dataflow | ||
+ | * Extracting parallelism | ||
+ | * What are the bottlenecks? | ||
+ | * Issue width | ||
+ | * Dispatch width | ||
+ | * Parallelism in the program | ||
+ | * More example on slide #10 | ||
+ | * What does it mean to be restricted data flow | ||
+ | * Still visible as a Von Neumann model | ||
+ | * Where does the efficiency come from? | ||
+ | * Size of the scheduling windors/reorder buffer. Tradeoffs? What make sense? | ||
+ | * Load/store handling | ||
+ | * Would like to schedule them out of order, but make them visible in-order | ||
+ | * When do you schedule the load/store instructions? | ||
+ | * Can we predict if load/store are dependent? | ||
+ | * This is one of the most complex structure of the load/store handling | ||
+ | * What information can be used to predict these load/store optimization? | ||
+ | * Note: IPC = 1/CPI | ||
+ | * Centralized vs. distributed? What are the tradeoffs? | ||
+ | * How to handle when there is a misprediction/recovery | ||
+ | * Token dataflow arch. | ||
+ | * What are tokens? | ||
+ | * How to match tokens | ||
+ | * Tagged token dataflow arch. | ||
+ | * What are the tradeoffs? | ||
+ | * Difficulties? | ||
+ | |||
+ | ===== Lecture 16 (2/24 Mon.) ===== | ||
+ | |||
+ | * SISD/SIMD/MISD/MIMD | ||
+ | * Array processor | ||
+ | * Vector processor | ||
+ | * Data parallelism | ||
+ | * Where does the concurrency arise? | ||
+ | * Differences between array processor vs. vector processor | ||
+ | * VLIW | ||
+ | * Compactness of an array processor | ||
+ | * Vector operates on a vector of data (rather than a single datum (scalar)) | ||
+ | * Vector length (also applies to array processor) | ||
+ | * No dependency within a vector --> can have a deep pipeline | ||
+ | * Highly parallel (both instruction level (ILP) and memory level (MLP)) | ||
+ | * But the program needs to be very parallel | ||
+ | * Memory can be the bottleneck (due to very high MLP) | ||
+ | * What does the functional units look like? Deep pipelin and simpler control. | ||
+ | * CRAY-I is one of the examples of vector processor | ||
+ | * Memory access pattern in a vector processor | ||
+ | * How do the memory accesses benefit the memory bandwidth? | ||
+ | * Please refer to slides 73-74 in http://www.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring13-lecture25-mainmemory-afterlecture.pdf for a breif explanation of memory level parallelism | ||
+ | * Stride length vs. the number of banks | ||
+ | * stride length should be relatively prime to the number of banks | ||
+ | * Tradeoffs between row major and column major --> How can the vector processor deals with the two | ||
+ | * How to calculate the efficiency and performance of vector processors | ||
+ | * What if there are multiple memory ports? | ||
+ | * Gather/Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism) | ||
+ | * Helps handling sparse metrices | ||
+ | * Conditional operation | ||
+ | * Structure of vector units | ||
+ | * How to automatically parallelize code through the compiler? | ||
+ | * This is a hard problem. Compiler does not know the memory address. | ||
+ | * What do we need to ensure for both vector and array processor? | ||
+ | * Sequential bottleneck | ||
+ | * Amdahl's law | ||
+ | * Intel MMX --> An example of Intel's approach to SIMD | ||
+ | * No VLEN, use OpCode to define the length | ||
+ | * Stride is one in MMX | ||
+ | * Intel SSE --> Modern version of MMX | ||
+ | |||
+ | ===== Lecture 17 (2/26 Wed.) ===== | ||
+ | |||
+ | * GPU | ||
+ | * Warp/Wavefront | ||
+ | * A bunch of threads sharing the same PC | ||
+ | * SIMT | ||
+ | * Lanes | ||
+ | * FGMT + massively parallel | ||
+ | * Tolerate long latency | ||
+ | * Warp based SIMD vs. traditional SIMD | ||
+ | * SPMD (Programming model) | ||
+ | * Single program operates on multiple data | ||
+ | * can have synchronization point | ||
+ | * Many scientific applications are programmed in this manner | ||
+ | * Control flow problem (branch divergence) | ||
+ | * Masking (in a branch, mask threads that should not execute that path) | ||
+ | * Lower SIMD efficiency | ||
+ | * What if you have layers of branches? | ||
+ | * Dynamic wrap formation | ||
+ | * Combining threads from different warps to increase SIMD utilization | ||
+ | * This can cause memory divergence | ||
+ | * VLIW | ||
+ | * Wide fetch | ||
+ | * IA-64 | ||
+ | * Tradeoffs | ||
+ | * Simple hardware (no dynamic scheduling, no dependency checking within VLIW) | ||
+ | * A lot of loads at the compiler level | ||
+ | * Decoupled access/execute | ||
+ | * Limited form of OoO | ||
+ | * Tradeoffs | ||
+ | * How to street the instruction (determine dependency/stalling)? | ||
+ | * Instruction scheduling techniques (static vs. dynamic) | ||
+ | * Systoric arrays | ||
+ | * Processing elements transform data in chains | ||
+ | * Develop for image processing (for example, convolution) | ||
+ | * Stage processing | ||
+ | |||
+ | ===== Lecture 18 (2/28 Fri.) ===== | ||
+ | |||
+ | * Tradeoffs of VLIW | ||
+ | * Why does VLIW required static instruction scheduling | ||
+ | * Whose job it is? | ||
+ | * Compiler can rearrange basic blocks/instruction | ||
+ | * Basic block | ||
+ | * Benefits of having large basic block | ||
+ | * Entry/Exit | ||
+ | * Handling entries/exits | ||
+ | * Trace cache | ||
+ | * How to ensure correctness? | ||
+ | * Profiling | ||
+ | * Fixing up the instruction order to ensure correctness | ||
+ | * Dealing with multiple entries into the block | ||
+ | * Dealing with multiple exits into the block | ||
+ | * Super block | ||
+ | * How to form super blocks? | ||
+ | * Benefit of super block | ||
+ | * Tradeoff between not forming a super block and forming a super block | ||
+ | * Ambiguous branch (after profiling, both taken/not taken are equally likely) | ||
+ | * Cleaning up | ||
+ | * What scenario would make trace cache/superblock/profiling less effective? | ||
+ | * List scheduling | ||
+ | * Help figuring out which instructions VLIW should fetch | ||
+ | * Try to maximize instruction throughput | ||
+ | * How to assign priorities | ||
+ | * What if some instructions take longer than others | ||
+ | * Block structured ISA (BS-ISA) | ||
+ | * Problems with trace scheduling? | ||
+ | * What type of program will benefit from BS-ISA | ||
+ | * How to form blocks in BS-ISA? | ||
+ | * Combining basic blocks | ||
+ | * multiples of merged basic blocks | ||
+ | * How to deal with entries/exits in BS-ISA? | ||
+ | * undo the executed instructions from the entry point, then fetch the new block | ||
+ | * Advantages over trace cache | ||
+ | * Benefit of VLIW + Static instruction scheduling | ||
+ | * Intel IA-64 | ||
+ | * Static instruction scheduling and VLIW | ||
+ | |||
+ | ===== Lecture 19 (3/19 Wed.) ===== | ||
+ | |||
+ | * Ideal cache | ||
+ | * More capacity | ||
+ | * Fast | ||
+ | * Cheap | ||
+ | * High bandwidth | ||
+ | * DRAM cell | ||
+ | * Cheap | ||
+ | * Sense the purturbation through sense amplifier | ||
+ | * Slow and leaky | ||
+ | * SRAM cell (Cross coupled inverter) | ||
+ | * Expensice | ||
+ | * Fast (easier to sense the value in the cell) | ||
+ | * Memory bank | ||
+ | * Read access sequence | ||
+ | * DRAM: Activate -> Read -> Precharge (if needed) | ||
+ | * What dominate the access laatency for DRAM and SRAM | ||
+ | * Scaling issue | ||
+ | * Hard to scale the scale to be small | ||
+ | * Memory hierarchy | ||
+ | * Prefetching | ||
+ | * Caching | ||
+ | * Spatial and temporal locality | ||
+ | * Cache can exploit these | ||
+ | * Recently used data is likely to be accessed | ||
+ | * Nearby data is likely to be accessed | ||
+ | * Caching in a pipeline design | ||
+ | * Cache management | ||
+ | * Manual | ||
+ | * Data movement is managed manually | ||
+ | * Embedded processor | ||
+ | * GPU scratchpad | ||
+ | * Automatic | ||
+ | * HW manage data movements | ||
+ | * Latency analysis | ||
+ | * Based on the hit and miss status, next level access time (if miss), and the current level access time | ||
+ | * Cache basics | ||
+ | * Set/block (line)/Placement/replacement/direct mapped vs. associative cache/etc. | ||
+ | * Cache access | ||
+ | * How to access tag and data (in parallel vs serially) | ||
+ | * How do tag and index get used? | ||
+ | * Modern processors perform serial access for higher level cache (L3 for example) to save power | ||
+ | * Cost and benefit of having more associativity | ||
+ | * Given the associativity, which block should be replace if it is full | ||
+ | * Replacement poligy | ||
+ | * Random | ||
+ | * Least recently used (LRU) | ||
+ | * Least frequently used | ||
+ | * Least costly to refetch | ||
+ | * etc. | ||
+ | * How to implement LRU | ||
+ | * How to keep track of access ordering | ||
+ | * Complexity increases rapidly | ||
+ | * Approximate LRU | ||
+ | * Victim and next Victim policy | ||
+ | |||
+ | ===== Lecture 20 (3/21 Fri.) ===== | ||
+ | |||
+ | * Set thrashing | ||
+ | * Working set is bigger than the associativity | ||
+ | * Belady's OPT | ||
+ | * Is this optimal? | ||
+ | * Complexity? | ||
+ | * Similarity between cache and page table | ||
+ | * Number of blocks vs pages | ||
+ | * Time to find the block/page to replace | ||
+ | * Handling writes | ||
+ | * Write through | ||
+ | * Need a modified bit to make sure accesses to data got the updated data | ||
+ | * Write back | ||
+ | * Simpler, no consistency issues | ||
+ | * Sectored cache | ||
+ | * Use subblock | ||
+ | * lower bandwidth | ||
+ | * more complex | ||
+ | * Instruction vs data cache | ||
+ | * Where to place instructions | ||
+ | * Unified vs. separated | ||
+ | * In the first level cache | ||
+ | * Cache access | ||
+ | * First level access | ||
+ | * Second level access | ||
+ | * When to start the second level access | ||
+ | * Performance vs. energy | ||
+ | * Address translation | ||
+ | * Homonym and Synonyms | ||
+ | * Homonym: Same VA but maps to different PA | ||
+ | * With multiple processes | ||
+ | * Synonyms: Multiple VAs map to the same PA | ||
+ | * Shared libraries, shared data, copy-on-write | ||
+ | * I/O | ||
+ | * Can these create problems when we have the cache | ||
+ | * How to eliminate these problems? | ||
+ | * Page coloring | ||
+ | * Interaction between cache and TLB | ||
+ | * Virtually indexed vs. physically indexed | ||
+ | * Virtually tagged vs. physically tagged | ||
+ | * Virtually indexed physically tagged | ||
+ | * Virtual memory in DRAM | ||
+ | * Control where data is mapped to in channel/rank/bank | ||
+ | * More parallelism | ||
+ | * Reduce interference | ||
+ | |||
+ | ===== Lecture 21 (3/24 Mon.) ===== | ||
+ | |||
+ | |||
+ | |||
+ | * Different parameters that affect cache miss | ||
+ | * Thrashing | ||
+ | * Different types of cache misses | ||
+ | * Compulsory misses | ||
+ | * Can mitigate with prefetches | ||
+ | * Capacity misses | ||
+ | * More assoc | ||
+ | * Victim cache | ||
+ | * Conflict misses | ||
+ | * Hashing | ||
+ | * Large block vs. small block | ||
+ | * Subblocks | ||
+ | * Victim cache | ||
+ | * Small, but fully assoc. cache behind the actual cache | ||
+ | * Cached misses cache block | ||
+ | * Prevent ping-ponging | ||
+ | * Pseudo associativity | ||
+ | * Simpler way to implement associative cache | ||
+ | * Skewed assoc. cache | ||
+ | * Different hashing functions for each way | ||
+ | * Restructure data access pattern | ||
+ | * Order of loop traversal | ||
+ | * Blocking | ||
+ | * Memory level parallelism | ||
+ | * Cost per miss of a parallel cache miss is less costly compared to serial misses | ||
+ | * MSHR | ||
+ | * Keep track of pending cache | ||
+ | * Think of this as the load/store buffer-ish for cache | ||
+ | * What information goes into the MSHR? | ||
+ | * When do you access the MSHR? | ||
+ | |||
+ | |||
+ | ===== Lecture 22 (3/26 Wed.) ===== | ||
+ | |||
+ | |||
+ | |||
+ | * Multi-porting | ||
+ | * Virtual multi-porting | ||
+ | * Time-share the port, not too scalable but cheap | ||
+ | * True multiporting | ||
+ | * Multiple cache copies | ||
+ | * Banking | ||
+ | * Can have bank conflict | ||
+ | * Extra interconnects across banks | ||
+ | * Address mapping can mitigate bank conflict | ||
+ | * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity) | ||
+ | * Accessing DRAM | ||
+ | * Row bits | ||
+ | * Column bits | ||
+ | * Addressibility | ||
+ | * DRAM has its own clock | ||
+ | * DRAM (2T) vs. SRAM (6T) | ||
+ | * Cost | ||
+ | * Latency | ||
+ | * Interleaving in DRAM | ||
+ | * Effects from address mapping on memory interleaving | ||
+ | * Effects from memory access patterns from the program on interleaving | ||
+ | * DRAM Bank | ||
+ | * To minimize the cost of interleaving (Shared the data bus and the command bus) | ||
+ | * DRAM Rank | ||
+ | * Minimize the cost of the chip (a bundle of chips operated together) | ||
+ | * DRAM Channel | ||
+ | * An interface to DRAM, each with its own ranks/banks | ||
+ | * DIMM | ||
+ | * More DIMM adds the interconnect complexity | ||
+ | * List of commands to read/write data into DRAM | ||
+ | * Activate -> read/write -> precharge | ||
+ | * Activate moves data into the row buffer | ||
+ | * Precharge prepare the bank for the next access | ||
+ | * Row buffer hit | ||
+ | * Row buffer conflict | ||
+ | * Scheduling memory requests to lower row conflicts | ||
+ | * Burst mode of DRAM | ||
+ | * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits | ||
+ | * Address mapping | ||
+ | * Row interleaved | ||
+ | * Cache block interleaved | ||
+ | * Memory controller | ||
+ | * Sending DRAM commands | ||
+ | * Periodically send commands to refresh DRAM cells | ||
+ | * Ensure correctness and data integrity | ||
+ | * Where to place the memory controller | ||
+ | * On CPU chip vs. at the main memory | ||
+ | * Higher BW on-chip | ||
+ | * Determine the order of requests that will be serviced in DRAM | ||
+ | * Request queues that hold requests | ||
+ | * Send requests whenever the request can be sent to the bank | ||
+ | * Determine which command (across banks) should be sent to DRAM | ||
+ | * Priority of demand vs. prefetch requests | ||
+ | * Memory scheduling policies | ||
+ | * FCFS | ||
+ | * FR-FCFS | ||
+ | * Capped FR-FCFS: FR-FCFS with a timeout | ||
+ | * Usually this is done in a command level (read/write commands and precharge/activate commands) | ||
+ | | ||
+ | | ||
+ | ===== Lecture 23 (3/28 Fri.) ===== | ||
+ | |||
+ | * DRAM design choices | ||
+ | * Cost/density/latency/BW/Yield | ||
+ | * Sense Amplifier | ||
+ | * How do they work | ||
+ | * Dual data rate | ||
+ | * Subarray | ||
+ | * Rowclone | ||
+ | * Moving bulk of data from one row to others | ||
+ | * Lower latency and BW when performing copies/zeroes out the data | ||
+ | * TL-DRAM | ||
+ | * Far segment | ||
+ | * Near segment | ||
+ | * What causes the long latency | ||
+ | * Benefit of TL-DRAM | ||
+ | * TL-DRAM vs. DRAM cache (adding a small cache in DRAM) | ||
+ | |||
+ | | ||
+ | | ||
+ | | ||
+ | ===== Lecture 24 (3/31 Mon.) ===== | ||
+ | | ||
+ | |||
+ | * Memory controller | ||
+ | * Different commands | ||
+ | * Memory scheduler | ||
+ | * Determine the order of requests to be issued to DRAM | ||
+ | * Age/hit-miss status/types(load/store/prefetch/from GPU/from CPU)/criticality | ||
+ | * Row buffer | ||
+ | * hit/conflict | ||
+ | * open/closed row | ||
+ | * Open row policy | ||
+ | * Closed row policy | ||
+ | * Tradeoffs between open and closed row policy | ||
+ | * What if the programs has high row buffer locality: open row might benefit more | ||
+ | * Closed row will service misses request faster | ||
+ | * Bank conflict | ||
+ | * Interference from different applications/threads | ||
+ | * Differnt programs/processes/threads interfere with each other | ||
+ | * introduce more row buffer/bank conflicts | ||
+ | * Memory schedule has to manage these interference | ||
+ | * Memory hog problems | ||
+ | * Interference in the data/command bus | ||
+ | * FR-FCFS | ||
+ | * Why does FR-FCFS make sense? | ||
+ | * Row buffer has lower lantecy | ||
+ | * Issues with FR-FCFS | ||
+ | * Unfairness | ||
+ | * STFM | ||
+ | * Fairness issue in memory scheduling | ||
+ | * How does STFM calculate the fairness and slowdown | ||
+ | * How to estimate slowdown time when it is runing alone | ||
+ | * Definition of fairness (based on STFM, different papers/areas define fairness differently) | ||
+ | * PAR-BS | ||
+ | * Parallelism in programs | ||
+ | * Intereference across banks | ||
+ | * How to form a batch | ||
+ | * How to determine ranking between batches/within a batch | ||
+ | | ||
+ | |||
+ | |||
+ | ===== Lecture 25 (2/2 Wed.) ===== | ||
+ | |||
+ | |||
+ | |||
+ | * Latency sensitivity | ||
+ | * Performance drops a lot when the memory request latency is long | ||
+ | * TCM | ||
+ | * Tradeoff between throughput and fairness | ||
+ | * Latency sensitive cluster (non-intensive cluster) | ||
+ | * Ranking based on memory intensity | ||
+ | * Bandwidth intensive cluster | ||
+ | * Round robin within the cluster | ||
+ | * Generally latency sensitive cluster has more priority | ||
+ | * Provide robust fairness vs. throughput | ||
+ | * Complexity of TCM? | ||
+ | * Different ways to control interference in DRAM | ||
+ | * Partitioning of resource | ||
+ | * Channel partitioning: map applications that interfere with each other in a different channel | ||
+ | * Keep track of application's characteristics | ||
+ | * Dedicate a channel might waste the bandwidth | ||
+ | * Need OS support to determine the channel bits | ||
+ | * Source throttling | ||
+ | * A controller throttle the core depends on the performance target | ||
+ | * Example: Fairness via source throttling | ||
+ | * Detect unfairness and throttle application that is interfering | ||
+ | * How do you estimate slowdown? | ||
+ | * Threshold based solution: hard to configure | ||
+ | * App/thread scheduling | ||
+ | * Critical threads usually stall the progress | ||
+ | * Designing DRAM controller | ||
+ | * Has to handle the normal DRAM operations | ||
+ | * Read/write/refresh/all the timing constraints | ||
+ | * Keep track of resources | ||
+ | * Assign priorities to different requests | ||
+ | * Manage requests to banks | ||
+ | * Self-optimizing controller | ||
+ | * Use machine learning to improve DRAM controller | ||
+ | * DRAM Refresh | ||
+ | * Why does DRAM has to refresh every 64ms | ||
+ | * Banks are unavailable during refresh | ||
+ | * LPDDR mitigate this by using a per-bank refresh | ||
+ | * Has to spend longer time with bigger DRAM | ||
+ | * Distributed refresh: stagger refresh every 64 ms in a distributed manner | ||
+ | * As oppose to burst refresh (long pause time) | ||
+ | * RAIDR: Reduce DRAM refresh by profiling and binning | ||
+ | * Some row do not have to be refresh very frequently | ||
+ | * Profile the row | ||
+ | * High temperature changes the retention time: need online profiling | ||
+ | * Bloom filter | ||
+ | * Represent set membership | ||
+ | * Approximated | ||
+ | * Can contain false positive | ||
+ | * Better/more hash function helps eliminate this | ||
+ | | ||
+ | | ||
+ | | ||
+ | ===== Lecture 26 (4/7 Mon.) ===== | ||
+ | |||
+ | |||
+ | |||
+ | * Tolerate latency can be costly | ||
+ | * Instruction window is complex | ||
+ | * Benefit also diminishes | ||
+ | * Designing the buffers can be complex | ||
+ | * A simpler way to tolerate out of order is desirable | ||
+ | * Different sources that cause the core to stall in OoO | ||
+ | * Cache miss | ||
+ | * Note that stall happens if the inst. window is full | ||
+ | * Scaling instruction window size is hard | ||
+ | * It is better (less complex) to make the windows more efficient | ||
+ | * Runahead execution | ||
+ | * Try to optain MLP w/o increasing instruction windows | ||
+ | * Runahead (i.e. execute ahead) when there is a long memory instruction | ||
+ | * Long memory instruction stall processor for a while anyways, so it's better to make use out of it | ||
+ | * Execute future instruction to generate accurate prefetches | ||
+ | * Allow future data to be in the cache | ||
+ | * How to support runahead execution? | ||
+ | * Need a way to checkpoing the state when entering runahead mode | ||
+ | * How to make executing in the wrong path useful? | ||
+ | * Need runahead cache to handle load/store in Runahead mode (since they are speculative) | ||
+ | * Cost and benefit of runahead execution (slide number 27) | ||
+ | * Runahead can have inefficiency | ||
+ | * Runahead period that are useless | ||
+ | * Get rid of useless inefficient period | ||
+ | * What if there is a dependent cache miss | ||
+ | * Cannot be paralellized in a vanilla runahead | ||
+ | * Can predict the value of the dependent load | ||
+ | * How to predict the address of the load | ||
+ | * Delta value information | ||
+ | * Stride predictor | ||
+ | * AVD prediction | ||
+ | | ||
+ | ===== Lecture 28 (4/9 Wed.) ===== | ||
+ | |||
+ | |||
+ | * Questions regarding prefetching | ||
+ | * What to prefetch | ||
+ | * When to prefetch | ||
+ | * how do we prefetch | ||
+ | * where to prefetch from | ||
+ | * Prefetching can cause thrasing (evict a useful block) | ||
+ | * Prefetching can also be useless (not being used) | ||
+ | * Need to be efficient | ||
+ | * Can cause memory bandwidth problem in GPU | ||
+ | * Prefetch the whole block, more than one block, or subblock? | ||
+ | * Each one of them has pros and cons | ||
+ | * Big prefetch is more likely to waste bandwidth | ||
+ | * Commonly done in a cache block granularity | ||
+ | * Prefetch accuracy: fraction of useful prefetches out of all the prefetches | ||
+ | * Prefetcher usually predict based on | ||
+ | * Past knowledge | ||
+ | * Compiler hints | ||
+ | * Prefetcher has to prefetch at the right time | ||
+ | * Prefetch that is too early might get evicted | ||
+ | * It might also evict other useful data | ||
+ | * Prefetch too late does not hide the whole memory latency | ||
+ | * Previous prefetches at the same PC can be used as the history | ||
+ | * Previous demand requests also is a good information to use for prefetches | ||
+ | * Prefetch buffer | ||
+ | * Place the prefetch data to avoid thrashing | ||
+ | * Can treat demand/prefetch requests separately | ||
+ | * More complex | ||
+ | * Generally, demand block is more important | ||
+ | * This means eviction should prefer prefetch block as oppose to demand block | ||
+ | * Tradeoffs between where do we place the prefetcher | ||
+ | * Look at L1 hits and misses | ||
+ | * Look at L1 misses only | ||
+ | * Look at L2 misses | ||
+ | * Different access pattern affect accuracy | ||
+ | * Tradeoffs between handling more requests (seeing L1 hits and misses) and less visibility (only see L2 miss) | ||
+ | * Software vs. hardware vs. execution based prefetching | ||
+ | * Software: ISA previde prefetch instructions, software utilize it | ||
+ | * What information are useful | ||
+ | * How to make sure the prefetch is timely | ||
+ | * What if you have a pointer based structure | ||
+ | * Not easy to prefetch pointer chasing (because in many case the work between prefetches is short, so you cannot predict the next one timely enough) | ||
+ | * Can be solved by hinting the nextnext and/or nextnextnext address | ||
+ | * Hardware: Identify the pattern and prefetch | ||
+ | * Execution driven: Oppotunistically try to prefetch (runahead, dual-core execution) | ||
+ | * Stride prefetcher | ||
+ | * Predict strides, which is common in many programs | ||
+ | * Cache block based or instruction based | ||
+ | * Stream buffer design | ||
+ | * Buffer the stream of accesses (next address) | ||
+ | * Use the information to prefetch | ||
+ | * What affect prefetcher performance | ||
+ | * Prefetch distance | ||
+ | * How far ahead should we prefetch | ||
+ | * Prefetch degree | ||
+ | * How many prefetches do we prefetch | ||
+ | * Prefetcher performance | ||
+ | * Coverage | ||
+ | * Out of the demand requests, how many are actually from the prefetch request | ||
+ | * Accuracy | ||
+ | * Out of all the prefetch requests, how many are actually getting used | ||
+ | * Timeliness | ||
+ | * How much memory latency can we hide from the prefetch requests | ||
+ | * Feedback directed prefetcher | ||
+ | * Use the result of the prefetcher as a feedback to the prefetcher | ||
+ | * with accuracy, timeliness, polluting information | ||
+ | * Markov prefetcher | ||
+ | * Prefetch based on the previous history | ||
+ | * Use markov model to predict | ||
+ | * Pros: Can cover arbitary pattern (easy for link list traversal or trees) | ||
+ | * Downside: High cost, cannot help with compulsory misses (no history) | ||
+ | * Content directed prefetching | ||
+ | * Indentify the content in memory for pointers (which is used as the address to prefetch | ||
+ | * Not very efficient (hard to figure out which block is the pointer) | ||
+ | * Software can give hints | ||
+ | | ||
+ | ===== Lecture 28 (4/14 Mon.) ===== | ||
+ | |||
+ | |||
+ | * Execution based prefetcher | ||
+ | * Helper thread/speculative thread | ||
+ | * Use another thread to pre-execute a program | ||
+ | * Can be a software based or hardware based | ||
+ | * Discover misses before the main program (to prefetch data in a timely manner) | ||
+ | * How do you construct the helper thread | ||
+ | * Preexecute instruction (one example of how to initialize a speculative thread), slide 9 | ||
+ | * Benefit of multiprocessor | ||
+ | * Improve performace without not significantly increase power consumption | ||
+ | * Better cost efficiency and easier to scale | ||
+ | * Improve dependability (in case the other core is faulty | ||
+ | * Different types of parallelism | ||
+ | * Instruction level parallelism | ||
+ | * Data level parallelism | ||
+ | * Task level parallelism | ||
+ | * Task level parallelism | ||
+ | * Partition a single, potentially big, task into multiple parallel sub-task | ||
+ | * Can be done explicitly (parallel programming by the programmer) | ||
+ | * Or implicitly (hardware partition a single thread specilatively) | ||
+ | * Or, run multiple independent tasks (still improve throughput, but the speedup of any single tasks are not better, also simpler to implement) | ||
+ | * Loosely coupled multiprocessor | ||
+ | * No shared global address space | ||
+ | * Message passing to communicate between different sources | ||
+ | * Simple to manage memory | ||
+ | * Tightly coupled multiprocesor | ||
+ | * Shared global address space | ||
+ | * Need to ensure consistency of data | ||
+ | * Switch on event multithreading | ||
+ | * Switch to another context when there is an event (for example, a cache miss) | ||
+ | * Simulteneous multithreading | ||
+ | * Dispatch instruction from multiple threads at the same time | ||
+ | * Amdahl's law | ||
+ | * Maximum speedup | ||
+ | * Parallel portion is not perfect | ||
+ | * Serial bottleneck | ||
+ | * Synchronization cost | ||
+ | * Load balance | ||
+ | * Some threads has more work, requires more time to hit the sync. point | ||
+ | * Issue in parallel programming | ||
+ | * Correctness | ||
+ | * Synchronization | ||
+ | * Consistency | ||
+ | | ||
+ | ===== Lecture 29 (4/16 Wed.) ===== | ||
+ | | ||
+ | |||
+ | |||
+ | * Ordering of instructions | ||
+ | * Maintaining memory consistency when there are multiple threads and shared memory | ||
+ | * Need to ensure the semantic is not changed | ||
+ | * Making sire the shared data is properly locked when used | ||
+ | * Support mutual exclusion | ||
+ | * Ordering depends on when each processor is executed | ||
+ | * Debugging is also difficult (non-deterministic behavior) | ||
+ | * Weak consistency: global ordering when sync | ||
+ | * programmer hints where the synchronizations are | ||
+ | * Total store order model: global ordering only with store | ||
+ | * Cache coherence | ||
+ | * Can be done in the software level or hardware level | ||
+ | * Coherence protocol | ||
+ | * Need to ensure that all the processors see and update the correct state of the cache block | ||
+ | * Need to make sure that writes get propagated and serialized | ||
+ | * Simple protocol are not scalable (one point of synchrnization) | ||
+ | * Update vs. invalidate | ||
+ | * For invalidate, only the core that needs to read retains the correct copy | ||
+ | * Can lead to ping-ponging (tons of read/writes from several processors) | ||
+ | * For updates, bus becomes the bottleneck | ||
+ | * Snoopy bus | ||
+ | * Bus based, single point of serialization | ||
+ | * More efficient with small number of processors | ||
+ | * All cache snoop other caches read/write requests to keep the cache block coherent | ||
+ | * Directory based | ||
+ | * Single point of serialization per block | ||
+ | * Directory coordinate the coherency | ||
+ | * More scalable | ||
+ | * The directory keeps track of where the copies of each block resides | ||
+ | * Supply data on a read | ||
+ | * Invalide the block on a write | ||
+ | * Has an exclusive state | ||
+ | * MSI coherent protocol | ||
+ | * Slide number 56-57 | ||
+ | * Consume bus bandwidth (need an "exclusive" state | ||
+ | * MESI coherent protocal | ||
+ | * Add the exclusive state: this is the only cache copy and it is clean state to MSI | ||
+ | * Tradeoffs between snooping and directory based | ||
+ | * Slide 71 has a good summary on this | ||
+ | * MOESI | ||
+ | * Improvement over MESI protocol | ||
+ | |||
+ | | ||
+ | ===== Lecture 29 (4/18 Wed.) ===== | ||
+ | |||
+ | |||
+ | |||
+ | * Interference | ||
+ | * Complexity of the memory scheduler | ||
+ | * Ranking/prioritization has cost | ||
+ | * Complex scheduler has higher latency | ||
+ | * Performance metric for multicore/multithead applications | ||
+ | * Speedup | ||
+ | * Slowdown | ||
+ | * Harmonic vs wrighted | ||
+ | * Fairness mertic | ||
+ | * Maximum slowdown | ||
+ | * Why does it make sense | ||
+ | * Any scenario that it does not make sense? | ||
+ | * Predictable performance | ||
+ | * Why is it important? | ||
+ | * In server environment, different jobs are on the same server | ||
+ | * In a mobile environment, there are multiple sources that can slowdown other sources | ||
+ | * How to relate slowdown with request service rate | ||
+ | * MISE: soft slowdown guarantee | ||
+ | * BDI | ||
+ | * Memory wall | ||
+ | * What is the concern regarding the memory wall | ||
+ | * Size of the cache on the die (CPU die) | ||
+ | * One possible solution: cache compression | ||
+ | * What is the problems of existing cache compression mechanism | ||
+ | * Some are too complex | ||
+ | * Decompression is in the critical path | ||
+ | * Need to decompress when reading the data -> decompression should not be in the critical path | ||
+ | * Important factor to the performance | ||
+ | * Software compression is not good enough to compress everything | ||
+ | * Zero value compression | ||
+ | * Simple | ||
+ | * Good compression ratio | ||
+ | * What is data does not have many zeroes | ||
+ | * Frequent value compression | ||
+ | * Some data appear fequently | ||
+ | * Simple and good compression ratio | ||
+ | * have to profile | ||
+ | * decompression is complex | ||
+ | * Frequent pattern compression | ||
+ | * Still to complex in terms of decompression | ||
+ | * Based delta compression | ||
+ | * Easy to decompress but retain the benefit of compression | ||
+ | | ||
+ | | ||
+ | ===== Lecture 31 (4/28 Mon.) ===== | ||
+ | |||
+ | * Directory based cache coherent | ||
+ | * Each directory has to handle validate/invalidation | ||
+ | * Extra cost of syncronization | ||
+ | * Need to ensure race conditions are resolved | ||
+ | * Interconnection | ||
+ | * Topology | ||
+ | * Bus | ||
+ | * Mesh | ||
+ | * Torus | ||
+ | * Tree | ||
+ | * Butterfly | ||
+ | * Ring | ||
+ | * Bi-directional ring | ||
+ | * More scalable | ||
+ | * Hierarchical ring | ||
+ | * Even more scalable | ||
+ | * More complex | ||
+ | * Crossbar | ||
+ | * etc. | ||
+ | * Circuit switching | ||
+ | * Multistage network | ||
+ | * Butterfly | ||
+ | * Delta network | ||
+ | * Handling contention | ||
+ | * Buffering vs. dropping/deflection (no buffering) | ||
+ | * Routing algorithm | ||
+ | * Handling deadlock | ||
+ | * X-Y routing | ||
+ | * Turn model (to avoid deadlocks) | ||
+ | * Add more buffering for an escape path | ||
+ | * Oblivious routing | ||
+ | * Can take different path | ||
+ | * DOR between each intermediate location | ||
+ | * Balance network load | ||
+ | * Adaptive routing | ||
+ | * Use the state of the network to determine the route | ||
+ | * Aware of local and/or global congestions | ||
+ | * Non minimal adaptive routing can have livelocks | ||
+ | |||
+ | ===== Lecture 32 (4/30 Wed.) ===== | ||
+ | |||
+ | |||
+ | * Serialized code section | ||
+ | * Degrade performance | ||
+ | * Waste energy | ||
+ | * Heterogeneous cores | ||
+ | * Can execute serialized portion on a powerful large core | ||
+ | * Tradeoff between multiple small cores, multiple large cores or heterogenerous cores | ||
+ | * Critical section | ||
+ | * bottleneck in several multithreaded workloads | ||
+ | * Assymmetry can help | ||
+ | * Accelerated critical section | ||
+ | * Use a large core to run serialized portion of the code | ||
+ | * How to correctly support ACS | ||
+ | * False serialization | ||
+ | * Handling private/shared data | ||
+ | * BIS | ||
+ | * Ideltify the bottleneck | ||
+ | * Serial bottleneck | ||
+ | * Barrier | ||
+ | * Critical section | ||
+ | * Pipeline stages | ||
+ | * Application might wait on different types of bottlenecks | ||
+ | * Allow bottleneckcall and bottleneckreturn | ||
+ | * Acceleration can be done in multiple ways | ||
+ | * ship to a big core | ||
+ | * increase the frequency | ||
+ | * Priorize the thread in share resources (memory scheduler always schedule reqeusts from the thread first, etc.) | ||
+ | * Bottleneck table keeps track of different thread's bottleneck and determine the criticality | ||
+ | | ||
+ | | ||
+ | |