This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
buzzword [2014/02/12 19:20] rachata |
buzzword [2014/04/02 18:13] rachata |
||
---|---|---|---|
Line 499: | Line 499: | ||
* Advantages? | * Advantages? | ||
| | ||
+ | ===== Lecture 14 (2/19 Wed.) ===== | ||
+ | |||
+ | * Predictor (branch predictor, cache line predictor ...) | ||
+ | * Power budget (and its importance) | ||
+ | * Architectural state, precise state | ||
+ | * Memory dependence is known dynamically | ||
+ | * Register state is not shared across threads/processors | ||
+ | * Memory state is shared across threads/processors | ||
+ | * How to maintain speculative memory states | ||
+ | * Write buffers (helps simplify the process of checking the reorder buffer) | ||
+ | * Overall OoO mechanism | ||
+ | * What are other ways of eliminating dispatch stalls | ||
+ | * Dispatch when the sources are ready | ||
+ | * Retired instructions make the source available | ||
+ | * Register renaming | ||
+ | * Reservation station | ||
+ | * What goes into the reservation station | ||
+ | * Tags required in the reservation station | ||
+ | * Tomasulo's algorithm | ||
+ | * Without precise exception, OoO is hard to debug | ||
+ | * Arch. register ID | ||
+ | * Examples in the slides | ||
+ | * Slides 28 --> register renaming | ||
+ | * Slides 30-35 --> Exercise (also on the board) | ||
+ | * This will be usefull for the midterm | ||
+ | * Register aliasing table | ||
+ | |||
+ | ===== Lecture 15 (2/21 Fri.) ===== | ||
+ | |||
+ | * OoO --> Restricted Dataflow | ||
+ | * Extracting parallelism | ||
+ | * What are the bottlenecks? | ||
+ | * Issue width | ||
+ | * Dispatch width | ||
+ | * Parallelism in the program | ||
+ | * More example on slide #10 | ||
+ | * What does it mean to be restricted data flow | ||
+ | * Still visible as a Von Neumann model | ||
+ | * Where does the efficiency come from? | ||
+ | * Size of the scheduling windors/reorder buffer. Tradeoffs? What make sense? | ||
+ | * Load/store handling | ||
+ | * Would like to schedule them out of order, but make them visible in-order | ||
+ | * When do you schedule the load/store instructions? | ||
+ | * Can we predict if load/store are dependent? | ||
+ | * This is one of the most complex structure of the load/store handling | ||
+ | * What information can be used to predict these load/store optimization? | ||
+ | * Note: IPC = 1/CPI | ||
+ | * Centralized vs. distributed? What are the tradeoffs? | ||
+ | * How to handle when there is a misprediction/recovery | ||
+ | * Token dataflow arch. | ||
+ | * What are tokens? | ||
+ | * How to match tokens | ||
+ | * Tagged token dataflow arch. | ||
+ | * What are the tradeoffs? | ||
+ | * Difficulties? | ||
+ | |||
+ | ===== Lecture 16 (2/24 Mon.) ===== | ||
+ | |||
+ | * SISD/SIMD/MISD/MIMD | ||
+ | * Array processor | ||
+ | * Vector processor | ||
+ | * Data parallelism | ||
+ | * Where does the concurrency arise? | ||
+ | * Differences between array processor vs. vector processor | ||
+ | * VLIW | ||
+ | * Compactness of an array processor | ||
+ | * Vector operates on a vector of data (rather than a single datum (scalar)) | ||
+ | * Vector length (also applies to array processor) | ||
+ | * No dependency within a vector --> can have a deep pipeline | ||
+ | * Highly parallel (both instruction level (ILP) and memory level (MLP)) | ||
+ | * But the program needs to be very parallel | ||
+ | * Memory can be the bottleneck (due to very high MLP) | ||
+ | * What does the functional units look like? Deep pipelin and simpler control. | ||
+ | * CRAY-I is one of the examples of vector processor | ||
+ | * Memory access pattern in a vector processor | ||
+ | * How do the memory accesses benefit the memory bandwidth? | ||
+ | * Please refer to slides 73-74 in http://www.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring13-lecture25-mainmemory-afterlecture.pdf for a breif explanation of memory level parallelism | ||
+ | * Stride length vs. the number of banks | ||
+ | * stride length should be relatively prime to the number of banks | ||
+ | * Tradeoffs between row major and column major --> How can the vector processor deals with the two | ||
+ | * How to calculate the efficiency and performance of vector processors | ||
+ | * What if there are multiple memory ports? | ||
+ | * Gather/Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism) | ||
+ | * Helps handling sparse metrices | ||
+ | * Conditional operation | ||
+ | * Structure of vector units | ||
+ | * How to automatically parallelize code through the compiler? | ||
+ | * This is a hard problem. Compiler does not know the memory address. | ||
+ | * What do we need to ensure for both vector and array processor? | ||
+ | * Sequential bottleneck | ||
+ | * Amdahl's law | ||
+ | * Intel MMX --> An example of Intel's approach to SIMD | ||
+ | * No VLEN, use OpCode to define the length | ||
+ | * Stride is one in MMX | ||
+ | * Intel SSE --> Modern version of MMX | ||
+ | |||
+ | ===== Lecture 17 (2/26 Wed.) ===== | ||
+ | |||
+ | * GPU | ||
+ | * Warp/Wavefront | ||
+ | * A bunch of threads sharing the same PC | ||
+ | * SIMT | ||
+ | * Lanes | ||
+ | * FGMT + massively parallel | ||
+ | * Tolerate long latency | ||
+ | * Warp based SIMD vs. traditional SIMD | ||
+ | * SPMD (Programming model) | ||
+ | * Single program operates on multiple data | ||
+ | * can have synchronization point | ||
+ | * Many scientific applications are programmed in this manner | ||
+ | * Control flow problem (branch divergence) | ||
+ | * Masking (in a branch, mask threads that should not execute that path) | ||
+ | * Lower SIMD efficiency | ||
+ | * What if you have layers of branches? | ||
+ | * Dynamic wrap formation | ||
+ | * Combining threads from different warps to increase SIMD utilization | ||
+ | * This can cause memory divergence | ||
+ | * VLIW | ||
+ | * Wide fetch | ||
+ | * IA-64 | ||
+ | * Tradeoffs | ||
+ | * Simple hardware (no dynamic scheduling, no dependency checking within VLIW) | ||
+ | * A lot of loads at the compiler level | ||
+ | * Decoupled access/execute | ||
+ | * Limited form of OoO | ||
+ | * Tradeoffs | ||
+ | * How to street the instruction (determine dependency/stalling)? | ||
+ | * Instruction scheduling techniques (static vs. dynamic) | ||
+ | * Systoric arrays | ||
+ | * Processing elements transform data in chains | ||
+ | * Develop for image processing (for example, convolution) | ||
+ | * Stage processing | ||
+ | |||
+ | ===== Lecture 18 (2/28 Fri.) ===== | ||
+ | |||
+ | * Tradeoffs of VLIW | ||
+ | * Why does VLIW required static instruction scheduling | ||
+ | * Whose job it is? | ||
+ | * Compiler can rearrange basic blocks/instruction | ||
+ | * Basic block | ||
+ | * Benefits of having large basic block | ||
+ | * Entry/Exit | ||
+ | * Handling entries/exits | ||
+ | * Trace cache | ||
+ | * How to ensure correctness? | ||
+ | * Profiling | ||
+ | * Fixing up the instruction order to ensure correctness | ||
+ | * Dealing with multiple entries into the block | ||
+ | * Dealing with multiple exits into the block | ||
+ | * Super block | ||
+ | * How to form super blocks? | ||
+ | * Benefit of super block | ||
+ | * Tradeoff between not forming a super block and forming a super block | ||
+ | * Ambiguous branch (after profiling, both taken/not taken are equally likely) | ||
+ | * Cleaning up | ||
+ | * What scenario would make trace cache/superblock/profiling less effective? | ||
+ | * List scheduling | ||
+ | * Help figuring out which instructions VLIW should fetch | ||
+ | * Try to maximize instruction throughput | ||
+ | * How to assign priorities | ||
+ | * What if some instructions take longer than others | ||
+ | * Block structured ISA (BS-ISA) | ||
+ | * Problems with trace scheduling? | ||
+ | * What type of program will benefit from BS-ISA | ||
+ | * How to form blocks in BS-ISA? | ||
+ | * Combining basic blocks | ||
+ | * multiples of merged basic blocks | ||
+ | * How to deal with entries/exits in BS-ISA? | ||
+ | * undo the executed instructions from the entry point, then fetch the new block | ||
+ | * Advantages over trace cache | ||
+ | * Benefit of VLIW + Static instruction scheduling | ||
+ | * Intel IA-64 | ||
+ | * Static instruction scheduling and VLIW | ||
+ | |||
+ | ===== Lecture 19 (3/19 Wed.) ===== | ||
+ | |||
+ | * Ideal cache | ||
+ | * More capacity | ||
+ | * Fast | ||
+ | * Cheap | ||
+ | * High bandwidth | ||
+ | * DRAM cell | ||
+ | * Cheap | ||
+ | * Sense the purturbation through sense amplifier | ||
+ | * Slow and leaky | ||
+ | * SRAM cell (Cross coupled inverter) | ||
+ | * Expensice | ||
+ | * Fast (easier to sense the value in the cell) | ||
+ | * Memory bank | ||
+ | * Read access sequence | ||
+ | * DRAM: Activate -> Read -> Precharge (if needed) | ||
+ | * What dominate the access laatency for DRAM and SRAM | ||
+ | * Scaling issue | ||
+ | * Hard to scale the scale to be small | ||
+ | * Memory hierarchy | ||
+ | * Prefetching | ||
+ | * Caching | ||
+ | * Spatial and temporal locality | ||
+ | * Cache can exploit these | ||
+ | * Recently used data is likely to be accessed | ||
+ | * Nearby data is likely to be accessed | ||
+ | * Caching in a pipeline design | ||
+ | * Cache management | ||
+ | * Manual | ||
+ | * Data movement is managed manually | ||
+ | * Embedded processor | ||
+ | * GPU scratchpad | ||
+ | * Automatic | ||
+ | * HW manage data movements | ||
+ | * Latency analysis | ||
+ | * Based on the hit and miss status, next level access time (if miss), and the current level access time | ||
+ | * Cache basics | ||
+ | * Set/block (line)/Placement/replacement/direct mapped vs. associative cache/etc. | ||
+ | * Cache access | ||
+ | * How to access tag and data (in parallel vs serially) | ||
+ | * How do tag and index get used? | ||
+ | * Modern processors perform serial access for higher level cache (L3 for example) to save power | ||
+ | * Cost and benefit of having more associativity | ||
+ | * Given the associativity, which block should be replace if it is full | ||
+ | * Replacement poligy | ||
+ | * Random | ||
+ | * Least recently used (LRU) | ||
+ | * Least frequently used | ||
+ | * Least costly to refetch | ||
+ | * etc. | ||
+ | * How to implement LRU | ||
+ | * How to keep track of access ordering | ||
+ | * Complexity increases rapidly | ||
+ | * Approximate LRU | ||
+ | * Victim and next Victim policy | ||
+ | |||
+ | ===== Lecture 20 (3/21 Fri.) ===== | ||
+ | |||
+ | * Set thrashing | ||
+ | * Working set is bigger than the associativity | ||
+ | * Belady's OPT | ||
+ | * Is this optimal? | ||
+ | * Complexity? | ||
+ | * Similarity between cache and page table | ||
+ | * Number of blocks vs pages | ||
+ | * Time to find the block/page to replace | ||
+ | * Handling writes | ||
+ | * Write through | ||
+ | * Need a modified bit to make sure accesses to data got the updated data | ||
+ | * Write back | ||
+ | * Simpler, no consistency issues | ||
+ | * Sectored cache | ||
+ | * Use subblock | ||
+ | * lower bandwidth | ||
+ | * more complex | ||
+ | * Instruction vs data cache | ||
+ | * Where to place instructions | ||
+ | * Unified vs. separated | ||
+ | * In the first level cache | ||
+ | * Cache access | ||
+ | * First level access | ||
+ | * Second level access | ||
+ | * When to start the second level access | ||
+ | * Performance vs. energy | ||
+ | * Address translation | ||
+ | * Homonym and Synonyms | ||
+ | * Homonym: Same VA but maps to different PA | ||
+ | * With multiple processes | ||
+ | * Synonyms: Multiple VAs map to the same PA | ||
+ | * Shared libraries, shared data, copy-on-write | ||
+ | * I/O | ||
+ | * Can these create problems when we have the cache | ||
+ | * How to eliminate these problems? | ||
+ | * Page coloring | ||
+ | * Interaction between cache and TLB | ||
+ | * Virtually indexed vs. physically indexed | ||
+ | * Virtually tagged vs. physically tagged | ||
+ | * Virtually indexed physically tagged | ||
+ | * Virtual memory in DRAM | ||
+ | * Control where data is mapped to in channel/rank/bank | ||
+ | * More parallelism | ||
+ | * Reduce interference | ||
+ | |||
+ | ===== Lecture 21 (3/24 Mon.) ===== | ||
+ | |||
+ | |||
+ | |||
+ | * Different parameters that affect cache miss | ||
+ | * Thrashing | ||
+ | * Different types of cache misses | ||
+ | * Compulsory misses | ||
+ | * Can mitigate with prefetches | ||
+ | * Capacity misses | ||
+ | * More assoc | ||
+ | * Victim cache | ||
+ | * Conflict misses | ||
+ | * Hashing | ||
+ | * Large block vs. small block | ||
+ | * Subblocks | ||
+ | * Victim cache | ||
+ | * Small, but fully assoc. cache behind the actual cache | ||
+ | * Cached misses cache block | ||
+ | * Prevent ping-ponging | ||
+ | * Pseudo associativity | ||
+ | * Simpler way to implement associative cache | ||
+ | * Skewed assoc. cache | ||
+ | * Different hashing functions for each way | ||
+ | * Restructure data access pattern | ||
+ | * Order of loop traversal | ||
+ | * Blocking | ||
+ | * Memory level parallelism | ||
+ | * Cost per miss of a parallel cache miss is less costly compared to serial misses | ||
+ | * MSHR | ||
+ | * Keep track of pending cache | ||
+ | * Think of this as the load/store buffer-ish for cache | ||
+ | * What information goes into the MSHR? | ||
+ | * When do you access the MSHR? | ||
+ | |||
+ | |||
+ | ===== Lecture 22 (3/26 Wed.) ===== | ||
+ | |||
+ | |||
+ | |||
+ | * Multi-porting | ||
+ | * Virtual multi-porting | ||
+ | * Time-share the port, not too scalable but cheap | ||
+ | * True multiporting | ||
+ | * Multiple cache copies | ||
+ | * Banking | ||
+ | * Can have bank conflict | ||
+ | * Extra interconnects across banks | ||
+ | * Address mapping can mitigate bank conflict | ||
+ | * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity) | ||
+ | * Accessing DRAM | ||
+ | * Row bits | ||
+ | * Column bits | ||
+ | * Addressibility | ||
+ | * DRAM has its own clock | ||
+ | * DRAM (2T) vs. SRAM (6T) | ||
+ | * Cost | ||
+ | * Latency | ||
+ | * Interleaving in DRAM | ||
+ | * Effects from address mapping on memory interleaving | ||
+ | * Effects from memory access patterns from the program on interleaving | ||
+ | * DRAM Bank | ||
+ | * To minimize the cost of interleaving (Shared the data bus and the command bus) | ||
+ | * DRAM Rank | ||
+ | * Minimize the cost of the chip (a bundle of chips operated together) | ||
+ | * DRAM Channel | ||
+ | * An interface to DRAM, each with its own ranks/banks | ||
+ | * DIMM | ||
+ | * More DIMM adds the interconnect complexity | ||
+ | * List of commands to read/write data into DRAM | ||
+ | * Activate -> read/write -> precharge | ||
+ | * Activate moves data into the row buffer | ||
+ | * Precharge prepare the bank for the next access | ||
+ | * Row buffer hit | ||
+ | * Row buffer conflict | ||
+ | * Scheduling memory requests to lower row conflicts | ||
+ | * Burst mode of DRAM | ||
+ | * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits | ||
+ | * Address mapping | ||
+ | * Row interleaved | ||
+ | * Cache block interleaved | ||
+ | * Memory controller | ||
+ | * Sending DRAM commands | ||
+ | * Periodically send commands to refresh DRAM cells | ||
+ | * Ensure correctness and data integrity | ||
+ | * Where to place the memory controller | ||
+ | * On CPU chip vs. at the main memory | ||
+ | * Higher BW on-chip | ||
+ | * Determine the order of requests that will be serviced in DRAM | ||
+ | * Request queues that hold requests | ||
+ | * Send requests whenever the request can be sent to the bank | ||
+ | * Determine which command (across banks) should be sent to DRAM | ||
+ | * Priority of demand vs. prefetch requests | ||
+ | * Memory scheduling policies | ||
+ | * FCFS | ||
+ | * FR-FCFS | ||
+ | * Capped FR-FCFS: FR-FCFS with a timeout | ||
+ | * Usually this is done in a command level (read/write commands and precharge/activate commands) | ||
+ | | ||
+ | | ||
+ | ===== Lecture 23 (3/28 Fri.) ===== | ||
+ | |||
+ | * DRAM design choices | ||
+ | * Cost/density/latency/BW/Yield | ||
+ | * Sense Amplifier | ||
+ | * How do they work | ||
+ | * Dual data rate | ||
+ | * Subarray | ||
+ | * Rowclone | ||
+ | * Moving bulk of data from one row to others | ||
+ | * Lower latency and BW when performing copies/zeroes out the data | ||
+ | * TL-DRAM | ||
+ | * Far segment | ||
+ | * Near segment | ||
+ | * What causes the long latency | ||
+ | * Benefit of TL-DRAM | ||
+ | * TL-DRAM vs. DRAM cache (adding a small cache in DRAM) | ||
+ | |||
+ | | ||
+ | | ||
+ | | ||
+ | ===== Lecture 24 (3/31 Mon.) ===== | ||
+ | | ||
+ | |||
+ | * Memory controller | ||
+ | * Different commands | ||
+ | * Memory scheduler | ||
+ | * Determine the order of requests to be issued to DRAM | ||
+ | * Age/hit-miss status/types(load/store/prefetch/from GPU/from CPU)/criticality | ||
+ | * Row buffer | ||
+ | * hit/conflict | ||
+ | * open/closed row | ||
+ | * Open row policy | ||
+ | * Closed row policy | ||
+ | * Tradeoffs between open and closed row policy | ||
+ | * What if the programs has high row buffer locality: open row might benefit more | ||
+ | * Closed row will service misses request faster | ||
+ | * Bank conflict | ||
+ | * Interference from different applications/threads | ||
+ | * Differnt programs/processes/threads interfere with each other | ||
+ | * introduce more row buffer/bank conflicts | ||
+ | * Memory schedule has to manage these interference | ||
+ | * Memory hog problems | ||
+ | * Interference in the data/command bus | ||
+ | * FR-FCFS | ||
+ | * Why does FR-FCFS make sense? | ||
+ | * Row buffer has lower lantecy | ||
+ | * Issues with FR-FCFS | ||
+ | * Unfairness | ||
+ | * STFM | ||
+ | * Fairness issue in memory scheduling | ||
+ | * How does STFM calculate the fairness and slowdown | ||
+ | * How to estimate slowdown time when it is runing alone | ||
+ | * Definition of fairness (based on STFM, different papers/areas define fairness differently) | ||
+ | * PAR-BS | ||
+ | * Parallelism in programs | ||
+ | * Intereference across banks | ||
+ | * How to form a batch | ||
+ | * How to determine ranking between batches/within a batch | ||
+ | | ||
+ | |||
+ | |||
+ | ===== Lecture 25 (2/2 Wed.) ===== | ||
+ | |||
+ | |||
+ | |||
+ | * Latency sensitivity | ||
+ | * Performance drops a lot when the memory request latency is long | ||
+ | * TCM | ||
+ | * Tradeoff between throughput and fairness | ||
+ | * Latency sensitive cluster (non-intensive cluster) | ||
+ | * Ranking based on memory intensity | ||
+ | * Bandwidth intensive cluster | ||
+ | * Round robin within the cluster | ||
+ | * Generally latency sensitive cluster has more priority | ||
+ | * Provide robust fairness vs. throughput | ||
+ | * Complexity of TCM? | ||
+ | * Different ways to control interference in DRAM | ||
+ | * Partitioning of resource | ||
+ | * Channel partitioning: map applications that interfere with each other in a different channel | ||
+ | * Keep track of application's characteristics | ||
+ | * Dedicate a channel might waste the bandwidth | ||
+ | * Need OS support to determine the channel bits | ||
+ | * Source throttling | ||
+ | * A controller throttle the core depends on the performance target | ||
+ | * Example: Fairness via source throttling | ||
+ | * Detect unfairness and throttle application that is interfering | ||
+ | * How do you estimate slowdown? | ||
+ | * Threshold based solution: hard to configure | ||
+ | * App/thread scheduling | ||
+ | * Critical threads usually stall the progress | ||
+ | * Designing DRAM controller | ||
+ | * Has to handle the normal DRAM operations | ||
+ | * Read/write/refresh/all the timing constraints | ||
+ | * Keep track of resources | ||
+ | * Assign priorities to different requests | ||
+ | * Manage requests to banks | ||
+ | * Self-optimizing controller | ||
+ | * Use machine learning to improve DRAM controller | ||
+ | * DRAM Refresh | ||
+ | * Why does DRAM has to refresh every 64ms | ||
+ | * Banks are unavailable during refresh | ||
+ | * LPDDR mitigate this by using a per-bank refresh | ||
+ | * Has to spend longer time with bigger DRAM | ||
+ | * Distributed refresh: stagger refresh every 64 ms in a distributed manner | ||
+ | * As oppose to burst refresh (long pause time) | ||
+ | * RAIDR: Reduce DRAM refresh by profiling and binning | ||
+ | * Some row do not have to be refresh very frequently | ||
+ | * Profile the row | ||
+ | * High temperature changes the retention time: need online profiling | ||
+ | * Bloom filter | ||
+ | * Represent set membership | ||
+ | * Approximated | ||
+ | * Can contain false positive | ||
+ | * Better/more hash function helps eliminate this | ||
+ | |