This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
buzzword [2014/03/21 18:16] rachata |
buzzword [2014/04/07 18:17] rachata |
||
---|---|---|---|
Line 776: | Line 776: | ||
* More parallelism | * More parallelism | ||
* Reduce interference | * Reduce interference | ||
+ | |||
+ | ===== Lecture 21 (3/24 Mon.) ===== | ||
+ | |||
+ | |||
+ | |||
+ | * Different parameters that affect cache miss | ||
+ | * Thrashing | ||
+ | * Different types of cache misses | ||
+ | * Compulsory misses | ||
+ | * Can mitigate with prefetches | ||
+ | * Capacity misses | ||
+ | * More assoc | ||
+ | * Victim cache | ||
+ | * Conflict misses | ||
+ | * Hashing | ||
+ | * Large block vs. small block | ||
+ | * Subblocks | ||
+ | * Victim cache | ||
+ | * Small, but fully assoc. cache behind the actual cache | ||
+ | * Cached misses cache block | ||
+ | * Prevent ping-ponging | ||
+ | * Pseudo associativity | ||
+ | * Simpler way to implement associative cache | ||
+ | * Skewed assoc. cache | ||
+ | * Different hashing functions for each way | ||
+ | * Restructure data access pattern | ||
+ | * Order of loop traversal | ||
+ | * Blocking | ||
+ | * Memory level parallelism | ||
+ | * Cost per miss of a parallel cache miss is less costly compared to serial misses | ||
+ | * MSHR | ||
+ | * Keep track of pending cache | ||
+ | * Think of this as the load/store buffer-ish for cache | ||
+ | * What information goes into the MSHR? | ||
+ | * When do you access the MSHR? | ||
+ | |||
+ | |||
+ | ===== Lecture 22 (3/26 Wed.) ===== | ||
+ | |||
+ | |||
+ | |||
+ | * Multi-porting | ||
+ | * Virtual multi-porting | ||
+ | * Time-share the port, not too scalable but cheap | ||
+ | * True multiporting | ||
+ | * Multiple cache copies | ||
+ | * Banking | ||
+ | * Can have bank conflict | ||
+ | * Extra interconnects across banks | ||
+ | * Address mapping can mitigate bank conflict | ||
+ | * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity) | ||
+ | * Accessing DRAM | ||
+ | * Row bits | ||
+ | * Column bits | ||
+ | * Addressibility | ||
+ | * DRAM has its own clock | ||
+ | * DRAM (2T) vs. SRAM (6T) | ||
+ | * Cost | ||
+ | * Latency | ||
+ | * Interleaving in DRAM | ||
+ | * Effects from address mapping on memory interleaving | ||
+ | * Effects from memory access patterns from the program on interleaving | ||
+ | * DRAM Bank | ||
+ | * To minimize the cost of interleaving (Shared the data bus and the command bus) | ||
+ | * DRAM Rank | ||
+ | * Minimize the cost of the chip (a bundle of chips operated together) | ||
+ | * DRAM Channel | ||
+ | * An interface to DRAM, each with its own ranks/banks | ||
+ | * DIMM | ||
+ | * More DIMM adds the interconnect complexity | ||
+ | * List of commands to read/write data into DRAM | ||
+ | * Activate -> read/write -> precharge | ||
+ | * Activate moves data into the row buffer | ||
+ | * Precharge prepare the bank for the next access | ||
+ | * Row buffer hit | ||
+ | * Row buffer conflict | ||
+ | * Scheduling memory requests to lower row conflicts | ||
+ | * Burst mode of DRAM | ||
+ | * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits | ||
+ | * Address mapping | ||
+ | * Row interleaved | ||
+ | * Cache block interleaved | ||
+ | * Memory controller | ||
+ | * Sending DRAM commands | ||
+ | * Periodically send commands to refresh DRAM cells | ||
+ | * Ensure correctness and data integrity | ||
+ | * Where to place the memory controller | ||
+ | * On CPU chip vs. at the main memory | ||
+ | * Higher BW on-chip | ||
+ | * Determine the order of requests that will be serviced in DRAM | ||
+ | * Request queues that hold requests | ||
+ | * Send requests whenever the request can be sent to the bank | ||
+ | * Determine which command (across banks) should be sent to DRAM | ||
+ | * Priority of demand vs. prefetch requests | ||
+ | * Memory scheduling policies | ||
+ | * FCFS | ||
+ | * FR-FCFS | ||
+ | * Capped FR-FCFS: FR-FCFS with a timeout | ||
+ | * Usually this is done in a command level (read/write commands and precharge/activate commands) | ||
+ | | ||
+ | | ||
+ | ===== Lecture 23 (3/28 Fri.) ===== | ||
+ | |||
+ | * DRAM design choices | ||
+ | * Cost/density/latency/BW/Yield | ||
+ | * Sense Amplifier | ||
+ | * How do they work | ||
+ | * Dual data rate | ||
+ | * Subarray | ||
+ | * Rowclone | ||
+ | * Moving bulk of data from one row to others | ||
+ | * Lower latency and BW when performing copies/zeroes out the data | ||
+ | * TL-DRAM | ||
+ | * Far segment | ||
+ | * Near segment | ||
+ | * What causes the long latency | ||
+ | * Benefit of TL-DRAM | ||
+ | * TL-DRAM vs. DRAM cache (adding a small cache in DRAM) | ||
+ | |||
+ | | ||
+ | | ||
+ | | ||
+ | ===== Lecture 24 (3/31 Mon.) ===== | ||
+ | | ||
+ | |||
+ | * Memory controller | ||
+ | * Different commands | ||
+ | * Memory scheduler | ||
+ | * Determine the order of requests to be issued to DRAM | ||
+ | * Age/hit-miss status/types(load/store/prefetch/from GPU/from CPU)/criticality | ||
+ | * Row buffer | ||
+ | * hit/conflict | ||
+ | * open/closed row | ||
+ | * Open row policy | ||
+ | * Closed row policy | ||
+ | * Tradeoffs between open and closed row policy | ||
+ | * What if the programs has high row buffer locality: open row might benefit more | ||
+ | * Closed row will service misses request faster | ||
+ | * Bank conflict | ||
+ | * Interference from different applications/threads | ||
+ | * Differnt programs/processes/threads interfere with each other | ||
+ | * introduce more row buffer/bank conflicts | ||
+ | * Memory schedule has to manage these interference | ||
+ | * Memory hog problems | ||
+ | * Interference in the data/command bus | ||
+ | * FR-FCFS | ||
+ | * Why does FR-FCFS make sense? | ||
+ | * Row buffer has lower lantecy | ||
+ | * Issues with FR-FCFS | ||
+ | * Unfairness | ||
+ | * STFM | ||
+ | * Fairness issue in memory scheduling | ||
+ | * How does STFM calculate the fairness and slowdown | ||
+ | * How to estimate slowdown time when it is runing alone | ||
+ | * Definition of fairness (based on STFM, different papers/areas define fairness differently) | ||
+ | * PAR-BS | ||
+ | * Parallelism in programs | ||
+ | * Intereference across banks | ||
+ | * How to form a batch | ||
+ | * How to determine ranking between batches/within a batch | ||
+ | | ||
+ | |||
+ | |||
+ | ===== Lecture 25 (2/2 Wed.) ===== | ||
+ | |||
+ | |||
+ | |||
+ | * Latency sensitivity | ||
+ | * Performance drops a lot when the memory request latency is long | ||
+ | * TCM | ||
+ | * Tradeoff between throughput and fairness | ||
+ | * Latency sensitive cluster (non-intensive cluster) | ||
+ | * Ranking based on memory intensity | ||
+ | * Bandwidth intensive cluster | ||
+ | * Round robin within the cluster | ||
+ | * Generally latency sensitive cluster has more priority | ||
+ | * Provide robust fairness vs. throughput | ||
+ | * Complexity of TCM? | ||
+ | * Different ways to control interference in DRAM | ||
+ | * Partitioning of resource | ||
+ | * Channel partitioning: map applications that interfere with each other in a different channel | ||
+ | * Keep track of application's characteristics | ||
+ | * Dedicate a channel might waste the bandwidth | ||
+ | * Need OS support to determine the channel bits | ||
+ | * Source throttling | ||
+ | * A controller throttle the core depends on the performance target | ||
+ | * Example: Fairness via source throttling | ||
+ | * Detect unfairness and throttle application that is interfering | ||
+ | * How do you estimate slowdown? | ||
+ | * Threshold based solution: hard to configure | ||
+ | * App/thread scheduling | ||
+ | * Critical threads usually stall the progress | ||
+ | * Designing DRAM controller | ||
+ | * Has to handle the normal DRAM operations | ||
+ | * Read/write/refresh/all the timing constraints | ||
+ | * Keep track of resources | ||
+ | * Assign priorities to different requests | ||
+ | * Manage requests to banks | ||
+ | * Self-optimizing controller | ||
+ | * Use machine learning to improve DRAM controller | ||
+ | * DRAM Refresh | ||
+ | * Why does DRAM has to refresh every 64ms | ||
+ | * Banks are unavailable during refresh | ||
+ | * LPDDR mitigate this by using a per-bank refresh | ||
+ | * Has to spend longer time with bigger DRAM | ||
+ | * Distributed refresh: stagger refresh every 64 ms in a distributed manner | ||
+ | * As oppose to burst refresh (long pause time) | ||
+ | * RAIDR: Reduce DRAM refresh by profiling and binning | ||
+ | * Some row do not have to be refresh very frequently | ||
+ | * Profile the row | ||
+ | * High temperature changes the retention time: need online profiling | ||
+ | * Bloom filter | ||
+ | * Represent set membership | ||
+ | * Approximated | ||
+ | * Can contain false positive | ||
+ | * Better/more hash function helps eliminate this | ||
+ | | ||
+ | | ||
+ | | ||
+ | ===== Lecture 26 (4/7 Mon.) ===== | ||
+ | |||
+ | |||
+ | |||
+ | * Tolerate latency can be costly | ||
+ | * Instruction window is complex | ||
+ | * Benefit also diminishes | ||
+ | * Designing the buffers can be complex | ||
+ | * A simpler way to tolerate out of order is desirable | ||
+ | * Different sources that cause the core to stall in OoO | ||
+ | * Cache miss | ||
+ | * Note that stall happens if the inst. window is full | ||
+ | * Scaling instruction window size is hard | ||
+ | * It is better (less complex) to make the windows more efficient | ||
+ | * Runahead execution | ||
+ | * Try to optain MLP w/o increasing instruction windows | ||
+ | * Runahead (i.e. execute ahead) when there is a long memory instruction | ||
+ | * Long memory instruction stall processor for a while anyways, so it's better to make use out of it | ||
+ | * Execute future instruction to generate accurate prefetches | ||
+ | * Allow future data to be in the cache | ||
+ | * How to support runahead execution? | ||
+ | * Need a way to checkpoing the state when entering runahead mode | ||
+ | * How to make executing in the wrong path useful? | ||
+ | * Need runahead cache to handle load/store in Runahead mode (since they are speculative) | ||
+ | * Cost and benefit of runahead execution (slide number 27) | ||
+ | * Runahead can have inefficiency | ||
+ | * Runahead period that are useless | ||
+ | * Get rid of useless inefficient period | ||
+ | * What if there is a dependent cache miss | ||
+ | * Cannot be paralellized in a vanilla runahead | ||
+ | * Can predict the value of the dependent load | ||
+ | * How to predict the address of the load | ||
+ | * Delta value information | ||
+ | * Stride predictor | ||
+ | * AVD prediction | ||
+ | | ||
+ | | ||