This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
buzzword [2015/03/25 18:19] rachata |
buzzword [2015/04/01 18:17] rachata |
||
---|---|---|---|
Line 948: | Line 948: | ||
+ | ===== Lecture 23 (03/27 Fri.) ===== | ||
+ | |||
+ | * Different ways to control interference in DRAM | ||
+ | * Partitioning of resource | ||
+ | * Channel partitioning: map applications that interfere with each other in a different channel | ||
+ | * Keep track of application's characteristics | ||
+ | * Dedicate a channel might waste the bandwidth | ||
+ | * Need OS support to determine the channel bits | ||
+ | * Source throttling | ||
+ | * A controller throttle the core depends on the performance target | ||
+ | * Example: Fairness via source throttling | ||
+ | * Detect unfairness and throttle application that is interfering | ||
+ | * How do you estimate slowdown? | ||
+ | * Threshold based solution: hard to configure | ||
+ | * App/thread scheduling | ||
+ | * Critical threads usually stall the progress | ||
+ | * Designing DRAM controller | ||
+ | * Has to handle the normal DRAM operations | ||
+ | * Read/write/refresh/all the timing constraints | ||
+ | * Keep track of resources | ||
+ | * Assign priorities to different requests | ||
+ | * Manage requests to banks | ||
+ | * Self-optimizing controller | ||
+ | * Use machine learning to improve DRAM controller | ||
+ | * A-DRM | ||
+ | * Architecture aware DRAM | ||
+ | * Multithread | ||
+ | * synchronization | ||
+ | * Pipeline programs | ||
+ | * Producer consumer model | ||
+ | * Critical path | ||
+ | * Limiter threads | ||
+ | * Prioritization between threads | ||
+ | * Different power mode in DRAM | ||
+ | * DRAM Refresh | ||
+ | * Why does DRAM has to refresh every 64ms | ||
+ | * Banks are unavailable during refresh | ||
+ | * LPDDR mitigate this by using a per-bank refresh | ||
+ | * Has to spend longer time with bigger DRAM | ||
+ | * Distributed refresh: stagger refresh every 64 ms in a distributed manner | ||
+ | * As oppose to burst refresh (long pause time) | ||
+ | * RAIDR: Reduce DRAM refresh by profiling and binning | ||
+ | * Some row do not have to be refresh very frequently | ||
+ | * Profile the row | ||
+ | * High temperature changes the retention time: need online profiling | ||
+ | * Bloom filter | ||
+ | * Represent set membership | ||
+ | * Approximated | ||
+ | * Can contain false positive | ||
+ | * Better/more hash function helps eliminate this | ||
| | ||
+ | ===== Lecture 24 (03/30 Mon.) ===== | ||
+ | |||
+ | * Simulation | ||
+ | * Drawbacks of RTL simulations | ||
+ | * Time consuming | ||
+ | * Complex to develop | ||
+ | * Hard to perform design explorations | ||
+ | * Explore the design space quickly | ||
+ | * Match the behavior of existing systems | ||
+ | * Tradeoffs: speed, accuracy, flexibility | ||
+ | * High-level simulation vs. detailed simulation | ||
+ | * High-level simulation is faster, but lower accuracy | ||
+ | * Controllers that works on multiple types of cores | ||
+ | * Design problems: how to find a good scheduling policy on its own? | ||
+ | * Self-optimizing memory controller: using machine learning | ||
+ | * Can adapt to the applications | ||
+ | * The complexity is very high | ||
+ | * Tolerate latency can be costly | ||
+ | * Instruction window is complex | ||
+ | * Benefit also diminishes | ||
+ | * Designing the buffers can be complex | ||
+ | * A simpler way to tolerate out of order is desirable | ||
+ | * Different sources that cause the core to stall in OoO | ||
+ | * Cache miss | ||
+ | * Note that stall happens if the inst. window is full | ||
+ | * Scaling instruction window size is hard | ||
+ | * It is better (less complex) to make the windows more efficient | ||
+ | * Runahead execution | ||
+ | * Try to optain MLP w/o increasing instruction windows | ||
+ | * Runahead (i.e. execute ahead) when there is a long memory instruction | ||
+ | * Long memory instruction stall processor for a while anyways, so it's better to make use out of it | ||
+ | * Execute future instruction to generate accurate prefetches | ||
+ | * Allow future data to be in the cache | ||
+ | * How to support runahead execution? | ||
+ | * Need a way to checkpoing the state when entering runahead mode | ||
+ | * How to make executing in the wrong path useful? | ||
+ | * Need runahead cache to handle load/store in Runahead mode (since they are speculative) | ||
+ | |||
+ | |||
+ | ===== Lecture 25 (4/1 Wed.) ===== | ||
+ | |||
+ | * More Runahead executions | ||
+ | * How to support runahead execution? | ||
+ | * Need a way to checkpoing the state when entering runahead mode | ||
+ | * How to make executing in the wrong path useful? | ||
+ | * Need runahead cache to handle load/store in Runahead mode (since they are speculative) | ||
+ | * Cost and benefit of runahead execution (slide number 27) | ||
+ | * Runahead can have inefficiency | ||
+ | * Runahead period that are useless | ||
+ | * Get rid of useless inefficient period | ||
+ | * What if there is a dependent cache miss | ||
+ | * Cannot be paralellized in a vanilla runahead | ||
+ | * Can predict the value of the dependent load | ||
+ | * How to predict the address of the load | ||
+ | * Delta value information | ||
+ | * Stride predictor | ||
+ | * AVD prediction | ||
+ | * Questions regarding prefetching | ||
+ | * What to prefetch | ||
+ | * When to prefetch | ||
+ | * how do we prefetch | ||
+ | * where to prefetch from | ||
+ | * Prefetching can cause thrasing (evict a useful block) | ||
+ | * Prefetching can also be useless (not being used) | ||
+ | * Need to be efficient | ||
+ | * Can cause memory bandwidth problem in GPU | ||
+ | * Prefetch the whole block, more than one block, or subblock? | ||
+ | * Each one of them has pros and cons | ||
+ | * Big prefetch is more likely to waste bandwidth | ||
+ | * Commonly done in a cache block granularity | ||
+ | * Prefetch accuracy: fraction of useful prefetches out of all the prefetches | ||
+ | * Prefetcher usually predict based on | ||
+ | * Past knowledge | ||
+ | * Compiler hints | ||
+ | * Prefetcher has to prefetch at the right time | ||
+ | * Prefetch that is too early might get evicted | ||
+ | * It might also evict other useful data | ||
+ | * Prefetch too late does not hide the whole memory latency | ||
+ | * Previous prefetches at the same PC can be used as the history | ||
+ | * Previous demand requests also is a good information to use for prefetches | ||
+ | * Prefetch buffer | ||
+ | * Place the prefetch data to avoid thrashing | ||
+ | * Can treat demand/prefetch requests separately | ||
+ | * More complex | ||
+ | * Generally, demand block is more important | ||
+ | * This means eviction should prefer prefetch block as oppose to demand block | ||
+ | * Tradeoffs between where do we place the prefetcher | ||
+ | * Look at L1 hits and misses | ||
+ | * Look at L1 misses only | ||
+ | * Look at L2 misses | ||
+ | * Different access pattern affect accuracy | ||
+ | * Tradeoffs between handling more requests (seeing L1 hits and misses) and less visibility (only see L2 miss) | ||
+ | * Software vs. hardware vs. execution based prefetching | ||
+ | * Software: ISA previde prefetch instructions, software utilize it | ||
+ | * What information are useful | ||
+ | * How to make sure the prefetch is timely | ||
+ | * What if you have a pointer based structure | ||
+ | * Not easy to prefetch pointer chasing (because in many case the work between prefetches is short, so you cannot predict the next one timely enough) | ||
+ | * Can be solved by hinting the nextnext and/or nextnextnext address | ||
+ | * Hardware: Identify the pattern and prefetch | ||
+ | * Execution driven: Oppotunistically try to prefetch (runahead, dual-core execution) | ||
+ | * Stride prefetcher | ||
+ | * Predict strides, which is common in many programs | ||
+ | * Cache block based or instruction based | ||
+ | * Stream buffer design | ||
+ | * Buffer the stream of accesses (next address) | ||
+ | * Use the information to prefetch | ||
+ | * What affect prefetcher performance | ||
+ | * Prefetch distance | ||
+ | * How far ahead should we prefetch | ||
+ | * Prefetch degree | ||
+ | * How many prefetches do we prefetch | ||
+ | * Prefetcher performance | ||
+ | * Coverage | ||
+ | * Out of the demand requests, how many are actually from the prefetch request | ||
+ | * Accuracy | ||
+ | * Out of all the prefetch requests, how many are actually getting used | ||
+ | * Timeliness | ||
+ | * How much memory latency can we hide from the prefetch requests | ||
+ | * Cache pullition | ||
+ | * How much did the prefetcher cause misses in the demand misses? | ||
+ | * Hard to quantify |