This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
buzzword [2014/04/02 18:13] rachata |
buzzword [2014/04/30 18:12] rachata |
||
---|---|---|---|
Line 992: | Line 992: | ||
* Can contain false positive | * Can contain false positive | ||
* Better/more hash function helps eliminate this | * Better/more hash function helps eliminate this | ||
+ | | ||
+ | | ||
+ | | ||
+ | ===== Lecture 26 (4/7 Mon.) ===== | ||
+ | |||
+ | |||
+ | |||
+ | * Tolerate latency can be costly | ||
+ | * Instruction window is complex | ||
+ | * Benefit also diminishes | ||
+ | * Designing the buffers can be complex | ||
+ | * A simpler way to tolerate out of order is desirable | ||
+ | * Different sources that cause the core to stall in OoO | ||
+ | * Cache miss | ||
+ | * Note that stall happens if the inst. window is full | ||
+ | * Scaling instruction window size is hard | ||
+ | * It is better (less complex) to make the windows more efficient | ||
+ | * Runahead execution | ||
+ | * Try to optain MLP w/o increasing instruction windows | ||
+ | * Runahead (i.e. execute ahead) when there is a long memory instruction | ||
+ | * Long memory instruction stall processor for a while anyways, so it's better to make use out of it | ||
+ | * Execute future instruction to generate accurate prefetches | ||
+ | * Allow future data to be in the cache | ||
+ | * How to support runahead execution? | ||
+ | * Need a way to checkpoing the state when entering runahead mode | ||
+ | * How to make executing in the wrong path useful? | ||
+ | * Need runahead cache to handle load/store in Runahead mode (since they are speculative) | ||
+ | * Cost and benefit of runahead execution (slide number 27) | ||
+ | * Runahead can have inefficiency | ||
+ | * Runahead period that are useless | ||
+ | * Get rid of useless inefficient period | ||
+ | * What if there is a dependent cache miss | ||
+ | * Cannot be paralellized in a vanilla runahead | ||
+ | * Can predict the value of the dependent load | ||
+ | * How to predict the address of the load | ||
+ | * Delta value information | ||
+ | * Stride predictor | ||
+ | * AVD prediction | ||
+ | | ||
+ | ===== Lecture 28 (4/9 Wed.) ===== | ||
+ | |||
+ | |||
+ | * Questions regarding prefetching | ||
+ | * What to prefetch | ||
+ | * When to prefetch | ||
+ | * how do we prefetch | ||
+ | * where to prefetch from | ||
+ | * Prefetching can cause thrasing (evict a useful block) | ||
+ | * Prefetching can also be useless (not being used) | ||
+ | * Need to be efficient | ||
+ | * Can cause memory bandwidth problem in GPU | ||
+ | * Prefetch the whole block, more than one block, or subblock? | ||
+ | * Each one of them has pros and cons | ||
+ | * Big prefetch is more likely to waste bandwidth | ||
+ | * Commonly done in a cache block granularity | ||
+ | * Prefetch accuracy: fraction of useful prefetches out of all the prefetches | ||
+ | * Prefetcher usually predict based on | ||
+ | * Past knowledge | ||
+ | * Compiler hints | ||
+ | * Prefetcher has to prefetch at the right time | ||
+ | * Prefetch that is too early might get evicted | ||
+ | * It might also evict other useful data | ||
+ | * Prefetch too late does not hide the whole memory latency | ||
+ | * Previous prefetches at the same PC can be used as the history | ||
+ | * Previous demand requests also is a good information to use for prefetches | ||
+ | * Prefetch buffer | ||
+ | * Place the prefetch data to avoid thrashing | ||
+ | * Can treat demand/prefetch requests separately | ||
+ | * More complex | ||
+ | * Generally, demand block is more important | ||
+ | * This means eviction should prefer prefetch block as oppose to demand block | ||
+ | * Tradeoffs between where do we place the prefetcher | ||
+ | * Look at L1 hits and misses | ||
+ | * Look at L1 misses only | ||
+ | * Look at L2 misses | ||
+ | * Different access pattern affect accuracy | ||
+ | * Tradeoffs between handling more requests (seeing L1 hits and misses) and less visibility (only see L2 miss) | ||
+ | * Software vs. hardware vs. execution based prefetching | ||
+ | * Software: ISA previde prefetch instructions, software utilize it | ||
+ | * What information are useful | ||
+ | * How to make sure the prefetch is timely | ||
+ | * What if you have a pointer based structure | ||
+ | * Not easy to prefetch pointer chasing (because in many case the work between prefetches is short, so you cannot predict the next one timely enough) | ||
+ | * Can be solved by hinting the nextnext and/or nextnextnext address | ||
+ | * Hardware: Identify the pattern and prefetch | ||
+ | * Execution driven: Oppotunistically try to prefetch (runahead, dual-core execution) | ||
+ | * Stride prefetcher | ||
+ | * Predict strides, which is common in many programs | ||
+ | * Cache block based or instruction based | ||
+ | * Stream buffer design | ||
+ | * Buffer the stream of accesses (next address) | ||
+ | * Use the information to prefetch | ||
+ | * What affect prefetcher performance | ||
+ | * Prefetch distance | ||
+ | * How far ahead should we prefetch | ||
+ | * Prefetch degree | ||
+ | * How many prefetches do we prefetch | ||
+ | * Prefetcher performance | ||
+ | * Coverage | ||
+ | * Out of the demand requests, how many are actually from the prefetch request | ||
+ | * Accuracy | ||
+ | * Out of all the prefetch requests, how many are actually getting used | ||
+ | * Timeliness | ||
+ | * How much memory latency can we hide from the prefetch requests | ||
+ | * Feedback directed prefetcher | ||
+ | * Use the result of the prefetcher as a feedback to the prefetcher | ||
+ | * with accuracy, timeliness, polluting information | ||
+ | * Markov prefetcher | ||
+ | * Prefetch based on the previous history | ||
+ | * Use markov model to predict | ||
+ | * Pros: Can cover arbitary pattern (easy for link list traversal or trees) | ||
+ | * Downside: High cost, cannot help with compulsory misses (no history) | ||
+ | * Content directed prefetching | ||
+ | * Indentify the content in memory for pointers (which is used as the address to prefetch | ||
+ | * Not very efficient (hard to figure out which block is the pointer) | ||
+ | * Software can give hints | ||
+ | | ||
+ | ===== Lecture 28 (4/14 Mon.) ===== | ||
+ | |||
+ | |||
+ | * Execution based prefetcher | ||
+ | * Helper thread/speculative thread | ||
+ | * Use another thread to pre-execute a program | ||
+ | * Can be a software based or hardware based | ||
+ | * Discover misses before the main program (to prefetch data in a timely manner) | ||
+ | * How do you construct the helper thread | ||
+ | * Preexecute instruction (one example of how to initialize a speculative thread), slide 9 | ||
+ | * Benefit of multiprocessor | ||
+ | * Improve performace without not significantly increase power consumption | ||
+ | * Better cost efficiency and easier to scale | ||
+ | * Improve dependability (in case the other core is faulty | ||
+ | * Different types of parallelism | ||
+ | * Instruction level parallelism | ||
+ | * Data level parallelism | ||
+ | * Task level parallelism | ||
+ | * Task level parallelism | ||
+ | * Partition a single, potentially big, task into multiple parallel sub-task | ||
+ | * Can be done explicitly (parallel programming by the programmer) | ||
+ | * Or implicitly (hardware partition a single thread specilatively) | ||
+ | * Or, run multiple independent tasks (still improve throughput, but the speedup of any single tasks are not better, also simpler to implement) | ||
+ | * Loosely coupled multiprocessor | ||
+ | * No shared global address space | ||
+ | * Message passing to communicate between different sources | ||
+ | * Simple to manage memory | ||
+ | * Tightly coupled multiprocesor | ||
+ | * Shared global address space | ||
+ | * Need to ensure consistency of data | ||
+ | * Switch on event multithreading | ||
+ | * Switch to another context when there is an event (for example, a cache miss) | ||
+ | * Simulteneous multithreading | ||
+ | * Dispatch instruction from multiple threads at the same time | ||
+ | * Amdahl's law | ||
+ | * Maximum speedup | ||
+ | * Parallel portion is not perfect | ||
+ | * Serial bottleneck | ||
+ | * Synchronization cost | ||
+ | * Load balance | ||
+ | * Some threads has more work, requires more time to hit the sync. point | ||
+ | * Issue in parallel programming | ||
+ | * Correctness | ||
+ | * Synchronization | ||
+ | * Consistency | ||
+ | | ||
+ | ===== Lecture 29 (4/16 Wed.) ===== | ||
+ | | ||
+ | |||
+ | |||
+ | * Ordering of instructions | ||
+ | * Maintaining memory consistency when there are multiple threads and shared memory | ||
+ | * Need to ensure the semantic is not changed | ||
+ | * Making sire the shared data is properly locked when used | ||
+ | * Support mutual exclusion | ||
+ | * Ordering depends on when each processor is executed | ||
+ | * Debugging is also difficult (non-deterministic behavior) | ||
+ | * Weak consistency: global ordering when sync | ||
+ | * programmer hints where the synchronizations are | ||
+ | * Total store order model: global ordering only with store | ||
+ | * Cache coherence | ||
+ | * Can be done in the software level or hardware level | ||
+ | * Coherence protocol | ||
+ | * Need to ensure that all the processors see and update the correct state of the cache block | ||
+ | * Need to make sure that writes get propagated and serialized | ||
+ | * Simple protocol are not scalable (one point of synchrnization) | ||
+ | * Update vs. invalidate | ||
+ | * For invalidate, only the core that needs to read retains the correct copy | ||
+ | * Can lead to ping-ponging (tons of read/writes from several processors) | ||
+ | * For updates, bus becomes the bottleneck | ||
+ | * Snoopy bus | ||
+ | * Bus based, single point of serialization | ||
+ | * More efficient with small number of processors | ||
+ | * All cache snoop other caches read/write requests to keep the cache block coherent | ||
+ | * Directory based | ||
+ | * Single point of serialization per block | ||
+ | * Directory coordinate the coherency | ||
+ | * More scalable | ||
+ | * The directory keeps track of where the copies of each block resides | ||
+ | * Supply data on a read | ||
+ | * Invalide the block on a write | ||
+ | * Has an exclusive state | ||
+ | * MSI coherent protocol | ||
+ | * Slide number 56-57 | ||
+ | * Consume bus bandwidth (need an "exclusive" state | ||
+ | * MESI coherent protocal | ||
+ | * Add the exclusive state: this is the only cache copy and it is clean state to MSI | ||
+ | * Tradeoffs between snooping and directory based | ||
+ | * Slide 71 has a good summary on this | ||
+ | * MOESI | ||
+ | * Improvement over MESI protocol | ||
+ | |||
+ | | ||
+ | ===== Lecture 29 (4/18 Wed.) ===== | ||
+ | |||
+ | |||
+ | |||
+ | * Interference | ||
+ | * Complexity of the memory scheduler | ||
+ | * Ranking/prioritization has cost | ||
+ | * Complex scheduler has higher latency | ||
+ | * Performance metric for multicore/multithead applications | ||
+ | * Speedup | ||
+ | * Slowdown | ||
+ | * Harmonic vs wrighted | ||
+ | * Fairness mertic | ||
+ | * Maximum slowdown | ||
+ | * Why does it make sense | ||
+ | * Any scenario that it does not make sense? | ||
+ | * Predictable performance | ||
+ | * Why is it important? | ||
+ | * In server environment, different jobs are on the same server | ||
+ | * In a mobile environment, there are multiple sources that can slowdown other sources | ||
+ | * How to relate slowdown with request service rate | ||
+ | * MISE: soft slowdown guarantee | ||
+ | * BDI | ||
+ | * Memory wall | ||
+ | * What is the concern regarding the memory wall | ||
+ | * Size of the cache on the die (CPU die) | ||
+ | * One possible solution: cache compression | ||
+ | * What is the problems of existing cache compression mechanism | ||
+ | * Some are too complex | ||
+ | * Decompression is in the critical path | ||
+ | * Need to decompress when reading the data -> decompression should not be in the critical path | ||
+ | * Important factor to the performance | ||
+ | * Software compression is not good enough to compress everything | ||
+ | * Zero value compression | ||
+ | * Simple | ||
+ | * Good compression ratio | ||
+ | * What is data does not have many zeroes | ||
+ | * Frequent value compression | ||
+ | * Some data appear fequently | ||
+ | * Simple and good compression ratio | ||
+ | * have to profile | ||
+ | * decompression is complex | ||
+ | * Frequent pattern compression | ||
+ | * Still to complex in terms of decompression | ||
+ | * Based delta compression | ||
+ | * Easy to decompress but retain the benefit of compression | ||
+ | | ||
+ | | ||
+ | ===== Lecture 31 (4/28 Mon.) ===== | ||
+ | |||
+ | * Directory based cache coherent | ||
+ | * Each directory has to handle validate/invalidation | ||
+ | * Extra cost of syncronization | ||
+ | * Need to ensure race conditions are resolved | ||
+ | * Interconnection | ||
+ | * Topology | ||
+ | * Bus | ||
+ | * Mesh | ||
+ | * Torus | ||
+ | * Tree | ||
+ | * Butterfly | ||
+ | * Ring | ||
+ | * Bi-directional ring | ||
+ | * More scalable | ||
+ | * Hierarchical ring | ||
+ | * Even more scalable | ||
+ | * More complex | ||
+ | * Crossbar | ||
+ | * etc. | ||
+ | * Circuit switching | ||
+ | * Multistage network | ||
+ | * Butterfly | ||
+ | * Delta network | ||
+ | * Handling contention | ||
+ | * Buffering vs. dropping/deflection (no buffering) | ||
+ | * Routing algorithm | ||
+ | * Handling deadlock | ||
+ | * X-Y routing | ||
+ | * Turn model (to avoid deadlocks) | ||
+ | * Add more buffering for an escape path | ||
+ | * Oblivious routing | ||
+ | * Can take different path | ||
+ | * DOR between each intermediate location | ||
+ | * Balance network load | ||
+ | * Adaptive routing | ||
+ | * Use the state of the network to determine the route | ||
+ | * Aware of local and/or global congestions | ||
+ | * Non minimal adaptive routing can have livelocks | ||
+ | |||
+ | ===== Lecture 32 (4/30 Wed.) ===== | ||
+ | |||
+ | |||
+ | * Serialized code section | ||
+ | * Degrade performance | ||
+ | * Waste energy | ||
+ | * Heterogeneous cores | ||
+ | * Can execute serialized portion on a powerful large core | ||
+ | * Tradeoff between multiple small cores, multiple large cores or heterogenerous cores | ||
+ | * Critical section | ||
+ | * bottleneck in several multithreaded workloads | ||
+ | * Assymmetry can help | ||
+ | * Accelerated critical section | ||
+ | * Use a large core to run serialized portion of the code | ||
+ | * How to correctly support ACS | ||
+ | * False serialization | ||
+ | * Handling private/shared data | ||
+ | * BIS | ||
+ | * Ideltify the bottleneck | ||
+ | * Serial bottleneck | ||
+ | * Barrier | ||
+ | * Critical section | ||
+ | * Pipeline stages | ||
+ | * Application might wait on different types of bottlenecks | ||
+ | * Allow bottleneckcall and bottleneckreturn | ||
+ | * Acceleration can be done in multiple ways | ||
+ | * ship to a big core | ||
+ | * increase the frequency | ||
+ | * Priorize the thread in share resources (memory scheduler always schedule reqeusts from the thread first, etc.) | ||
+ | * Bottleneck table keeps track of different thread's bottleneck and determine the criticality | ||
+ | | ||
+ | | ||
| |