This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Last revision Both sides next revision | ||
buzzword [2015/04/06 19:11] kevincha [Lecture 27 (4/6 Mon.)] |
buzzword [2015/04/22 16:47] rachata |
||
---|---|---|---|
Line 1239: | Line 1239: | ||
* Synchronization | * Synchronization | ||
* Consistency | * Consistency | ||
+ | |||
+ | |||
+ | ===== Lecture 28 (4/8 Wed.) ===== | ||
+ | * Ordering of instructions | ||
+ | * Maintaining memory consistency when there are multiple threads and shared memory | ||
+ | * Need to ensure the semantic is not changed | ||
+ | * Making sure the shared data is properly locked when used | ||
+ | * Support mutual exclusion | ||
+ | * Ordering depends on when each processor is executed | ||
+ | * Debugging is also difficult (non-deterministic behavior) | ||
+ | * Dekker's algorithm | ||
+ | * Inconsistency -- the two processors did NOT see the same order of operations to memory | ||
+ | * Sequential consistency | ||
+ | * Multiple correct global orders | ||
+ | * Two issues: | ||
+ | * Too conservative/strict | ||
+ | * Performance limiting | ||
+ | * Weak consistency: global ordering when sync | ||
+ | * programmer hints where the synchronizations are | ||
+ | * Memory fence | ||
+ | * More burden on the programmers | ||
+ | * Cache coherence | ||
+ | * Can be done in the software level or hardware level | ||
+ | * Snoop-based coherence | ||
+ | * A simple protocol with two states by broadcasting reads/writes on a bus | ||
+ | * Maintaining coherence | ||
+ | * Needs to provide 1) write propagation and 2) write serialization | ||
+ | * Update vs. Invalidate | ||
+ | * Two cache coherence methods | ||
+ | * Snoopy bus | ||
+ | * Bus based, single point of serialization | ||
+ | * More efficient with small number of processors | ||
+ | * Processors snoop other caches read/write requests to keep the cache block coherent | ||
+ | * Directory | ||
+ | * Single point of serialization per block | ||
+ | * Directory coordinates the coherency | ||
+ | * More scalable | ||
+ | * The directory keeps track of where the copies of each block resides | ||
+ | * Supplies data on a read | ||
+ | * Invalidates the block on a write | ||
+ | * Has an exclusive state | ||
+ | |||
+ | ===== Lecture 29 (4/10 Fri.) ===== | ||
+ | * MSI coherent protocol | ||
+ | * The problem: unnecessary broadcasts of invalidations | ||
+ | * MESI coherent protocol | ||
+ | * Add the exclusive state: this is the only cache copy and it is a clean state to MSI | ||
+ | * Multiple invalidation tradeoffs | ||
+ | * Problem: memory can be unnecessarily updated | ||
+ | * A possible owner state (MOESI) | ||
+ | * Tradeoffs between snooping and directory based coherence protocols | ||
+ | * Slide 31 has a good summary | ||
+ | * Directory: data structures | ||
+ | * Bit vectors vs. linked lists | ||
+ | * Scalability of directories | ||
+ | * Size? Latency? Thousand of nodes? Best of both snooping and directory? | ||
+ | |||
+ | | ||
+ | ===== Lecture 30 (4/13 Mon.) ===== | ||
+ | * In-memory computing | ||
+ | * Design goals of DRAM | ||
+ | * DRAM structures | ||
+ | * Banks | ||
+ | * Capacitors and sense amplifiers | ||
+ | * Trade-offs b/w number of sense amps and cells | ||
+ | * Width of bank I/O vs. row size | ||
+ | * DRAM operations | ||
+ | * ACTIVATE, READ/WRITE, and PRECHARGE | ||
+ | * Trade-offs | ||
+ | * Latency | ||
+ | * Bandwidth: Chip vs. rank vs. bank | ||
+ | * What's the benefit of having 8 chips? | ||
+ | * Parallelism | ||
+ | * RowClone | ||
+ | * What are the problems? | ||
+ | * Copying b/w two rows that share the same sense amplifier | ||
+ | * System software support | ||
+ | * Bitwise AND/OR | ||
+ | |||
+ | ===== Lecture 31 (4/15 Wed.) ===== | ||
+ | |||
+ | * Application slowdown | ||
+ | * Interference between different applications | ||
+ | * Applications' performance depends on other applications that they are running with | ||
+ | * Predictable performance | ||
+ | * Why are they important? | ||
+ | * Applications that need predictibility | ||
+ | * How to predict the performance? | ||
+ | * What information are useful? | ||
+ | * What need to be guarantee? | ||
+ | * How to estimate the performance when running with others? | ||
+ | * Easy, just measure the performance while it is running. | ||
+ | * How to estimate the performance when the application is running by itself. | ||
+ | * Hard if there is no profiling. | ||
+ | * The relationship between memory service rate and the performance. | ||
+ | * Key assumption: applications are memory bound | ||
+ | * Behavior of memory-bound applications | ||
+ | * With and without interference | ||
+ | * Memory phase vs. compute phase | ||
+ | * MISE | ||
+ | * Estimating slowdown using request service rate | ||
+ | * Inaccuracy when measuring request service rate alone | ||
+ | * Non-memory-bound applications | ||
+ | * Control slowdown and provide soft guarantee | ||
+ | * Taking into account of the shared cache | ||
+ | * MISE model + cache resource management | ||
+ | * Aug tag store | ||
+ | * Separate tag store for different cores | ||
+ | * Cache access rate alone and shared as the metric to estimate slowdown | ||
+ | * Cache paritiioning | ||
+ | * How to determine partitioning | ||
+ | * Utility based cache partitioning | ||
+ | * Others | ||
+ | * Maximum slowdown and fairness metric | ||
+ | | ||
+ | |||
+ | |||
+ | ===== Lecture 32 (4/20 Mon.) ===== | ||
+ | |||
+ | * Heterogeneous systems | ||
+ | * Asymmetric cores: different types of cores on the chip | ||
+ | * Each of these cores are optimized for different workloads/requirements/goals | ||
+ | * Multiple special purpose processors | ||
+ | * Flexible and can adapt to workload behavior | ||
+ | * Disadvantages: complex and high overhead | ||
+ | * Examples: CPU-GPU systems, heterogeneity in execution models | ||
+ | * Heterogeneous resources | ||
+ | * Example: reliable and non-reliable DRAM in the same system | ||
+ | * Key problems in modern systems | ||
+ | * Memory system | ||
+ | * Efficiency | ||
+ | * Predictability | ||
+ | * Assymmetric design can help solving these problems | ||
+ | * Serialized code sections | ||
+ | * Bottleneck in multicore execution | ||
+ | * Parallelizable vs. serial portion | ||
+ | * Accelerate critical section | ||
+ | * Cache ping-ponging | ||
+ | * Synchronization latency | ||
+ | * Symmetric vs. assymmetric design | ||
+ | * Large cores + small cores | ||
+ | * Core assymmetry | ||
+ | * Amdahl's law with heterogeneous cores | ||
+ | * Parallel bottlenecks | ||
+ | * Resource contention | ||
+ | * Depends on what are running | ||
+ | * Accelerated critical section | ||
+ | * Ship critical sections to large cores | ||
+ | * Small modifications and low overhead | ||
+ | * False serialization might become the bottleneck | ||
+ | * Can reduce parallel throughput | ||
+ | * Effect on private cache misses and shared cache misses | ||
+ | | ||
+ | | ||
+ | |||
+ | | ||
+ | |