This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
buzzword [2014/04/14 18:16] rachata |
buzzword [2014/05/02 18:14] rachata |
||
---|---|---|---|
Line 1154: | Line 1154: | ||
* Synchronization | * Synchronization | ||
* Consistency | * Consistency | ||
+ | | ||
+ | ===== Lecture 29 (4/16 Wed.) ===== | ||
| | ||
- | | + | |
+ | |||
+ | * Ordering of instructions | ||
+ | * Maintaining memory consistency when there are multiple threads and shared memory | ||
+ | * Need to ensure the semantic is not changed | ||
+ | * Making sire the shared data is properly locked when used | ||
+ | * Support mutual exclusion | ||
+ | * Ordering depends on when each processor is executed | ||
+ | * Debugging is also difficult (non-deterministic behavior) | ||
+ | * Weak consistency: global ordering when sync | ||
+ | * programmer hints where the synchronizations are | ||
+ | * Total store order model: global ordering only with store | ||
+ | * Cache coherence | ||
+ | * Can be done in the software level or hardware level | ||
+ | * Coherence protocol | ||
+ | * Need to ensure that all the processors see and update the correct state of the cache block | ||
+ | * Need to make sure that writes get propagated and serialized | ||
+ | * Simple protocol are not scalable (one point of synchrnization) | ||
+ | * Update vs. invalidate | ||
+ | * For invalidate, only the core that needs to read retains the correct copy | ||
+ | * Can lead to ping-ponging (tons of read/writes from several processors) | ||
+ | * For updates, bus becomes the bottleneck | ||
+ | * Snoopy bus | ||
+ | * Bus based, single point of serialization | ||
+ | * More efficient with small number of processors | ||
+ | * All cache snoop other caches read/write requests to keep the cache block coherent | ||
+ | * Directory based | ||
+ | * Single point of serialization per block | ||
+ | * Directory coordinate the coherency | ||
+ | * More scalable | ||
+ | * The directory keeps track of where the copies of each block resides | ||
+ | * Supply data on a read | ||
+ | * Invalide the block on a write | ||
+ | * Has an exclusive state | ||
+ | * MSI coherent protocol | ||
+ | * Slide number 56-57 | ||
+ | * Consume bus bandwidth (need an "exclusive" state | ||
+ | * MESI coherent protocal | ||
+ | * Add the exclusive state: this is the only cache copy and it is clean state to MSI | ||
+ | * Tradeoffs between snooping and directory based | ||
+ | * Slide 71 has a good summary on this | ||
+ | * MOESI | ||
+ | * Improvement over MESI protocol | ||
+ | |||
+ | |||
+ | ===== Lecture 29 (4/18 Wed.) ===== | ||
+ | |||
+ | |||
+ | |||
+ | * Interference | ||
+ | * Complexity of the memory scheduler | ||
+ | * Ranking/prioritization has cost | ||
+ | * Complex scheduler has higher latency | ||
+ | * Performance metric for multicore/multithead applications | ||
+ | * Speedup | ||
+ | * Slowdown | ||
+ | * Harmonic vs wrighted | ||
+ | * Fairness mertic | ||
+ | * Maximum slowdown | ||
+ | * Why does it make sense | ||
+ | * Any scenario that it does not make sense? | ||
+ | * Predictable performance | ||
+ | * Why is it important? | ||
+ | * In server environment, different jobs are on the same server | ||
+ | * In a mobile environment, there are multiple sources that can slowdown other sources | ||
+ | * How to relate slowdown with request service rate | ||
+ | * MISE: soft slowdown guarantee | ||
+ | * BDI | ||
+ | * Memory wall | ||
+ | * What is the concern regarding the memory wall | ||
+ | * Size of the cache on the die (CPU die) | ||
+ | * One possible solution: cache compression | ||
+ | * What is the problems of existing cache compression mechanism | ||
+ | * Some are too complex | ||
+ | * Decompression is in the critical path | ||
+ | * Need to decompress when reading the data -> decompression should not be in the critical path | ||
+ | * Important factor to the performance | ||
+ | * Software compression is not good enough to compress everything | ||
+ | * Zero value compression | ||
+ | * Simple | ||
+ | * Good compression ratio | ||
+ | * What is data does not have many zeroes | ||
+ | * Frequent value compression | ||
+ | * Some data appear fequently | ||
+ | * Simple and good compression ratio | ||
+ | * have to profile | ||
+ | * decompression is complex | ||
+ | * Frequent pattern compression | ||
+ | * Still to complex in terms of decompression | ||
+ | * Based delta compression | ||
+ | * Easy to decompress but retain the benefit of compression | ||
+ | |||
+ | |||
+ | ===== Lecture 31 (4/28 Mon.) ===== | ||
+ | |||
+ | * Directory based cache coherent | ||
+ | * Each directory has to handle validate/invalidation | ||
+ | * Extra cost of syncronization | ||
+ | * Need to ensure race conditions are resolved | ||
+ | * Interconnection | ||
+ | * Topology | ||
+ | * Bus | ||
+ | * Mesh | ||
+ | * Torus | ||
+ | * Tree | ||
+ | * Butterfly | ||
+ | * Ring | ||
+ | * Bi-directional ring | ||
+ | * More scalable | ||
+ | * Hierarchical ring | ||
+ | * Even more scalable | ||
+ | * More complex | ||
+ | * Crossbar | ||
+ | * etc. | ||
+ | * Circuit switching | ||
+ | * Multistage network | ||
+ | * Butterfly | ||
+ | * Delta network | ||
+ | * Handling contention | ||
+ | * Buffering vs. dropping/deflection (no buffering) | ||
+ | * Routing algorithm | ||
+ | * Handling deadlock | ||
+ | * X-Y routing | ||
+ | * Turn model (to avoid deadlocks) | ||
+ | * Add more buffering for an escape path | ||
+ | * Oblivious routing | ||
+ | * Can take different path | ||
+ | * DOR between each intermediate location | ||
+ | * Balance network load | ||
+ | * Adaptive routing | ||
+ | * Use the state of the network to determine the route | ||
+ | * Aware of local and/or global congestions | ||
+ | * Non minimal adaptive routing can have livelocks | ||
+ | |||
+ | ===== Lecture 32 (4/30 Wed.) ===== | ||
+ | |||
+ | |||
+ | * Serialized code section | ||
+ | * Degrade performance | ||
+ | * Waste energy | ||
+ | * Heterogeneous cores | ||
+ | * Can execute serialized portion on a powerful large core | ||
+ | * Tradeoff between multiple small cores, multiple large cores or heterogenerous cores | ||
+ | * Critical section | ||
+ | * bottleneck in several multithreaded workloads | ||
+ | * Assymmetry can help | ||
+ | * Accelerated critical section | ||
+ | * Use a large core to run serialized portion of the code | ||
+ | * How to correctly support ACS | ||
+ | * False serialization | ||
+ | * Handling private/shared data | ||
+ | * BIS | ||
+ | * Ideltify the bottleneck | ||
+ | * Serial bottleneck | ||
+ | * Barrier | ||
+ | * Critical section | ||
+ | * Pipeline stages | ||
+ | * Application might wait on different types of bottlenecks | ||
+ | * Allow bottleneckcall and bottleneckreturn | ||
+ | * Acceleration can be done in multiple ways | ||
+ | * ship to a big core | ||
+ | * increase the frequency | ||
+ | * Priorize the thread in share resources (memory scheduler always schedule reqeusts from the thread first, etc.) | ||
+ | * Bottleneck table keeps track of different thread's bottleneck and determine the criticality | ||
+ | |||
+ | |||
+ | ===== Lecture 33 (5/2 Fri.) ===== | ||
+ | |||
+ | |||
+ | * DRAM scaling problem | ||
+ | * Possible solutions to the scaling problem | ||
+ | * Less leakage DRAM | ||
+ | * Heterogeneous DRAM (TL-DRAM, etc.) | ||
+ | * Add more functionality to DRAM | ||
+ | * Denser design (3D stack) | ||
+ | * Different technology | ||
+ | * NVM | ||
+ | * Non volatile memory | ||
+ | * Resistive memory | ||
+ | * PCM | ||
+ | * Inject current to change the phase | ||
+ | * Scales better than DRAM | ||
+ | * Multiple bits per cell | ||
+ | * Wider resistence range | ||
+ | * No refresh is needed | ||
+ | * Downside: Latency and write endurance | ||
+ | * STT-MRAM | ||
+ | * Inject current to change the polarity | ||
+ | * Memristor | ||
+ | * Inject current to change the structure | ||
+ | * Persistency - data stay there even without power | ||
+ | * Unified memory and storage management (persistent data structure) - Single level store | ||
+ | * Improve energy and performance | ||
+ | * Simplify programming model |