This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
buzzword [2015/04/03 18:20] rachata |
buzzword [2015/04/13 16:08] kevincha |
||
---|---|---|---|
Line 1194: | Line 1194: | ||
* Based on the characteristics | * Based on the characteristics | ||
* Frequently accessed data that need lower write latency in DRAM | * Frequently accessed data that need lower write latency in DRAM | ||
- | | ||
- | | ||
| | ||
+ | ===== Lecture 27 (4/6 Mon.) ===== | ||
+ | * Flynn's taxonomy | ||
+ | * Parallelism | ||
+ | * Reduces power consumption (P ~ CV^2F) | ||
+ | * Better cost efficiency and easier to scale | ||
+ | * Improves dependability (in case the other core is faulty | ||
+ | * Different types of parallelism | ||
+ | * Instruction level parallelism | ||
+ | * Data level parallelism | ||
+ | * Task level parallelism | ||
+ | * Task level parallelism | ||
+ | * Partition a single, potentially big, task into multiple parallel sub-task | ||
+ | * Can be done explicitly (parallel programming by the programmer) | ||
+ | * Or implicitly (hardware partitions a single thread speculatively) | ||
+ | * Or, run multiple independent tasks (still improves throughput, but the speedup of any single tasks is not better, also simpler to implement) | ||
+ | * Loosely coupled multiprocessor | ||
+ | * No shared global address space | ||
+ | * Message passing to communicate between different sources | ||
+ | * Simple to manage memory | ||
+ | * Tightly coupled multiprocessor | ||
+ | * Shared global address space | ||
+ | * Need to ensure consistency of data | ||
+ | * Programming issues | ||
+ | * Hardware-based multithreading | ||
+ | * Coarse grained | ||
+ | * Find grained | ||
+ | * Simultaneous: Dispatch instruction from multiple threads at the same time | ||
+ | * Parallel speedup | ||
+ | * Superlinear speedup | ||
+ | * Utilization, Redundancy, Efficiency | ||
+ | * Amdahl's law | ||
+ | * Maximum speedup | ||
+ | * Parallel portion is not perfect | ||
+ | * Serial bottleneck | ||
+ | * Synchronization cost | ||
+ | * Load balance | ||
+ | * Some threads has more work, requires more time to hit the sync. point | ||
+ | * Critical sections | ||
+ | * Enforce mutually exclusive access to shared data | ||
+ | * Issues in parallel programming | ||
+ | * Correctness | ||
+ | * Synchronization | ||
+ | * Consistency | ||
+ | |||
+ | |||
+ | ===== Lecture 28 (4/8 Wed.) ===== | ||
+ | * Ordering of instructions | ||
+ | * Maintaining memory consistency when there are multiple threads and shared memory | ||
+ | * Need to ensure the semantic is not changed | ||
+ | * Making sure the shared data is properly locked when used | ||
+ | * Support mutual exclusion | ||
+ | * Ordering depends on when each processor is executed | ||
+ | * Debugging is also difficult (non-deterministic behavior) | ||
+ | * Dekker's algorithm | ||
+ | * Inconsistency -- the two processors did NOT see the same order of operations to memory | ||
+ | * Sequential consistency | ||
+ | * Multiple correct global orders | ||
+ | * Two issues: | ||
+ | * Too conservative/strict | ||
+ | * Performance limiting | ||
+ | * Weak consistency: global ordering when sync | ||
+ | * programmer hints where the synchronizations are | ||
+ | * Memory fence | ||
+ | * More burden on the programmers | ||
+ | * Cache coherence | ||
+ | * Can be done in the software level or hardware level | ||
+ | * Snoop-based coherence | ||
+ | * A simple protocol with two states by broadcasting reads/writes on a bus | ||
+ | * Maintaining coherence | ||
+ | * Needs to provide 1) write propagation and 2) write serialization | ||
+ | * Update vs. Invalidate | ||
+ | * Two cache coherence methods | ||
+ | * Snoopy bus | ||
+ | * Bus based, single point of serialization | ||
+ | * More efficient with small number of processors | ||
+ | * Processors snoop other caches read/write requests to keep the cache block coherent | ||
+ | * Directory | ||
+ | * Single point of serialization per block | ||
+ | * Directory coordinates the coherency | ||
+ | * More scalable | ||
+ | * The directory keeps track of where the copies of each block resides | ||
+ | * Supplies data on a read | ||
+ | * Invalidates the block on a write | ||
+ | * Has an exclusive state | ||
+ | |||
+ | ===== Lecture 29 (4/13 Mon.) ===== | ||
+ | * MSI coherent protocol | ||
+ | * The problem: unnecessary broadcasts of invalidations | ||
+ | * MESI coherent protocol | ||
+ | * Add the exclusive state: this is the only cache copy and it is a clean state to MSI | ||
+ | * Multiple invalidation tradeoffs | ||
+ | * Problem: memory can be unnecessarily updated | ||
+ | * A possible owner state (MOESI) | ||
+ | * Tradeoffs between snooping and directory based coherence protocols | ||
+ | * Slide 31 has a good summary | ||
+ | * Directory: data structures | ||
+ | * Bit vectors vs. linked lists | ||
+ | * Scalability of directories | ||
+ | * Size? Latency? Thousand of nodes? Best of both snooping and directory? |