This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
buzzword [2015/03/25 18:19] rachata |
buzzword [2015/04/27 18:20] rachata |
||
---|---|---|---|
Line 948: | Line 948: | ||
+ | ===== Lecture 23 (03/27 Fri.) ===== | ||
+ | |||
+ | * Different ways to control interference in DRAM | ||
+ | * Partitioning of resource | ||
+ | * Channel partitioning: map applications that interfere with each other in a different channel | ||
+ | * Keep track of application's characteristics | ||
+ | * Dedicate a channel might waste the bandwidth | ||
+ | * Need OS support to determine the channel bits | ||
+ | * Source throttling | ||
+ | * A controller throttle the core depends on the performance target | ||
+ | * Example: Fairness via source throttling | ||
+ | * Detect unfairness and throttle application that is interfering | ||
+ | * How do you estimate slowdown? | ||
+ | * Threshold based solution: hard to configure | ||
+ | * App/thread scheduling | ||
+ | * Critical threads usually stall the progress | ||
+ | * Designing DRAM controller | ||
+ | * Has to handle the normal DRAM operations | ||
+ | * Read/write/refresh/all the timing constraints | ||
+ | * Keep track of resources | ||
+ | * Assign priorities to different requests | ||
+ | * Manage requests to banks | ||
+ | * Self-optimizing controller | ||
+ | * Use machine learning to improve DRAM controller | ||
+ | * A-DRM | ||
+ | * Architecture aware DRAM | ||
+ | * Multithread | ||
+ | * synchronization | ||
+ | * Pipeline programs | ||
+ | * Producer consumer model | ||
+ | * Critical path | ||
+ | * Limiter threads | ||
+ | * Prioritization between threads | ||
+ | * Different power mode in DRAM | ||
+ | * DRAM Refresh | ||
+ | * Why does DRAM has to refresh every 64ms | ||
+ | * Banks are unavailable during refresh | ||
+ | * LPDDR mitigate this by using a per-bank refresh | ||
+ | * Has to spend longer time with bigger DRAM | ||
+ | * Distributed refresh: stagger refresh every 64 ms in a distributed manner | ||
+ | * As oppose to burst refresh (long pause time) | ||
+ | * RAIDR: Reduce DRAM refresh by profiling and binning | ||
+ | * Some row do not have to be refresh very frequently | ||
+ | * Profile the row | ||
+ | * High temperature changes the retention time: need online profiling | ||
+ | * Bloom filter | ||
+ | * Represent set membership | ||
+ | * Approximated | ||
+ | * Can contain false positive | ||
+ | * Better/more hash function helps eliminate this | ||
| | ||
+ | ===== Lecture 24 (03/30 Mon.) ===== | ||
+ | |||
+ | * Simulation | ||
+ | * Drawbacks of RTL simulations | ||
+ | * Time consuming | ||
+ | * Complex to develop | ||
+ | * Hard to perform design explorations | ||
+ | * Explore the design space quickly | ||
+ | * Match the behavior of existing systems | ||
+ | * Tradeoffs: speed, accuracy, flexibility | ||
+ | * High-level simulation vs. detailed simulation | ||
+ | * High-level simulation is faster, but lower accuracy | ||
+ | * Controllers that works on multiple types of cores | ||
+ | * Design problems: how to find a good scheduling policy on its own? | ||
+ | * Self-optimizing memory controller: using machine learning | ||
+ | * Can adapt to the applications | ||
+ | * The complexity is very high | ||
+ | * Tolerate latency can be costly | ||
+ | * Instruction window is complex | ||
+ | * Benefit also diminishes | ||
+ | * Designing the buffers can be complex | ||
+ | * A simpler way to tolerate out of order is desirable | ||
+ | * Different sources that cause the core to stall in OoO | ||
+ | * Cache miss | ||
+ | * Note that stall happens if the inst. window is full | ||
+ | * Scaling instruction window size is hard | ||
+ | * It is better (less complex) to make the windows more efficient | ||
+ | * Runahead execution | ||
+ | * Try to optain MLP w/o increasing instruction windows | ||
+ | * Runahead (i.e. execute ahead) when there is a long memory instruction | ||
+ | * Long memory instruction stall processor for a while anyways, so it's better to make use out of it | ||
+ | * Execute future instruction to generate accurate prefetches | ||
+ | * Allow future data to be in the cache | ||
+ | * How to support runahead execution? | ||
+ | * Need a way to checkpoing the state when entering runahead mode | ||
+ | * How to make executing in the wrong path useful? | ||
+ | * Need runahead cache to handle load/store in Runahead mode (since they are speculative) | ||
+ | |||
+ | |||
+ | ===== Lecture 25 (4/1 Wed.) ===== | ||
+ | |||
+ | * More Runahead executions | ||
+ | * How to support runahead execution? | ||
+ | * Need a way to checkpoing the state when entering runahead mode | ||
+ | * How to make executing in the wrong path useful? | ||
+ | * Need runahead cache to handle load/store in Runahead mode (since they are speculative) | ||
+ | * Cost and benefit of runahead execution (slide number 27) | ||
+ | * Runahead can have inefficiency | ||
+ | * Runahead period that are useless | ||
+ | * Get rid of useless inefficient period | ||
+ | * What if there is a dependent cache miss | ||
+ | * Cannot be paralellized in a vanilla runahead | ||
+ | * Can predict the value of the dependent load | ||
+ | * How to predict the address of the load | ||
+ | * Delta value information | ||
+ | * Stride predictor | ||
+ | * AVD prediction | ||
+ | * Questions regarding prefetching | ||
+ | * What to prefetch | ||
+ | * When to prefetch | ||
+ | * how do we prefetch | ||
+ | * where to prefetch from | ||
+ | * Prefetching can cause thrasing (evict a useful block) | ||
+ | * Prefetching can also be useless (not being used) | ||
+ | * Need to be efficient | ||
+ | * Can cause memory bandwidth problem in GPU | ||
+ | * Prefetch the whole block, more than one block, or subblock? | ||
+ | * Each one of them has pros and cons | ||
+ | * Big prefetch is more likely to waste bandwidth | ||
+ | * Commonly done in a cache block granularity | ||
+ | * Prefetch accuracy: fraction of useful prefetches out of all the prefetches | ||
+ | * Prefetcher usually predict based on | ||
+ | * Past knowledge | ||
+ | * Compiler hints | ||
+ | * Prefetcher has to prefetch at the right time | ||
+ | * Prefetch that is too early might get evicted | ||
+ | * It might also evict other useful data | ||
+ | * Prefetch too late does not hide the whole memory latency | ||
+ | * Previous prefetches at the same PC can be used as the history | ||
+ | * Previous demand requests also is a good information to use for prefetches | ||
+ | * Prefetch buffer | ||
+ | * Place the prefetch data to avoid thrashing | ||
+ | * Can treat demand/prefetch requests separately | ||
+ | * More complex | ||
+ | * Generally, demand block is more important | ||
+ | * This means eviction should prefer prefetch block as oppose to demand block | ||
+ | * Tradeoffs between where do we place the prefetcher | ||
+ | * Look at L1 hits and misses | ||
+ | * Look at L1 misses only | ||
+ | * Look at L2 misses | ||
+ | * Different access pattern affect accuracy | ||
+ | * Tradeoffs between handling more requests (seeing L1 hits and misses) and less visibility (only see L2 miss) | ||
+ | * Software vs. hardware vs. execution based prefetching | ||
+ | * Software: ISA previde prefetch instructions, software utilize it | ||
+ | * What information are useful | ||
+ | * How to make sure the prefetch is timely | ||
+ | * What if you have a pointer based structure | ||
+ | * Not easy to prefetch pointer chasing (because in many case the work between prefetches is short, so you cannot predict the next one timely enough) | ||
+ | * Can be solved by hinting the nextnext and/or nextnextnext address | ||
+ | * Hardware: Identify the pattern and prefetch | ||
+ | * Execution driven: Oppotunistically try to prefetch (runahead, dual-core execution) | ||
+ | * Stride prefetcher | ||
+ | * Predict strides, which is common in many programs | ||
+ | * Cache block based or instruction based | ||
+ | * Stream buffer design | ||
+ | * Buffer the stream of accesses (next address) | ||
+ | * Use the information to prefetch | ||
+ | * What affect prefetcher performance | ||
+ | * Prefetch distance | ||
+ | * How far ahead should we prefetch | ||
+ | * Prefetch degree | ||
+ | * How many prefetches do we prefetch | ||
+ | * Prefetcher performance | ||
+ | * Coverage | ||
+ | * Out of the demand requests, how many are actually from the prefetch request | ||
+ | * Accuracy | ||
+ | * Out of all the prefetch requests, how many are actually getting used | ||
+ | * Timeliness | ||
+ | * How much memory latency can we hide from the prefetch requests | ||
+ | * Cache pullition | ||
+ | * How much did the prefetcher cause misses in the demand misses? | ||
+ | * Hard to quantify | ||
+ | |||
+ | |||
+ | ===== Lecture 26 (4/3 Fri.) ===== | ||
+ | |||
+ | * Feedback directed prefetcher | ||
+ | * Use the result of the prefetcher as a feedback to the prefetcher | ||
+ | * with accuracy, timeliness, polluting information | ||
+ | * Markov prefetcher | ||
+ | * Prefetch based on the previous history | ||
+ | * Use markov model to predict | ||
+ | * Pros: Can cover arbitary pattern (easy for link list traversal or trees) | ||
+ | * Downside: High cost, cannot help with compulsory misses (no history) | ||
+ | * Content directed prefetching | ||
+ | * Indentify the content in memory for pointers (which is used as the address to prefetch | ||
+ | * Not very efficient (hard to figure out which block is the pointer) | ||
+ | * Software can give hints | ||
+ | * Correlation table | ||
+ | * Address correlation | ||
+ | * Execution based prefetcher | ||
+ | * Helper thread/speculative thread | ||
+ | * Use another thread to pre-execute a program | ||
+ | * Can be a software based or hardware based | ||
+ | * Discover misses before the main program (to prefetch data in a timely manner) | ||
+ | * How do you construct the helper thread | ||
+ | * Preexecute instruction (one example of how to initialize a speculative thread), slide 9 | ||
+ | * Thread-based pre-execution | ||
+ | * Error tolerance | ||
+ | * Solution to errors | ||
+ | * Tolerate errors | ||
+ | * New interface, new design | ||
+ | * Eliminate or minimize errors | ||
+ | * New technology, system-wide rethinking | ||
+ | * Embrace errors | ||
+ | * Map data that can tolerate errors to error-prone area | ||
+ | * Hybrid memory systesm | ||
+ | * Combining multiple memory technology together | ||
+ | * What can emerging technology help? | ||
+ | * Scalability | ||
+ | * Lower the cost | ||
+ | * Energy efficiency | ||
+ | * Possible solutions to the scaling problem | ||
+ | * Less leakage DRAM | ||
+ | * Heterogeneous DRAM (TL-DRAM, etc.) | ||
+ | * Add more functionality to DRAM | ||
+ | * Denser design (3D stack) | ||
+ | * Different technology | ||
+ | * NVM | ||
+ | * Charge vs. resistice memory | ||
+ | * How data is written? | ||
+ | * How to read the data? | ||
+ | * Non volatile memory | ||
+ | * Resistive memory | ||
+ | * PCM | ||
+ | * Inject current to change the phase | ||
+ | * Scales better than DRAM | ||
+ | * Multiple bits per cell | ||
+ | * Wider resistence range | ||
+ | * No refresh is needed | ||
+ | * Downside: Latency and write endurance | ||
+ | * STT-MRAM | ||
+ | * Inject current to change the polarity | ||
+ | * Memristor | ||
+ | * Inject current to change the structure | ||
+ | * Pros and cons between different technologies | ||
+ | * Persistency - data stay there even without power | ||
+ | * Unified memory and storage management (persistent data structure) - Single level store | ||
+ | * Improve energy and performance | ||
+ | * Simplify programming model | ||
+ | * Different design options for DRAM + NVM | ||
+ | * DRAM as a cache | ||
+ | * Place some data in DRAM and other in PCM | ||
+ | * Based on the characteristics | ||
+ | * Frequently accessed data that need lower write latency in DRAM | ||
+ | | ||
+ | |||
+ | ===== Lecture 27 (4/6 Mon.) ===== | ||
+ | * Flynn's taxonomy | ||
+ | * Parallelism | ||
+ | * Reduces power consumption (P ~ CV^2F) | ||
+ | * Better cost efficiency and easier to scale | ||
+ | * Improves dependability (in case the other core is faulty | ||
+ | * Different types of parallelism | ||
+ | * Instruction level parallelism | ||
+ | * Data level parallelism | ||
+ | * Task level parallelism | ||
+ | * Task level parallelism | ||
+ | * Partition a single, potentially big, task into multiple parallel sub-task | ||
+ | * Can be done explicitly (parallel programming by the programmer) | ||
+ | * Or implicitly (hardware partitions a single thread speculatively) | ||
+ | * Or, run multiple independent tasks (still improves throughput, but the speedup of any single tasks is not better, also simpler to implement) | ||
+ | * Loosely coupled multiprocessor | ||
+ | * No shared global address space | ||
+ | * Message passing to communicate between different sources | ||
+ | * Simple to manage memory | ||
+ | * Tightly coupled multiprocessor | ||
+ | * Shared global address space | ||
+ | * Need to ensure consistency of data | ||
+ | * Programming issues | ||
+ | * Hardware-based multithreading | ||
+ | * Coarse grained | ||
+ | * Find grained | ||
+ | * Simultaneous: Dispatch instruction from multiple threads at the same time | ||
+ | * Parallel speedup | ||
+ | * Superlinear speedup | ||
+ | * Utilization, Redundancy, Efficiency | ||
+ | * Amdahl's law | ||
+ | * Maximum speedup | ||
+ | * Parallel portion is not perfect | ||
+ | * Serial bottleneck | ||
+ | * Synchronization cost | ||
+ | * Load balance | ||
+ | * Some threads has more work, requires more time to hit the sync. point | ||
+ | * Critical sections | ||
+ | * Enforce mutually exclusive access to shared data | ||
+ | * Issues in parallel programming | ||
+ | * Correctness | ||
+ | * Synchronization | ||
+ | * Consistency | ||
+ | |||
+ | |||
+ | ===== Lecture 28 (4/8 Wed.) ===== | ||
+ | * Ordering of instructions | ||
+ | * Maintaining memory consistency when there are multiple threads and shared memory | ||
+ | * Need to ensure the semantic is not changed | ||
+ | * Making sure the shared data is properly locked when used | ||
+ | * Support mutual exclusion | ||
+ | * Ordering depends on when each processor is executed | ||
+ | * Debugging is also difficult (non-deterministic behavior) | ||
+ | * Dekker's algorithm | ||
+ | * Inconsistency -- the two processors did NOT see the same order of operations to memory | ||
+ | * Sequential consistency | ||
+ | * Multiple correct global orders | ||
+ | * Two issues: | ||
+ | * Too conservative/strict | ||
+ | * Performance limiting | ||
+ | * Weak consistency: global ordering when sync | ||
+ | * programmer hints where the synchronizations are | ||
+ | * Memory fence | ||
+ | * More burden on the programmers | ||
+ | * Cache coherence | ||
+ | * Can be done in the software level or hardware level | ||
+ | * Snoop-based coherence | ||
+ | * A simple protocol with two states by broadcasting reads/writes on a bus | ||
+ | * Maintaining coherence | ||
+ | * Needs to provide 1) write propagation and 2) write serialization | ||
+ | * Update vs. Invalidate | ||
+ | * Two cache coherence methods | ||
+ | * Snoopy bus | ||
+ | * Bus based, single point of serialization | ||
+ | * More efficient with small number of processors | ||
+ | * Processors snoop other caches read/write requests to keep the cache block coherent | ||
+ | * Directory | ||
+ | * Single point of serialization per block | ||
+ | * Directory coordinates the coherency | ||
+ | * More scalable | ||
+ | * The directory keeps track of where the copies of each block resides | ||
+ | * Supplies data on a read | ||
+ | * Invalidates the block on a write | ||
+ | * Has an exclusive state | ||
+ | |||
+ | ===== Lecture 29 (4/10 Fri.) ===== | ||
+ | * MSI coherent protocol | ||
+ | * The problem: unnecessary broadcasts of invalidations | ||
+ | * MESI coherent protocol | ||
+ | * Add the exclusive state: this is the only cache copy and it is a clean state to MSI | ||
+ | * Multiple invalidation tradeoffs | ||
+ | * Problem: memory can be unnecessarily updated | ||
+ | * A possible owner state (MOESI) | ||
+ | * Tradeoffs between snooping and directory based coherence protocols | ||
+ | * Slide 31 has a good summary | ||
+ | * Directory: data structures | ||
+ | * Bit vectors vs. linked lists | ||
+ | * Scalability of directories | ||
+ | * Size? Latency? Thousand of nodes? Best of both snooping and directory? | ||
+ | |||
+ | | ||
+ | ===== Lecture 30 (4/13 Mon.) ===== | ||
+ | * In-memory computing | ||
+ | * Design goals of DRAM | ||
+ | * DRAM structures | ||
+ | * Banks | ||
+ | * Capacitors and sense amplifiers | ||
+ | * Trade-offs b/w number of sense amps and cells | ||
+ | * Width of bank I/O vs. row size | ||
+ | * DRAM operations | ||
+ | * ACTIVATE, READ/WRITE, and PRECHARGE | ||
+ | * Trade-offs | ||
+ | * Latency | ||
+ | * Bandwidth: Chip vs. rank vs. bank | ||
+ | * What's the benefit of having 8 chips? | ||
+ | * Parallelism | ||
+ | * RowClone | ||
+ | * What are the problems? | ||
+ | * Copying b/w two rows that share the same sense amplifier | ||
+ | * System software support | ||
+ | * Bitwise AND/OR | ||
+ | |||
+ | ===== Lecture 31 (4/15 Wed.) ===== | ||
+ | |||
+ | * Application slowdown | ||
+ | * Interference between different applications | ||
+ | * Applications' performance depends on other applications that they are running with | ||
+ | * Predictable performance | ||
+ | * Why are they important? | ||
+ | * Applications that need predictibility | ||
+ | * How to predict the performance? | ||
+ | * What information are useful? | ||
+ | * What need to be guarantee? | ||
+ | * How to estimate the performance when running with others? | ||
+ | * Easy, just measure the performance while it is running. | ||
+ | * How to estimate the performance when the application is running by itself. | ||
+ | * Hard if there is no profiling. | ||
+ | * The relationship between memory service rate and the performance. | ||
+ | * Key assumption: applications are memory bound | ||
+ | * Behavior of memory-bound applications | ||
+ | * With and without interference | ||
+ | * Memory phase vs. compute phase | ||
+ | * MISE | ||
+ | * Estimating slowdown using request service rate | ||
+ | * Inaccuracy when measuring request service rate alone | ||
+ | * Non-memory-bound applications | ||
+ | * Control slowdown and provide soft guarantee | ||
+ | * Taking into account of the shared cache | ||
+ | * MISE model + cache resource management | ||
+ | * Aug tag store | ||
+ | * Separate tag store for different cores | ||
+ | * Cache access rate alone and shared as the metric to estimate slowdown | ||
+ | * Cache paritiioning | ||
+ | * How to determine partitioning | ||
+ | * Utility based cache partitioning | ||
+ | * Others | ||
+ | * Maximum slowdown and fairness metric | ||
+ | | ||
+ | |||
+ | |||
+ | ===== Lecture 32 (4/20 Mon.) ===== | ||
+ | |||
+ | * Heterogeneous systems | ||
+ | * Asymmetric cores: different types of cores on the chip | ||
+ | * Each of these cores are optimized for different workloads/requirements/goals | ||
+ | * Multiple special purpose processors | ||
+ | * Flexible and can adapt to workload behavior | ||
+ | * Disadvantages: complex and high overhead | ||
+ | * Examples: CPU-GPU systems, heterogeneity in execution models | ||
+ | * Heterogeneous resources | ||
+ | * Example: reliable and non-reliable DRAM in the same system | ||
+ | * Key problems in modern systems | ||
+ | * Memory system | ||
+ | * Efficiency | ||
+ | * Predictability | ||
+ | * Assymmetric design can help solving these problems | ||
+ | * Serialized code sections | ||
+ | * Bottleneck in multicore execution | ||
+ | * Parallelizable vs. serial portion | ||
+ | * Accelerate critical section | ||
+ | * Cache ping-ponging | ||
+ | * Synchronization latency | ||
+ | * Symmetric vs. assymmetric design | ||
+ | * Large cores + small cores | ||
+ | * Core assymmetry | ||
+ | * Amdahl's law with heterogeneous cores | ||
+ | * Parallel bottlenecks | ||
+ | * Resource contention | ||
+ | * Depends on what are running | ||
+ | * Accelerated critical section | ||
+ | * Ship critical sections to large cores | ||
+ | * Small modifications and low overhead | ||
+ | * False serialization might become the bottleneck | ||
+ | * Can reduce parallel throughput | ||
+ | * Effect on private cache misses and shared cache misses | ||
+ | | ||
+ | | ||
+ | ===== Lecture 33 (4/27 Mon.) ===== | ||
+ | |||
+ | * Interconnects | ||
+ | * Connecting multiple components together | ||
+ | * Goal: Scalability, flexibility, performance and energy efficiency | ||
+ | * Metric: Performance, bandwidth, bisection bandwidth, cost, energy efficienct, system performance, contention, latency | ||
+ | * Saturation point | ||
+ | * Saturation throughput | ||
+ | * Topology | ||
+ | * How to wire components together, affects routing, throughput, latency | ||
+ | * Bus: All nodes connected to a single ring | ||
+ | * Hard to increase frequency, bandwidth, poor scalability but simple | ||
+ | * Point-to-point | ||
+ | * Low contention and potentially low latency. Costly, not scalable and hard to wire. | ||
+ | * Crossbar | ||
+ | * No contention. Concurrent request from different src/dest can be sent concurrently. Costly. | ||
+ | * Multistage logarithmic network | ||
+ | * Indirect network, low contention, multiple request can be sent concurrently. More scalable compared to crossbar. | ||
+ | * Circuit switch | ||
+ | * Omega network, delta network. | ||
+ | * Butterfly network | ||
+ | * Intermediate switch between sources and destinations | ||
+ | * Switching vs. topology | ||
+ | * Ring | ||
+ | * Each node connected to two other nodes, forming a ring | ||
+ | * Low overhead, high latency, not as scalable. | ||
+ | * Unidirectional ring and bi-directional ring | ||
+ | * Hierarchical Rings | ||
+ | * Layers of rings. More scalable, lower latency. | ||
+ | * Bridge router connect multiple rings together | ||
+ | * Mesh | ||
+ | * 4 input and output ports | ||
+ | * More bisection bandwidth and more scalable | ||
+ | * Easy to layout | ||
+ | * Path diversity | ||
+ | * Routers are more complex | ||
+ | * Tree | ||
+ | * Another hierarchical topology | ||
+ | * Specialized topology | ||
+ | * Good for local traffic | ||
+ | * Fat tree: higher level have more bandwidth | ||
+ | * CM-5 Fat tree | ||
+ | * Fat tree with 4x2 switches | ||
+ | * Hypercube | ||
+ | * N-Dimensional cubes | ||
+ | * Caltech cosmic cube | ||
+ | * Very complex | ||
+ | * Routing algorithm | ||
+ | * How does message get sent from source to destination | ||
+ | * Static or adaptive | ||
+ | * Handling contention | ||
+ | * Buffering helps handling contention, but adds complexity | ||
+ | * Three types of routing algorithms | ||
+ | * Deterministic: always takes the same path | ||
+ | * Oblivious: takes different paths without taking into account of the state of the network | ||
+ | * For example, Valiant algorithm | ||
+ | * Adaptive: takes different paths taking into account of the state of the network | ||
+ | * Non-minimal adaptive routing vs. minimal adaptive routing | ||
+ | * Minimal path: path that has minimum number of hops | ||
+ | * Buffering and flow control | ||
+ | * How to store within the network | ||
+ | * Handling oversubscription | ||
+ | * Source throttling | ||
+ | * Bufferless vs. buffered crossbars | ||
+ | * Buffer overflow | ||
+ | * Bufferless deflection routing | ||
+ | * Deflect packets when there is contention | ||
+ | * Hot-potato routing | ||
+ | |||
+ | |||
+ | |