This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
buzzword [2015/02/25 21:05] kevincha [Lecture 17 (2/25 Wed.)] |
buzzword [2015/04/10 16:38] kevincha [Lecture 28 (4/8 Wed.)] |
||
---|---|---|---|
Line 45: | Line 45: | ||
* Memory wall (a part of scaling issue) | * Memory wall (a part of scaling issue) | ||
* Scaling issue | * Scaling issue | ||
- | * Transister are getting smaller | + | * Transistor are getting smaller |
* Key components of a computer | * Key components of a computer | ||
* Design points | * Design points | ||
Line 54: | Line 54: | ||
* Reliability problems that cause errors | * Reliability problems that cause errors | ||
* Analogies from Kuhn's "The Structure of Scientific Revolutions" (Recommended book) | * Analogies from Kuhn's "The Structure of Scientific Revolutions" (Recommended book) | ||
- | * Pre paradigm science | + | * Pre-paradigm science |
* Normal science | * Normal science | ||
- | * Revolutionalry science | + | * Revolutionary science |
* Components of a computer | * Components of a computer | ||
* Computation | * Computation | ||
Line 76: | Line 76: | ||
* Operands | * Operands | ||
* Live-outs/Live-ins | * Live-outs/Live-ins | ||
- | * DIfferent types of data flow nodes (conditional/relational/barrier) | + | * Different types of data flow nodes (conditional/relational/barrier) |
* How to do transactional transaction in dataflow? | * How to do transactional transaction in dataflow? | ||
* Example: bank transactions | * Example: bank transactions | ||
Line 121: | Line 121: | ||
* Tradeoffs between 0,1,2,3 address machines | * Tradeoffs between 0,1,2,3 address machines | ||
* Postfix notation | * Postfix notation | ||
- | * Instructions/Opcode/Operade specifiers (i.e. addressing modes) | + | * Instructions/Opcode/Operand specifiers (i.e. addressing modes) |
* Simply vs. complex data type (and their tradeoffs) | * Simply vs. complex data type (and their tradeoffs) | ||
* Semantic gap and level | * Semantic gap and level | ||
Line 236: | Line 236: | ||
* Variable latency memory | * Variable latency memory | ||
* Handling interrupts | * Handling interrupts | ||
- | * Difference betwen interrupts and exceptions | + | * Difference between interrupts and exceptions |
* Emulator (i.e. uCode allots minimal datapath to emulate the ISA) | * Emulator (i.e. uCode allots minimal datapath to emulate the ISA) | ||
* Updating machine behavior | * Updating machine behavior | ||
Line 257: | Line 257: | ||
* Idle resources | * Idle resources | ||
* Throughput of a pipelined design | * Throughput of a pipelined design | ||
- | * What dictacts the throughput of a pipelined design? | + | * What dictates the throughput of a pipelined design? |
* Latency of the pipelined design | * Latency of the pipelined design | ||
* Dependency | * Dependency | ||
Line 336: | Line 336: | ||
* Inst. from different thread can fill-in the bubbles | * Inst. from different thread can fill-in the bubbles | ||
* Cost? | * Cost? | ||
- | * Simulteneuos multithreading | + | * Simultaneuos multithreading |
* Branch prediction | * Branch prediction | ||
* Guess what to fetch next. | * Guess what to fetch next. | ||
Line 345: | Line 345: | ||
* Given the program/number of instructions, percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions. | * Given the program/number of instructions, percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions. | ||
* How many extra instructions are being fetched? | * How many extra instructions are being fetched? | ||
- | * What is the performance degredation? | + | * What is the performance degradation? |
* How to reduce the miss penalty? | * How to reduce the miss penalty? | ||
* Predicting the next address (non PC+4 address) | * Predicting the next address (non PC+4 address) | ||
Line 352: | Line 352: | ||
* Global branch history - for directions | * Global branch history - for directions | ||
* Can use compiler to profile and get more info | * Can use compiler to profile and get more info | ||
- | * Input set dictacts the accuracy | + | * Input set dictates the accuracy |
* Add time to compilation | * Add time to compilation | ||
* Heuristics that are common and doesn't require profiling. | * Heuristics that are common and doesn't require profiling. | ||
Line 358: | Line 358: | ||
* Does not require profiling | * Does not require profiling | ||
* Static branch prediction | * Static branch prediction | ||
- | * Pregrammer provides pragmas, hinting the likelihood of taken/not taken branch | + | * Programmer provides pragmas, hinting the likelihood of taken/not taken branch |
* For example, x86 has the hint bit | * For example, x86 has the hint bit | ||
* Dynamic branch prediction | * Dynamic branch prediction | ||
Line 369: | Line 369: | ||
* Why are they very important? | * Why are they very important? | ||
* Differences between 99% accuracy and 98% accuracy | * Differences between 99% accuracy and 98% accuracy | ||
- | * Cost of a misprediction when the pipeline is veryd eep | + | * Cost of a misprediction when the pipeline is very deep |
* Global branch correlation | * Global branch correlation | ||
* Some branches are correlated | * Some branches are correlated | ||
Line 405: | Line 405: | ||
* Use branch prediction on an easy to predict code | * Use branch prediction on an easy to predict code | ||
* Use predicated execution on a hard to predict code | * Use predicated execution on a hard to predict code | ||
- | * Compiler can be more aggressive in optimimzing the code | + | * Compiler can be more aggressive in optmizing the code |
* What are the tradeoffs (slide# 47) | * What are the tradeoffs (slide# 47) | ||
* Multi-path execution | * Multi-path execution | ||
Line 411: | Line 411: | ||
* Can lead to wasted work | * Can lead to wasted work | ||
* VLIW | * VLIW | ||
- | * SuperScalar | + | * Superscalar |
Line 434: | Line 434: | ||
* Reorder buffer | * Reorder buffer | ||
* Reorder results before they are visible to the arch. state | * Reorder results before they are visible to the arch. state | ||
- | * Need to presearve the sequential sematic and data | + | * Need to preserve the sequential semantic and data |
- | * What are the informatinos in the ROB entry | + | * What are the information in the ROB entry |
* Where to get the value from (forwarding path? reorder buffer?) | * Where to get the value from (forwarding path? reorder buffer?) | ||
* Extra logic to check where the youngest instructions/value is | * Extra logic to check where the youngest instructions/value is | ||
- | * Content addressible search (CAM) | + | * Content addressable search (CAM) |
* A lot of comparators | * A lot of comparators | ||
* Different ways to simplify the reorder buffer | * Different ways to simplify the reorder buffer | ||
Line 450: | Line 450: | ||
* An updated value (Speculative), called future file | * An updated value (Speculative), called future file | ||
* A backup value (to restore the state quickly | * A backup value (to restore the state quickly | ||
- | * Double the cost of the regfile, but reduce the area as you don't have to use a content addressible memory (compared to ROB alone) | + | * Double the cost of the regfile, but reduce the area as you don't have to use a content addressable memory (compared to ROB alone) |
* Branch misprediction resembles Exception | * Branch misprediction resembles Exception | ||
* The difference is that branch misprediction is not visible to the software | * The difference is that branch misprediction is not visible to the software | ||
Line 486: | Line 486: | ||
* Slides 28 --> register renaming | * Slides 28 --> register renaming | ||
* Slides 30-35 --> Exercise (also on the board) | * Slides 30-35 --> Exercise (also on the board) | ||
- | * This will be usefull for the midterm | + | * This will be useful for the midterm |
* Register aliasing table | * Register aliasing table | ||
* Broadcasting tags | * Broadcasting tags | ||
Line 503: | Line 503: | ||
* Still visible as a Von Neumann model | * Still visible as a Von Neumann model | ||
* Where does the efficiency come from? | * Where does the efficiency come from? | ||
- | * Size of the scheduling windors/reorder buffer. Tradeoffs? What make sense? | + | * Size of the scheduling windows/reorder buffer. Tradeoffs? What make sense? |
* Load/store handling | * Load/store handling | ||
* Would like to schedule them out of order, but make them visible in-order | * Would like to schedule them out of order, but make them visible in-order | ||
Line 538: | Line 538: | ||
* But the program needs to be very parallel | * But the program needs to be very parallel | ||
* Memory can be the bottleneck (due to very high MLP) | * Memory can be the bottleneck (due to very high MLP) | ||
- | * What does the functional units look like? Deep pipelin and simpler control. | + | * What does the functional units look like? Deep pipeline and simpler control. |
* CRAY-I is one of the examples of vector processor | * CRAY-I is one of the examples of vector processor | ||
* Memory access pattern in a vector processor | * Memory access pattern in a vector processor | ||
Line 549: | Line 549: | ||
* What if there are multiple memory ports? | * What if there are multiple memory ports? | ||
* Gather/Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism) | * Gather/Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism) | ||
- | * Helps handling sparse metrices | + | * Helps handling sparse matrices |
* Conditional operation | * Conditional operation | ||
* Structure of vector units | * Structure of vector units | ||
Line 579: | Line 579: | ||
* Lower SIMD efficiency | * Lower SIMD efficiency | ||
* What if you have layers of branches? | * What if you have layers of branches? | ||
- | * Dynamic wrap formation | + | * Dynamic warp formation |
* Combining threads from different warps to increase SIMD utilization | * Combining threads from different warps to increase SIMD utilization | ||
* This can cause memory divergence | * This can cause memory divergence | ||
Line 593: | Line 593: | ||
* How to street the instruction (determine dependency/stalling)? | * How to street the instruction (determine dependency/stalling)? | ||
* Instruction scheduling techniques (static vs. dynamic) | * Instruction scheduling techniques (static vs. dynamic) | ||
- | * Systoric arrays | + | * Systolic arrays |
* Processing elements transform data in chains | * Processing elements transform data in chains | ||
* Develop for image processing (for example, convolution) | * Develop for image processing (for example, convolution) | ||
Line 601: | Line 601: | ||
===== Lecture 16 (2/23 Mon.) ===== | ===== Lecture 16 (2/23 Mon.) ===== | ||
- | * Systoric arrays | + | * Systolic arrays |
* Processing elements transform data in chains | * Processing elements transform data in chains | ||
* Can be arrays of multi-dimensional processing elements | * Can be arrays of multi-dimensional processing elements | ||
Line 622: | Line 622: | ||
* Side entrance | * Side entrance | ||
* Fixed up code | * Fixed up code | ||
- | * How scheudling is done | + | * How scheduling is done |
* Instruction scheduling | * Instruction scheduling | ||
* Prioritization heuristics | * Prioritization heuristics | ||
Line 629: | Line 629: | ||
* Hyperblock | * Hyperblock | ||
* BS-ISA | * BS-ISA | ||
- | * Tradeoffs betwwen trace cache/Hyperblock/Superblock/BS-ISA | + | * Tradeoffs between trace cache/Hyperblock/Superblock/BS-ISA |
| | ||
+ | ===== Lecture 17 (2/25 Wed.) ===== | ||
* IA-64 | * IA-64 | ||
* EPIC | * EPIC | ||
Line 640: | Line 640: | ||
* Non-faulting loads and exception propagation | * Non-faulting loads and exception propagation | ||
* Aggressive ST-LD reordering | * Aggressive ST-LD reordering | ||
- | * Phyiscal memory system | + | * Physical memory system |
* Ideal pipelines | * Ideal pipelines | ||
* Ideal cache | * Ideal cache | ||
Line 649: | Line 649: | ||
* DRAM cell | * DRAM cell | ||
* Cheap | * Cheap | ||
- | * Sense the purturbation through sense amplifier | + | * Sense the perturbation through sense amplifier |
* Slow and leaky | * Slow and leaky | ||
* SRAM cell (Cross coupled inverter) | * SRAM cell (Cross coupled inverter) | ||
Line 657: | Line 657: | ||
* Read access sequence | * Read access sequence | ||
* DRAM: Activate -> Read -> Precharge (if needed) | * DRAM: Activate -> Read -> Precharge (if needed) | ||
- | * What dominate the access laatency for DRAM and SRAM | + | * What dominate the access latency for DRAM and SRAM |
* Scaling issue | * Scaling issue | ||
* Hard to scale the scale to be small | * Hard to scale the scale to be small | ||
Line 685: | Line 685: | ||
* Cost and benefit of having more associativity | * Cost and benefit of having more associativity | ||
* Given the associativity, which block should be replace if it is full | * Given the associativity, which block should be replace if it is full | ||
- | * Replacement poligy | + | * Replacement policy |
* Random | * Random | ||
* Least recently used (LRU) | * Least recently used (LRU) | ||
Line 696: | Line 696: | ||
* Approximate LRU | * Approximate LRU | ||
* Victim and next Victim policy | * Victim and next Victim policy | ||
+ | |||
+ | ===== Lecture 18 (2/27 Fri.) ===== | ||
+ | * Tag store and data store | ||
+ | * Cache hit rate | ||
+ | * Average memory access time (AMAT) | ||
+ | * AMAT vs. Stall time | ||
+ | * Cache basics | ||
+ | * Direct mapped vs. associative cache | ||
+ | * Set/block (line)/Placement/replacement | ||
+ | * How do tag and index get used? | ||
+ | * Full associativity | ||
+ | * Set associative cache | ||
+ | * insertion, promotion, eviction (replacement) | ||
+ | * Various replacement policies | ||
+ | * How to implement LRU | ||
+ | * How to keep track of access ordering | ||
+ | * Complexity increases rapidly | ||
+ | * Approximate LRU | ||
+ | * Victim and next Victim policy | ||
+ | * Set thrashing | ||
+ | * Working set is bigger than the associativity | ||
+ | * Belady's OPT | ||
+ | * Is this optimal? | ||
+ | * Complexity? | ||
+ | * DRAM as a cache for disk | ||
+ | * Handling writes | ||
+ | * Write through | ||
+ | * Need a modified bit to make sure accesses to data got the updated data | ||
+ | * Write back | ||
+ | * Simpler, no consistency issues | ||
+ | * Sectored cache | ||
+ | * Use subblock | ||
+ | * lower bandwidth | ||
+ | * more complex | ||
+ | * Instruction vs data cache | ||
+ | * Where to place instructions | ||
+ | * Unified vs. separated | ||
+ | * In the first level cache | ||
+ | * Cache access | ||
+ | * First level access | ||
+ | * Second level access | ||
+ | * When to start the second level access | ||
+ | * Cache performance | ||
+ | * capacity | ||
+ | * block size | ||
+ | * associativity | ||
+ | * Classification of cache misses | ||
+ | ===== Lecture 19 (03/02 Mon.) ===== | ||
+ | * Subblocks | ||
+ | * Victim cache | ||
+ | * Small, but fully assoc. cache behind the actual cache | ||
+ | * Cached misses cache block | ||
+ | * Prevent ping-ponging | ||
+ | * Pseudo associativity | ||
+ | * Simpler way to implement associative cache | ||
+ | * Skewed assoc. cache | ||
+ | * Different hashing functions for each way | ||
+ | * Restructure data access pattern | ||
+ | * Order of loop traversal | ||
+ | * Blocking | ||
+ | * Memory level parallelism | ||
+ | * Cost per miss of a parallel cache miss is less costly compared to serial misses | ||
+ | * MSHR | ||
+ | * Keep track of pending cache | ||
+ | * Think of this as the load/store buffer-ish for cache | ||
+ | * What information goes into the MSHR? | ||
+ | * When do you access the MSHR? | ||
+ | * Memory banks | ||
+ | * Shared caches in multi-core processors | ||
+ | ===== Lecture 20 (03/04 Wed.) ===== | ||
+ | * Virtual vs. physical memory | ||
+ | * System's management on memory | ||
+ | * Benefits | ||
+ | * Problem: physical memory has limited size | ||
+ | * Mechanisms: indirection, virtual addresses, and translation | ||
+ | * Demand paging | ||
+ | * Physical memory as a cache | ||
+ | * Tasks of system SW for VM | ||
+ | * Serving a page fault | ||
+ | * Address translation | ||
+ | * Page table | ||
+ | * PTE (page table entry) | ||
+ | * Page replacement algorithm | ||
+ | * CLOCK algo. | ||
+ | * Inverted page table | ||
+ | * Page size trade-offs | ||
+ | * Protection | ||
+ | * Multi-level page tables | ||
+ | * x86 implementation of page table | ||
+ | * TLB | ||
+ | * Handling misses | ||
+ | * When to do address translation? | ||
+ | * Homonym and Synonyms | ||
+ | * Homonym: Same VA but maps to different PA with multiple processes | ||
+ | * Synonyms: Multiple VAs map to the same PA | ||
+ | * Shared libraries, shared data, copy-on-write | ||
+ | * Virtually indexed vs. physically indexed | ||
+ | * Virtually tagged vs. physically tagged | ||
+ | * Virtually indexed physically tagged | ||
+ | * Can these create problems when we have the cache | ||
+ | * How to eliminate these problems? | ||
+ | * Page coloring | ||
+ | * Interaction between cache and TLB | ||
+ | |||
+ | |||
+ | ===== Lecture 21 (03/23 Mon.) ===== | ||
+ | |||
+ | * DRAM scaling problem | ||
+ | * Demands/trends affecting the main memory | ||
+ | * More capacity | ||
+ | * Low energy | ||
+ | * More bandwidth | ||
+ | * QoS | ||
+ | * ECC in DRAM | ||
+ | * Multi-porting | ||
+ | * Virtual multi-porting | ||
+ | * Time-share the port, not too scalable but cheap | ||
+ | * True multiporting | ||
+ | * Multiple cache copies | ||
+ | * Alignment | ||
+ | * Banking | ||
+ | * Can have bank conflict | ||
+ | * Extra interconnects across banks | ||
+ | * Address mapping can mitigate bank conflict | ||
+ | * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity) | ||
+ | * Bank mapping | ||
+ | * How to avoid bank conflicts? | ||
+ | * Channel mapping | ||
+ | * Address mapping to minimize bank conflict | ||
+ | * Page coloring | ||
+ | * Virtual to physical mapping that can help reducing conflicts | ||
+ | * Accessing DRAM | ||
+ | * Row bits | ||
+ | * Column bits | ||
+ | * Addressibility | ||
+ | * DRAM has its own clock | ||
+ | * Sense amplifier | ||
+ | * Bit lines | ||
+ | * Word lines | ||
+ | * DRAM (2T) vs. SRAM (6T) | ||
+ | * Cost | ||
+ | * Latency | ||
+ | * Interleaving in DRAM | ||
+ | * Effects from address mapping on memory interleaving | ||
+ | * Effects from memory access patterns from the program on interleaving | ||
+ | * DRAM Bank | ||
+ | * To minimize the cost of interleaving (Shared the data bus and the command bus) | ||
+ | * DRAM Rank | ||
+ | * Minimize the cost of the chip (a bundle of chips operated together) | ||
+ | * DRAM Channel | ||
+ | * An interface to DRAM, each with its own ranks/banks | ||
+ | * DRAM Chip | ||
+ | * DIMM | ||
+ | * More DIMM adds the interconnect complexity | ||
+ | * List of commands to read/write data into DRAM | ||
+ | * Activate -> read/write -> precharge | ||
+ | * Activate moves data into the row buffer | ||
+ | * Precharge prepare the bank for the next access | ||
+ | * Row buffer hit | ||
+ | * Row buffer conflict | ||
+ | * Scheduling memory requests to lower row conflicts | ||
+ | * Burst mode of DRAM | ||
+ | * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits | ||
+ | * Address mapping | ||
+ | * Row interleaved | ||
+ | * Cache block interleaved | ||
+ | * Memory controller | ||
+ | * Sending DRAM commands | ||
+ | * Periodically send commands to refresh DRAM cells | ||
+ | * Ensure correctness and data integrity | ||
+ | * Where to place the memory controller | ||
+ | * On CPU chip vs. at the main memory | ||
+ | * Higher BW on-chip | ||
+ | * Determine the order of requests that will be serviced in DRAM | ||
+ | * Request queues that hold requests | ||
+ | * Send requests whenever the request can be sent to the bank | ||
+ | * Determine which command (across banks) should be sent to DRAM | ||
+ | |||
+ | ===== Lecture 22 (03/25 Wed.) ===== | ||
+ | |||
+ | |||
+ | |||
+ | * Flash controller | ||
+ | * Flash memory | ||
+ | * Garbage collection in flash | ||
+ | * Overhead in flash memory | ||
+ | * Erase (off the critical path, but takes a long time) | ||
+ | * Different types of DRAM | ||
+ | * DRAM design choices | ||
+ | * Cost/density/latency/BW/Yield | ||
+ | * Sense Amplifier | ||
+ | * How do they work | ||
+ | * Dual data rate | ||
+ | * Subarray | ||
+ | * Rowclone | ||
+ | * Moving bulk of data from one row to others | ||
+ | * Lower latency and BW when performing copies/zeroes out the data | ||
+ | * TL-DRAM | ||
+ | * Far segment | ||
+ | * Near segment | ||
+ | * What causes the long latency | ||
+ | * Benefit of TL-DRAM | ||
+ | * TL-DRAM vs. DRAM cache (adding a small cache in DRAM) | ||
+ | * List of commands to read/write data into DRAM | ||
+ | * Activate -> read/write -> precharge | ||
+ | * Activate moves data into the row buffer | ||
+ | * Precharge prepare the bank for the next access | ||
+ | * Row buffer hit | ||
+ | * Row buffer conflict | ||
+ | * Scheduling memory requests to lower row conflicts | ||
+ | * Burst mode of DRAM | ||
+ | * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits | ||
+ | * Address mapping | ||
+ | * Row interleaved | ||
+ | * Cache block interleaved | ||
+ | * Memory controller | ||
+ | * Sending DRAM commands | ||
+ | * Periodically send commands to refresh DRAM cells | ||
+ | * Ensure correctness and data integrity | ||
+ | * Where to place the memory controller | ||
+ | * On CPU chip vs. at the main memory | ||
+ | * Higher BW on-chip | ||
+ | * Determine the order of requests that will be serviced in DRAM | ||
+ | * Request queues that hold requests | ||
+ | * Send requests whenever the request can be sent to the bank | ||
+ | * Determine which command (across banks) should be sent to DRAM | ||
+ | * Priority of demand vs. prefetch requests | ||
+ | * Memory scheduling policies | ||
+ | * FCFS | ||
+ | * FR-FCFS | ||
+ | * Try to maximize row buffer hit rate | ||
+ | * Capped FR-FCFS: FR-FCFS with a timeout | ||
+ | * Usually this is done in a command level (read/write commands and precharge/activate commands) | ||
+ | * PAR-BS | ||
+ | * Key benefits | ||
+ | * stall time | ||
+ | * shortest job first | ||
+ | * STFM | ||
+ | * ATLAS | ||
+ | * TCM | ||
+ | * Key benefits | ||
+ | * Configurability | ||
+ | * Fairness + performance at the same time | ||
+ | * Robuestness isuees | ||
+ | * Open row policy | ||
+ | * Closed row policy | ||
+ | * QoS | ||
+ | * QoS issues in memory scheduling | ||
+ | * Fairness | ||
+ | * Performance guarantee | ||
+ | |||
+ | |||
+ | ===== Lecture 23 (03/27 Fri.) ===== | ||
+ | |||
+ | * Different ways to control interference in DRAM | ||
+ | * Partitioning of resource | ||
+ | * Channel partitioning: map applications that interfere with each other in a different channel | ||
+ | * Keep track of application's characteristics | ||
+ | * Dedicate a channel might waste the bandwidth | ||
+ | * Need OS support to determine the channel bits | ||
+ | * Source throttling | ||
+ | * A controller throttle the core depends on the performance target | ||
+ | * Example: Fairness via source throttling | ||
+ | * Detect unfairness and throttle application that is interfering | ||
+ | * How do you estimate slowdown? | ||
+ | * Threshold based solution: hard to configure | ||
+ | * App/thread scheduling | ||
+ | * Critical threads usually stall the progress | ||
+ | * Designing DRAM controller | ||
+ | * Has to handle the normal DRAM operations | ||
+ | * Read/write/refresh/all the timing constraints | ||
+ | * Keep track of resources | ||
+ | * Assign priorities to different requests | ||
+ | * Manage requests to banks | ||
+ | * Self-optimizing controller | ||
+ | * Use machine learning to improve DRAM controller | ||
+ | * A-DRM | ||
+ | * Architecture aware DRAM | ||
+ | * Multithread | ||
+ | * synchronization | ||
+ | * Pipeline programs | ||
+ | * Producer consumer model | ||
+ | * Critical path | ||
+ | * Limiter threads | ||
+ | * Prioritization between threads | ||
+ | * Different power mode in DRAM | ||
+ | * DRAM Refresh | ||
+ | * Why does DRAM has to refresh every 64ms | ||
+ | * Banks are unavailable during refresh | ||
+ | * LPDDR mitigate this by using a per-bank refresh | ||
+ | * Has to spend longer time with bigger DRAM | ||
+ | * Distributed refresh: stagger refresh every 64 ms in a distributed manner | ||
+ | * As oppose to burst refresh (long pause time) | ||
+ | * RAIDR: Reduce DRAM refresh by profiling and binning | ||
+ | * Some row do not have to be refresh very frequently | ||
+ | * Profile the row | ||
+ | * High temperature changes the retention time: need online profiling | ||
+ | * Bloom filter | ||
+ | * Represent set membership | ||
+ | * Approximated | ||
+ | * Can contain false positive | ||
+ | * Better/more hash function helps eliminate this | ||
+ | | ||
+ | ===== Lecture 24 (03/30 Mon.) ===== | ||
+ | |||
+ | * Simulation | ||
+ | * Drawbacks of RTL simulations | ||
+ | * Time consuming | ||
+ | * Complex to develop | ||
+ | * Hard to perform design explorations | ||
+ | * Explore the design space quickly | ||
+ | * Match the behavior of existing systems | ||
+ | * Tradeoffs: speed, accuracy, flexibility | ||
+ | * High-level simulation vs. detailed simulation | ||
+ | * High-level simulation is faster, but lower accuracy | ||
+ | * Controllers that works on multiple types of cores | ||
+ | * Design problems: how to find a good scheduling policy on its own? | ||
+ | * Self-optimizing memory controller: using machine learning | ||
+ | * Can adapt to the applications | ||
+ | * The complexity is very high | ||
+ | * Tolerate latency can be costly | ||
+ | * Instruction window is complex | ||
+ | * Benefit also diminishes | ||
+ | * Designing the buffers can be complex | ||
+ | * A simpler way to tolerate out of order is desirable | ||
+ | * Different sources that cause the core to stall in OoO | ||
+ | * Cache miss | ||
+ | * Note that stall happens if the inst. window is full | ||
+ | * Scaling instruction window size is hard | ||
+ | * It is better (less complex) to make the windows more efficient | ||
+ | * Runahead execution | ||
+ | * Try to optain MLP w/o increasing instruction windows | ||
+ | * Runahead (i.e. execute ahead) when there is a long memory instruction | ||
+ | * Long memory instruction stall processor for a while anyways, so it's better to make use out of it | ||
+ | * Execute future instruction to generate accurate prefetches | ||
+ | * Allow future data to be in the cache | ||
+ | * How to support runahead execution? | ||
+ | * Need a way to checkpoing the state when entering runahead mode | ||
+ | * How to make executing in the wrong path useful? | ||
+ | * Need runahead cache to handle load/store in Runahead mode (since they are speculative) | ||
+ | |||
+ | |||
+ | ===== Lecture 25 (4/1 Wed.) ===== | ||
+ | |||
+ | * More Runahead executions | ||
+ | * How to support runahead execution? | ||
+ | * Need a way to checkpoing the state when entering runahead mode | ||
+ | * How to make executing in the wrong path useful? | ||
+ | * Need runahead cache to handle load/store in Runahead mode (since they are speculative) | ||
+ | * Cost and benefit of runahead execution (slide number 27) | ||
+ | * Runahead can have inefficiency | ||
+ | * Runahead period that are useless | ||
+ | * Get rid of useless inefficient period | ||
+ | * What if there is a dependent cache miss | ||
+ | * Cannot be paralellized in a vanilla runahead | ||
+ | * Can predict the value of the dependent load | ||
+ | * How to predict the address of the load | ||
+ | * Delta value information | ||
+ | * Stride predictor | ||
+ | * AVD prediction | ||
+ | * Questions regarding prefetching | ||
+ | * What to prefetch | ||
+ | * When to prefetch | ||
+ | * how do we prefetch | ||
+ | * where to prefetch from | ||
+ | * Prefetching can cause thrasing (evict a useful block) | ||
+ | * Prefetching can also be useless (not being used) | ||
+ | * Need to be efficient | ||
+ | * Can cause memory bandwidth problem in GPU | ||
+ | * Prefetch the whole block, more than one block, or subblock? | ||
+ | * Each one of them has pros and cons | ||
+ | * Big prefetch is more likely to waste bandwidth | ||
+ | * Commonly done in a cache block granularity | ||
+ | * Prefetch accuracy: fraction of useful prefetches out of all the prefetches | ||
+ | * Prefetcher usually predict based on | ||
+ | * Past knowledge | ||
+ | * Compiler hints | ||
+ | * Prefetcher has to prefetch at the right time | ||
+ | * Prefetch that is too early might get evicted | ||
+ | * It might also evict other useful data | ||
+ | * Prefetch too late does not hide the whole memory latency | ||
+ | * Previous prefetches at the same PC can be used as the history | ||
+ | * Previous demand requests also is a good information to use for prefetches | ||
+ | * Prefetch buffer | ||
+ | * Place the prefetch data to avoid thrashing | ||
+ | * Can treat demand/prefetch requests separately | ||
+ | * More complex | ||
+ | * Generally, demand block is more important | ||
+ | * This means eviction should prefer prefetch block as oppose to demand block | ||
+ | * Tradeoffs between where do we place the prefetcher | ||
+ | * Look at L1 hits and misses | ||
+ | * Look at L1 misses only | ||
+ | * Look at L2 misses | ||
+ | * Different access pattern affect accuracy | ||
+ | * Tradeoffs between handling more requests (seeing L1 hits and misses) and less visibility (only see L2 miss) | ||
+ | * Software vs. hardware vs. execution based prefetching | ||
+ | * Software: ISA previde prefetch instructions, software utilize it | ||
+ | * What information are useful | ||
+ | * How to make sure the prefetch is timely | ||
+ | * What if you have a pointer based structure | ||
+ | * Not easy to prefetch pointer chasing (because in many case the work between prefetches is short, so you cannot predict the next one timely enough) | ||
+ | * Can be solved by hinting the nextnext and/or nextnextnext address | ||
+ | * Hardware: Identify the pattern and prefetch | ||
+ | * Execution driven: Oppotunistically try to prefetch (runahead, dual-core execution) | ||
+ | * Stride prefetcher | ||
+ | * Predict strides, which is common in many programs | ||
+ | * Cache block based or instruction based | ||
+ | * Stream buffer design | ||
+ | * Buffer the stream of accesses (next address) | ||
+ | * Use the information to prefetch | ||
+ | * What affect prefetcher performance | ||
+ | * Prefetch distance | ||
+ | * How far ahead should we prefetch | ||
+ | * Prefetch degree | ||
+ | * How many prefetches do we prefetch | ||
+ | * Prefetcher performance | ||
+ | * Coverage | ||
+ | * Out of the demand requests, how many are actually from the prefetch request | ||
+ | * Accuracy | ||
+ | * Out of all the prefetch requests, how many are actually getting used | ||
+ | * Timeliness | ||
+ | * How much memory latency can we hide from the prefetch requests | ||
+ | * Cache pullition | ||
+ | * How much did the prefetcher cause misses in the demand misses? | ||
+ | * Hard to quantify | ||
+ | |||
+ | |||
+ | ===== Lecture 26 (4/3 Fri.) ===== | ||
+ | |||
+ | * Feedback directed prefetcher | ||
+ | * Use the result of the prefetcher as a feedback to the prefetcher | ||
+ | * with accuracy, timeliness, polluting information | ||
+ | * Markov prefetcher | ||
+ | * Prefetch based on the previous history | ||
+ | * Use markov model to predict | ||
+ | * Pros: Can cover arbitary pattern (easy for link list traversal or trees) | ||
+ | * Downside: High cost, cannot help with compulsory misses (no history) | ||
+ | * Content directed prefetching | ||
+ | * Indentify the content in memory for pointers (which is used as the address to prefetch | ||
+ | * Not very efficient (hard to figure out which block is the pointer) | ||
+ | * Software can give hints | ||
+ | * Correlation table | ||
+ | * Address correlation | ||
+ | * Execution based prefetcher | ||
+ | * Helper thread/speculative thread | ||
+ | * Use another thread to pre-execute a program | ||
+ | * Can be a software based or hardware based | ||
+ | * Discover misses before the main program (to prefetch data in a timely manner) | ||
+ | * How do you construct the helper thread | ||
+ | * Preexecute instruction (one example of how to initialize a speculative thread), slide 9 | ||
+ | * Thread-based pre-execution | ||
+ | * Error tolerance | ||
+ | * Solution to errors | ||
+ | * Tolerate errors | ||
+ | * New interface, new design | ||
+ | * Eliminate or minimize errors | ||
+ | * New technology, system-wide rethinking | ||
+ | * Embrace errors | ||
+ | * Map data that can tolerate errors to error-prone area | ||
+ | * Hybrid memory systesm | ||
+ | * Combining multiple memory technology together | ||
+ | * What can emerging technology help? | ||
+ | * Scalability | ||
+ | * Lower the cost | ||
+ | * Energy efficiency | ||
+ | * Possible solutions to the scaling problem | ||
+ | * Less leakage DRAM | ||
+ | * Heterogeneous DRAM (TL-DRAM, etc.) | ||
+ | * Add more functionality to DRAM | ||
+ | * Denser design (3D stack) | ||
+ | * Different technology | ||
+ | * NVM | ||
+ | * Charge vs. resistice memory | ||
+ | * How data is written? | ||
+ | * How to read the data? | ||
+ | * Non volatile memory | ||
+ | * Resistive memory | ||
+ | * PCM | ||
+ | * Inject current to change the phase | ||
+ | * Scales better than DRAM | ||
+ | * Multiple bits per cell | ||
+ | * Wider resistence range | ||
+ | * No refresh is needed | ||
+ | * Downside: Latency and write endurance | ||
+ | * STT-MRAM | ||
+ | * Inject current to change the polarity | ||
+ | * Memristor | ||
+ | * Inject current to change the structure | ||
+ | * Pros and cons between different technologies | ||
+ | * Persistency - data stay there even without power | ||
+ | * Unified memory and storage management (persistent data structure) - Single level store | ||
+ | * Improve energy and performance | ||
+ | * Simplify programming model | ||
+ | * Different design options for DRAM + NVM | ||
+ | * DRAM as a cache | ||
+ | * Place some data in DRAM and other in PCM | ||
+ | * Based on the characteristics | ||
+ | * Frequently accessed data that need lower write latency in DRAM | ||
+ | | ||
+ | |||
+ | ===== Lecture 27 (4/6 Mon.) ===== | ||
+ | * Flynn's taxonomy | ||
+ | * Parallelism | ||
+ | * Reduces power consumption (P ~ CV^2F) | ||
+ | * Better cost efficiency and easier to scale | ||
+ | * Improves dependability (in case the other core is faulty | ||
+ | * Different types of parallelism | ||
+ | * Instruction level parallelism | ||
+ | * Data level parallelism | ||
+ | * Task level parallelism | ||
+ | * Task level parallelism | ||
+ | * Partition a single, potentially big, task into multiple parallel sub-task | ||
+ | * Can be done explicitly (parallel programming by the programmer) | ||
+ | * Or implicitly (hardware partitions a single thread speculatively) | ||
+ | * Or, run multiple independent tasks (still improves throughput, but the speedup of any single tasks is not better, also simpler to implement) | ||
+ | * Loosely coupled multiprocessor | ||
+ | * No shared global address space | ||
+ | * Message passing to communicate between different sources | ||
+ | * Simple to manage memory | ||
+ | * Tightly coupled multiprocessor | ||
+ | * Shared global address space | ||
+ | * Need to ensure consistency of data | ||
+ | * Programming issues | ||
+ | * Hardware-based multithreading | ||
+ | * Coarse grained | ||
+ | * Find grained | ||
+ | * Simultaneous: Dispatch instruction from multiple threads at the same time | ||
+ | * Parallel speedup | ||
+ | * Superlinear speedup | ||
+ | * Utilization, Redundancy, Efficiency | ||
+ | * Amdahl's law | ||
+ | * Maximum speedup | ||
+ | * Parallel portion is not perfect | ||
+ | * Serial bottleneck | ||
+ | * Synchronization cost | ||
+ | * Load balance | ||
+ | * Some threads has more work, requires more time to hit the sync. point | ||
+ | * Critical sections | ||
+ | * Enforce mutually exclusive access to shared data | ||
+ | * Issues in parallel programming | ||
+ | * Correctness | ||
+ | * Synchronization | ||
+ | * Consistency | ||
+ | |||
+ | |||
+ | ===== Lecture 28 (4/8 Wed.) ===== | ||
+ | * Ordering of instructions | ||
+ | * Maintaining memory consistency when there are multiple threads and shared memory | ||
+ | * Need to ensure the semantic is not changed | ||
+ | * Making sure the shared data is properly locked when used | ||
+ | * Support mutual exclusion | ||
+ | * Ordering depends on when each processor is executed | ||
+ | * Debugging is also difficult (non-deterministic behavior) | ||
+ | * Dekker's algorithm | ||
+ | * Inconsistency -- the two processors did NOT see the same order of operations to memory | ||
+ | * Sequential consistency | ||
+ | * Multiple correct global orders | ||
+ | * Two issues: | ||
+ | * Too conservative/strict | ||
+ | * Performance limiting | ||
+ | * Weak consistency: global ordering when sync | ||
+ | * programmer hints where the synchronizations are | ||
+ | * Memory fence | ||
+ | * More burden on the programmers | ||
+ | * Cache coherence | ||
+ | * Can be done in the software level or hardware level | ||
+ | * Snoop-based coherence | ||
+ | * A simple protocol with two states by broadcasting reads/writes on a bus | ||
+ | * Maintaining coherence | ||
+ | * Needs to provide 1) write propagation and 2) write serialization | ||
+ | * Update vs. Invalidate | ||
+ | * Two cache coherence methods | ||
+ | * Snoopy bus | ||
+ | * Bus based, single point of serialization | ||
+ | * More efficient with small number of processors | ||
+ | * Processors snoop other caches read/write requests to keep the cache block coherent | ||
+ | * Directory | ||
+ | * Single point of serialization per block | ||
+ | * Directory coordinates the coherency | ||
+ | * More scalable | ||
+ | * The directory keeps track of where the copies of each block resides | ||
+ | * Supplies data on a read | ||
+ | * Invalidates the block on a write | ||
+ | * Has an exclusive state |