This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
buzzword [2014/04/28 18:18] rachata |
buzzword [2015/04/15 18:16] rachata |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Buzzwords ====== | ====== Buzzwords ====== | ||
- | Buzzwords are terms that are mentioned during lecture which are particularly important to understand thoroughly. This page tracks the buzzwords for each of the lectures and can be used as a reference for finding gaps in your understanding of course material. | + | Buzzwords are terms that are mentioned during lecture which are particularly important to understand thoroughly. This page tracks the buzzwords for each of the lectures and can be used as a reference for finding gaps in your understanding of course material. |
- | + | ||
- | ===== Lecture 1 (1/13 Mon.) ===== | + | |
+ | ===== Lecture 1 (1/12 Mon.) ===== | ||
* Level of transformation | * Level of transformation | ||
* Algorithm | * Algorithm | ||
Line 10: | Line 9: | ||
* Compiler | * Compiler | ||
* Cross abstraction layers | * Cross abstraction layers | ||
- | * Expose an interface | ||
* Tradeoffs | * Tradeoffs | ||
* Caches | * Caches | ||
- | * Multi-thread | + | * DRAM/memory controller |
- | * Multi-core | + | * DRAM banks |
- | * Unfairness | + | |
- | * DRAM controller/Memory controller | + | |
- | * Memory hog | + | |
* Row buffer hit/miss | * Row buffer hit/miss | ||
* Row buffer locality | * Row buffer locality | ||
- | * Streaming access/ Random access | + | * Unfairness |
- | * DRAM refresh | + | * Memory performance hog |
- | * Retention time | + | * Shared DRAM memory system |
- | * Profiling DRAM retention time | + | * Streaming access vs. random access |
+ | * Memory scheduling policies | ||
+ | * Scheduling priority | ||
+ | * Retention time of DRAM | ||
+ | * Process variation | ||
+ | * Retention time profile | ||
* Power consumption | * Power consumption | ||
- | * Wimpy cores | ||
* Bloom filter | * Bloom filter | ||
- | * Pros/Cons | + | * Hamming code |
- | * False Positive | + | * Hamming distance |
- | * Simulation | + | * DRAM row hammer |
- | * Memory performance attacks | + | |
- | * RTL design | + | |
- | ===== Lecture 2 (1/15 Wed.) ===== | + | ===== Lecture 2 (1/14 Wed.) ===== |
- | + | ||
- | * Optimizing for energy/ Optimizing for the performance | + | |
- | * Generally you should optimize for the users | + | |
- | * state-of-the-art | + | |
- | * RTL Simulation | + | |
- | * Long, slow and can be costly | + | |
- | * High level simulation | + | |
- | * What should be employed? | + | |
- | * Important to get the idea of how they are implemented in RTL | + | |
- | * Allows designer to filter out techniques that do not work well | + | |
- | * Design points | + | |
- | * Design processors to meet the design points | + | |
- | * Software stack | + | |
- | * Design decisions | + | |
- | * Datacenters | + | |
- | * MIPS R2000 | + | |
- | * What are architectural techniques that improve the performance of a processor over MIPS 2000 | + | |
* Moore's Law | * Moore's Law | ||
+ | * Algorithm --> step-by-step procedure to solve a problem | ||
* in-order execution | * in-order execution | ||
* out-of-order execution | * out-of-order execution | ||
Line 64: | Line 45: | ||
* Memory wall (a part of scaling issue) | * Memory wall (a part of scaling issue) | ||
* Scaling issue | * Scaling issue | ||
- | * Transister are getting smaller | + | * Transistor are getting smaller |
+ | * Key components of a computer | ||
+ | * Design points | ||
+ | * Design processors to meet the design points | ||
+ | * Software stack | ||
+ | * Design decisions | ||
+ | * Datacenters | ||
* Reliability problems that cause errors | * Reliability problems that cause errors | ||
* Analogies from Kuhn's "The Structure of Scientific Revolutions" (Recommended book) | * Analogies from Kuhn's "The Structure of Scientific Revolutions" (Recommended book) | ||
- | * Pre paradigm science | + | * Pre-paradigm science |
* Normal science | * Normal science | ||
- | * Revolutionalry science | + | * Revolutionary science |
* Components of a computer | * Components of a computer | ||
* Computation | * Computation | ||
* Communication | * Communication | ||
- | * Storage | + | * Storage |
- | * DRAM | + | * DRAM |
- | * NVRAM (Non-volatile memory): PCM, STT-MRAM | + | * NVRAM (Non-volatile memory): PCM, STT-MRAM |
- | * Storage (Flash/Harddrive) | + | * Storage (Flash/Harddrive) |
* Von Neumann Model (Control flow model) | * Von Neumann Model (Control flow model) | ||
* Stored program computer | * Stored program computer | ||
- | * Properties of Von Neumann Model: Stored program, sequential instruction processing | + | * Properties of Von Neumann Model: Stored program, sequential instruction processing |
- | * Unified memory | + | * Unified memory |
- | * When does an instruction is being interpreted as an instruction (as oppose to a datum)? | + | * When does an instruction is being interpreted as an instruction (as oppose to a datum)? |
- | * Program counter | + | * Program counter |
- | * Examples: x86, ARM, Alpha, IBM Power series, SPARC, MIPS | + | * Examples: x86, ARM, Alpha, IBM Power series, SPARC, MIPS |
* Data flow model | * Data flow model | ||
* Data flow machine | * Data flow machine | ||
Line 89: | Line 76: | ||
* Operands | * Operands | ||
* Live-outs/Live-ins | * Live-outs/Live-ins | ||
- | * DIfferent types of data flow nodes (conditional/relational/barrier) | + | * Different types of data flow nodes (conditional/relational/barrier) |
* How to do transactional transaction in dataflow? | * How to do transactional transaction in dataflow? | ||
* Example: bank transactions | * Example: bank transactions | ||
* Tradeoffs between control-driven and data-driven | * Tradeoffs between control-driven and data-driven | ||
* What are easier to program? | * What are easier to program? | ||
- | * Which are easy to compile? | + | * Which are easy to compile? |
- | * What are more parallel (does that mean it is faster?) | + | * What are more parallel (does that mean it is faster?) |
- | * Which machines are more complex to design? | + | * Which machines are more complex to design? |
* In control flow, when a program is stop, there is a pointer to the current state (precise state). | * In control flow, when a program is stop, there is a pointer to the current state (precise state). | ||
* ISA vs. Microarchitecture | * ISA vs. Microarchitecture | ||
* Semantics in the ISA | * Semantics in the ISA | ||
- | * uArch should obey the ISA | + | * uArch should obey the ISA |
- | * Changing ISA is costly, can affect compatibility. | + | * Changing ISA is costly, can affect compatibility. |
* Instruction pointers | * Instruction pointers | ||
* uArch techniques: common and powerful techniques break Vonn Neumann model if done at the ISA level | * uArch techniques: common and powerful techniques break Vonn Neumann model if done at the ISA level | ||
Line 109: | Line 96: | ||
* Out-of-order executions | * Out-of-order executions | ||
* etc. | * etc. | ||
- | * Design techniques | + | * Design techniques |
- | * Adder implementation (Bit serial, ripple carry, carry lookahead) | + | * Adder implementation (Bit serial, ripple carry, carry lookahead) |
- | * Connection machine (an example of a machine that use bit serial to tradeoff latency for more parallelism) | + | * Connection machine (an example of a machine that use bit serial to tradeoff latency for more parallelism) |
* Microprocessor: ISA + uArch + circuits | * Microprocessor: ISA + uArch + circuits | ||
* What are a part of the ISA? Instructions, memory, etc. | * What are a part of the ISA? Instructions, memory, etc. | ||
Line 120: | Line 107: | ||
===== Lecture 3 (1/17 Fri.) ===== | ===== Lecture 3 (1/17 Fri.) ===== | ||
- | * Design tradeoffs | + | * Microarchitecture |
- | * Macro Architectures | + | * Three major tradeoffs of computer architecture |
- | * Reconfiguribility vs. specialized designs | + | * Macro-architecture |
- | * Parallelism (instructions, data parallel) | + | * LC-3b ISA |
- | * Uniform decode (Example: Alpha) | + | * Unused instructions |
- | * Steering bits (Sub-opcode) | + | * Bit steering |
- | * 0,1,2,3 address machines | + | * Instruction processing style |
- | * Stack machine | + | * 0,1,2,3 address machines |
- | * Accumulator machine | + | * Stack machine |
- | * 2-operand machine | + | * Accumulator machine |
- | * 3-operand machine | + | * 2-operand machine |
- | * Tradeoffs between 0,1,2,3 address machines | + | * 3-operand machine |
- | * Instructions/Opcode/Operade specifiers (i.e. addressing modes) | + | * Tradeoffs between 0,1,2,3 address machines |
- | * Simply vs. complex data type (and their tradeoffs) | + | * Postfix notation |
- | * Semantic gap | + | * Instructions/Opcode/Operand specifiers (i.e. addressing modes) |
- | * Translation layer | + | * Simply vs. complex data type (and their tradeoffs) |
- | * Addressability | + | * Semantic gap and level |
- | * Byte/bit addressable machines | + | * Translation layer |
- | * Virtual memory | + | * Addressability |
- | * Big/little endian | + | * Byte/bit addressable machines |
- | * Benefits of having registers (data locality) | + | * Virtual memory |
- | * Programmer visible (Architectural) state | + | * Big/little endian |
- | * Programmers can access this directly | + | * Benefits of having registers (data locality) |
- | * What are the benefits? | + | * Programmer visible (Architectural) state |
- | * Microarchitectural state | + | * Programmers can access this directly |
- | * Programmers cannot access this directly | + | * What are the benefits? |
- | * Evolution of registers (from accumulators to registers) | + | * Microarchitectural state |
- | * Different types of instructions | + | * Programmers cannot access this directly |
- | * Control instructions | + | * Evolution of registers (from accumulators to registers) |
- | * Data instructions | + | * Different types of instructions |
- | * Operation instructions | + | * Control instructions |
- | * Addressing modes | + | * Data instructions |
- | * Tradeoffs (complexity, flexibility, etc.) | + | * Operation instructions |
- | * Orthogonal ISA | + | * Addressing modes |
- | * Addressing modes that are orthogonal to instructino types | + | * Tradeoffs (complexity, flexibility, etc.) |
- | * Vectors vs. non vectored interrupts | + | * Orthogonal ISA |
- | * Complex vs. simple instructions | + | * Addressing modes that are orthogonal to instruction types |
- | * Tradeoffs | + | * I/O devices |
- | * RISC vs. CISC | + | * Vectored vs. non-vectored interrupts |
- | * Tradeoff | + | * Complex vs. simple instructions |
- | * Backward compatibility | + | * Tradeoffs |
- | * Performance | + | * RISC vs. CISC |
- | * Optimization opportunity | + | * Tradeoff |
+ | * Backward compatibility | ||
+ | * Performance | ||
+ | * Optimization opportunity | ||
+ | * Translation | ||
- | ===== Lecture 4 (1/22 Wed.) ===== | + | ===== Lecture 4 (1/21 Wed.) ===== |
- | * Semantic gap | + | * Fixed vs. variable length instruction |
- | * Small vs. Large semantic gap (CISC vs. RISC) | + | * Huffman encoding |
- | * Benefit of RISC vs. CISC | + | * Uniform vs. non-uniform decode |
- | * Micro operations/microcode | + | * Registers |
- | * Translate complex instructions into smaller instructions | + | * Tradeoffs between number of registers |
- | * Parallelism (motivation for RISC) | + | * Alignments |
- | * Compiler optimization | + | * How does MIPS load words across alignment the boundary |
- | * Code optimization through translation | + | |
- | * VLIW | + | |
- | * Fixed vs. variable length instructions | + | |
- | * Tradeoffs | + | |
- | * Alignment issues? (fetch/decode) | + | |
- | * Decoding issues? | + | |
- | * Code size? | + | |
- | * Adding additional instructions? | + | |
- | * Memory bandwidth and cache utilization? | + | |
- | * Energy? | + | |
- | * Encoding in variable length instructions | + | |
- | * Structure of Alpha instructions and other uniform decode instructions | + | |
- | * Different type of instructions | + | |
- | * Benefit of knowing what type of instructions | + | |
- | * Speculatively operate future instructions | + | |
- | * x86 and other non-uniform decode instructions | + | |
- | * Tradeoff vs. uniform decode | + | |
- | * Tradeoffs for different number of registers | + | |
- | * Spilling into memory if the number of registers is small | + | |
- | * Compiler optimization on how to manage which value to keep/spill | + | |
- | * Addressing modes | + | |
- | * Benefits? | + | |
- | * Types? | + | |
- | * Different uses of addressing modes? | + | |
- | * Various ISA-level tradeoffs | + | |
- | * Virtual memory | + | |
- | * Unalign memory access/aligned memory access | + | |
- | * Cost vs. benefit of unaligned access | + | |
- | * ISA specification | + | |
- | * Things you have to obey/specifie in the ISA specification | + | |
- | * Architectural states | + | |
- | * Microarchitecture implements how arch. state A transformed to the next arch. state A' | + | |
- | * Single cycle machines | + | |
- | * Critical path in the single cycle machine | + | |
- | * Multi cycle machines | + | |
- | * Functional units | + | |
- | * Performance metrics | + | |
- | * CPI/IPC | + | |
- | * CPI of a single cycle microarchitecture | + | |
- | ===== Lecture 5 (1/24 Fri.) ===== | + | ===== Lecture 5 (1/26 Mon.) ===== |
+ | * Tradeoffs in ISA: Instruction length | ||
+ | * Uniform vs. non-uniform | ||
+ | * Design point/Use cases | ||
+ | * What dictates the design point? | ||
+ | * Architectural states | ||
+ | * uArch | ||
+ | * How to implement the ISA in the uArch | ||
+ | * Different stages in the uArch | ||
+ | * Clock cycles | ||
+ | * Multi-cycle machine | ||
+ | * Datapath and control logic | ||
+ | * Control signals | ||
+ | * Execution time of instructions/program | ||
+ | * Metrics and what do they means | ||
* Instruction processing | * Instruction processing | ||
* Fetch | * Fetch | ||
Line 221: | Line 187: | ||
* Memory fetch | * Memory fetch | ||
* Writeback | * Writeback | ||
- | * Datapath & Control logic in microprocessors | + | * Encoding and semantics |
* Different types of instructions (I-type, R-type, etc.) | * Different types of instructions (I-type, R-type, etc.) | ||
* Control flow instructions | * Control flow instructions | ||
Line 230: | Line 196: | ||
* Critical path analysis | * Critical path analysis | ||
* Critical path of a single cycle processor | * Critical path of a single cycle processor | ||
- | * Combinational logic & Sequential logic | + | * What is in the control signals? |
+ | * Combinational logic & Sequential logic | ||
* Control store | * Control store | ||
* Tradeoffs of a single cycle uarch | * Tradeoffs of a single cycle uarch | ||
- | * Dynamic power/Static power | ||
- | * Speedup calculation | ||
- | * Parallelism | ||
- | * Serial bottleneck | ||
- | * Amdahl's bottleneck | ||
* Design principles | * Design principles | ||
* Common case design | * Common case design | ||
* Critical path design | * Critical path design | ||
* Balanced designs | * Balanced designs | ||
- | * Multi cycle design | + | * Dynamic power/Static power |
+ | * Increases in power due to frequency | ||
- | ===== Lecture 6 (1/27 Mon.) ===== | + | ===== Lecture 6 (1/28 Mon.) ===== |
+ | * Design principles | ||
+ | * Common case design | ||
+ | * Critical path design | ||
+ | * Balanced designs | ||
+ | * Multi cycle design | ||
* Microcoded/Microprogrammed machines | * Microcoded/Microprogrammed machines | ||
* States | * States | ||
+ | * Translation from one state to another | ||
* Microinstructions | * Microinstructions | ||
* Microsequencing | * Microsequencing | ||
Line 267: | Line 236: | ||
* Variable latency memory | * Variable latency memory | ||
* Handling interrupts | * Handling interrupts | ||
- | * Difference betwen interrupts and exceptions | + | * Difference between interrupts and exceptions |
* Emulator (i.e. uCode allots minimal datapath to emulate the ISA) | * Emulator (i.e. uCode allots minimal datapath to emulate the ISA) | ||
* Updating machine behavior | * Updating machine behavior | ||
Line 273: | Line 242: | ||
* Vertical microcode | * Vertical microcode | ||
* Primitives | * Primitives | ||
+ | | ||
+ | ===== Lecture 7 (1/30 Fri.) ===== | ||
- | ===== Lecture 7 (1/29 Wed.) ===== | + | * Emulator (i.e. uCode allots minimal datapath to emulate the ISA) |
+ | * Updating machine behavior | ||
+ | * Horizontal microcode | ||
+ | * Vertical microcode | ||
+ | * Primitives | ||
+ | * nanocode and millicode | ||
+ | * what are the differences between nano/milli/microcode | ||
+ | * microprogrammed vs. hardwire control | ||
* Pipelining | * Pipelining | ||
* Limitations of the multi-programmed design | * Limitations of the multi-programmed design | ||
* Idle resources | * Idle resources | ||
* Throughput of a pipelined design | * Throughput of a pipelined design | ||
- | * What dictacts the throughput of a pipelined design? | + | * What dictates the throughput of a pipelined design? |
* Latency of the pipelined design | * Latency of the pipelined design | ||
* Dependency | * Dependency | ||
Line 300: | Line 277: | ||
* Pipeline flush | * Pipeline flush | ||
* Speculation | * Speculation | ||
+ | | ||
+ | ===== Lecture 8 (2/2 Mon.) ===== | ||
+ | |||
* Interlocking | * Interlocking | ||
* Multipath execution | * Multipath execution | ||
Line 305: | Line 285: | ||
* No-op (Bubbles in the pipeline) | * No-op (Bubbles in the pipeline) | ||
* Valid bits in the instructions | * Valid bits in the instructions | ||
- | |||
- | ===== Lecture 8 (1/31 Fri.) ===== | ||
* Branch prediction | * Branch prediction | ||
* Different types of data dependence | * Different types of data dependence | ||
Line 335: | Line 313: | ||
* Based on the profile info. position the code based on it | * Based on the profile info. position the code based on it | ||
* Try to make the next sequential instruction be the next inst. to be executed | * Try to make the next sequential instruction be the next inst. to be executed | ||
- | * Trace cache | ||
* Predicate combining (combine predicate for a branch instruction) | * Predicate combining (combine predicate for a branch instruction) | ||
* Predicated execution (control dependence becomes data dependence) | * Predicated execution (control dependence becomes data dependence) | ||
+ | | ||
+ | | ||
+ | ===== Lecture 9 (2/4 Wed.) ===== | ||
* Definition of basic blocks | * Definition of basic blocks | ||
* Control flow graph | * Control flow graph | ||
- | |||
- | ===== Lecture 9 (2/3 Mon.) ===== | ||
* Delayed branching | * Delayed branching | ||
* benefit? | * benefit? | ||
Line 358: | Line 336: | ||
* Inst. from different thread can fill-in the bubbles | * Inst. from different thread can fill-in the bubbles | ||
* Cost? | * Cost? | ||
- | * Simulteneuos multithreading | + | * Simultaneuos multithreading |
* Branch prediction | * Branch prediction | ||
* Guess what to fetch next. | * Guess what to fetch next. | ||
Line 367: | Line 345: | ||
* Given the program/number of instructions, percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions. | * Given the program/number of instructions, percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions. | ||
* How many extra instructions are being fetched? | * How many extra instructions are being fetched? | ||
- | * What is the performance degredation? | + | * What is the performance degradation? |
* How to reduce the miss penalty? | * How to reduce the miss penalty? | ||
* Predicting the next address (non PC+4 address) | * Predicting the next address (non PC+4 address) | ||
Line 374: | Line 352: | ||
* Global branch history - for directions | * Global branch history - for directions | ||
* Can use compiler to profile and get more info | * Can use compiler to profile and get more info | ||
- | * Input set dictacts the accuracy | + | * Input set dictates the accuracy |
* Add time to compilation | * Add time to compilation | ||
* Heuristics that are common and doesn't require profiling. | * Heuristics that are common and doesn't require profiling. | ||
* Might be inaccurate | * Might be inaccurate | ||
* Does not require profiling | * Does not require profiling | ||
- | * Programmer can tell the hardware (via pragmas (hints)) | + | * Static branch prediction |
+ | * Programmer provides pragmas, hinting the likelihood of taken/not taken branch | ||
* For example, x86 has the hint bit | * For example, x86 has the hint bit | ||
* Dynamic branch prediction | * Dynamic branch prediction | ||
Line 386: | Line 365: | ||
* One more bit for hysteresis | * One more bit for hysteresis | ||
- | + | ===== Lecture 10 (2/6 Fri.) ===== | |
- | ===== Lecture 10 (2/5 Wed.) ===== | + | |
* Branch prediction accuracy | * Branch prediction accuracy | ||
* Why are they very important? | * Why are they very important? | ||
* Differences between 99% accuracy and 98% accuracy | * Differences between 99% accuracy and 98% accuracy | ||
- | * Cost of a misprediction when the pipeline is veryd eep | + | * Cost of a misprediction when the pipeline is very deep |
- | * Value prediction | + | |
* Global branch correlation | * Global branch correlation | ||
* Some branches are correlated | * Some branches are correlated | ||
Line 404: | Line 380: | ||
* What information are used | * What information are used | ||
* Two-level branch prediction | * Two-level branch prediction | ||
- | * What entries do you keep in the glocal history? | + | * What entries do you keep in the global history? |
* What entries do you keep in the local history? | * What entries do you keep in the local history? | ||
* How many table? | * How many table? | ||
Line 414: | Line 390: | ||
* Store both GHP and PC in one combined information | * Store both GHP and PC in one combined information | ||
* How do you use the information? Why does the XOR result still usable? | * How do you use the information? Why does the XOR result still usable? | ||
- | * Slides (page 16-18) for a good overview of one- and two-level predictors | ||
* Warmup cost of the branch predictor | * Warmup cost of the branch predictor | ||
* Hybrid solution? Fast warmup is used first, then switch to the slower one. | * Hybrid solution? Fast warmup is used first, then switch to the slower one. | ||
* Tournament predictor (Alpha 21264) | * Tournament predictor (Alpha 21264) | ||
- | * Other types of branch predictor | ||
- | * Using machine learning? | ||
- | * Geometric history length | ||
- | * Look at branches far behind (but using geometric step) | ||
* Predicated execution - eliminate branches | * Predicated execution - eliminate branches | ||
* What are the tradeoffs | * What are the tradeoffs | ||
- | * What is the block is big (can lead to execution a lot of useless work) | + | * What if the block is big (can lead to execution a lot of useless work) |
* Allows easier code optimization | * Allows easier code optimization | ||
* From the compiler PoV, predicated execution combine multiple basic blocks into one bigger basic block | * From the compiler PoV, predicated execution combine multiple basic blocks into one bigger basic block | ||
Line 434: | Line 405: | ||
* Use branch prediction on an easy to predict code | * Use branch prediction on an easy to predict code | ||
* Use predicated execution on a hard to predict code | * Use predicated execution on a hard to predict code | ||
- | * Compiler can be more aggressive in optimimzing the code | + | * Compiler can be more aggressive in optmizing the code |
* What are the tradeoffs (slide# 47) | * What are the tradeoffs (slide# 47) | ||
* Multi-path execution | * Multi-path execution | ||
* Execute both paths | * Execute both paths | ||
* Can lead to wasted work | * Can lead to wasted work | ||
+ | * VLIW | ||
+ | * Superscalar | ||
- | ===== Lecture 11 (2/12 Wed.) ===== | + | ===== Lecture 11 (2/11 Wed.) ===== |
- | * Call and return prediction | + | * Geometric GHR length for branch prediction |
- | * Direct call is easy to predict | + | * Perceptron branch predictor |
- | * Retun is harder (indirect branches) | + | |
- | * Nested calls make return easier to predict | + | |
- | * Can use stack to predict the return | + | |
- | * Indirect branch prediction | + | |
- | * These branches have multiple targets | + | |
- | * For switch-case, virtual function calls, jump tables, interface calls | + | |
- | * BTB to predict the target address - low accuracy | + | |
- | * History based: BTB + GHR | + | |
- | * Virtual program counter prediction | + | |
- | * Complications in superscalar processors | + | |
- | * Fetch? What if multiple branches are fetched at the same time? | + | |
- | * Logic requires to ensure correctness? | + | |
* Multi-cycle executions (Different functional units take different number of cycles) | * Multi-cycle executions (Different functional units take different number of cycles) | ||
* Instructions can retire out-of-order | * Instructions can retire out-of-order | ||
Line 462: | Line 423: | ||
* Exceptions and Interrupts | * Exceptions and Interrupts | ||
* When they are handled? | * When they are handled? | ||
- | * Why are some interrupts should be handled right away? | + | * Why are some interrupts should be handled right away? |
* Precise exception | * Precise exception | ||
* arch. state should be consistent before handling the exception/interrupts | * arch. state should be consistent before handling the exception/interrupts | ||
Line 470: | Line 431: | ||
* Easier to restart the processes | * Easier to restart the processes | ||
* How to ensure precise exception? | * How to ensure precise exception? | ||
- | * Tradeoffs between each method | + | * Tradeoffs between each method |
* Reorder buffer | * Reorder buffer | ||
* Reorder results before they are visible to the arch. state | * Reorder results before they are visible to the arch. state | ||
- | * Need to presearve the sequential sematic and data | + | * Need to preserve the sequential semantic and data |
- | * What are the informatinos in the ROB entry | + | * What are the information in the ROB entry |
* Where to get the value from (forwarding path? reorder buffer?) | * Where to get the value from (forwarding path? reorder buffer?) | ||
* Extra logic to check where the youngest instructions/value is | * Extra logic to check where the youngest instructions/value is | ||
- | * Content addressible search | + | * Content addressable search (CAM) |
* A lot of comparators | * A lot of comparators | ||
* Different ways to simplify the reorder buffer | * Different ways to simplify the reorder buffer | ||
Line 487: | Line 448: | ||
* Future file (commonly used, along with reorder buffer) | * Future file (commonly used, along with reorder buffer) | ||
* Keep two set of register files | * Keep two set of register files | ||
- | * An updated value (Speculative), called fiture file | + | * An updated value (Speculative), called future file |
* A backup value (to restore the state quickly | * A backup value (to restore the state quickly | ||
- | * Double the cost of the regfile, but reduce the area as you don't have to use a content addressible memory (compared to ROB alone) | + | * Double the cost of the regfile, but reduce the area as you don't have to use a content addressable memory (compared to ROB alone) |
* Branch misprediction resembles Exception | * Branch misprediction resembles Exception | ||
* The difference is that branch misprediction is not visible to the software | * The difference is that branch misprediction is not visible to the software | ||
Line 498: | Line 459: | ||
* Checkpointing | * Checkpointing | ||
* Advantages? | * Advantages? | ||
- | | ||
- | ===== Lecture 14 (2/19 Wed.) ===== | ||
+ | ===== Lecture 12 (2/13 Fri.) ===== | ||
+ | |||
+ | * Renaming | ||
+ | * Register renaming table | ||
* Predictor (branch predictor, cache line predictor ...) | * Predictor (branch predictor, cache line predictor ...) | ||
* Power budget (and its importance) | * Power budget (and its importance) | ||
Line 523: | Line 486: | ||
* Slides 28 --> register renaming | * Slides 28 --> register renaming | ||
* Slides 30-35 --> Exercise (also on the board) | * Slides 30-35 --> Exercise (also on the board) | ||
- | * This will be usefull for the midterm | + | * This will be useful for the midterm |
* Register aliasing table | * Register aliasing table | ||
+ | * Broadcasting tags | ||
+ | * Using dataflow | ||
- | ===== Lecture 15 (2/21 Fri.) ===== | + | |
+ | ===== Lecture 13 (2/16 Mon.) ===== | ||
* OoO --> Restricted Dataflow | * OoO --> Restricted Dataflow | ||
Line 534: | Line 500: | ||
* Dispatch width | * Dispatch width | ||
* Parallelism in the program | * Parallelism in the program | ||
- | * More example on slide #10 | ||
* What does it mean to be restricted data flow | * What does it mean to be restricted data flow | ||
* Still visible as a Von Neumann model | * Still visible as a Von Neumann model | ||
* Where does the efficiency come from? | * Where does the efficiency come from? | ||
- | * Size of the scheduling windors/reorder buffer. Tradeoffs? What make sense? | + | * Size of the scheduling windows/reorder buffer. Tradeoffs? What make sense? |
* Load/store handling | * Load/store handling | ||
* Would like to schedule them out of order, but make them visible in-order | * Would like to schedule them out of order, but make them visible in-order | ||
Line 545: | Line 510: | ||
* This is one of the most complex structure of the load/store handling | * This is one of the most complex structure of the load/store handling | ||
* What information can be used to predict these load/store optimization? | * What information can be used to predict these load/store optimization? | ||
- | * Note: IPC = 1/CPI | ||
* Centralized vs. distributed? What are the tradeoffs? | * Centralized vs. distributed? What are the tradeoffs? | ||
* How to handle when there is a misprediction/recovery | * How to handle when there is a misprediction/recovery | ||
+ | * OoO + branch prediction? | ||
+ | * Speculatively update the history register | ||
+ | * When do you update the GHR? | ||
* Token dataflow arch. | * Token dataflow arch. | ||
* What are tokens? | * What are tokens? | ||
Line 555: | Line 522: | ||
* Difficulties? | * Difficulties? | ||
- | ===== Lecture 16 (2/24 Mon.) ===== | + | ===== Lecture 14 (2/18 Wed.) ===== |
* SISD/SIMD/MISD/MIMD | * SISD/SIMD/MISD/MIMD | ||
Line 571: | Line 538: | ||
* But the program needs to be very parallel | * But the program needs to be very parallel | ||
* Memory can be the bottleneck (due to very high MLP) | * Memory can be the bottleneck (due to very high MLP) | ||
- | * What does the functional units look like? Deep pipelin and simpler control. | + | * What does the functional units look like? Deep pipeline and simpler control. |
* CRAY-I is one of the examples of vector processor | * CRAY-I is one of the examples of vector processor | ||
* Memory access pattern in a vector processor | * Memory access pattern in a vector processor | ||
* How do the memory accesses benefit the memory bandwidth? | * How do the memory accesses benefit the memory bandwidth? | ||
- | * Please refer to slides 73-74 in http://www.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring13-lecture25-mainmemory-afterlecture.pdf for a breif explanation of memory level parallelism | + | * Memory level parallelism |
* Stride length vs. the number of banks | * Stride length vs. the number of banks | ||
* stride length should be relatively prime to the number of banks | * stride length should be relatively prime to the number of banks | ||
Line 582: | Line 549: | ||
* What if there are multiple memory ports? | * What if there are multiple memory ports? | ||
* Gather/Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism) | * Gather/Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism) | ||
- | * Helps handling sparse metrices | + | * Helps handling sparse matrices |
* Conditional operation | * Conditional operation | ||
* Structure of vector units | * Structure of vector units | ||
Line 595: | Line 562: | ||
* Intel SSE --> Modern version of MMX | * Intel SSE --> Modern version of MMX | ||
- | ===== Lecture 17 (2/26 Wed.) ===== | + | ===== Lecture 15 (2/20 Fri.) ===== |
* GPU | * GPU | ||
* Warp/Wavefront | * Warp/Wavefront | ||
Line 613: | Line 579: | ||
* Lower SIMD efficiency | * Lower SIMD efficiency | ||
* What if you have layers of branches? | * What if you have layers of branches? | ||
- | * Dynamic wrap formation | + | * Dynamic warp formation |
* Combining threads from different warps to increase SIMD utilization | * Combining threads from different warps to increase SIMD utilization | ||
* This can cause memory divergence | * This can cause memory divergence | ||
Line 627: | Line 593: | ||
* How to street the instruction (determine dependency/stalling)? | * How to street the instruction (determine dependency/stalling)? | ||
* Instruction scheduling techniques (static vs. dynamic) | * Instruction scheduling techniques (static vs. dynamic) | ||
- | * Systoric arrays | + | * Systolic arrays |
* Processing elements transform data in chains | * Processing elements transform data in chains | ||
* Develop for image processing (for example, convolution) | * Develop for image processing (for example, convolution) | ||
* Stage processing | * Stage processing | ||
+ | | ||
- | ===== Lecture 18 (2/28 Fri.) ===== | ||
- | |||
- | * Tradeoffs of VLIW | ||
- | * Why does VLIW required static instruction scheduling | ||
- | * Whose job it is? | ||
- | * Compiler can rearrange basic blocks/instruction | ||
- | * Basic block | ||
- | * Benefits of having large basic block | ||
- | * Entry/Exit | ||
- | * Handling entries/exits | ||
- | * Trace cache | ||
- | * How to ensure correctness? | ||
- | * Profiling | ||
- | * Fixing up the instruction order to ensure correctness | ||
- | * Dealing with multiple entries into the block | ||
- | * Dealing with multiple exits into the block | ||
- | * Super block | ||
- | * How to form super blocks? | ||
- | * Benefit of super block | ||
- | * Tradeoff between not forming a super block and forming a super block | ||
- | * Ambiguous branch (after profiling, both taken/not taken are equally likely) | ||
- | * Cleaning up | ||
- | * What scenario would make trace cache/superblock/profiling less effective? | ||
- | * List scheduling | ||
- | * Help figuring out which instructions VLIW should fetch | ||
- | * Try to maximize instruction throughput | ||
- | * How to assign priorities | ||
- | * What if some instructions take longer than others | ||
- | * Block structured ISA (BS-ISA) | ||
- | * Problems with trace scheduling? | ||
- | * What type of program will benefit from BS-ISA | ||
- | * How to form blocks in BS-ISA? | ||
- | * Combining basic blocks | ||
- | * multiples of merged basic blocks | ||
- | * How to deal with entries/exits in BS-ISA? | ||
- | * undo the executed instructions from the entry point, then fetch the new block | ||
- | * Advantages over trace cache | ||
- | * Benefit of VLIW + Static instruction scheduling | ||
- | * Intel IA-64 | ||
- | * Static instruction scheduling and VLIW | ||
- | |||
- | ===== Lecture 19 (3/19 Wed.) ===== | ||
+ | ===== Lecture 16 (2/23 Mon.) ===== | ||
+ | * Systolic arrays | ||
+ | * Processing elements transform data in chains | ||
+ | * Can be arrays of multi-dimensional processing elements | ||
+ | * Develop for image processing (for example, convolution) | ||
+ | * Can be use to break stages in pipeline programs, using a set of queues and processing elements | ||
+ | * Can enable high concurrency and good for regular programs | ||
+ | * Very special purpose | ||
+ | * The warp computer | ||
+ | * Static instruction scheduling | ||
+ | * How do we find the next instruction to execute? | ||
+ | * Live-in and live-out | ||
+ | * Basic blocks | ||
+ | * Rearranging instructions in the basic block | ||
+ | * Code movement from one basic block to another | ||
+ | * Straight line code | ||
+ | * Independent instructions | ||
+ | * How to identify independent instructions | ||
+ | * Atomicity | ||
+ | * Trace scheduling | ||
+ | * Side entrance | ||
+ | * Fixed up code | ||
+ | * How scheduling is done | ||
+ | * Instruction scheduling | ||
+ | * Prioritization heuristics | ||
+ | * Superblock | ||
+ | * Traces with no side-entrance | ||
+ | * Hyperblock | ||
+ | * BS-ISA | ||
+ | * Tradeoffs between trace cache/Hyperblock/Superblock/BS-ISA | ||
+ | | ||
+ | ===== Lecture 17 (2/25 Wed.) ===== | ||
+ | * IA-64 | ||
+ | * EPIC | ||
+ | * IA-64 instruction bundle | ||
+ | * Multiple instructions in the bundle along with the template bit | ||
+ | * Template bits | ||
+ | * Stop bits | ||
+ | * Non-faulting loads and exception propagation | ||
+ | * Aggressive ST-LD reordering | ||
+ | * Physical memory system | ||
+ | * Ideal pipelines | ||
* Ideal cache | * Ideal cache | ||
* More capacity | * More capacity | ||
Line 682: | Line 649: | ||
* DRAM cell | * DRAM cell | ||
* Cheap | * Cheap | ||
- | * Sense the purturbation through sense amplifier | + | * Sense the perturbation through sense amplifier |
* Slow and leaky | * Slow and leaky | ||
* SRAM cell (Cross coupled inverter) | * SRAM cell (Cross coupled inverter) | ||
Line 689: | Line 656: | ||
* Memory bank | * Memory bank | ||
* Read access sequence | * Read access sequence | ||
- | * DRAM: Activate -> Read -> Precharge (if needed) | + | * DRAM: Activate -> Read -> Precharge (if needed) |
- | * What dominate the access laatency for DRAM and SRAM | + | * What dominate the access latency for DRAM and SRAM |
* Scaling issue | * Scaling issue | ||
* Hard to scale the scale to be small | * Hard to scale the scale to be small | ||
Line 718: | Line 685: | ||
* Cost and benefit of having more associativity | * Cost and benefit of having more associativity | ||
* Given the associativity, which block should be replace if it is full | * Given the associativity, which block should be replace if it is full | ||
- | * Replacement poligy | + | * Replacement policy |
* Random | * Random | ||
* Least recently used (LRU) | * Least recently used (LRU) | ||
Line 729: | Line 696: | ||
* Approximate LRU | * Approximate LRU | ||
* Victim and next Victim policy | * Victim and next Victim policy | ||
- | + | ||
- | ===== Lecture 20 (3/21 Fri.) ===== | + | ===== Lecture 18 (2/27 Fri.) ===== |
+ | * Tag store and data store | ||
+ | * Cache hit rate | ||
+ | * Average memory access time (AMAT) | ||
+ | * AMAT vs. Stall time | ||
+ | * Cache basics | ||
+ | * Direct mapped vs. associative cache | ||
+ | * Set/block (line)/Placement/replacement | ||
+ | * How do tag and index get used? | ||
+ | * Full associativity | ||
+ | * Set associative cache | ||
+ | * insertion, promotion, eviction (replacement) | ||
+ | * Various replacement policies | ||
+ | * How to implement LRU | ||
+ | * How to keep track of access ordering | ||
+ | * Complexity increases rapidly | ||
+ | * Approximate LRU | ||
+ | * Victim and next Victim policy | ||
* Set thrashing | * Set thrashing | ||
* Working set is bigger than the associativity | * Working set is bigger than the associativity | ||
Line 737: | Line 720: | ||
* Is this optimal? | * Is this optimal? | ||
* Complexity? | * Complexity? | ||
- | * Similarity between cache and page table | + | * DRAM as a cache for disk |
- | * Number of blocks vs pages | + | |
- | * Time to find the block/page to replace | + | |
* Handling writes | * Handling writes | ||
* Write through | * Write through | ||
Line 754: | Line 735: | ||
* In the first level cache | * In the first level cache | ||
* Cache access | * Cache access | ||
- | * First level access | + | * First level access |
* Second level access | * Second level access | ||
* When to start the second level access | * When to start the second level access | ||
- | * Performance vs. energy | + | * Cache performance |
- | * Address translation | + | * capacity |
- | * Homonym and Synonyms | + | * block size |
- | * Homonym: Same VA but maps to different PA | + | * associativity |
- | * With multiple processes | + | * Classification of cache misses |
- | * Synonyms: Multiple VAs map to the same PA | + | ===== Lecture 19 (03/02 Mon.) ===== |
- | * Shared libraries, shared data, copy-on-write | + | |
- | * I/O | + | |
- | * Can these create problems when we have the cache | + | |
- | * How to eliminate these problems? | + | |
- | * Page coloring | + | |
- | * Interaction between cache and TLB | + | |
- | * Virtually indexed vs. physically indexed | + | |
- | * Virtually tagged vs. physically tagged | + | |
- | * Virtually indexed physically tagged | + | |
- | * Virtual memory in DRAM | + | |
- | * Control where data is mapped to in channel/rank/bank | + | |
- | * More parallelism | + | |
- | * Reduce interference | + | |
- | + | ||
- | ===== Lecture 21 (3/24 Mon.) ===== | + | |
- | + | ||
- | + | ||
- | + | ||
- | * Different parameters that affect cache miss | + | |
- | * Thrashing | + | |
- | * Different types of cache misses | + | |
- | * Compulsory misses | + | |
- | * Can mitigate with prefetches | + | |
- | * Capacity misses | + | |
- | * More assoc | + | |
- | * Victim cache | + | |
- | * Conflict misses | + | |
- | * Hashing | + | |
- | * Large block vs. small block | + | |
* Subblocks | * Subblocks | ||
* Victim cache | * Victim cache | ||
Line 811: | Line 763: | ||
* What information goes into the MSHR? | * What information goes into the MSHR? | ||
* When do you access the MSHR? | * When do you access the MSHR? | ||
+ | * Memory banks | ||
+ | * Shared caches in multi-core processors | ||
+ | ===== Lecture 20 (03/04 Wed.) ===== | ||
+ | * Virtual vs. physical memory | ||
+ | * System's management on memory | ||
+ | * Benefits | ||
+ | * Problem: physical memory has limited size | ||
+ | * Mechanisms: indirection, virtual addresses, and translation | ||
+ | * Demand paging | ||
+ | * Physical memory as a cache | ||
+ | * Tasks of system SW for VM | ||
+ | * Serving a page fault | ||
+ | * Address translation | ||
+ | * Page table | ||
+ | * PTE (page table entry) | ||
+ | * Page replacement algorithm | ||
+ | * CLOCK algo. | ||
+ | * Inverted page table | ||
+ | * Page size trade-offs | ||
+ | * Protection | ||
+ | * Multi-level page tables | ||
+ | * x86 implementation of page table | ||
+ | * TLB | ||
+ | * Handling misses | ||
+ | * When to do address translation? | ||
+ | * Homonym and Synonyms | ||
+ | * Homonym: Same VA but maps to different PA with multiple processes | ||
+ | * Synonyms: Multiple VAs map to the same PA | ||
+ | * Shared libraries, shared data, copy-on-write | ||
+ | * Virtually indexed vs. physically indexed | ||
+ | * Virtually tagged vs. physically tagged | ||
+ | * Virtually indexed physically tagged | ||
+ | * Can these create problems when we have the cache | ||
+ | * How to eliminate these problems? | ||
+ | * Page coloring | ||
+ | * Interaction between cache and TLB | ||
- | ===== Lecture 22 (3/26 Wed.) ===== | + | ===== Lecture 21 (03/23 Mon.) ===== |
- | + | ||
+ | * DRAM scaling problem | ||
+ | * Demands/trends affecting the main memory | ||
+ | * More capacity | ||
+ | * Low energy | ||
+ | * More bandwidth | ||
+ | * QoS | ||
+ | * ECC in DRAM | ||
* Multi-porting | * Multi-porting | ||
* Virtual multi-porting | * Virtual multi-porting | ||
Line 822: | Line 815: | ||
* True multiporting | * True multiporting | ||
* Multiple cache copies | * Multiple cache copies | ||
+ | * Alignment | ||
* Banking | * Banking | ||
* Can have bank conflict | * Can have bank conflict | ||
Line 827: | Line 821: | ||
* Address mapping can mitigate bank conflict | * Address mapping can mitigate bank conflict | ||
* Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity) | * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity) | ||
+ | * Bank mapping | ||
+ | * How to avoid bank conflicts? | ||
+ | * Channel mapping | ||
+ | * Address mapping to minimize bank conflict | ||
+ | * Page coloring | ||
+ | * Virtual to physical mapping that can help reducing conflicts | ||
* Accessing DRAM | * Accessing DRAM | ||
* Row bits | * Row bits | ||
Line 832: | Line 832: | ||
* Addressibility | * Addressibility | ||
* DRAM has its own clock | * DRAM has its own clock | ||
+ | * Sense amplifier | ||
+ | * Bit lines | ||
+ | * Word lines | ||
* DRAM (2T) vs. SRAM (6T) | * DRAM (2T) vs. SRAM (6T) | ||
* Cost | * Cost | ||
Line 844: | Line 847: | ||
* DRAM Channel | * DRAM Channel | ||
* An interface to DRAM, each with its own ranks/banks | * An interface to DRAM, each with its own ranks/banks | ||
+ | * DRAM Chip | ||
* DIMM | * DIMM | ||
* More DIMM adds the interconnect complexity | * More DIMM adds the interconnect complexity | ||
Line 869: | Line 873: | ||
* Send requests whenever the request can be sent to the bank | * Send requests whenever the request can be sent to the bank | ||
* Determine which command (across banks) should be sent to DRAM | * Determine which command (across banks) should be sent to DRAM | ||
- | * Priority of demand vs. prefetch requests | ||
- | * Memory scheduling policies | ||
- | * FCFS | ||
- | * FR-FCFS | ||
- | * Capped FR-FCFS: FR-FCFS with a timeout | ||
- | * Usually this is done in a command level (read/write commands and precharge/activate commands) | ||
- | | ||
- | | ||
- | ===== Lecture 23 (3/28 Fri.) ===== | ||
+ | ===== Lecture 22 (03/25 Wed.) ===== | ||
+ | |||
+ | |||
+ | |||
+ | * Flash controller | ||
+ | * Flash memory | ||
+ | * Garbage collection in flash | ||
+ | * Overhead in flash memory | ||
+ | * Erase (off the critical path, but takes a long time) | ||
+ | * Different types of DRAM | ||
* DRAM design choices | * DRAM design choices | ||
* Cost/density/latency/BW/Yield | * Cost/density/latency/BW/Yield | ||
Line 894: | Line 899: | ||
* Benefit of TL-DRAM | * Benefit of TL-DRAM | ||
* TL-DRAM vs. DRAM cache (adding a small cache in DRAM) | * TL-DRAM vs. DRAM cache (adding a small cache in DRAM) | ||
- | + | * List of commands to read/write data into DRAM | |
- | + | * Activate -> read/write -> precharge | |
- | + | * Activate moves data into the row buffer | |
- | + | * Precharge prepare the bank for the next access | |
- | ===== Lecture 24 (3/31 Mon.) ===== | + | * Row buffer hit |
- | + | * Row buffer conflict | |
+ | * Scheduling memory requests to lower row conflicts | ||
+ | * Burst mode of DRAM | ||
+ | * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits | ||
+ | * Address mapping | ||
+ | * Row interleaved | ||
+ | * Cache block interleaved | ||
* Memory controller | * Memory controller | ||
- | * Different commands | + | * Sending DRAM commands |
- | * Memory scheduler | + | * Periodically send commands to refresh DRAM cells |
- | * Determine the order of requests to be issued to DRAM | + | * Ensure correctness and data integrity |
- | * Age/hit-miss status/types(load/store/prefetch/from GPU/from CPU)/criticality | + | * Where to place the memory controller |
- | * Row buffer | + | * On CPU chip vs. at the main memory |
- | * hit/conflict | + | * Higher BW on-chip |
- | * open/closed row | + | * Determine the order of requests that will be serviced in DRAM |
- | * Open row policy | + | * Request queues that hold requests |
- | * Closed row policy | + | * Send requests whenever the request can be sent to the bank |
- | * Tradeoffs between open and closed row policy | + | * Determine which command (across banks) should be sent to DRAM |
- | * What if the programs has high row buffer locality: open row might benefit more | + | * Priority of demand vs. prefetch requests |
- | * Closed row will service misses request faster | + | * Memory scheduling policies |
- | * Bank conflict | + | * FCFS |
- | * Interference from different applications/threads | + | * FR-FCFS |
- | * Differnt programs/processes/threads interfere with each other | + | * Try to maximize row buffer hit rate |
- | * introduce more row buffer/bank conflicts | + | * Capped FR-FCFS: FR-FCFS with a timeout |
- | * Memory schedule has to manage these interference | + | * Usually this is done in a command level (read/write commands and precharge/activate commands) |
- | * Memory hog problems | + | * PAR-BS |
- | * Interference in the data/command bus | + | * Key benefits |
- | * FR-FCFS | + | * stall time |
- | * Why does FR-FCFS make sense? | + | * shortest job first |
- | * Row buffer has lower lantecy | + | * STFM |
- | * Issues with FR-FCFS | + | * ATLAS |
- | * Unfairness | + | * TCM |
- | * STFM | + | * Key benefits |
- | * Fairness issue in memory scheduling | + | * Configurability |
- | * How does STFM calculate the fairness and slowdown | + | * Fairness + performance at the same time |
- | * How to estimate slowdown time when it is runing alone | + | * Robuestness isuees |
- | * Definition of fairness (based on STFM, different papers/areas define fairness differently) | + | * Open row policy |
- | * PAR-BS | + | * Closed row policy |
- | * Parallelism in programs | + | * QoS |
- | * Intereference across banks | + | * QoS issues in memory scheduling |
- | * How to form a batch | + | * Fairness |
- | * How to determine ranking between batches/within a batch | + | * Performance guarantee |
- | | + | |
- | + | ||
- | + | ||
- | ===== Lecture 25 (2/2 Wed.) ===== | + | |
+ | ===== Lecture 23 (03/27 Fri.) ===== | ||
- | * Latency sensitivity | ||
- | * Performance drops a lot when the memory request latency is long | ||
- | * TCM | ||
- | * Tradeoff between throughput and fairness | ||
- | * Latency sensitive cluster (non-intensive cluster) | ||
- | * Ranking based on memory intensity | ||
- | * Bandwidth intensive cluster | ||
- | * Round robin within the cluster | ||
- | * Generally latency sensitive cluster has more priority | ||
- | * Provide robust fairness vs. throughput | ||
- | * Complexity of TCM? | ||
* Different ways to control interference in DRAM | * Different ways to control interference in DRAM | ||
* Partitioning of resource | * Partitioning of resource | ||
Line 976: | Line 972: | ||
* Self-optimizing controller | * Self-optimizing controller | ||
* Use machine learning to improve DRAM controller | * Use machine learning to improve DRAM controller | ||
+ | * A-DRM | ||
+ | * Architecture aware DRAM | ||
+ | * Multithread | ||
+ | * synchronization | ||
+ | * Pipeline programs | ||
+ | * Producer consumer model | ||
+ | * Critical path | ||
+ | * Limiter threads | ||
+ | * Prioritization between threads | ||
+ | * Different power mode in DRAM | ||
* DRAM Refresh | * DRAM Refresh | ||
* Why does DRAM has to refresh every 64ms | * Why does DRAM has to refresh every 64ms | ||
Line 993: | Line 999: | ||
* Better/more hash function helps eliminate this | * Better/more hash function helps eliminate this | ||
| | ||
- | + | ===== Lecture 24 (03/30 Mon.) ===== | |
- | + | ||
- | ===== Lecture 26 (4/7 Mon.) ===== | + | |
- | + | ||
+ | * Simulation | ||
+ | * Drawbacks of RTL simulations | ||
+ | * Time consuming | ||
+ | * Complex to develop | ||
+ | * Hard to perform design explorations | ||
+ | * Explore the design space quickly | ||
+ | * Match the behavior of existing systems | ||
+ | * Tradeoffs: speed, accuracy, flexibility | ||
+ | * High-level simulation vs. detailed simulation | ||
+ | * High-level simulation is faster, but lower accuracy | ||
+ | * Controllers that works on multiple types of cores | ||
+ | * Design problems: how to find a good scheduling policy on its own? | ||
+ | * Self-optimizing memory controller: using machine learning | ||
+ | * Can adapt to the applications | ||
+ | * The complexity is very high | ||
* Tolerate latency can be costly | * Tolerate latency can be costly | ||
* Instruction window is complex | * Instruction window is complex | ||
Line 1015: | Line 1032: | ||
* Execute future instruction to generate accurate prefetches | * Execute future instruction to generate accurate prefetches | ||
* Allow future data to be in the cache | * Allow future data to be in the cache | ||
+ | * How to support runahead execution? | ||
+ | * Need a way to checkpoing the state when entering runahead mode | ||
+ | * How to make executing in the wrong path useful? | ||
+ | * Need runahead cache to handle load/store in Runahead mode (since they are speculative) | ||
+ | |||
+ | |||
+ | ===== Lecture 25 (4/1 Wed.) ===== | ||
+ | |||
+ | * More Runahead executions | ||
* How to support runahead execution? | * How to support runahead execution? | ||
* Need a way to checkpoing the state when entering runahead mode | * Need a way to checkpoing the state when entering runahead mode | ||
Line 1030: | Line 1056: | ||
* Stride predictor | * Stride predictor | ||
* AVD prediction | * AVD prediction | ||
- | | ||
- | ===== Lecture 28 (4/9 Wed.) ===== | ||
- | |||
- | |||
* Questions regarding prefetching | * Questions regarding prefetching | ||
* What to prefetch | * What to prefetch | ||
Line 1096: | Line 1118: | ||
* Timeliness | * Timeliness | ||
* How much memory latency can we hide from the prefetch requests | * How much memory latency can we hide from the prefetch requests | ||
+ | * Cache pullition | ||
+ | * How much did the prefetcher cause misses in the demand misses? | ||
+ | * Hard to quantify | ||
+ | |||
+ | |||
+ | ===== Lecture 26 (4/3 Fri.) ===== | ||
+ | |||
* Feedback directed prefetcher | * Feedback directed prefetcher | ||
* Use the result of the prefetcher as a feedback to the prefetcher | * Use the result of the prefetcher as a feedback to the prefetcher | ||
Line 1108: | Line 1137: | ||
* Not very efficient (hard to figure out which block is the pointer) | * Not very efficient (hard to figure out which block is the pointer) | ||
* Software can give hints | * Software can give hints | ||
- | + | * Correlation table | |
- | ===== Lecture 28 (4/14 Mon.) ===== | + | * Address correlation |
- | + | ||
* Execution based prefetcher | * Execution based prefetcher | ||
* Helper thread/speculative thread | * Helper thread/speculative thread | ||
Line 1119: | Line 1146: | ||
* How do you construct the helper thread | * How do you construct the helper thread | ||
* Preexecute instruction (one example of how to initialize a speculative thread), slide 9 | * Preexecute instruction (one example of how to initialize a speculative thread), slide 9 | ||
- | * Benefit of multiprocessor | + | * Thread-based pre-execution |
- | * Improve performace without not significantly increase power consumption | + | * Error tolerance |
+ | * Solution to errors | ||
+ | * Tolerate errors | ||
+ | * New interface, new design | ||
+ | * Eliminate or minimize errors | ||
+ | * New technology, system-wide rethinking | ||
+ | * Embrace errors | ||
+ | * Map data that can tolerate errors to error-prone area | ||
+ | * Hybrid memory systesm | ||
+ | * Combining multiple memory technology together | ||
+ | * What can emerging technology help? | ||
+ | * Scalability | ||
+ | * Lower the cost | ||
+ | * Energy efficiency | ||
+ | * Possible solutions to the scaling problem | ||
+ | * Less leakage DRAM | ||
+ | * Heterogeneous DRAM (TL-DRAM, etc.) | ||
+ | * Add more functionality to DRAM | ||
+ | * Denser design (3D stack) | ||
+ | * Different technology | ||
+ | * NVM | ||
+ | * Charge vs. resistice memory | ||
+ | * How data is written? | ||
+ | * How to read the data? | ||
+ | * Non volatile memory | ||
+ | * Resistive memory | ||
+ | * PCM | ||
+ | * Inject current to change the phase | ||
+ | * Scales better than DRAM | ||
+ | * Multiple bits per cell | ||
+ | * Wider resistence range | ||
+ | * No refresh is needed | ||
+ | * Downside: Latency and write endurance | ||
+ | * STT-MRAM | ||
+ | * Inject current to change the polarity | ||
+ | * Memristor | ||
+ | * Inject current to change the structure | ||
+ | * Pros and cons between different technologies | ||
+ | * Persistency - data stay there even without power | ||
+ | * Unified memory and storage management (persistent data structure) - Single level store | ||
+ | * Improve energy and performance | ||
+ | * Simplify programming model | ||
+ | * Different design options for DRAM + NVM | ||
+ | * DRAM as a cache | ||
+ | * Place some data in DRAM and other in PCM | ||
+ | * Based on the characteristics | ||
+ | * Frequently accessed data that need lower write latency in DRAM | ||
+ | |||
+ | |||
+ | ===== Lecture 27 (4/6 Mon.) ===== | ||
+ | * Flynn's taxonomy | ||
+ | * Parallelism | ||
+ | * Reduces power consumption (P ~ CV^2F) | ||
* Better cost efficiency and easier to scale | * Better cost efficiency and easier to scale | ||
- | * Improve dependability (in case the other core is faulty | + | * Improves dependability (in case the other core is faulty |
* Different types of parallelism | * Different types of parallelism | ||
* Instruction level parallelism | * Instruction level parallelism | ||
Line 1130: | Line 1209: | ||
* Partition a single, potentially big, task into multiple parallel sub-task | * Partition a single, potentially big, task into multiple parallel sub-task | ||
* Can be done explicitly (parallel programming by the programmer) | * Can be done explicitly (parallel programming by the programmer) | ||
- | * Or implicitly (hardware partition a single thread specilatively) | + | * Or implicitly (hardware partitions a single thread speculatively) |
- | * Or, run multiple independent tasks (still improve throughput, but the speedup of any single tasks are not better, also simpler to implement) | + | * Or, run multiple independent tasks (still improves throughput, but the speedup of any single tasks is not better, also simpler to implement) |
* Loosely coupled multiprocessor | * Loosely coupled multiprocessor | ||
* No shared global address space | * No shared global address space | ||
* Message passing to communicate between different sources | * Message passing to communicate between different sources | ||
* Simple to manage memory | * Simple to manage memory | ||
- | * Tightly coupled multiprocesor | + | * Tightly coupled multiprocessor |
* Shared global address space | * Shared global address space | ||
* Need to ensure consistency of data | * Need to ensure consistency of data | ||
- | * Switch on event multithreading | + | * Programming issues |
- | * Switch to another context when there is an event (for example, a cache miss) | + | * Hardware-based multithreading |
- | * Simulteneous multithreading | + | * Coarse grained |
- | * Dispatch instruction from multiple threads at the same time | + | * Find grained |
+ | * Simultaneous: Dispatch instruction from multiple threads at the same time | ||
+ | * Parallel speedup | ||
+ | * Superlinear speedup | ||
+ | * Utilization, Redundancy, Efficiency | ||
* Amdahl's law | * Amdahl's law | ||
* Maximum speedup | * Maximum speedup | ||
Line 1150: | Line 1233: | ||
* Load balance | * Load balance | ||
* Some threads has more work, requires more time to hit the sync. point | * Some threads has more work, requires more time to hit the sync. point | ||
- | * Issue in parallel programming | + | * Critical sections |
+ | * Enforce mutually exclusive access to shared data | ||
+ | * Issues in parallel programming | ||
* Correctness | * Correctness | ||
* Synchronization | * Synchronization | ||
* Consistency | * Consistency | ||
- | | ||
- | ===== Lecture 29 (4/16 Wed.) ===== | ||
- | | ||
+ | ===== Lecture 28 (4/8 Wed.) ===== | ||
* Ordering of instructions | * Ordering of instructions | ||
* Maintaining memory consistency when there are multiple threads and shared memory | * Maintaining memory consistency when there are multiple threads and shared memory | ||
* Need to ensure the semantic is not changed | * Need to ensure the semantic is not changed | ||
- | * Making sire the shared data is properly locked when used | + | * Making sure the shared data is properly locked when used |
* Support mutual exclusion | * Support mutual exclusion | ||
* Ordering depends on when each processor is executed | * Ordering depends on when each processor is executed | ||
* Debugging is also difficult (non-deterministic behavior) | * Debugging is also difficult (non-deterministic behavior) | ||
+ | * Dekker's algorithm | ||
+ | * Inconsistency -- the two processors did NOT see the same order of operations to memory | ||
+ | * Sequential consistency | ||
+ | * Multiple correct global orders | ||
+ | * Two issues: | ||
+ | * Too conservative/strict | ||
+ | * Performance limiting | ||
* Weak consistency: global ordering when sync | * Weak consistency: global ordering when sync | ||
* programmer hints where the synchronizations are | * programmer hints where the synchronizations are | ||
- | * Total store order model: global ordering only with store | + | * Memory fence |
+ | * More burden on the programmers | ||
* Cache coherence | * Cache coherence | ||
* Can be done in the software level or hardware level | * Can be done in the software level or hardware level | ||
- | * Coherence protocol | + | * Snoop-based coherence |
- | * Need to ensure that all the processors see and update the correct state of the cache block | + | * A simple protocol with two states by broadcasting reads/writes on a bus |
- | * Need to make sure that writes get propagated and serialized | + | * Maintaining coherence |
- | * Simple protocol are not scalable (one point of synchrnization) | + | * Needs to provide 1) write propagation and 2) write serialization |
- | * Update vs. invalidate | + | * Update vs. Invalidate |
- | * For invalidate, only the core that needs to read retains the correct copy | + | * Two cache coherence methods |
- | * Can lead to ping-ponging (tons of read/writes from several processors) | + | * Snoopy bus |
- | * For updates, bus becomes the bottleneck | + | * Bus based, single point of serialization |
- | * Snoopy bus | + | * More efficient with small number of processors |
- | * Bus based, single point of serialization | + | * Processors snoop other caches read/write requests to keep the cache block coherent |
- | * More efficient with small number of processors | + | * Directory |
- | * All cache snoop other caches read/write requests to keep the cache block coherent | + | * Single point of serialization per block |
- | * Directory based | + | * Directory coordinates the coherency |
- | * Single point of serialization per block | + | * More scalable |
- | * Directory coordinate the coherency | + | * The directory keeps track of where the copies of each block resides |
- | * More scalable | + | * Supplies data on a read |
- | * The directory keeps track of where the copies of each block resides | + | * Invalidates the block on a write |
- | * Supply data on a read | + | * Has an exclusive state |
- | * Invalide the block on a write | + | |
- | * Has an exclusive state | + | ===== Lecture 29 (4/13 Mon.) ===== |
* MSI coherent protocol | * MSI coherent protocol | ||
- | * Slide number 56-57 | + | * The problem: unnecessary broadcasts of invalidations |
- | * Consume bus bandwidth (need an "exclusive" state | + | * MESI coherent protocol |
- | * MESI coherent protocal | + | * Add the exclusive state: this is the only cache copy and it is a clean state to MSI |
- | * Add the exclusive state: this is the only cache copy and it is clean state to MSI | + | * Multiple invalidation tradeoffs |
- | * Tradeoffs between snooping and directory based | + | * Problem: memory can be unnecessarily updated |
- | * Slide 71 has a good summary on this | + | * A possible owner state (MOESI) |
- | * MOESI | + | * Tradeoffs between snooping and directory based coherence protocols |
- | * Improvement over MESI protocol | + | * Slide 31 has a good summary |
+ | * Directory: data structures | ||
+ | * Bit vectors vs. linked lists | ||
+ | * Scalability of directories | ||
+ | * Size? Latency? Thousand of nodes? Best of both snooping and directory? | ||
- | + | ===== Lecture 30 (4/15 Wed.) ===== | |
- | ===== Lecture 29 (4/18 Wed.) ===== | + | |
- | + | * Application slowdown | |
- | + | * Interference between different applications | |
- | * Interference | + | * Applications' performance depends on other applications that they are running with |
- | * Complexity of the memory scheduler | + | |
- | * Ranking/prioritization has cost | + | |
- | * Complex scheduler has higher latency | + | |
- | * Performance metric for multicore/multithead applications | + | |
- | * Speedup | + | |
- | * Slowdown | + | |
- | * Harmonic vs wrighted | + | |
- | * Fairness mertic | + | |
- | * Maximum slowdown | + | |
- | * Why does it make sense | + | |
- | * Any scenario that it does not make sense? | + | |
* Predictable performance | * Predictable performance | ||
- | * Why is it important? | + | * Why are they important? |
- | * In server environment, different jobs are on the same server | + | * Applications that need predictibility |
- | * In a mobile environment, there are multiple sources that can slowdown other sources | + | * How to predict the performance? |
- | * How to relate slowdown with request service rate | + | * What information are useful? |
- | * MISE: soft slowdown guarantee | + | * What need to be guarantee? |
- | * BDI | + | * How to estimate the performance when running with others? |
- | * Memory wall | + | * Easy, just measure the performance while it is running. |
- | * What is the concern regarding the memory wall | + | * How to estimate the performance when the application is running by itself. |
- | * Size of the cache on the die (CPU die) | + | * Hard if there is no profiling. |
- | * One possible solution: cache compression | + | * The relationship between memory service rate and the performance. |
- | * What is the problems of existing cache compression mechanism | + | * Key assumption: applications are memory bound |
- | * Some are too complex | + | * Behavior of memory-bound applications |
- | * Decompression is in the critical path | + | * With and without interference |
- | * Need to decompress when reading the data -> decompression should not be in the critical path | + | * Memory phase vs. compute phase |
- | * Important factor to the performance | + | * MISE |
- | * Software compression is not good enough to compress everything | + | * Estimating slowdown using request service rate |
- | * Zero value compression | + | * Inaccuracy when measuring request service rate alone |
- | * Simple | + | * Non-memory-bound applications |
- | * Good compression ratio | + | * Control slowdown and provide soft guarantee |
- | * What is data does not have many zeroes | + | * Taking into account of the shared cache |
- | * Frequent value compression | + | * MISE model + cache resource management |
- | * Some data appear fequently | + | * Aug tag store |
- | * Simple and good compression ratio | + | * Separate tag store for different cores |
- | * have to profile | + | * Cache access rate alone and shared as the metric to estimate slowdown |
- | * decompression is complex | + | * Cache paritiioning |
- | * Frequent pattern compression | + | * How to determine partitioning |
- | * Still to complex in terms of decompression | + | * Utility based cache partitioning |
- | * Based delta compression | + | * Others |
- | * Easy to decompress but retain the benefit of compression | + | * Maximum slowdown and fairness metric |
- | | + | |
| | ||
- | ===== Lecture 31 (4/28 Mon.) ===== | ||
- | |||
- | * Directory based cache coherent | ||
- | * Each directory has to handle validate/invalidation | ||
- | * Extra cost of syncronization | ||
- | * Need to ensure race conditions are resolved | ||
- | * Interconnection | ||
- | * Topology | ||
- | * Bus | ||
- | * Mesh | ||
- | * Torus | ||
- | * Tree | ||
- | * Butterfly | ||
- | * Ring | ||
- | * Bi-directional ring | ||
- | * More scalable | ||
- | * Hierarchical ring | ||
- | * Even more scalable | ||
- | * More complex | ||
- | * Crossbar | ||
- | * etc. | ||
- | * Circuit switching | ||
- | * Multistage network | ||
- | * Butterfly | ||
- | * Delta network | ||
- | * Handling contention | ||
- | * Buffering vs. dropping/deflection (no buffering) | ||
- | * Routing algorithm | ||
- | * Handling deadlock | ||
- | * X-Y routing | ||
- | * Turn model (to avoid deadlocks) | ||
- | * Add more buffering for an escape path | ||
- | * Oblivious routing | ||
- | * Can take different path | ||
- | * DOR between each intermediate location | ||
- | * Balance network load | ||
- | * Adaptive routing | ||
- | * Use the state of the network to determine the route | ||
- | * Aware of local and/or global congestions | ||
- | * Non minimal adaptive routing can have livelocks | ||
- | |||
| | ||