This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
buzzword [2014/03/31 18:16] rachata |
buzzword [2015/01/16 19:44] kevincha |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Buzzwords ====== | ====== Buzzwords ====== | ||
- | Buzzwords are terms that are mentioned during lecture which are particularly important to understand thoroughly. This page tracks the buzzwords for each of the lectures and can be used as a reference for finding gaps in your understanding of course material. | + | Buzzwords are terms that are mentioned during lecture which are particularly important to understand thoroughly. This page tracks the buzzwords for each of the lectures and can be used as a reference for finding gaps in your understanding of course material. |
- | + | ||
- | ===== Lecture 1 (1/13 Mon.) ===== | + | |
+ | ===== Lecture 1 (1/12 Mon.) ===== | ||
* Level of transformation | * Level of transformation | ||
* Algorithm | * Algorithm | ||
Line 10: | Line 9: | ||
* Compiler | * Compiler | ||
* Cross abstraction layers | * Cross abstraction layers | ||
- | * Expose an interface | ||
* Tradeoffs | * Tradeoffs | ||
* Caches | * Caches | ||
- | * Multi-thread | + | * DRAM/memory controller |
- | * Multi-core | + | * DRAM banks |
- | * Unfairness | + | |
- | * DRAM controller/Memory controller | + | |
- | * Memory hog | + | |
* Row buffer hit/miss | * Row buffer hit/miss | ||
* Row buffer locality | * Row buffer locality | ||
- | * Streaming access/ Random access | + | * Unfairness |
- | * DRAM refresh | + | * Memory performance hog |
- | * Retention time | + | * Shared DRAM memory system |
- | * Profiling DRAM retention time | + | * Streaming access vs. random access |
+ | * Memory scheduling policies | ||
+ | * Scheduling priority | ||
+ | * Retention time of DRAM | ||
+ | * Process variation | ||
+ | * Retention time profile | ||
* Power consumption | * Power consumption | ||
- | * Wimpy cores | ||
* Bloom filter | * Bloom filter | ||
- | * Pros/Cons | + | * Hamming code |
- | * False Positive | + | * Hamming distance |
- | * Simulation | + | * DRAM row hammer |
- | * Memory performance attacks | + | |
- | * RTL design | + | |
- | + | ||
- | ===== Lecture 2 (1/15 Wed.) ===== | + | |
- | * Optimizing for energy/ Optimizing for the performance | + | ===== Lecture 2 (1/14 Wed.) ===== |
- | * Generally you should optimize for the users | + | |
- | * state-of-the-art | + | |
- | * RTL Simulation | + | |
- | * Long, slow and can be costly | + | |
- | * High level simulation | + | |
- | * What should be employed? | + | |
- | * Important to get the idea of how they are implemented in RTL | + | |
- | * Allows designer to filter out techniques that do not work well | + | |
- | * Design points | + | |
- | * Design processors to meet the design points | + | |
- | * Software stack | + | |
- | * Design decisions | + | |
- | * Datacenters | + | |
- | * MIPS R2000 | + | |
- | * What are architectural techniques that improve the performance of a processor over MIPS 2000 | + | |
* Moore's Law | * Moore's Law | ||
+ | * Algorithm --> step-by-step procedure to solve a problem | ||
* in-order execution | * in-order execution | ||
* out-of-order execution | * out-of-order execution | ||
Line 65: | Line 46: | ||
* Scaling issue | * Scaling issue | ||
* Transister are getting smaller | * Transister are getting smaller | ||
+ | * Key components of a computer | ||
+ | * Design points | ||
+ | * Design processors to meet the design points | ||
+ | * Software stack | ||
+ | * Design decisions | ||
+ | * Datacenters | ||
* Reliability problems that cause errors | * Reliability problems that cause errors | ||
* Analogies from Kuhn's "The Structure of Scientific Revolutions" (Recommended book) | * Analogies from Kuhn's "The Structure of Scientific Revolutions" (Recommended book) | ||
Line 73: | Line 60: | ||
* Computation | * Computation | ||
* Communication | * Communication | ||
- | * Storage | + | * Storage |
- | * DRAM | + | * DRAM |
- | * NVRAM (Non-volatile memory): PCM, STT-MRAM | + | * NVRAM (Non-volatile memory): PCM, STT-MRAM |
- | * Storage (Flash/Harddrive) | + | * Storage (Flash/Harddrive) |
* Von Neumann Model (Control flow model) | * Von Neumann Model (Control flow model) | ||
* Stored program computer | * Stored program computer | ||
- | * Properties of Von Neumann Model: Stored program, sequential instruction processing | + | * Properties of Von Neumann Model: Stored program, sequential instruction processing |
- | * Unified memory | + | * Unified memory |
- | * When does an instruction is being interpreted as an instruction (as oppose to a datum)? | + | * When does an instruction is being interpreted as an instruction (as oppose to a datum)? |
- | * Program counter | + | * Program counter |
- | * Examples: x86, ARM, Alpha, IBM Power series, SPARC, MIPS | + | * Examples: x86, ARM, Alpha, IBM Power series, SPARC, MIPS |
* Data flow model | * Data flow model | ||
* Data flow machine | * Data flow machine | ||
Line 94: | Line 81: | ||
* Tradeoffs between control-driven and data-driven | * Tradeoffs between control-driven and data-driven | ||
* What are easier to program? | * What are easier to program? | ||
- | * Which are easy to compile? | + | * Which are easy to compile? |
- | * What are more parallel (does that mean it is faster?) | + | * What are more parallel (does that mean it is faster?) |
- | * Which machines are more complex to design? | + | * Which machines are more complex to design? |
* In control flow, when a program is stop, there is a pointer to the current state (precise state). | * In control flow, when a program is stop, there is a pointer to the current state (precise state). | ||
* ISA vs. Microarchitecture | * ISA vs. Microarchitecture | ||
* Semantics in the ISA | * Semantics in the ISA | ||
- | * uArch should obey the ISA | + | * uArch should obey the ISA |
- | * Changing ISA is costly, can affect compatibility. | + | * Changing ISA is costly, can affect compatibility. |
* Instruction pointers | * Instruction pointers | ||
* uArch techniques: common and powerful techniques break Vonn Neumann model if done at the ISA level | * uArch techniques: common and powerful techniques break Vonn Neumann model if done at the ISA level | ||
Line 109: | Line 96: | ||
* Out-of-order executions | * Out-of-order executions | ||
* etc. | * etc. | ||
- | * Design techniques | + | * Design techniques |
- | * Adder implementation (Bit serial, ripple carry, carry lookahead) | + | * Adder implementation (Bit serial, ripple carry, carry lookahead) |
- | * Connection machine (an example of a machine that use bit serial to tradeoff latency for more parallelism) | + | * Connection machine (an example of a machine that use bit serial to tradeoff latency for more parallelism) |
* Microprocessor: ISA + uArch + circuits | * Microprocessor: ISA + uArch + circuits | ||
* What are a part of the ISA? Instructions, memory, etc. | * What are a part of the ISA? Instructions, memory, etc. | ||
Line 120: | Line 107: | ||
===== Lecture 3 (1/17 Fri.) ===== | ===== Lecture 3 (1/17 Fri.) ===== | ||
- | * Design tradeoffs | + | * Microarchitecture |
- | * Macro Architectures | + | |
- | * Reconfiguribility vs. specialized designs | + | |
- | * Parallelism (instructions, data parallel) | + | |
- | * Uniform decode (Example: Alpha) | + | |
- | * Steering bits (Sub-opcode) | + | |
- | * 0,1,2,3 address machines | + | |
- | * Stack machine | + | |
- | * Accumulator machine | + | |
- | * 2-operand machine | + | |
- | * 3-operand machine | + | |
- | * Tradeoffs between 0,1,2,3 address machines | + | |
- | * Instructions/Opcode/Operade specifiers (i.e. addressing modes) | + | |
- | * Simply vs. complex data type (and their tradeoffs) | + | |
- | * Semantic gap | + | |
- | * Translation layer | + | |
- | * Addressability | + | |
- | * Byte/bit addressable machines | + | |
- | * Virtual memory | + | |
- | * Big/little endian | + | |
- | * Benefits of having registers (data locality) | + | |
- | * Programmer visible (Architectural) state | + | |
- | * Programmers can access this directly | + | |
- | * What are the benefits? | + | |
- | * Microarchitectural state | + | |
- | * Programmers cannot access this directly | + | |
- | * Evolution of registers (from accumulators to registers) | + | |
- | * Different types of instructions | + | |
- | * Control instructions | + | |
- | * Data instructions | + | |
- | * Operation instructions | + | |
- | * Addressing modes | + | |
- | * Tradeoffs (complexity, flexibility, etc.) | + | |
- | * Orthogonal ISA | + | |
- | * Addressing modes that are orthogonal to instructino types | + | |
- | * Vectors vs. non vectored interrupts | + | |
- | * Complex vs. simple instructions | + | |
- | * Tradeoffs | + | |
- | * RISC vs. CISC | + | |
- | * Tradeoff | + | |
- | * Backward compatibility | + | |
- | * Performance | + | |
- | * Optimization opportunity | + | |
- | ===== Lecture 4 (1/22 Wed.) ===== | + | * Three major tradeoffs of computer architecture |
- | * Semantic gap | + | * Macro-architecture |
- | * Small vs. Large semantic gap (CISC vs. RISC) | + | |
- | * Benefit of RISC vs. CISC | + | |
- | * Micro operations/microcode | + | |
- | * Translate complex instructions into smaller instructions | + | |
- | * Parallelism (motivation for RISC) | + | |
- | * Compiler optimization | + | |
- | * Code optimization through translation | + | |
- | * VLIW | + | |
- | * Fixed vs. variable length instructions | + | |
- | * Tradeoffs | + | |
- | * Alignment issues? (fetch/decode) | + | |
- | * Decoding issues? | + | |
- | * Code size? | + | |
- | * Adding additional instructions? | + | |
- | * Memory bandwidth and cache utilization? | + | |
- | * Energy? | + | |
- | * Encoding in variable length instructions | + | |
- | * Structure of Alpha instructions and other uniform decode instructions | + | |
- | * Different type of instructions | + | |
- | * Benefit of knowing what type of instructions | + | |
- | * Speculatively operate future instructions | + | |
- | * x86 and other non-uniform decode instructions | + | |
- | * Tradeoff vs. uniform decode | + | |
- | * Tradeoffs for different number of registers | + | |
- | * Spilling into memory if the number of registers is small | + | |
- | * Compiler optimization on how to manage which value to keep/spill | + | |
- | * Addressing modes | + | |
- | * Benefits? | + | |
- | * Types? | + | |
- | * Different uses of addressing modes? | + | |
- | * Various ISA-level tradeoffs | + | |
- | * Virtual memory | + | |
- | * Unalign memory access/aligned memory access | + | |
- | * Cost vs. benefit of unaligned access | + | |
- | * ISA specification | + | |
- | * Things you have to obey/specifie in the ISA specification | + | |
- | * Architectural states | + | |
- | * Microarchitecture implements how arch. state A transformed to the next arch. state A' | + | |
- | * Single cycle machines | + | |
- | * Critical path in the single cycle machine | + | |
- | * Multi cycle machines | + | |
- | * Functional units | + | |
- | * Performance metrics | + | |
- | * CPI/IPC | + | |
- | * CPI of a single cycle microarchitecture | + | |
- | ===== Lecture 5 (1/24 Fri.) ===== | + | * LC-3b ISA |
- | * Instruction processing | + | * Unused instructions |
- | * Fetch | + | |
- | * Decode | + | |
- | * Execute | + | |
- | * Memory fetch | + | |
- | * Writeback | + | |
- | * Datapath & Control logic in microprocessors | + | |
- | * Different types of instructions (I-type, R-type, etc.) | + | |
- | * Control flow instructions | + | |
- | * Non-control flow instructions | + | |
- | * Delayed slot/Delayed branch | + | |
- | * Single cycle control logic | + | |
- | * Lockstep | + | |
- | * Critical path analysis | + | |
- | * Critical path of a single cycle processor | + | |
- | * Combinational logic & Sequential logic | + | |
- | * Control store | + | |
- | * Tradeoffs of a single cycle uarch | + | |
- | * Dynamic power/Static power | + | |
- | * Speedup calculation | + | |
- | * Parallelism | + | |
- | * Serial bottleneck | + | |
- | * Amdahl's bottleneck | + | |
- | * Design principles | + | |
- | * Common case design | + | |
- | * Critical path design | + | |
- | * Balanced designs | + | |
- | * Multi cycle design | + | |
- | ===== Lecture 6 (1/27 Mon.) ===== | + | * Bit steering |
- | * Microcoded/Microprogrammed machines | + | * Instruction processing style |
- | * States | + | |
- | * Microinstructions | + | |
- | * Microsequencing | + | |
- | * Control store - Product control signals | + | |
- | * Microsequencer | + | |
- | * Control signal | + | |
- | * What do they have to control? | + | |
- | * Instruction processing cycle | + | |
- | * Latch signals | + | |
- | * State machine | + | |
- | * State variables | + | |
- | * Condition code | + | |
- | * Steering bits | + | |
- | * Branch enable logic | + | |
- | * Difference between gating and loading? (write enable vs. driving the bus) | + | |
- | * Memory mapped I/O | + | |
- | * Hardwired logic | + | |
- | * What control signals come from hardwired logic? | + | |
- | * Variable latency memory | + | |
- | * Handling interrupts | + | |
- | * Difference betwen interrupts and exceptions | + | |
- | * Emulator (i.e. uCode allots minimal datapath to emulate the ISA) | + | |
- | * Updating machine behavior | + | |
- | * Horizontal microcode | + | |
- | * Vertical microcode | + | |
- | * Primitives | + | |
- | ===== Lecture 7 (1/29 Wed.) ===== | + | * 0,1,2,3 address machines |
- | * Pipelining | + | * Stack machine |
- | * Limitations of the multi-programmed design | + | |
- | * Idle resources | + | |
- | * Throughput of a pipelined design | + | |
- | * What dictacts the throughput of a pipelined design? | + | |
- | * Latency of the pipelined design | + | |
- | * Dependency | + | |
- | * Overhead of pipelining | + | |
- | * Latch cost? | + | |
- | * Data forwarding/bypassing | + | |
- | * What are the ideal pipeline? | + | |
- | * External fragmentation | + | |
- | * Issues in pipeline designs | + | |
- | * Stalling | + | |
- | * Dependency (Hazard) | + | |
- | * Flow dependence | + | |
- | * Output dependence | + | |
- | * Anti dependence | + | |
- | * How to handle them? | + | |
- | * Resource contention | + | |
- | * Keeping the pipeline full | + | |
- | * Handling exception/interrupts | + | |
- | * Pipeline flush | + | |
- | * Speculation | + | |
- | * Interlocking | + | |
- | * Multipath execution | + | |
- | * Fine grain multithreading | + | |
- | * No-op (Bubbles in the pipeline) | + | |
- | * Valid bits in the instructions | + | |
- | ===== Lecture 8 (1/31 Fri.) ===== | + | * Accumulator machine |
- | * Branch prediction | + | |
- | * Different types of data dependence | + | |
- | * Pipeline stalls | + | |
- | * bubbles | + | |
- | * How to handle stalls | + | |
- | * Stall conditions | + | |
- | * Stall signals | + | |
- | * Dependences | + | |
- | * Distant between dependences | + | |
- | * Data forwarding/bypassing | + | |
- | * Maintaining the correct dataflow | + | |
- | * Different ways to design data forwarding path/logic | + | |
- | * Different techniques to handle interlockings | + | |
- | * SW based | + | |
- | * HW based | + | |
- | * Profiling | + | |
- | * Static profiling | + | |
- | * Helps from the software (compiler) | + | |
- | * Superblock optimization | + | |
- | * Analyzing basic blocks | + | |
- | * How to deal with branches? | + | |
- | * Branch prediction | + | |
- | * Delayed branching (branch delay slot) | + | |
- | * Forward control flow/backward control flow | + | |
- | * Branch prediction accuracy | + | |
- | * Profile guided code positioning | + | |
- | * Based on the profile info. position the code based on it | + | |
- | * Try to make the next sequential instruction be the next inst. to be executed | + | |
- | * Trace cache | + | |
- | * Predicate combining (combine predicate for a branch instruction) | + | |
- | * Predicated execution (control dependence becomes data dependence) | + | |
- | * Definition of basic blocks | + | |
- | * Control flow graph | + | |
- | ===== Lecture 9 (2/3 Mon.) ===== | + | * 2-operand machine |
- | * Delayed branching | + | |
- | * benefit? | + | |
- | * What does it eliminates? | + | |
- | * downside? | + | |
- | * Delayed branching in SPARC (with squashing) | + | |
- | * Backward compatibility with the delayed slot | + | |
- | * What should be filled in the delayed slot | + | |
- | * How to ensure correctness | + | |
- | * Fine-grained multithreading | + | |
- | * fetch from different threads | + | |
- | * What are the issues (what if the program doesn't have many threads) | + | |
- | * CDC 6000 | + | |
- | * Denelcor HEP | + | |
- | * No dependency checking | + | |
- | * Inst. from different thread can fill-in the bubbles | + | |
- | * Cost? | + | |
- | * Simulteneuos multithreading | + | |
- | * Branch prediction | + | |
- | * Guess what to fetch next. | + | |
- | * Misprediction penalty | + | |
- | * Need to guess the direction and target | + | |
- | * How to perform the performance analysis? | + | |
- | * Given the branch prediction accuracy and penalty cost, how to compute a cost of a branch misprediction. | + | |
- | * Given the program/number of instructions, percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions. | + | |
- | * How many extra instructions are being fetched? | + | |
- | * What is the performance degredation? | + | |
- | * How to reduce the miss penalty? | + | |
- | * Predicting the next address (non PC+4 address) | + | |
- | * Branch target buffer (BTB) | + | |
- | * Predicting the address of the branch | + | |
- | * Global branch history - for directions | + | |
- | * Can use compiler to profile and get more info | + | |
- | * Input set dictacts the accuracy | + | |
- | * Add time to compilation | + | |
- | * Heuristics that are common and doesn't require profiling. | + | |
- | * Might be inaccurate | + | |
- | * Does not require profiling | + | |
- | * Programmer can tell the hardware (via pragmas (hints)) | + | |
- | * For example, x86 has the hint bit | + | |
- | * Dynamic branch prediction | + | |
- | * Last time predictor | + | |
- | * Two bits counter based prediction | + | |
- | * One more bit for hysteresis | + | |
+ | * 3-operand machine | ||
- | ===== Lecture 10 (2/5 Wed.) ===== | + | * Tradeoffs between 0,1,2,3 address machines |
- | * Branch prediction accuracy | + | * Postfix notation |
- | * Why are they very important? | + | |
- | * Differences between 99% accuracy and 98% accuracy | + | |
- | * Cost of a misprediction when the pipeline is veryd eep | + | |
- | * Value prediction | + | |
- | * Global branch correlation | + | |
- | * Some branches are correlated | + | |
- | * Local branch correlation | + | |
- | * Some branches can depend on the result of past branches | + | |
- | * Pattern history table | + | |
- | * Record global taken/not taken results. | + | |
- | * Cost vs. accuracy (What to record, do you record PC? Just taken/not taken info.?) | + | |
- | * One-level branch predictor | + | |
- | * What information are used | + | |
- | * Two-level branch prediction | + | |
- | * What entries do you keep in the glocal history? | + | |
- | * What entries do you keep in the local history? | + | |
- | * How many table? | + | |
- | * Cost when training a table | + | |
- | * What are the purposes of each table? | + | |
- | * Potential problems of a two-level history | + | |
- | * GShare predictor | + | |
- | * Global history predictor is hashed with the PC | + | |
- | * Store both GHP and PC in one combined information | + | |
- | * How do you use the information? Why does the XOR result still usable? | + | |
- | * Slides (page 16-18) for a good overview of one- and two-level predictors | + | |
- | * Warmup cost of the branch predictor | + | |
- | * Hybrid solution? Fast warmup is used first, then switch to the slower one. | + | |
- | * Tournament predictor (Alpha 21264) | + | |
- | * Other types of branch predictor | + | |
- | * Using machine learning? | + | |
- | * Geometric history length | + | |
- | * Look at branches far behind (but using geometric step) | + | |
- | * Predicated execution - eliminate branches | + | |
- | * What are the tradeoffs | + | |
- | * What is the block is big (can lead to execution a lot of useless work) | + | |
- | * Allows easier code optimization | + | |
- | * From the compiler PoV, predicated execution combine multiple basic blocks into one bigger basic block | + | |
- | * Reduce control dependences | + | |
- | * Need ISA support | + | |
- | * Wish branches | + | |
- | * Compiler generate both predicated and non-predicated codes | + | |
- | * HW design which one to use | + | |
- | * Use branch prediction on an easy to predict code | + | |
- | * Use predicated execution on a hard to predict code | + | |
- | * Compiler can be more aggressive in optimimzing the code | + | |
- | * What are the tradeoffs (slide# 47) | + | |
- | * Multi-path execution | + | |
- | * Execute both paths | + | |
- | * Can lead to wasted work | + | |
+ | * Instructions/Opcode/Operade specifiers (i.e. addressing modes) | ||
- | ===== Lecture 11 (2/12 Wed.) ===== | + | * Simply vs. complex data type (and their tradeoffs) |
- | * Call and return prediction | + | * Semantic gap and level |
- | * Direct call is easy to predict | + | |
- | * Retun is harder (indirect branches) | + | |
- | * Nested calls make return easier to predict | + | |
- | * Can use stack to predict the return | + | |
- | * Indirect branch prediction | + | |
- | * These branches have multiple targets | + | |
- | * For switch-case, virtual function calls, jump tables, interface calls | + | |
- | * BTB to predict the target address - low accuracy | + | |
- | * History based: BTB + GHR | + | |
- | * Virtual program counter prediction | + | |
- | * Complications in superscalar processors | + | |
- | * Fetch? What if multiple branches are fetched at the same time? | + | |
- | * Logic requires to ensure correctness? | + | |
- | * Multi-cycle executions (Different functional units take different number of cycles) | + | |
- | * Instructions can retire out-of-order | + | |
- | * How to deal with this case? Stall? Throw exceptions if there are problems? | + | |
- | * Exceptions and Interrupts | + | |
- | * When they are handled? | + | |
- | * Why are some interrupts should be handled right away? | + | |
- | * Precise exception | + | |
- | * arch. state should be consistent before handling the exception/interrupts | + | |
- | * Easier to debug (you see the sequential flow when the interrupt occurs) | + | |
- | * Deterministic | + | |
- | * Easier to recover from the exception | + | |
- | * Easier to restart the processes | + | |
- | * How to ensure precise exception? | + | |
- | * Tradeoffs between each method | + | |
- | * Reorder buffer | + | |
- | * Reorder results before they are visible to the arch. state | + | |
- | * Need to presearve the sequential sematic and data | + | |
- | * What are the informatinos in the ROB entry | + | |
- | * Where to get the value from (forwarding path? reorder buffer?) | + | |
- | * Extra logic to check where the youngest instructions/value is | + | |
- | * Content addressible search | + | |
- | * A lot of comparators | + | |
- | * Different ways to simplify the reorder buffer | + | |
- | * Register renaming | + | |
- | * Same register refers to independent values (lacks of registers) | + | |
- | * Where does the exception happen (after retire) | + | |
- | * History buffer | + | |
- | * Update the register file when the instruction complete. Unroll if there is an exception. | + | |
- | * Future file (commonly used, along with reorder buffer) | + | |
- | * Keep two set of register files | + | |
- | * An updated value (Speculative), called fiture file | + | |
- | * A backup value (to restore the state quickly | + | |
- | * Double the cost of the regfile, but reduce the area as you don't have to use a content addressible memory (compared to ROB alone) | + | |
- | * Branch misprediction resembles Exception | + | |
- | * The difference is that branch misprediction is not visible to the software | + | |
- | * Also much more common (say, divide by zero vs. a mispredicted branch) | + | |
- | * Recovery is similar to exception handling | + | |
- | * Latency of the state recovery | + | |
- | * What to do during the state recovery | + | |
- | * Checkpointing | + | |
- | * Advantages? | + | |
- | + | ||
- | ===== Lecture 14 (2/19 Wed.) ===== | + | |
- | * Predictor (branch predictor, cache line predictor ...) | + | * Translation layer |
- | * Power budget (and its importance) | + | |
- | * Architectural state, precise state | + | |
- | * Memory dependence is known dynamically | + | |
- | * Register state is not shared across threads/processors | + | |
- | * Memory state is shared across threads/processors | + | |
- | * How to maintain speculative memory states | + | |
- | * Write buffers (helps simplify the process of checking the reorder buffer) | + | |
- | * Overall OoO mechanism | + | |
- | * What are other ways of eliminating dispatch stalls | + | |
- | * Dispatch when the sources are ready | + | |
- | * Retired instructions make the source available | + | |
- | * Register renaming | + | |
- | * Reservation station | + | |
- | * What goes into the reservation station | + | |
- | * Tags required in the reservation station | + | |
- | * Tomasulo's algorithm | + | |
- | * Without precise exception, OoO is hard to debug | + | |
- | * Arch. register ID | + | |
- | * Examples in the slides | + | |
- | * Slides 28 --> register renaming | + | |
- | * Slides 30-35 --> Exercise (also on the board) | + | |
- | * This will be usefull for the midterm | + | |
- | * Register aliasing table | + | |
- | ===== Lecture 15 (2/21 Fri.) ===== | + | * Addressability |
- | * OoO --> Restricted Dataflow | + | * Byte/bit addressable machines |
- | * Extracting parallelism | + | |
- | * What are the bottlenecks? | + | |
- | * Issue width | + | |
- | * Dispatch width | + | |
- | * Parallelism in the program | + | |
- | * More example on slide #10 | + | |
- | * What does it mean to be restricted data flow | + | |
- | * Still visible as a Von Neumann model | + | |
- | * Where does the efficiency come from? | + | |
- | * Size of the scheduling windors/reorder buffer. Tradeoffs? What make sense? | + | |
- | * Load/store handling | + | |
- | * Would like to schedule them out of order, but make them visible in-order | + | |
- | * When do you schedule the load/store instructions? | + | |
- | * Can we predict if load/store are dependent? | + | |
- | * This is one of the most complex structure of the load/store handling | + | |
- | * What information can be used to predict these load/store optimization? | + | |
- | * Note: IPC = 1/CPI | + | |
- | * Centralized vs. distributed? What are the tradeoffs? | + | |
- | * How to handle when there is a misprediction/recovery | + | |
- | * Token dataflow arch. | + | |
- | * What are tokens? | + | |
- | * How to match tokens | + | |
- | * Tagged token dataflow arch. | + | |
- | * What are the tradeoffs? | + | |
- | * Difficulties? | + | |
- | ===== Lecture 16 (2/24 Mon.) ===== | + | * Virtual memory |
- | * SISD/SIMD/MISD/MIMD | + | * Big/little endian |
- | * Array processor | + | |
- | * Vector processor | + | |
- | * Data parallelism | + | |
- | * Where does the concurrency arise? | + | |
- | * Differences between array processor vs. vector processor | + | |
- | * VLIW | + | |
- | * Compactness of an array processor | + | |
- | * Vector operates on a vector of data (rather than a single datum (scalar)) | + | |
- | * Vector length (also applies to array processor) | + | |
- | * No dependency within a vector --> can have a deep pipeline | + | |
- | * Highly parallel (both instruction level (ILP) and memory level (MLP)) | + | |
- | * But the program needs to be very parallel | + | |
- | * Memory can be the bottleneck (due to very high MLP) | + | |
- | * What does the functional units look like? Deep pipelin and simpler control. | + | |
- | * CRAY-I is one of the examples of vector processor | + | |
- | * Memory access pattern in a vector processor | + | |
- | * How do the memory accesses benefit the memory bandwidth? | + | |
- | * Please refer to slides 73-74 in http://www.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring13-lecture25-mainmemory-afterlecture.pdf for a breif explanation of memory level parallelism | + | |
- | * Stride length vs. the number of banks | + | |
- | * stride length should be relatively prime to the number of banks | + | |
- | * Tradeoffs between row major and column major --> How can the vector processor deals with the two | + | |
- | * How to calculate the efficiency and performance of vector processors | + | |
- | * What if there are multiple memory ports? | + | |
- | * Gather/Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism) | + | |
- | * Helps handling sparse metrices | + | |
- | * Conditional operation | + | |
- | * Structure of vector units | + | |
- | * How to automatically parallelize code through the compiler? | + | |
- | * This is a hard problem. Compiler does not know the memory address. | + | |
- | * What do we need to ensure for both vector and array processor? | + | |
- | * Sequential bottleneck | + | |
- | * Amdahl's law | + | |
- | * Intel MMX --> An example of Intel's approach to SIMD | + | |
- | * No VLEN, use OpCode to define the length | + | |
- | * Stride is one in MMX | + | |
- | * Intel SSE --> Modern version of MMX | + | |
- | ===== Lecture 17 (2/26 Wed.) ===== | + | * Benefits of having registers (data locality) |
- | * GPU | + | * Programmer visible (Architectural) state |
- | * Warp/Wavefront | + | |
- | * A bunch of threads sharing the same PC | + | |
- | * SIMT | + | |
- | * Lanes | + | |
- | * FGMT + massively parallel | + | |
- | * Tolerate long latency | + | |
- | * Warp based SIMD vs. traditional SIMD | + | |
- | * SPMD (Programming model) | + | |
- | * Single program operates on multiple data | + | |
- | * can have synchronization point | + | |
- | * Many scientific applications are programmed in this manner | + | |
- | * Control flow problem (branch divergence) | + | |
- | * Masking (in a branch, mask threads that should not execute that path) | + | |
- | * Lower SIMD efficiency | + | |
- | * What if you have layers of branches? | + | |
- | * Dynamic wrap formation | + | |
- | * Combining threads from different warps to increase SIMD utilization | + | |
- | * This can cause memory divergence | + | |
- | * VLIW | + | |
- | * Wide fetch | + | |
- | * IA-64 | + | |
- | * Tradeoffs | + | |
- | * Simple hardware (no dynamic scheduling, no dependency checking within VLIW) | + | |
- | * A lot of loads at the compiler level | + | |
- | * Decoupled access/execute | + | |
- | * Limited form of OoO | + | |
- | * Tradeoffs | + | |
- | * How to street the instruction (determine dependency/stalling)? | + | |
- | * Instruction scheduling techniques (static vs. dynamic) | + | |
- | * Systoric arrays | + | |
- | * Processing elements transform data in chains | + | |
- | * Develop for image processing (for example, convolution) | + | |
- | * Stage processing | + | |
- | ===== Lecture 18 (2/28 Fri.) ===== | + | * Programmers can access this directly |
- | * Tradeoffs of VLIW | + | * What are the benefits? |
- | * Why does VLIW required static instruction scheduling | + | |
- | * Whose job it is? | + | |
- | * Compiler can rearrange basic blocks/instruction | + | |
- | * Basic block | + | |
- | * Benefits of having large basic block | + | |
- | * Entry/Exit | + | |
- | * Handling entries/exits | + | |
- | * Trace cache | + | |
- | * How to ensure correctness? | + | |
- | * Profiling | + | |
- | * Fixing up the instruction order to ensure correctness | + | |
- | * Dealing with multiple entries into the block | + | |
- | * Dealing with multiple exits into the block | + | |
- | * Super block | + | |
- | * How to form super blocks? | + | |
- | * Benefit of super block | + | |
- | * Tradeoff between not forming a super block and forming a super block | + | |
- | * Ambiguous branch (after profiling, both taken/not taken are equally likely) | + | |
- | * Cleaning up | + | |
- | * What scenario would make trace cache/superblock/profiling less effective? | + | |
- | * List scheduling | + | |
- | * Help figuring out which instructions VLIW should fetch | + | |
- | * Try to maximize instruction throughput | + | |
- | * How to assign priorities | + | |
- | * What if some instructions take longer than others | + | |
- | * Block structured ISA (BS-ISA) | + | |
- | * Problems with trace scheduling? | + | |
- | * What type of program will benefit from BS-ISA | + | |
- | * How to form blocks in BS-ISA? | + | |
- | * Combining basic blocks | + | |
- | * multiples of merged basic blocks | + | |
- | * How to deal with entries/exits in BS-ISA? | + | |
- | * undo the executed instructions from the entry point, then fetch the new block | + | |
- | * Advantages over trace cache | + | |
- | * Benefit of VLIW + Static instruction scheduling | + | |
- | * Intel IA-64 | + | |
- | * Static instruction scheduling and VLIW | + | |
- | ===== Lecture 19 (3/19 Wed.) ===== | + | * Microarchitectural state |
- | * Ideal cache | + | * Programmers cannot access this directly |
- | * More capacity | + | |
- | * Fast | + | |
- | * Cheap | + | |
- | * High bandwidth | + | |
- | * DRAM cell | + | |
- | * Cheap | + | |
- | * Sense the purturbation through sense amplifier | + | |
- | * Slow and leaky | + | |
- | * SRAM cell (Cross coupled inverter) | + | |
- | * Expensice | + | |
- | * Fast (easier to sense the value in the cell) | + | |
- | * Memory bank | + | |
- | * Read access sequence | + | |
- | * DRAM: Activate -> Read -> Precharge (if needed) | + | |
- | * What dominate the access laatency for DRAM and SRAM | + | |
- | * Scaling issue | + | |
- | * Hard to scale the scale to be small | + | |
- | * Memory hierarchy | + | |
- | * Prefetching | + | |
- | * Caching | + | |
- | * Spatial and temporal locality | + | |
- | * Cache can exploit these | + | |
- | * Recently used data is likely to be accessed | + | |
- | * Nearby data is likely to be accessed | + | |
- | * Caching in a pipeline design | + | |
- | * Cache management | + | |
- | * Manual | + | |
- | * Data movement is managed manually | + | |
- | * Embedded processor | + | |
- | * GPU scratchpad | + | |
- | * Automatic | + | |
- | * HW manage data movements | + | |
- | * Latency analysis | + | |
- | * Based on the hit and miss status, next level access time (if miss), and the current level access time | + | |
- | * Cache basics | + | |
- | * Set/block (line)/Placement/replacement/direct mapped vs. associative cache/etc. | + | |
- | * Cache access | + | |
- | * How to access tag and data (in parallel vs serially) | + | |
- | * How do tag and index get used? | + | |
- | * Modern processors perform serial access for higher level cache (L3 for example) to save power | + | |
- | * Cost and benefit of having more associativity | + | |
- | * Given the associativity, which block should be replace if it is full | + | |
- | * Replacement poligy | + | |
- | * Random | + | |
- | * Least recently used (LRU) | + | |
- | * Least frequently used | + | |
- | * Least costly to refetch | + | |
- | * etc. | + | |
- | * How to implement LRU | + | |
- | * How to keep track of access ordering | + | |
- | * Complexity increases rapidly | + | |
- | * Approximate LRU | + | |
- | * Victim and next Victim policy | + | |
- | ===== Lecture 20 (3/21 Fri.) ===== | + | * Evolution of registers (from accumulators to registers) |
- | * Set thrashing | + | * Different types of instructions |
- | * Working set is bigger than the associativity | + | |
- | * Belady's OPT | + | |
- | * Is this optimal? | + | |
- | * Complexity? | + | |
- | * Similarity between cache and page table | + | |
- | * Number of blocks vs pages | + | |
- | * Time to find the block/page to replace | + | |
- | * Handling writes | + | |
- | * Write through | + | |
- | * Need a modified bit to make sure accesses to data got the updated data | + | |
- | * Write back | + | |
- | * Simpler, no consistency issues | + | |
- | * Sectored cache | + | |
- | * Use subblock | + | |
- | * lower bandwidth | + | |
- | * more complex | + | |
- | * Instruction vs data cache | + | |
- | * Where to place instructions | + | |
- | * Unified vs. separated | + | |
- | * In the first level cache | + | |
- | * Cache access | + | |
- | * First level access | + | |
- | * Second level access | + | |
- | * When to start the second level access | + | |
- | * Performance vs. energy | + | |
- | * Address translation | + | |
- | * Homonym and Synonyms | + | |
- | * Homonym: Same VA but maps to different PA | + | |
- | * With multiple processes | + | |
- | * Synonyms: Multiple VAs map to the same PA | + | |
- | * Shared libraries, shared data, copy-on-write | + | |
- | * I/O | + | |
- | * Can these create problems when we have the cache | + | |
- | * How to eliminate these problems? | + | |
- | * Page coloring | + | |
- | * Interaction between cache and TLB | + | |
- | * Virtually indexed vs. physically indexed | + | |
- | * Virtually tagged vs. physically tagged | + | |
- | * Virtually indexed physically tagged | + | |
- | * Virtual memory in DRAM | + | |
- | * Control where data is mapped to in channel/rank/bank | + | |
- | * More parallelism | + | |
- | * Reduce interference | + | |
- | ===== Lecture 21 (3/24 Mon.) ===== | + | * Control instructions |
+ | * Data instructions | ||
+ | * Operation instructions | ||
- | * Different parameters that affect cache miss | + | * Addressing modes |
- | * Thrashing | + | |
- | * Different types of cache misses | + | |
- | * Compulsory misses | + | |
- | * Can mitigate with prefetches | + | |
- | * Capacity misses | + | |
- | * More assoc | + | |
- | * Victim cache | + | |
- | * Conflict misses | + | |
- | * Hashing | + | |
- | * Large block vs. small block | + | |
- | * Subblocks | + | |
- | * Victim cache | + | |
- | * Small, but fully assoc. cache behind the actual cache | + | |
- | * Cached misses cache block | + | |
- | * Prevent ping-ponging | + | |
- | * Pseudo associativity | + | |
- | * Simpler way to implement associative cache | + | |
- | * Skewed assoc. cache | + | |
- | * Different hashing functions for each way | + | |
- | * Restructure data access pattern | + | |
- | * Order of loop traversal | + | |
- | * Blocking | + | |
- | * Memory level parallelism | + | |
- | * Cost per miss of a parallel cache miss is less costly compared to serial misses | + | |
- | * MSHR | + | |
- | * Keep track of pending cache | + | |
- | * Think of this as the load/store buffer-ish for cache | + | |
- | * What information goes into the MSHR? | + | |
- | * When do you access the MSHR? | + | |
+ | * Tradeoffs (complexity, flexibility, etc.) | ||
- | ===== Lecture 22 (3/26 Wed.) ===== | + | * Orthogonal ISA |
+ | * Addressing modes that are orthogonal to instruction types | ||
+ | * I/O devices | ||
- | * Multi-porting | + | * Vectored vs. non-vectored interrupts |
- | * Virtual multi-porting | + | |
- | * Time-share the port, not too scalable but cheap | + | * Complex vs. simple instructions |
- | * True multiporting | + | |
- | * Multiple cache copies | + | * Tradeoffs |
- | * Banking | + | |
- | * Can have bank conflict | + | * RISC vs. CISC |
- | * Extra interconnects across banks | + | |
- | * Address mapping can mitigate bank conflict | + | * Tradeoff |
- | * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity) | + | |
- | * Accessing DRAM | + | * Backward compatibility |
- | * Row bits | + | |
- | * Column bits | + | |
- | * Addressibility | + | |
- | * DRAM has its own clock | + | |
- | * DRAM (2T) vs. SRAM (6T) | + | |
- | * Cost | + | |
- | * Latency | + | |
- | * Interleaving in DRAM | + | |
- | * Effects from address mapping on memory interleaving | + | |
- | * Effects from memory access patterns from the program on interleaving | + | |
- | * DRAM Bank | + | |
- | * To minimize the cost of interleaving (Shared the data bus and the command bus) | + | |
- | * DRAM Rank | + | |
- | * Minimize the cost of the chip (a bundle of chips operated together) | + | |
- | * DRAM Channel | + | |
- | * An interface to DRAM, each with its own ranks/banks | + | |
- | * DIMM | + | |
- | * More DIMM adds the interconnect complexity | + | |
- | * List of commands to read/write data into DRAM | + | |
- | * Activate -> read/write -> precharge | + | |
- | * Activate moves data into the row buffer | + | |
- | * Precharge prepare the bank for the next access | + | |
- | * Row buffer hit | + | |
- | * Row buffer conflict | + | |
- | * Scheduling memory requests to lower row conflicts | + | |
- | * Burst mode of DRAM | + | |
- | * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits | + | |
- | * Address mapping | + | |
- | * Row interleaved | + | |
- | * Cache block interleaved | + | |
- | * Memory controller | + | |
- | * Sending DRAM commands | + | |
- | * Periodically send commands to refresh DRAM cells | + | |
- | * Ensure correctness and data integrity | + | |
- | * Where to place the memory controller | + | |
- | * On CPU chip vs. at the main memory | + | |
- | * Higher BW on-chip | + | |
- | * Determine the order of requests that will be serviced in DRAM | + | |
- | * Request queues that hold requests | + | |
- | * Send requests whenever the request can be sent to the bank | + | |
- | * Determine which command (across banks) should be sent to DRAM | + | |
- | * Priority of demand vs. prefetch requests | + | |
- | * Memory scheduling policies | + | |
- | * FCFS | + | |
- | * FR-FCFS | + | |
- | * Capped FR-FCFS: FR-FCFS with a timeout | + | |
- | * Usually this is done in a command level (read/write commands and precharge/activate commands) | + | |
- | + | ||
- | + | ||
- | ===== Lecture 23 (3/28 Fri.) ===== | + | |
- | * DRAM design choices | + | * Performance |
- | * Cost/density/latency/BW/Yield | + | |
- | * Sense Amplifier | + | |
- | * How do they work | + | |
- | * Dual data rate | + | |
- | * Subarray | + | |
- | * Rowclone | + | |
- | * Moving bulk of data from one row to others | + | |
- | * Lower latency and BW when performing copies/zeroes out the data | + | |
- | * TL-DRAM | + | |
- | * Far segment | + | |
- | * Near segment | + | |
- | * What causes the long latency | + | |
- | * Benefit of TL-DRAM | + | |
- | * TL-DRAM vs. DRAM cache (adding a small cache in DRAM) | + | |
- | + | * Optimization opportunity | |
- | + | ||
- | + | ||
- | ===== Lecture 24 (3/31 Mon.) ===== | + | |
- | + | ||
- | * Memory controller | + | * Translation |
- | * Different commands | + | |
- | * Memory scheduler | + | |
- | * Determine the order of requests to be issued to DRAM | + | |
- | * Age/hit-miss status/types(load/store/prefetch/from GPU/from CPU)/criticality | + | |
- | * Row buffer | + | |
- | * hit/conflict | + | |
- | * open/closed row | + | |
- | * Open row policy | + | |
- | * Closed row policy | + | |
- | * Tradeoffs between open and closed row policy | + | |
- | * What if the programs has high row buffer locality: open row might benefit more | + | |
- | * Closed row will service misses request faster | + | |
- | * Bank conflict | + | |
- | * Interference from different applications/threads | + | |
- | * Differnt programs/processes/threads interfere with each other | + | |
- | * introduce more row buffer/bank conflicts | + | |
- | * Memory schedule has to manage these interference | + | |
- | * Memory hog problems | + | |
- | * Interference in the data/command bus | + | |
- | * FR-FCFS | + | |
- | * Why does FR-FCFS make sense? | + | |
- | * Row buffer has lower lantecy | + | |
- | * Issues with FR-FCFS | + | |
- | * Unfairness | + | |
- | * STFM | + | |
- | * Fairness issue in memory scheduling | + | |
- | * How does STFM calculate the fairness and slowdown | + | |
- | * How to estimate slowdown time when it is runing alone | + | |
- | * Definition of fairness (based on STFM, different papers/areas define fairness differently) | + | |
- | * PAR-BS | + | |
- | * Parallelism in programs | + | |
- | * Intereference across banks | + | |
- | * How to form a batch | + | |
- | * How to determine ranking between batches/within a batch | + | |
- | + | ||
- | + |