This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
buzzword [2015/01/10 21:03] kevincha [Lecture 1 (1/13 Mon.)] |
buzzword [2015/02/12 19:01] kevincha [Lecture 11 (2/11 Mon.)] |
||
---|---|---|---|
Line 4: | Line 4: | ||
===== Lecture 1 (1/12 Mon.) ===== | ===== Lecture 1 (1/12 Mon.) ===== | ||
+ | * Level of transformation | ||
+ | * Algorithm | ||
+ | * System software | ||
+ | * Compiler | ||
+ | * Cross abstraction layers | ||
+ | * Tradeoffs | ||
+ | * Caches | ||
+ | * DRAM/memory controller | ||
+ | * DRAM banks | ||
+ | * Row buffer hit/miss | ||
+ | * Row buffer locality | ||
+ | * Unfairness | ||
+ | * Memory performance hog | ||
+ | * Shared DRAM memory system | ||
+ | * Streaming access vs. random access | ||
+ | * Memory scheduling policies | ||
+ | * Scheduling priority | ||
+ | * Retention time of DRAM | ||
+ | * Process variation | ||
+ | * Retention time profile | ||
+ | * Power consumption | ||
+ | * Bloom filter | ||
+ | * Hamming code | ||
+ | * Hamming distance | ||
+ | * DRAM row hammer | ||
+ | |||
+ | ===== Lecture 2 (1/14 Wed.) ===== | ||
+ | * Moore's Law | ||
+ | * Algorithm --> step-by-step procedure to solve a problem | ||
+ | * in-order execution | ||
+ | * out-of-order execution | ||
+ | * technologies that are available on cellphones | ||
+ | * new applications that are made available through new computer architecture techniques | ||
+ | * more data mining (genomics/medical areas) | ||
+ | * lower power (cellphones) | ||
+ | * smaller cores (cellphones/computers) | ||
+ | * etc. | ||
+ | * Performance bottlenecks in a single thread/core processors | ||
+ | * multi-core as an alternative | ||
+ | * Memory wall (a part of scaling issue) | ||
+ | * Scaling issue | ||
+ | * Transister are getting smaller | ||
+ | * Key components of a computer | ||
+ | * Design points | ||
+ | * Design processors to meet the design points | ||
+ | * Software stack | ||
+ | * Design decisions | ||
+ | * Datacenters | ||
+ | * Reliability problems that cause errors | ||
+ | * Analogies from Kuhn's "The Structure of Scientific Revolutions" (Recommended book) | ||
+ | * Pre paradigm science | ||
+ | * Normal science | ||
+ | * Revolutionalry science | ||
+ | * Components of a computer | ||
+ | * Computation | ||
+ | * Communication | ||
+ | * Storage | ||
+ | * DRAM | ||
+ | * NVRAM (Non-volatile memory): PCM, STT-MRAM | ||
+ | * Storage (Flash/Harddrive) | ||
+ | * Von Neumann Model (Control flow model) | ||
+ | * Stored program computer | ||
+ | * Properties of Von Neumann Model: Stored program, sequential instruction processing | ||
+ | * Unified memory | ||
+ | * When does an instruction is being interpreted as an instruction (as oppose to a datum)? | ||
+ | * Program counter | ||
+ | * Examples: x86, ARM, Alpha, IBM Power series, SPARC, MIPS | ||
+ | * Data flow model | ||
+ | * Data flow machine | ||
+ | * Data flow graph | ||
+ | * Operands | ||
+ | * Live-outs/Live-ins | ||
+ | * DIfferent types of data flow nodes (conditional/relational/barrier) | ||
+ | * How to do transactional transaction in dataflow? | ||
+ | * Example: bank transactions | ||
+ | * Tradeoffs between control-driven and data-driven | ||
+ | * What are easier to program? | ||
+ | * Which are easy to compile? | ||
+ | * What are more parallel (does that mean it is faster?) | ||
+ | * Which machines are more complex to design? | ||
+ | * In control flow, when a program is stop, there is a pointer to the current state (precise state). | ||
+ | * ISA vs. Microarchitecture | ||
+ | * Semantics in the ISA | ||
+ | * uArch should obey the ISA | ||
+ | * Changing ISA is costly, can affect compatibility. | ||
+ | * Instruction pointers | ||
+ | * uArch techniques: common and powerful techniques break Vonn Neumann model if done at the ISA level | ||
+ | * Conceptual techniques | ||
+ | * Pipelining | ||
+ | * Multiple instructions at a time | ||
+ | * Out-of-order executions | ||
+ | * etc. | ||
+ | * Design techniques | ||
+ | * Adder implementation (Bit serial, ripple carry, carry lookahead) | ||
+ | * Connection machine (an example of a machine that use bit serial to tradeoff latency for more parallelism) | ||
+ | * Microprocessor: ISA + uArch + circuits | ||
+ | * What are a part of the ISA? Instructions, memory, etc. | ||
+ | * Things that are visible to the programmer/software | ||
+ | * What are not a part of the ISA? (what goes inside: uArch techniques) | ||
+ | * Things that are not suppose to be visible to the programmer/software but typically make the processor faster and/or consumes less power and/or less complex | ||
+ | |||
+ | ===== Lecture 3 (1/17 Fri.) ===== | ||
+ | |||
+ | * Microarchitecture | ||
+ | * Three major tradeoffs of computer architecture | ||
+ | * Macro-architecture | ||
+ | * LC-3b ISA | ||
+ | * Unused instructions | ||
+ | * Bit steering | ||
+ | * Instruction processing style | ||
+ | * 0,1,2,3 address machines | ||
+ | * Stack machine | ||
+ | * Accumulator machine | ||
+ | * 2-operand machine | ||
+ | * 3-operand machine | ||
+ | * Tradeoffs between 0,1,2,3 address machines | ||
+ | * Postfix notation | ||
+ | * Instructions/Opcode/Operade specifiers (i.e. addressing modes) | ||
+ | * Simply vs. complex data type (and their tradeoffs) | ||
+ | * Semantic gap and level | ||
+ | * Translation layer | ||
+ | * Addressability | ||
+ | * Byte/bit addressable machines | ||
+ | * Virtual memory | ||
+ | * Big/little endian | ||
+ | * Benefits of having registers (data locality) | ||
+ | * Programmer visible (Architectural) state | ||
+ | * Programmers can access this directly | ||
+ | * What are the benefits? | ||
+ | * Microarchitectural state | ||
+ | * Programmers cannot access this directly | ||
+ | * Evolution of registers (from accumulators to registers) | ||
+ | * Different types of instructions | ||
+ | * Control instructions | ||
+ | * Data instructions | ||
+ | * Operation instructions | ||
+ | * Addressing modes | ||
+ | * Tradeoffs (complexity, flexibility, etc.) | ||
+ | * Orthogonal ISA | ||
+ | * Addressing modes that are orthogonal to instruction types | ||
+ | * I/O devices | ||
+ | * Vectored vs. non-vectored interrupts | ||
+ | * Complex vs. simple instructions | ||
+ | * Tradeoffs | ||
+ | * RISC vs. CISC | ||
+ | * Tradeoff | ||
+ | * Backward compatibility | ||
+ | * Performance | ||
+ | * Optimization opportunity | ||
+ | * Translation | ||
+ | |||
+ | ===== Lecture 4 (1/21 Wed.) ===== | ||
+ | |||
+ | * Fixed vs. variable length instruction | ||
+ | * Huffman encoding | ||
+ | * Uniform vs. non-uniform decode | ||
+ | * Registers | ||
+ | * Tradeoffs between number of registers | ||
+ | * Alignments | ||
+ | * How does MIPS load words across alignment the boundary | ||
+ | |||
+ | ===== Lecture 5 (1/26 Mon.) ===== | ||
+ | |||
+ | * Tradeoffs in ISA: Instruction length | ||
+ | * Uniform vs. non-uniform | ||
+ | * Design point/Use cases | ||
+ | * What dictates the design point? | ||
+ | * Architectural states | ||
+ | * uArch | ||
+ | * How to implement the ISA in the uArch | ||
+ | * Different stages in the uArch | ||
+ | * Clock cycles | ||
+ | * Multi-cycle machine | ||
+ | * Datapath and control logic | ||
+ | * Control signals | ||
+ | * Execution time of instructions/program | ||
+ | * Metrics and what do they means | ||
+ | * Instruction processing | ||
+ | * Fetch | ||
+ | * Decode | ||
+ | * Execute | ||
+ | * Memory fetch | ||
+ | * Writeback | ||
+ | * Encoding and semantics | ||
+ | * Different types of instructions (I-type, R-type, etc.) | ||
+ | * Control flow instructions | ||
+ | * Non-control flow instructions | ||
+ | * Delayed slot/Delayed branch | ||
+ | * Single cycle control logic | ||
+ | * Lockstep | ||
+ | * Critical path analysis | ||
+ | * Critical path of a single cycle processor | ||
+ | * What is in the control signals? | ||
+ | * Combinational logic & Sequential logic | ||
+ | * Control store | ||
+ | * Tradeoffs of a single cycle uarch | ||
+ | * Design principles | ||
+ | * Common case design | ||
+ | * Critical path design | ||
+ | * Balanced designs | ||
+ | * Dynamic power/Static power | ||
+ | * Increases in power due to frequency | ||
+ | |||
+ | ===== Lecture 6 (1/28 Mon.) ===== | ||
+ | |||
+ | * Design principles | ||
+ | * Common case design | ||
+ | * Critical path design | ||
+ | * Balanced designs | ||
+ | * Multi cycle design | ||
+ | * Microcoded/Microprogrammed machines | ||
+ | * States | ||
+ | * Translation from one state to another | ||
+ | * Microinstructions | ||
+ | * Microsequencing | ||
+ | * Control store - Product control signals | ||
+ | * Microsequencer | ||
+ | * Control signal | ||
+ | * What do they have to control? | ||
+ | * Instruction processing cycle | ||
+ | * Latch signals | ||
+ | * State machine | ||
+ | * State variables | ||
+ | * Condition code | ||
+ | * Steering bits | ||
+ | * Branch enable logic | ||
+ | * Difference between gating and loading? (write enable vs. driving the bus) | ||
+ | * Memory mapped I/O | ||
+ | * Hardwired logic | ||
+ | * What control signals come from hardwired logic? | ||
+ | * Variable latency memory | ||
+ | * Handling interrupts | ||
+ | * Difference betwen interrupts and exceptions | ||
+ | * Emulator (i.e. uCode allots minimal datapath to emulate the ISA) | ||
+ | * Updating machine behavior | ||
+ | * Horizontal microcode | ||
+ | * Vertical microcode | ||
+ | * Primitives | ||
+ | | ||
+ | ===== Lecture 7 (1/30 Fri.) ===== | ||
+ | |||
+ | * Emulator (i.e. uCode allots minimal datapath to emulate the ISA) | ||
+ | * Updating machine behavior | ||
+ | * Horizontal microcode | ||
+ | * Vertical microcode | ||
+ | * Primitives | ||
+ | * nanocode and millicode | ||
+ | * what are the differences between nano/milli/microcode | ||
+ | * microprogrammed vs. hardwire control | ||
+ | * Pipelining | ||
+ | * Limitations of the multi-programmed design | ||
+ | * Idle resources | ||
+ | * Throughput of a pipelined design | ||
+ | * What dictacts the throughput of a pipelined design? | ||
+ | * Latency of the pipelined design | ||
+ | * Dependency | ||
+ | * Overhead of pipelining | ||
+ | * Latch cost? | ||
+ | * Data forwarding/bypassing | ||
+ | * What are the ideal pipeline? | ||
+ | * External fragmentation | ||
+ | * Issues in pipeline designs | ||
+ | * Stalling | ||
+ | * Dependency (Hazard) | ||
+ | * Flow dependence | ||
+ | * Output dependence | ||
+ | * Anti dependence | ||
+ | * How to handle them? | ||
+ | * Resource contention | ||
+ | * Keeping the pipeline full | ||
+ | * Handling exception/interrupts | ||
+ | * Pipeline flush | ||
+ | * Speculation | ||
+ | | ||
+ | ===== Lecture 8 (2/2 Mon.) ===== | ||
+ | |||
+ | * Interlocking | ||
+ | * Multipath execution | ||
+ | * Fine grain multithreading | ||
+ | * No-op (Bubbles in the pipeline) | ||
+ | * Valid bits in the instructions | ||
+ | * Branch prediction | ||
+ | * Different types of data dependence | ||
+ | * Pipeline stalls | ||
+ | * bubbles | ||
+ | * How to handle stalls | ||
+ | * Stall conditions | ||
+ | * Stall signals | ||
+ | * Dependences | ||
+ | * Distant between dependences | ||
+ | * Data forwarding/bypassing | ||
+ | * Maintaining the correct dataflow | ||
+ | * Different ways to design data forwarding path/logic | ||
+ | * Different techniques to handle interlockings | ||
+ | * SW based | ||
+ | * HW based | ||
+ | * Profiling | ||
+ | * Static profiling | ||
+ | * Helps from the software (compiler) | ||
+ | * Superblock optimization | ||
+ | * Analyzing basic blocks | ||
+ | * How to deal with branches? | ||
+ | * Branch prediction | ||
+ | * Delayed branching (branch delay slot) | ||
+ | * Forward control flow/backward control flow | ||
+ | * Branch prediction accuracy | ||
+ | * Profile guided code positioning | ||
+ | * Based on the profile info. position the code based on it | ||
+ | * Try to make the next sequential instruction be the next inst. to be executed | ||
+ | * Predicate combining (combine predicate for a branch instruction) | ||
+ | * Predicated execution (control dependence becomes data dependence) | ||
+ | | ||
+ | | ||
+ | ===== Lecture 9 (2/4 Wed.) ===== | ||
+ | * Definition of basic blocks | ||
+ | * Control flow graph | ||
+ | * Delayed branching | ||
+ | * benefit? | ||
+ | * What does it eliminates? | ||
+ | * downside? | ||
+ | * Delayed branching in SPARC (with squashing) | ||
+ | * Backward compatibility with the delayed slot | ||
+ | * What should be filled in the delayed slot | ||
+ | * How to ensure correctness | ||
+ | * Fine-grained multithreading | ||
+ | * fetch from different threads | ||
+ | * What are the issues (what if the program doesn't have many threads) | ||
+ | * CDC 6000 | ||
+ | * Denelcor HEP | ||
+ | * No dependency checking | ||
+ | * Inst. from different thread can fill-in the bubbles | ||
+ | * Cost? | ||
+ | * Simulteneuos multithreading | ||
+ | * Branch prediction | ||
+ | * Guess what to fetch next. | ||
+ | * Misprediction penalty | ||
+ | * Need to guess the direction and target | ||
+ | * How to perform the performance analysis? | ||
+ | * Given the branch prediction accuracy and penalty cost, how to compute a cost of a branch misprediction. | ||
+ | * Given the program/number of instructions, percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions. | ||
+ | * How many extra instructions are being fetched? | ||
+ | * What is the performance degredation? | ||
+ | * How to reduce the miss penalty? | ||
+ | * Predicting the next address (non PC+4 address) | ||
+ | * Branch target buffer (BTB) | ||
+ | * Predicting the address of the branch | ||
+ | * Global branch history - for directions | ||
+ | * Can use compiler to profile and get more info | ||
+ | * Input set dictacts the accuracy | ||
+ | * Add time to compilation | ||
+ | * Heuristics that are common and doesn't require profiling. | ||
+ | * Might be inaccurate | ||
+ | * Does not require profiling | ||
+ | * Static branch prediction | ||
+ | * Pregrammer provides pragmas, hinting the likelihood of taken/not taken branch | ||
+ | * For example, x86 has the hint bit | ||
+ | * Dynamic branch prediction | ||
+ | * Last time predictor | ||
+ | * Two bits counter based prediction | ||
+ | * One more bit for hysteresis | ||
+ | |||
+ | ===== Lecture 10 (2/6 Fri.) ===== | ||
+ | * Branch prediction accuracy | ||
+ | * Why are they very important? | ||
+ | * Differences between 99% accuracy and 98% accuracy | ||
+ | * Cost of a misprediction when the pipeline is veryd eep | ||
+ | * Global branch correlation | ||
+ | * Some branches are correlated | ||
+ | * Local branch correlation | ||
+ | * Some branches can depend on the result of past branches | ||
+ | * Pattern history table | ||
+ | * Record global taken/not taken results. | ||
+ | * Cost vs. accuracy (What to record, do you record PC? Just taken/not taken info.?) | ||
+ | * One-level branch predictor | ||
+ | * What information are used | ||
+ | * Two-level branch prediction | ||
+ | * What entries do you keep in the global history? | ||
+ | * What entries do you keep in the local history? | ||
+ | * How many table? | ||
+ | * Cost when training a table | ||
+ | * What are the purposes of each table? | ||
+ | * Potential problems of a two-level history | ||
+ | * GShare predictor | ||
+ | * Global history predictor is hashed with the PC | ||
+ | * Store both GHP and PC in one combined information | ||
+ | * How do you use the information? Why does the XOR result still usable? | ||
+ | * Warmup cost of the branch predictor | ||
+ | * Hybrid solution? Fast warmup is used first, then switch to the slower one. | ||
+ | * Tournament predictor (Alpha 21264) | ||
+ | * Predicated execution - eliminate branches | ||
+ | * What are the tradeoffs | ||
+ | * What if the block is big (can lead to execution a lot of useless work) | ||
+ | * Allows easier code optimization | ||
+ | * From the compiler PoV, predicated execution combine multiple basic blocks into one bigger basic block | ||
+ | * Reduce control dependences | ||
+ | * Need ISA support | ||
+ | * Wish branches | ||
+ | * Compiler generate both predicated and non-predicated codes | ||
+ | * HW design which one to use | ||
+ | * Use branch prediction on an easy to predict code | ||
+ | * Use predicated execution on a hard to predict code | ||
+ | * Compiler can be more aggressive in optimimzing the code | ||
+ | * What are the tradeoffs (slide# 47) | ||
+ | * Multi-path execution | ||
+ | * Execute both paths | ||
+ | * Can lead to wasted work | ||
+ | * VLIW | ||
+ | * SuperScalar | ||
+ | |||
+ | |||
+ | ===== Lecture 11 (2/11 Mon.) ===== | ||
+ | |||
+ | * Geometric GHR length for branch prediction | ||
+ | * Perceptron branch predictor | ||
+ | * Multi-cycle executions (Different functional units take different number of cycles) | ||
+ | * Instructions can retire out-of-order | ||
+ | * How to deal with this case? Stall? Throw exceptions if there are problems? | ||
+ | * Exceptions and Interrupts | ||
+ | * When they are handled? | ||
+ | * Why are some interrupts should be handled right away? | ||
+ | * Precise exception | ||
+ | * arch. state should be consistent before handling the exception/interrupts | ||
+ | * Easier to debug (you see the sequential flow when the interrupt occurs) | ||
+ | * Deterministic | ||
+ | * Easier to recover from the exception | ||
+ | * Easier to restart the processes | ||
+ | * How to ensure precise exception? | ||
+ | * Tradeoffs between each method | ||
+ | * Reorder buffer | ||
+ | * Reorder results before they are visible to the arch. state | ||
+ | * Need to presearve the sequential sematic and data | ||
+ | * What are the informatinos in the ROB entry | ||
+ | * Where to get the value from (forwarding path? reorder buffer?) | ||
+ | * Extra logic to check where the youngest instructions/value is | ||
+ | * Content addressible search (CAM) | ||
+ | * A lot of comparators | ||
+ | * Different ways to simplify the reorder buffer | ||
+ | * Register renaming | ||
+ | * Same register refers to independent values (lacks of registers) | ||
+ | * Where does the exception happen (after retire) | ||
+ | * History buffer | ||
+ | * Update the register file when the instruction complete. Unroll if there is an exception. | ||
+ | * Future file (commonly used, along with reorder buffer) | ||
+ | * Keep two set of register files | ||
+ | * An updated value (Speculative), called future file | ||
+ | * A backup value (to restore the state quickly | ||
+ | * Double the cost of the regfile, but reduce the area as you don't have to use a content addressible memory (compared to ROB alone) | ||
+ | * Branch misprediction resembles Exception | ||
+ | * The difference is that branch misprediction is not visible to the software | ||
+ | * Also much more common (say, divide by zero vs. a mispredicted branch) | ||
+ | * Recovery is similar to exception handling | ||
+ | * Latency of the state recovery | ||
+ | * What to do during the state recovery | ||
+ | * Checkpointing | ||
+ | * Advantages? | ||
+ | |||