User Tools

Site Tools


buzzword

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
buzzword [2014/01/13 18:56]
rachata created
buzzword [2015/04/27 18:20]
rachata
Line 1: Line 1:
 ====== Buzzwords ====== ====== Buzzwords ======
  
-Buzzwords are terms that are mentioned during lecture which are particularly important to understand thoroughly. ​ This page tracks the buzzwords for each of the lectures and can be used as a reference for finding gaps in your understanding of course material.+Buzzwords are terms that are mentioned during lecture which are particularly important to understand thoroughly. This page tracks the buzzwords for each of the lectures and can be used as a reference for finding gaps in your understanding of course material.
  
-===== Lecture 1 (1/13 Mon.) =====+===== Lecture 1 (1/12 Mon.) ===== 
 +  * Level of transformation 
 +    * Algorithm 
 +    * System software 
 +    * Compiler 
 +  * Cross abstraction layers 
 +  * Tradeoffs 
 +  * Caches 
 +  * DRAM/memory controller 
 +  * DRAM banks 
 +  * Row buffer hit/miss 
 +  * Row buffer locality 
 +  * Unfairness 
 +  * Memory performance hog 
 +  * Shared DRAM memory system 
 +  * Streaming access vs. random access 
 +  * Memory scheduling policies 
 +  * Scheduling priority 
 +  * Retention time of DRAM 
 +  * Process variation 
 +  * Retention time profile 
 +  * Power consumption 
 +  * Bloom filter 
 +  * Hamming code 
 +  * Hamming distance 
 +  * DRAM row hammer
  
 +===== Lecture 2 (1/14 Wed.) =====
 +  * Moore'​s Law
 +  * Algorithm --> step-by-step procedure to solve a problem
 +  * in-order execution
 +  * out-of-order execution
 +  * technologies that are available on cellphones
 +  * new applications that are made available through new computer architecture techniques
 +    * more data mining (genomics/​medical areas)
 + * lower power (cellphones)
 + * smaller cores (cellphones/​computers)
 + * etc.
 +  * Performance bottlenecks in a single thread/core processors
 +    * multi-core as an alternative
 +  * Memory wall (a part of scaling issue)
 +  * Scaling issue
 +    * Transistor are getting smaller
 +  * Key components of a computer
 +  * Design points
 +    * Design processors to meet the design points
 +  * Software stack
 +  * Design decisions
 +  * Datacenters
 +  * Reliability problems that cause errors
 +  * Analogies from Kuhn's "The Structure of Scientific Revolutions"​ (Recommended book)
 +    * Pre-paradigm science
 +    * Normal science
 +    * Revolutionary science
 +  * Components of a computer
 +    * Computation
 +      * Communication
 +      * Storage
 +        * DRAM
 +        * NVRAM (Non-volatile memory): PCM, STT-MRAM
 +        * Storage (Flash/​Harddrive)
 +  * Von Neumann Model (Control flow model)
 +    * Stored program computer
 +      * Properties of Von Neumann Model: Stored program, sequential instruction processing
 +      * Unified memory
 +        * When does an instruction is being interpreted as an instruction (as oppose to a datum)?
 +      * Program counter
 +      * Examples: x86, ARM, Alpha, IBM Power series, SPARC, MIPS
 +  * Data flow model
 +    * Data flow machine
 +      * Data flow graph
 +    * Operands
 +    * Live-outs/​Live-ins
 +      * Different types of data flow nodes (conditional/​relational/​barrier)
 +    * How to do transactional transaction in dataflow?
 +      * Example: bank transactions
 +  * Tradeoffs between control-driven and data-driven
 +    * What are easier to program?
 +      * Which are easy to compile?
 +      * What are more parallel (does that mean it is faster?)
 +      * Which machines are more complex to design?
 +    * In control flow, when a program is stop, there is a pointer to the current state (precise state).
 +  * ISA vs. Microarchitecture
 +    * Semantics in the ISA
 +      * uArch should obey the ISA
 +      * Changing ISA is costly, can affect compatibility.
 +  * Instruction pointers
 +  * uArch techniques: common and powerful techniques break Vonn Neumann model if done at the ISA level
 +    * Conceptual techniques
 +      * Pipelining ​
 +      * Multiple instructions at a time
 +      * Out-of-order executions
 +      * etc.
 +        * Design techniques
 +          * Adder implementation (Bit serial, ripple carry, carry lookahead)
 +          * Connection machine (an example of a machine that use bit serial to tradeoff latency for more parallelism)
 +  * Microprocessor:​ ISA + uArch + circuits
 +  * What are a part of the ISA? Instructions,​ memory, etc.
 +    * Things that are visible to the programmer/​software
 +  * What are not a part of the ISA? (what goes inside: uArch techniques)
 +    * Things that are not suppose to be visible to the programmer/​software but typically make the processor faster and/or consumes less power and/or less complex
  
 +===== Lecture 3 (1/17 Fri.) =====
  
 +   * Microarchitecture
 +   * Three major tradeoffs of computer architecture
 +   * Macro-architecture
 +   * LC-3b ISA
 +   * Unused instructions
 +   * Bit steering
 +   * Instruction processing style
 +   * 0,1,2,3 address machines
 +   * Stack machine
 +   * Accumulator machine
 +   * 2-operand machine
 +   * 3-operand machine
 +   * Tradeoffs between 0,1,2,3 address machines
 +   * Postfix notation
 +   * Instructions/​Opcode/​Operand specifiers (i.e. addressing modes)
 +   * Simply vs. complex data type (and their tradeoffs)
 +   * Semantic gap and level
 +   * Translation layer
 +   * Addressability
 +   * Byte/bit addressable machines
 +   * Virtual memory
 +   * Big/little endian
 +   * Benefits of having registers (data locality)
 +   * Programmer visible (Architectural) state
 +   * Programmers can access this directly
 +   * What are the benefits?
 +   * Microarchitectural state
 +   * Programmers cannot access this directly
 +   * Evolution of registers (from accumulators to registers)
 +   * Different types of instructions
 +   * Control instructions
 +   * Data instructions
 +   * Operation instructions
 +   * Addressing modes
 +   * Tradeoffs (complexity,​ flexibility,​ etc.)
 +   * Orthogonal ISA
 +   * Addressing modes that are orthogonal to instruction types
 +   * I/O devices
 +   * Vectored vs. non-vectored interrupts
 +   * Complex vs. simple instructions
 +   * Tradeoffs
 +   * RISC vs. CISC
 +   * Tradeoff
 +   * Backward compatibility
 +   * Performance
 +   * Optimization opportunity
 +   * Translation
 +
 +===== Lecture 4 (1/21 Wed.) =====
 +
 +  * Fixed vs. variable length instruction
 +  * Huffman encoding
 +  * Uniform vs. non-uniform decode
 +  * Registers
 +    * Tradeoffs between number of registers
 +  * Alignments
 +    * How does MIPS load words across alignment the boundary
 +
 +===== Lecture 5 (1/26 Mon.) =====
 +
 +  * Tradeoffs in ISA: Instruction length
 +    * Uniform vs. non-uniform
 +  * Design point/Use cases
 +    * What dictates the design point?
 +  * Architectural states
 +  * uArch
 +    * How to implement the ISA in the uArch
 +  * Different stages in the uArch
 +  * Clock cycles
 +  * Multi-cycle machine
 +  * Datapath and control logic
 +    * Control signals
 +  * Execution time of instructions/​program
 +    * Metrics and what do they means
 +  * Instruction processing
 +    * Fetch
 +    * Decode
 +    * Execute
 +    * Memory fetch
 +    * Writeback
 +  * Encoding and semantics
 +  * Different types of instructions (I-type, R-type, etc.)
 +  * Control flow instructions
 +  * Non-control flow instructions
 +  * Delayed slot/​Delayed branch
 +  * Single cycle control logic
 +  * Lockstep
 +  * Critical path analysis
 +    * Critical path of a single cycle processor
 +  * What is in the control signals?
 +    * Combinational logic & Sequential logic
 +  * Control store
 +  * Tradeoffs of a single cycle uarch
 +  * Design principles
 +    * Common case design
 +    * Critical path design
 +    * Balanced designs
 +    * Dynamic power/​Static power
 +      * Increases in power due to frequency
 +
 +===== Lecture 6 (1/28 Mon.) =====
 +
 +  * Design principles
 +    * Common case design
 +    * Critical path design
 +    * Balanced designs
 +  * Multi cycle design
 +  * Microcoded/​Microprogrammed machines
 +    * States
 +    * Translation from one state to another
 +    * Microinstructions
 +    * Microsequencing
 +    * Control store - Product control signals
 +    * Microsequencer ​  
 +    * Control signal
 +      * What do they have to control? ​   ​
 +  * Instruction processing cycle
 +  * Latch signals
 +  * State machine
 +  * State variables
 +  * Condition code
 +  * Steering bits
 +  * Branch enable logic
 +  * Difference between gating and loading? (write enable vs. driving the bus)
 +  * Memory mapped I/O
 +  * Hardwired logic
 +    * What control signals come from hardwired logic?
 +  * Variable latency memory
 +  * Handling interrupts
 +  * Difference between interrupts and exceptions
 +  * Emulator (i.e. uCode allots minimal datapath to emulate the ISA)
 +  * Updating machine behavior
 +  * Horizontal microcode
 +  * Vertical microcode
 +  * Primitives
 +  ​
 +===== Lecture 7 (1/30 Fri.) =====
 +
 +  * Emulator (i.e. uCode allots minimal datapath to emulate the ISA)
 +  * Updating machine behavior
 +  * Horizontal microcode
 +  * Vertical microcode
 +  * Primitives
 +  * nanocode and millicode
 +    * what are the differences between nano/​milli/​microcode
 +  * microprogrammed vs. hardwire control
 +  * Pipelining
 +  * Limitations of the multi-programmed design
 +    * Idle resources
 +  * Throughput of a pipelined design
 +    * What dictates the throughput of a pipelined design?
 +  * Latency of the pipelined design
 +  * Dependency
 +  * Overhead of pipelining
 +    * Latch cost?
 +  * Data forwarding/​bypassing
 +  * What are the ideal pipeline?
 +  * External fragmentation
 +  * Issues in pipeline designs
 +    * Stalling
 +      * Dependency (Hazard)
 +        * Flow dependence
 +        * Output dependence
 +        * Anti dependence
 +        * How to handle them?
 +    * Resource contention
 +    * Keeping the pipeline full
 +    * Handling exception/​interrupts
 +    * Pipeline flush
 +    * Speculation
 +  ​
 +===== Lecture 8 (2/2 Mon.) =====  ​
 +
 +  * Interlocking
 +  * Multipath execution
 +  * Fine grain multithreading
 +  * No-op (Bubbles in the pipeline)
 +  * Valid bits in the instructions
 +  * Branch prediction
 +  * Different types of data dependence
 +  * Pipeline stalls
 +    * bubbles
 +    * How to handle stalls
 +    * Stall conditions
 +    * Stall signals
 +    * Dependences
 +      * Distant between dependences
 +    * Data forwarding/​bypassing
 +    * Maintaining the correct dataflow
 +  * Different ways to design data forwarding path/logic
 +  * Different techniques to handle interlockings
 +    * SW based
 +    * HW based
 +  * Profiling
 +    * Static profiling
 +    * Helps from the software (compiler)
 +      * Superblock optimization
 +      * Analyzing basic blocks
 +  * How to deal with branches?
 +    * Branch prediction
 +    * Delayed branching (branch delay slot)
 +    * Forward control flow/​backward control flow
 +    * Branch prediction accuracy
 +  * Profile guided code positioning
 +    * Based on the profile info. position the code based on it
 +    * Try to make the next sequential instruction be the next inst. to be executed
 +  * Predicate combining (combine predicate for a branch instruction)
 +  * Predicated execution (control dependence becomes data dependence)
 +  ​
 +    ​
 +===== Lecture 9 (2/4 Wed.) =====  ​
 +  * Definition of basic blocks
 +  * Control flow graph
 +  * Delayed branching
 +    * benefit?
 +    * What does it eliminates?
 +    * downside?
 +    * Delayed branching in SPARC (with squashing)
 +    * Backward compatibility with the delayed slot
 +    * What should be filled in the delayed slot
 +    * How to ensure correctness
 +  * Fine-grained multithreading
 +    * fetch from different threads ​
 +    * What are the issues (what if the program doesn'​t have many threads)
 +    * CDC 6000
 +    * Denelcor HEP
 +    * No dependency checking
 +    * Inst. from different thread can fill-in the bubbles
 +    * Cost?
 +  * Simultaneuos multithreading
 +  * Branch prediction
 +    * Guess what to fetch next.
 +    * Misprediction penalty
 +    * Need to guess the direction and target
 +    * How to perform the performance analysis?
 +      * Given the branch prediction accuracy and penalty cost, how to compute a cost of a branch misprediction.
 +      * Given the program/​number of instructions,​ percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions.
 +        * How many extra instructions are being fetched?
 +        * What is the performance degradation?​
 +    * How to reduce the miss penalty?
 +    * Predicting the next address (non PC+4 address)
 +    * Branch target buffer (BTB)
 +      * Predicting the address of the branch
 +    * Global branch history - for directions
 +    * Can use compiler to profile and get more info
 +      * Input set dictates the accuracy
 +      * Add time to compilation
 +    * Heuristics that are common and doesn'​t require profiling.
 +      * Might be inaccurate
 +      * Does not require profiling
 +    * Static branch prediction
 +      * Programmer provides pragmas, hinting the likelihood of taken/not taken branch
 +      * For example, x86 has the hint bit
 +    * Dynamic branch prediction
 +      * Last time predictor
 +      * Two bits counter based prediction
 +        * One more bit for hysteresis
 +
 +===== Lecture 10 (2/6 Fri.) =====
 +  * Branch prediction accuracy
 +    * Why are they very important?
 +      * Differences between 99% accuracy and 98% accuracy
 +      * Cost of a misprediction when the pipeline is very deep
 +  * Global branch correlation
 +    * Some branches are correlated
 +  * Local branch correlation
 +    * Some branches can depend on the result of past branches
 +  * Pattern history table
 +    * Record global taken/not taken results. ​
 +    * Cost vs. accuracy (What to record, do you record PC? Just taken/not taken info.?)
 +  * One-level branch predictor
 +    * What information are used
 +  * Two-level branch prediction
 +    * What entries do you keep in the global history?
 +    * What entries do you keep in the local history?
 +    * How many table?
 +    * Cost when training a table
 +    * What are the purposes of each table?
 +    * Potential problems of a two-level history
 +  * GShare predictor
 +    * Global history predictor is hashed with the PC
 +    * Store both GHP and PC in one combined information
 +    * How do you use the information?​ Why does the XOR result still usable?
 +  * Warmup cost of the branch predictor
 +    * Hybrid solution? Fast warmup is used first, then switch to the slower one.
 +  * Tournament predictor (Alpha 21264)
 +  * Predicated execution - eliminate branches
 +    * What are the tradeoffs
 +    * What if the block is big (can lead to execution a lot of useless work)
 +    * Allows easier code optimization
 +      * From the compiler PoV, predicated execution combine multiple basic blocks into one bigger basic block
 +      * Reduce control dependences
 +    * Need ISA support
 +  * Wish branches
 +    * Compiler generate both predicated and non-predicated codes
 +    * HW design which one to use
 +      * Use branch prediction on an easy to predict code
 +      * Use predicated execution on a hard to predict code
 +      * Compiler can be more aggressive in optmizing the code
 +    * What are the tradeoffs (slide# 47)
 +  * Multi-path execution
 +    * Execute both paths
 +    * Can lead to wasted work
 +    * VLIW
 +    * Superscalar
 +
 +
 +===== Lecture 11 (2/11 Wed.) =====
 +
 +  * Geometric GHR length for branch prediction
 +  * Perceptron branch predictor
 +  * Multi-cycle executions (Different functional units take different number of cycles)
 +    * Instructions can retire out-of-order
 +      * How to deal with this case? Stall? Throw exceptions if there are problems?
 +  * Exceptions and Interrupts
 +    * When they are handled? ​
 +    * Why are some interrupts should be handled right away?
 +  * Precise exception
 +    * arch. state should be consistent before handling the exception/​interrupts
 +      * Easier to debug (you see the sequential flow when the interrupt occurs)
 +        * Deterministic
 +      * Easier to recover from the exception
 +      * Easier to restart the processes
 +    * How to ensure precise exception?
 +    * Tradeoffs between each method
 +  * Reorder buffer
 +    * Reorder results before they are visible to the arch. state
 +      * Need to preserve the sequential semantic and data
 +    * What are the information in the ROB entry
 +    * Where to get the value from (forwarding path? reorder buffer?)
 +      * Extra logic to check where the youngest instructions/​value is
 +      * Content addressable search (CAM)
 +        * A lot of comparators
 +    * Different ways to simplify the reorder buffer
 +    * Register renaming
 +      * Same register refers to independent values (lacks of registers)
 +    * Where does the exception happen (after retire)
 +  * History buffer
 +    * Update the register file when the instruction complete. Unroll if there is an exception.
 +  * Future file (commonly used, along with reorder buffer)
 +    * Keep two set of register files
 +      * An updated value (Speculative),​ called future file
 +      * A backup value (to restore the state quickly
 +    * Double the cost of the regfile, but reduce the area as you don't have to use a content addressable memory (compared to ROB alone)
 +  * Branch misprediction resembles Exception
 +    * The difference is that branch misprediction is not visible to the software
 +      * Also much more common (say, divide by zero vs. a mispredicted branch)
 +    * Recovery is similar to exception handling
 +  * Latency of the state recovery
 +  * What to do during the state recovery
 +  * Checkpointing
 +    * Advantages?
 +
 +===== Lecture 12 (2/13 Fri.) =====
 +
 +  * Renaming
 +  * Register renaming table
 +  * Predictor (branch predictor, cache line predictor ...)
 +  * Power budget (and its importance)
 +  * Architectural state, precise state
 +  * Memory dependence is known dynamically
 +  * Register state is not shared across threads/​processors
 +  * Memory state is shared across threads/​processors
 +  * How to maintain speculative memory states
 +  * Write buffers (helps simplify the process of checking the reorder buffer)
 +  * Overall OoO mechanism
 +    * What are other ways of eliminating dispatch stalls
 +    * Dispatch when the sources are ready
 +    * Retired instructions make the source available
 +    * Register renaming
 +    * Reservation station
 +      * What goes into the reservation station
 +      * Tags required in the reservation station
 +    * Tomasulo'​s algorithm
 +    * Without precise exception, OoO is hard to debug
 +    * Arch. register ID
 +    * Examples in the slides
 +      * Slides 28 --> register renaming
 +      * Slides 30-35 --> Exercise (also on the board)
 +        * This will be useful for the midterm
 +    * Register aliasing table
 +    * Broadcasting tags
 +    * Using dataflow
 +
 +
 +===== Lecture 13 (2/16 Mon.) =====
 +
 +  * OoO --> Restricted Dataflow
 +    * Extracting parallelism
 +    * What are the bottlenecks?​
 +      * Issue width
 +      * Dispatch width
 +      * Parallelism in the program
 +    * What does it mean to be restricted data flow
 +      * Still visible as a Von Neumann model
 +    * Where does the efficiency come from?
 +    * Size of the scheduling windows/​reorder buffer. Tradeoffs? What make sense?
 +  * Load/store handling
 +    * Would like to schedule them out of order, but make them visible in-order
 +    * When do you schedule the load/store instructions?​
 +    * Can we predict if load/store are dependent?
 +    * This is one of the most complex structure of the load/store handling
 +    * What information can be used to predict these load/store optimization?​
 +  * Centralized vs. distributed?​ What are the tradeoffs?
 +  * How to handle when there is a misprediction/​recovery
 +    * OoO + branch prediction?
 +    * Speculatively update the history register
 +      * When do you update the GHR?
 +  * Token dataflow arch.
 +    * What are tokens?
 +    * How to match tokens
 +    * Tagged token dataflow arch.
 +    * What are the tradeoffs?
 +    * Difficulties?​
 +
 +===== Lecture 14 (2/18 Wed.) =====
 +
 +  * SISD/​SIMD/​MISD/​MIMD
 +  * Array processor
 +  * Vector processor
 +  * Data parallelism
 +    * Where does the concurrency arise?
 +  * Differences between array processor vs. vector processor
 +  * VLIW
 +  * Compactness of an array processor ​
 +  * Vector operates on a vector of data (rather than a single datum (scalar))
 +    * Vector length (also applies to array processor)
 +    * No dependency within a vector --> can have a deep pipeline
 +    * Highly parallel (both instruction level (ILP) and memory level (MLP))
 +    * But the program needs to be very parallel
 +    * Memory can be the bottleneck (due to very high MLP)
 +    * What does the functional units look like? Deep pipeline and simpler control.
 +    * CRAY-I is one of the examples of vector processor
 +    * Memory access pattern in a vector processor
 +      * How do the memory accesses benefit the memory bandwidth?
 +      * Memory level parallelism
 +      * Stride length vs. the number of banks
 +        * stride length should be relatively prime to the number of banks
 +      * Tradeoffs between row major and column major --> How can the vector processor deals with the two
 +    * How to calculate the efficiency and performance of vector processors
 +    * What if there are multiple memory ports?
 +    * Gather/​Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism)
 +      * Helps handling sparse matrices
 +    * Conditional operation
 +    * Structure of vector units
 +    * How to automatically parallelize code through the compiler?
 +      * This is a hard problem. Compiler does not know the memory address.
 +  * What do we need to ensure for both vector and array processor?
 +  * Sequential bottleneck ​
 +    * Amdahl'​s law  ​
 +  * Intel MMX --> An example of Intel'​s approach to SIMD
 +    * No VLEN, use OpCode to define the length
 +    * Stride is one in MMX
 +    * Intel SSE --> Modern version of MMX
 +
 +===== Lecture 15 (2/20 Fri.) =====
 +  * GPU
 +    * Warp/​Wavefront
 +      * A bunch of threads sharing the same PC
 +    * SIMT
 +    * Lanes
 +    * FGMT + massively parallel
 +      * Tolerate long latency
 +    * Warp based SIMD vs. traditional SIMD
 +  * SPMD (Programming model)
 +    * Single program operates on multiple data
 +      * can have synchronization point
 +    * Many scientific applications are programmed in this manner
 +  * Control flow problem (branch divergence)
 +    * Masking (in a branch, mask threads that should not execute that path)
 +    * Lower SIMD efficiency
 +    * What if you have layers of branches?
 +  * Dynamic warp formation
 +    * Combining threads from different warps to increase SIMD utilization
 +    * This can cause memory divergence
 +  * VLIW
 +    * Wide fetch
 +    * IA-64
 +    * Tradeoffs
 +      * Simple hardware (no dynamic scheduling, no dependency checking within VLIW)
 +      * A lot of loads at the compiler level
 +  * Decoupled access/​execute
 +    * Limited form of OoO
 +    * Tradeoffs
 +    * How to street the instruction (determine dependency/​stalling)?​
 +    * Instruction scheduling techniques (static vs. dynamic)
 +  * Systolic arrays
 +    * Processing elements transform data in chains
 +    * Develop for image processing (for example, convolution)
 +  * Stage processing
 +    ​
 +
 +
 +===== Lecture 16 (2/23 Mon.) =====
 +  * Systolic arrays
 +    * Processing elements transform data in chains
 +    * Can be arrays of multi-dimensional processing elements
 +    * Develop for image processing (for example, convolution)
 +    * Can be use to break stages in pipeline programs, using a set of queues and processing elements
 +    * Can enable high concurrency and good for regular programs
 +    * Very special purpose
 +    * The warp computer
 +  * Static instruction scheduling
 +    * How do we find the next instruction to execute?
 +  * Live-in and live-out
 +  * Basic blocks
 +    * Rearranging instructions in the basic block
 +    * Code movement from one basic block to another
 +  * Straight line code
 +  * Independent instructions
 +    * How to identify independent instructions
 +  * Atomicity
 +  * Trace scheduling
 +    * Side entrance
 +    * Fixed up code
 +    * How scheduling is done
 +  * Instruction scheduling
 +    * Prioritization heuristics
 +  * Superblock
 +    * Traces with no side-entrance
 +  * Hyperblock
 +  * BS-ISA
 +  * Tradeoffs between trace cache/​Hyperblock/​Superblock/​BS-ISA
 +    ​
 +===== Lecture 17 (2/25 Wed.) =====
 +  * IA-64
 +    * EPIC
 +  * IA-64 instruction bundle
 +    * Multiple instructions in the bundle along with the template bit
 +    * Template bits
 +    * Stop bits
 +    * Non-faulting loads and exception propagation
 +  * Aggressive ST-LD reordering
 +  * Physical memory system
 +  * Ideal pipelines
 +  * Ideal cache
 +    * More capacity
 +    * Fast
 +    * Cheap
 +    * High bandwidth
 +  * DRAM cell
 +    * Cheap
 +    * Sense the perturbation through sense amplifier
 +    * Slow and leaky
 +  * SRAM cell (Cross coupled inverter)
 +    * Expensice
 +    * Fast (easier to sense the value in the cell)
 +  * Memory bank
 +    * Read access sequence
 +    * DRAM: Activate -> Read -> Precharge (if needed)
 +    * What dominate the access latency for DRAM and SRAM
 +  * Scaling issue
 +    * Hard to scale the scale to be small
 +  * Memory hierarchy
 +    * Prefetching
 +    * Caching
 +  * Spatial and temporal locality
 +    * Cache can exploit these
 +    * Recently used data is likely to be accessed
 +    * Nearby data is likely to be accessed
 +  * Caching in a pipeline design
 +  * Cache management
 +    * Manual
 +      * Data movement is managed manually
 +        * Embedded processor
 +        * GPU scratchpad
 +    * Automatic
 +      * HW manage data movements
 +  * Latency analysis
 +    * Based on the hit and miss status, next level access time (if miss), and the current level access time
 +  * Cache basics
 +    * Set/block (line)/​Placement/​replacement/​direct mapped vs. associative cache/etc.
 +  * Cache access
 +    * How to access tag and data (in parallel vs serially)
 +    * How do tag and index get used?
 +    * Modern processors perform serial access for higher level cache (L3 for example) to save power
 +  * Cost and benefit of having more associativity
 +    * Given the associativity,​ which block should be replace if it is full
 +    * Replacement policy
 +      * Random
 +      * Least recently used (LRU)
 +      * Least frequently used
 +      * Least costly to refetch
 +      * etc.
 +  * How to implement LRU
 +    * How to keep track of access ordering
 +      * Complexity increases rapidly
 +    * Approximate LRU
 +      * Victim and next Victim policy
 + 
 +===== Lecture 18 (2/27 Fri.) =====
 +  * Tag store and data store
 +  * Cache hit rate
 +  * Average memory access time (AMAT)
 +  * AMAT vs. Stall time
 +  * Cache basics
 +    * Direct mapped vs. associative cache
 +    * Set/block (line)/​Placement/​replacement
 +    * How do tag and index get used?
 +  * Full associativity
 +  * Set associative cache
 +    * insertion, promotion, eviction (replacement)
 +  * Various replacement policies
 +  * How to implement LRU
 +    * How to keep track of access ordering
 +      * Complexity increases rapidly
 +    * Approximate LRU
 +      * Victim and next Victim policy
 +  * Set thrashing
 +    * Working set is bigger than the associativity
 +  * Belady'​s OPT
 +    * Is this optimal?
 +    * Complexity?
 +  * DRAM as a cache for disk
 +  * Handling writes
 +    * Write through
 +      * Need a modified bit to make sure accesses to data got the updated data
 +    * Write back
 +      * Simpler, no consistency issues
 +  * Sectored cache
 +    * Use subblock
 +      * lower bandwidth
 +      * more complex
 +  * Instruction vs data cache
 +    * Where to place instructions
 +      * Unified vs. separated
 +    * In the first level cache
 +  * Cache access
 +    * First level access
 +    * Second level access
 +      * When to start the second level access
 +  * Cache performance
 +    * capacity
 +    * block size
 +    * associativity
 +  * Classification of cache misses
 +===== Lecture 19 (03/02 Mon.) =====
 +  * Subblocks
 +  * Victim cache
 +    * Small, but fully assoc. cache behind the actual cache
 +    * Cached misses cache block
 +    * Prevent ping-ponging
 +  * Pseudo associativity
 +    * Simpler way to implement associative cache
 +  * Skewed assoc. cache
 +    * Different hashing functions for each way
 +  * Restructure data access pattern
 +    * Order of loop traversal
 +    * Blocking
 +  * Memory level parallelism
 +    * Cost per miss of a parallel cache miss is less costly compared to serial misses
 +  * MSHR
 +    * Keep track of pending cache
 +      * Think of this as the load/store buffer-ish for cache
 +    * What information goes into the MSHR?
 +    * When do you access the MSHR?
 +  * Memory banks
 +  * Shared caches in multi-core processors
 +===== Lecture 20 (03/04 Wed.) =====
 +  * Virtual vs. physical memory
 +  * System'​s management on memory
 +    * Benefits
 +  * Problem: physical memory has limited size
 +  * Mechanisms: indirection,​ virtual addresses, and translation
 +  * Demand paging
 +  * Physical memory as a cache
 +  * Tasks of system SW for VM
 +  * Serving a page fault
 +  * Address translation
 +  * Page table
 +    * PTE (page table entry)
 +  * Page replacement algorithm
 +    * CLOCK algo.
 +    * Inverted page table
 +  * Page size trade-offs
 +  * Protection
 +  * Multi-level page tables
 +  * x86 implementation of page table
 +  * TLB
 +    * Handling misses
 +  * When to do address translation?​
 +  * Homonym and Synonyms
 +    * Homonym: Same VA but maps to different PA with multiple processes
 +    * Synonyms: Multiple VAs map to the same PA
 +      * Shared libraries, shared data, copy-on-write
 +  * Virtually indexed vs. physically indexed
 +  * Virtually tagged vs. physically tagged
 +  * Virtually indexed physically tagged
 +  * Can these create problems when we have the cache
 +  * How to eliminate these problems?
 +  * Page coloring
 +  * Interaction between cache and TLB
 +
 +
 +===== Lecture 21 (03/23 Mon.) =====
 +
 +  * DRAM scaling problem
 +  * Demands/​trends affecting the main memory
 +    * More capacity
 +    * Low energy
 +    * More bandwidth
 +    * QoS
 +  * ECC in DRAM
 +  * Multi-porting
 +    * Virtual multi-porting
 +      * Time-share the port, not too scalable but cheap
 +    * True multiporting
 +  * Multiple cache copies
 +  * Alignment
 +  * Banking
 +    * Can have bank conflict
 +    * Extra interconnects across banks
 +    * Address mapping can mitigate bank conflict
 +    * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity)
 +  * Bank mapping
 +    * How to avoid bank conflicts?
 +  * Channel mapping
 +    * Address mapping to minimize bank conflict
 +    * Page coloring
 +      * Virtual to physical mapping that can help reducing conflicts
 +  * Accessing DRAM
 +    * Row bits
 +    * Column bits
 +    * Addressibility
 +    * DRAM has its own clock
 +    * Sense amplifier
 +    * Bit lines
 +    * Word lines
 +  * DRAM (2T) vs. SRAM (6T)
 +    * Cost
 +    * Latency
 +  * Interleaving in DRAM
 +    * Effects from address mapping on memory interleaving
 +    * Effects from memory access patterns from the program on interleaving
 +  * DRAM Bank
 +    * To minimize the cost of interleaving (Shared the data bus and the command bus)
 +  * DRAM Rank
 +    * Minimize the cost of the chip (a bundle of chips operated together)
 +  * DRAM Channel
 +    * An interface to DRAM, each with its own ranks/banks
 +  * DRAM Chip
 +  * DIMM
 +    * More DIMM adds the interconnect complexity
 +  * List of commands to read/write data into DRAM
 +    * Activate -> read/write -> precharge
 +    * Activate moves data into the row buffer
 +    * Precharge prepare the bank for the next access
 +  * Row buffer hit
 +  * Row buffer conflict
 +  * Scheduling memory requests to lower row conflicts
 +  * Burst mode of DRAM
 +    * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits
 +  * Address mapping
 +    * Row interleaved
 +    * Cache block interleaved
 +  * Memory controller
 +    * Sending DRAM commands
 +    * Periodically send commands to refresh DRAM cells
 +    * Ensure correctness and data integrity
 +    * Where to place the memory controller
 +      * On CPU chip vs. at the main memory
 +        * Higher BW on-chip
 +    * Determine the order of requests that will be serviced in DRAM
 +      * Request queues that hold requests
 +      * Send requests whenever the request can be sent to the bank
 +      * Determine which command (across banks) should be sent to DRAM
 +
 +===== Lecture 22 (03/25 Wed.) =====
 +
 +
 +
 +  * Flash controller
 +  * Flash memory
 +  * Garbage collection in flash
 +  * Overhead in flash memory
 +    * Erase (off the critical path, but takes a long time)
 +  * Different types of DRAM
 +  * DRAM design choices
 +    * Cost/​density/​latency/​BW/​Yield
 +  * Sense Amplifier
 +    * How do they work
 +  * Dual data rate
 +  * Subarray
 +  * Rowclone
 +    * Moving bulk of data from one row to others
 +    * Lower latency and BW when performing copies/​zeroes out the data
 +  * TL-DRAM
 +    * Far segment
 +    * Near segment
 +    * What causes the long latency
 +    * Benefit of TL-DRAM
 +      * TL-DRAM vs. DRAM cache (adding a small cache in DRAM)
 +  * List of commands to read/write data into DRAM
 +    * Activate -> read/write -> precharge
 +    * Activate moves data into the row buffer
 +    * Precharge prepare the bank for the next access
 +  * Row buffer hit
 +  * Row buffer conflict
 +  * Scheduling memory requests to lower row conflicts
 +  * Burst mode of DRAM
 +    * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits
 +  * Address mapping
 +    * Row interleaved
 +    * Cache block interleaved
 +  * Memory controller
 +    * Sending DRAM commands
 +    * Periodically send commands to refresh DRAM cells
 +    * Ensure correctness and data integrity
 +    * Where to place the memory controller
 +      * On CPU chip vs. at the main memory
 +        * Higher BW on-chip
 +    * Determine the order of requests that will be serviced in DRAM
 +      * Request queues that hold requests
 +      * Send requests whenever the request can be sent to the bank
 +      * Determine which command (across banks) should be sent to DRAM
 +  * Priority of demand vs. prefetch requests
 +  * Memory scheduling policies
 +    * FCFS
 +    * FR-FCFS
 +      * Try to maximize row buffer hit rate
 +      * Capped FR-FCFS: FR-FCFS with a timeout
 +      * Usually this is done in a command level (read/write commands and precharge/​activate commands)
 +    * PAR-BS
 +      * Key benefits
 +      * stall time
 +      * shortest job first
 +    * STFM
 +    * ATLAS
 +    * TCM
 +      * Key benefits
 +      * Configurability
 +      * Fairness + performance at the same time
 +      * Robuestness isuees
 +  * Open row policy
 +  * Closed row policy
 +  * QoS
 +    * QoS issues in memory scheduling
 +    * Fairness
 +    * Performance guarantee
 +
 +
 +===== Lecture 23 (03/27 Fri.) =====
 +
 +  * Different ways to control interference in DRAM
 +    * Partitioning of resource
 +      * Channel partitioning:​ map applications that interfere with each other in a different channel
 +        * Keep track of application'​s characteristics
 +        * Dedicate a channel might waste the bandwidth
 +        * Need OS support to determine the channel bits
 +    * Source throttling
 +      * A controller throttle the core depends on the performance target
 +      * Example: Fairness via source throttling
 +        * Detect unfairness and throttle application that is interfering
 +        * How do you estimate slowdown?
 +        * Threshold based solution: hard to configure
 +    * App/thread scheduling
 +      * Critical threads usually stall the progress
 +    * Designing DRAM controller
 +      * Has to handle the normal DRAM operations
 +        * Read/​write/​refresh/​all the timing constraints
 +      * Keep track of resources
 +      * Assign priorities to different requests
 +      * Manage requests to banks
 +    * Self-optimizing controller
 +      * Use machine learning to improve DRAM controller
 +    * A-DRM
 +      * Architecture aware DRAM
 +  * Multithread
 +    * synchronization
 +    * Pipeline programs
 +      * Producer consumer model
 +    * Critical path
 +    * Limiter threads
 +    * Prioritization between threads
 +  * Different power mode in DRAM
 +  * DRAM Refresh
 +    * Why does DRAM has to refresh every 64ms
 +    * Banks are unavailable during refresh
 +      * LPDDR mitigate this by using a per-bank refresh
 +    * Has to spend longer time with bigger DRAM
 +    * Distributed refresh: stagger refresh every 64 ms in a distributed manner
 +      * As oppose to burst refresh (long pause time)
 +  * RAIDR: Reduce DRAM refresh by profiling and binning
 +    * Some row do not have to be refresh very frequently
 +      * Profile the row
 +        * High temperature changes the retention time: need online profiling
 +  * Bloom filter
 +    * Represent set membership
 +    * Approximated
 +    * Can contain false positive
 +      * Better/more hash function helps eliminate this
 +  ​
 +===== Lecture 24 (03/30 Mon.) =====
 +
 +  * Simulation
 +    * Drawbacks of RTL simulations
 +      * Time consuming
 +      * Complex to develop
 +      * Hard to perform design explorations
 +    * Explore the design space quickly
 +    * Match the behavior of existing systems
 +    * Tradeoffs: speed, accuracy, flexibility
 +    * High-level simulation vs. detailed simulation
 +      * High-level simulation is faster, but lower accuracy
 +  * Controllers that works on multiple types of cores
 +    * Design problems: how to find a good scheduling policy on its own?
 +    * Self-optimizing memory controller: using machine learning
 +      * Can adapt to the applications
 +      * The complexity is very high      ​
 +  * Tolerate latency can be costly
 +    * Instruction window is complex
 +      * Benefit also diminishes
 +    * Designing the buffers can be complex
 +    * A simpler way to tolerate out of order is desirable
 +  * Different sources that cause the core to stall in OoO
 +    * Cache miss
 +    * Note that stall happens if the inst. window is full
 +  * Scaling instruction window size is hard
 +    * It is better (less complex) to make the windows more efficient
 +  * Runahead execution
 +    * Try to optain MLP w/o increasing instruction windows
 +    * Runahead (i.e. execute ahead) when there is a long memory instruction
 +      * Long memory instruction stall processor for a while anyways, so it's better to make use out of it
 +      * Execute future instruction to generate accurate prefetches
 +      * Allow future data to be in the cache
 +    * How to support runahead execution?
 +      * Need a way to checkpoing the state when entering runahead mode
 +      * How to make executing in the wrong path useful?
 +      * Need runahead cache to handle load/store in Runahead mode (since they are speculative)
 +
 +
 +===== Lecture 25 (4/1 Wed.) =====
 +
 +  * More Runahead executions
 +    * How to support runahead execution?
 +      * Need a way to checkpoing the state when entering runahead mode
 +      * How to make executing in the wrong path useful?
 +      * Need runahead cache to handle load/store in Runahead mode (since they are speculative)
 +    * Cost and benefit of runahead execution (slide number 27)
 +    * Runahead can have inefficiency
 +      * Runahead period that are useless
 +        * Get rid of useless inefficient period
 +    * What if there is a dependent cache miss
 +      * Cannot be paralellized in a vanilla runahead
 +      * Can predict the value of the dependent load
 +        * How to predict the address of the load
 +          * Delta value information
 +          * Stride predictor
 +          * AVD prediction
 +  * Questions regarding prefetching
 +    * What to prefetch
 +    * When to prefetch
 +    * how do we prefetch
 +    * where to prefetch from
 +  * Prefetching can cause thrasing (evict a useful block)
 +  * Prefetching can also be useless (not being used)
 +    * Need to be efficient
 +  * Can cause memory bandwidth problem in GPU
 +  * Prefetch the whole block, more than one block, or subblock?
 +    * Each one of them has pros and cons
 +    * Big prefetch is more likely to waste bandwidth
 +    * Commonly done in a cache block granularity
 +  * Prefetch accuracy: fraction of useful prefetches out of all the prefetches
 +  * Prefetcher usually predict based on
 +    * Past knowledge
 +    * Compiler hints
 +  * Prefetcher has to prefetch at the right time
 +    * Prefetch that is too early might get evicted
 +      * It might also evict other useful data
 +    * Prefetch too late does not hide the whole memory latency
 +  * Previous prefetches at the same PC can be used as the history
 +  * Previous demand requests also is a good information to use for prefetches
 +  * Prefetch buffer
 +    * Place the prefetch data to avoid thrashing
 +      * Can treat demand/​prefetch requests separately
 +      * More complex
 +  * Generally, demand block is more important
 +    * This means eviction should prefer prefetch block as oppose to demand block
 +  * Tradeoffs between where do we place the prefetcher
 +    * Look at L1 hits and misses
 +    * Look at L1 misses only
 +    * Look at L2 misses
 +    * Different access pattern affect accuracy
 +      * Tradeoffs between handling more requests (seeing L1 hits and misses) and less visibility (only see L2 miss)
 +  * Software vs. hardware vs. execution based prefetching
 +    * Software: ISA previde prefetch instructions,​ software utilize it
 +      * What information are useful
 +      * How to make sure the prefetch is timely
 +      * What if you have a pointer based structure
 +        * Not easy to prefetch pointer chasing (because in many case the work between prefetches is short, so you cannot predict the next one timely enough)
 +          * Can be solved by hinting the nextnext and/or nextnextnext address
 +    * Hardware: Identify the pattern and prefetch
 +    * Execution driven: Oppotunistically try to prefetch (runahead, dual-core execution)
 +  * Stride prefetcher
 +    * Predict strides, which is common in many programs
 +    * Cache block based or instruction based
 +  * Stream buffer design
 +    * Buffer the stream of accesses (next address) ​
 +    * Use the information to prefetch
 +  * What affect prefetcher performance
 +    * Prefetch distance
 +      * How far ahead should we prefetch
 +    * Prefetch degree
 +      * How many prefetches do we prefetch
 +  * Prefetcher performance
 +    * Coverage
 +      * Out of the demand requests, how many are actually from the prefetch request
 +    * Accuracy
 +      * Out of all the prefetch requests, how many are actually getting used
 +    * Timeliness
 +      * How much memory latency can we hide from the prefetch requests
 +    * Cache pullition
 +      * How much did the prefetcher cause misses in the demand misses?
 +        * Hard to quantify
 +
 +
 +===== Lecture 26 (4/3 Fri.) =====
 +
 +  * Feedback directed prefetcher
 +    * Use the result of the prefetcher as a feedback to the prefetcher
 +      * with accuracy, timeliness, polluting information
 +  * Markov prefetcher
 +    * Prefetch based on the previous history
 +    * Use markov model to predict
 +    * Pros: Can cover arbitary pattern (easy for link list traversal or trees)
 +    * Downside: High cost, cannot help with compulsory misses (no history)
 +  * Content directed prefetching
 +    * Indentify the content in memory for pointers (which is used as the address to prefetch
 +    * Not very efficient (hard to figure out which block is the pointer)
 +      * Software can give hints
 +  * Correlation table
 +    * Address correlation
 +  * Execution based prefetcher
 +    * Helper thread/​speculative thread
 +      * Use another thread to pre-execute a program
 +    * Can be a software based or hardware based
 +    * Discover misses before the main program (to prefetch data in a timely manner)
 +    * How do you construct the helper thread
 +    * Preexecute instruction (one example of how to initialize a speculative thread), slide 9
 +    * Thread-based pre-execution
 +  * Error tolerance
 +  * Solution to errors
 +    * Tolerate errors
 +      * New interface, new design
 +    * Eliminate or minimize errors
 +      * New technology, system-wide rethinking
 +    * Embrace errors
 +      * Map data that can tolerate errors to error-prone area
 +  * Hybrid memory systesm
 +    * Combining multiple memory technology together  ​
 +  * What can emerging technology help?
 +    * Scalability
 +    * Lower the cost
 +    * Energy efficiency
 +  * Possible solutions to the scaling problem
 +    * Less leakage DRAM
 +    * Heterogeneous DRAM (TL-DRAM, etc.)
 +    * Add more functionality to DRAM
 +    * Denser design (3D stack)
 +    * Different technology
 +      * NVM
 +  * Charge vs. resistice memory
 +    * How data is written?
 +    * How to read the data?
 +  * Non volatile memory
 +    * Resistive memory
 +      * PCM
 +        * Inject current to change the phase
 +        * Scales better than DRAM
 +          * Multiple bits per cell
 +            * Wider resistence range
 +        * No refresh is needed
 +        * Downside: Latency and write endurance
 +      * STT-MRAM
 +        * Inject current to change the polarity
 +      * Memristor
 +        * Inject current to change the structure
 +    * Pros and cons between different technologies
 +    * Persistency - data stay there even without power
 +      * Unified memory and storage management (persistent data structure) - Single level store
 +        * Improve energy and performance
 +        * Simplify programming model
 +  * Different design options for DRAM + NVM
 +    * DRAM as a cache
 +    * Place some data in DRAM and other in PCM
 +      * Based on the characteristics
 +        * Frequently accessed data that need lower write latency in DRAM
 +  ​
 +
 +===== Lecture 27 (4/6 Mon.) =====
 +  * Flynn'​s taxonomy
 +  * Parallelism
 +    * Reduces power consumption (P ~ CV^2F)
 +    * Better cost efficiency and easier to scale
 +    * Improves dependability (in case the other core is faulty
 +  * Different types of parallelism
 +    * Instruction level parallelism
 +    * Data level parallelism
 +    * Task level parallelism
 +  * Task level parallelism
 +    * Partition a single, potentially big, task into multiple parallel sub-task
 +      * Can be done explicitly (parallel programming by the programmer)
 +      * Or implicitly (hardware partitions a single thread speculatively)
 +    * Or, run multiple independent tasks (still improves throughput, but the speedup of any single tasks is not better, also simpler to implement)
 +  * Loosely coupled multiprocessor
 +    * No shared global address space
 +      * Message passing to communicate between different sources
 +    * Simple to manage memory
 +  * Tightly coupled multiprocessor
 +    * Shared global address space
 +    * Need to ensure consistency of data
 +    * Programming issues
 +  * Hardware-based multithreading
 +    * Coarse grained
 +    * Find grained
 +    * Simultaneous:​ Dispatch instruction from multiple threads at the same time
 +  * Parallel speedup
 +    * Superlinear speedup
 +  * Utilization,​ Redundancy, Efficiency
 +  * Amdahl'​s law
 +    * Maximum speedup
 +    * Parallel portion is not perfect
 +      * Serial bottleneck
 +      * Synchronization cost
 +      * Load balance
 +        * Some threads has more work, requires more time to hit the sync. point
 +  * Critical sections
 +    * Enforce mutually exclusive access to shared data
 +  * Issues in parallel programming
 +    * Correctness
 +    * Synchronization
 +    * Consistency
 +
 +
 +===== Lecture 28 (4/8 Wed.) =====
 +  * Ordering of instructions
 +    * Maintaining memory consistency when there are multiple threads and shared memory
 +    * Need to ensure the semantic is not changed
 +    * Making sure the shared data is properly locked when used
 +      * Support mutual exclusion
 +    * Ordering depends on when each processor is executed
 +    * Debugging is also difficult (non-deterministic behavior)
 +  * Dekker'​s algorithm
 +    * Inconsistency -- the two processors did NOT see the same order of operations to memory
 +  * Sequential consistency
 +    * Multiple correct global orders
 +    * Two issues:
 +        * Too conservative/​strict
 +        * Performance limiting
 +  * Weak consistency:​ global ordering when sync
 +    * programmer hints where the synchronizations are
 +    * Memory fence
 +    * More burden on the programmers
 +  * Cache coherence
 +    * Can be done in the software level or hardware level
 +  * Snoop-based coherence
 +    * A simple protocol with two states by broadcasting reads/​writes on a bus
 +  * Maintaining coherence
 +    * Needs to provide 1) write propagation and 2) write serialization
 +    * Update vs. Invalidate
 +  * Two cache coherence methods
 +    * Snoopy bus
 +      * Bus based, single point of serialization
 +      * More efficient with small number of processors
 +      * Processors snoop other caches read/write requests to keep the cache block coherent
 +    * Directory
 +      * Single point of serialization per block
 +      * Directory coordinates the coherency
 +      * More scalable
 +      * The directory keeps track of where the copies of each block resides
 +        * Supplies data on a read
 +        * Invalidates the block on a write
 +        * Has an exclusive state
 +
 +===== Lecture 29 (4/10 Fri.) =====
 +  * MSI coherent protocol
 +    * The problem: unnecessary broadcasts of invalidations
 +  * MESI coherent protocol
 +    * Add the exclusive state: this is the only cache copy and it is a clean state to MSI
 +    * Multiple invalidation tradeoffs
 +    * Problem: memory can be unnecessarily updated
 +    * A possible owner state (MOESI)
 +  * Tradeoffs between snooping and directory based coherence protocols
 +    * Slide 31 has a good summary
 +  * Directory: data structures
 +    * Bit vectors vs. linked lists
 +  * Scalability of directories
 +    * Size? Latency? Thousand of nodes? Best of both snooping and directory?
 +
 +    ​
 +===== Lecture 30 (4/13 Mon.) =====
 +  * In-memory computing
 +  * Design goals of DRAM
 +  * DRAM structures
 +    * Banks
 +    * Capacitors and sense amplifiers
 +    * Trade-offs b/w number of sense amps and cells
 +    * Width of bank I/O vs. row size
 +  * DRAM operations
 +    * ACTIVATE, READ/WRITE, and PRECHARGE
 +  * Trade-offs
 +    * Latency
 +    * Bandwidth: Chip vs. rank vs. bank
 +      * What's the benefit of having 8 chips?
 +    * Parallelism
 +  * RowClone
 +    * What are the problems?
 +    * Copying b/w two rows that share the same sense amplifier
 +    * System software support
 +  * Bitwise AND/OR
 +
 +===== Lecture 31 (4/15 Wed.) =====
 +
 +  * Application slowdown
 +  * Interference between different applications
 +    * Applications'​ performance depends on other applications that they are running with
 +  * Predictable performance
 +    * Why are they important?
 +    * Applications that need predictibility
 +    * How to predict the performance?​
 +      * What information are useful?
 +      * What need to be guarantee?
 +      * How to estimate the performance when running with others?
 +        * Easy, just measure the performance while it is running.
 +      * How to estimate the performance when the application is running by itself.
 +        * Hard if there is no profiling.
 +      * The relationship between memory service rate and the performance.
 +        * Key assumption: applications are memory bound
 +    * Behavior of memory-bound applications
 +      * With and without interference
 +  * Memory phase vs. compute phase
 +  * MISE
 +    * Estimating slowdown using request service rate
 +    * Inaccuracy when measuring request service rate alone
 +    * Non-memory-bound applications
 +    * Control slowdown and provide soft guarantee
 +  * Taking into account of the shared cache
 +    * MISE model + cache resource management
 +    * Aug tag store
 +      * Separate tag store for different cores
 +    * Cache access rate alone and shared as the metric to estimate slowdown
 +  * Cache paritiioning
 +    * How to determine partitioning
 +      * Utility based cache partitioning
 +      * Others
 +  * Maximum slowdown and fairness metric
 +    ​
 +
 +
 +===== Lecture 32 (4/20 Mon.) =====
 +
 +  * Heterogeneous systems
 +    * Asymmetric cores: different types of cores on the chip
 +      * Each of these cores are optimized for different workloads/​requirements/​goals
 +      * Multiple special purpose processors
 +      * Flexible and can adapt to workload behavior
 +      * Disadvantages:​ complex and high overhead
 +    * Examples: CPU-GPU systems, heterogeneity in execution models
 +    * Heterogeneous resources
 +      * Example: reliable and non-reliable DRAM in the same system
 +  * Key problems in modern systems
 +    * Memory system
 +    * Efficiency
 +    * Predictability
 +    * Assymmetric design can help solving these problems
 +  * Serialized code sections
 +    * Bottleneck in multicore execution
 +    * Parallelizable vs. serial portion
 +    * Accelerate critical section
 +    * Cache ping-ponging
 +      * Synchronization latency
 +    * Symmetric vs. assymmetric design
 +  * Large cores + small cores
 +    * Core assymmetry
 +  * Amdahl'​s law with heterogeneous cores
 +  * Parallel bottlenecks
 +    * Resource contention
 +      * Depends on what are running
 +  * Accelerated critical section
 +    * Ship critical sections to large cores
 +    * Small modifications and low overhead
 +    * False serialization might become the bottleneck
 +    * Can reduce parallel throughput
 +    * Effect on private cache misses and shared cache misses
 +    ​
 +  ​
 +===== Lecture 33 (4/27 Mon.) =====
 +
 +  * Interconnects
 +    * Connecting multiple components together
 +    * Goal: Scalability,​ flexibility,​ performance and energy efficiency
 +  * Metric: Performance,​ bandwidth, bisection bandwidth, cost, energy efficienct, system performance,​ contention, latency
 +    * Saturation point
 +      * Saturation throughput
 +  * Topology
 +    * How to wire components together, affects routing, throughput, latency
 +    * Bus: All nodes connected to a single ring
 +      * Hard to increase frequency, bandwidth, poor scalability but simple
 +    * Point-to-point
 +      * Low contention and potentially low latency. Costly, not scalable and hard to wire.
 +    * Crossbar
 +      * No contention. Concurrent request from different src/dest can be sent concurrently. Costly.
 +    * Multistage logarithmic network
 +      * Indirect network, low contention, multiple request can be sent concurrently. More scalable compared to crossbar.
 +      * Circuit switch
 +      * Omega network, delta network.
 +      * Butterfly network
 +      * Intermediate switch between sources and destinations
 +    * Switching vs. topology
 +    * Ring
 +      * Each node connected to two other nodes, forming a ring
 +      * Low overhead, high latency, not as scalable.
 +      * Unidirectional ring and bi-directional ring
 +    * Hierarchical Rings
 +      * Layers of rings. More scalable, lower latency.
 +      * Bridge router connect multiple rings together
 +    * Mesh
 +      * 4 input and output ports
 +      * More bisection bandwidth and more scalable
 +      * Easy to layout
 +      * Path diversity
 +      * Routers are more complex
 +    * Tree
 +      * Another hierarchical topology
 +      * Specialized topology
 +      * Good for local traffic
 +      * Fat tree: higher level have more bandwidth
 +      * CM-5 Fat tree
 +        * Fat tree with 4x2 switches
 +    * Hypercube
 +      * N-Dimensional cubes
 +      * Caltech cosmic cube
 +      * Very complex
 +  * Routing algorithm
 +    * How does message get sent from source to destination
 +    * Static or adaptive
 +    * Handling contention
 +      * Buffering helps handling contention, but adds complexity
 +    * Three types of routing algorithms
 +      * Deterministic:​ always takes the same path
 +      * Oblivious: takes different paths without taking into account of the state of the network
 +        * For example, Valiant algorithm
 +      * Adaptive: takes different paths taking into account of the state of the network
 +        * Non-minimal adaptive routing vs. minimal adaptive routing
 +      * Minimal path: path that has minimum number of hops
 +  * Buffering and flow control
 +    * How to store within the network
 +    * Handling oversubscription
 +    * Source throttling
 +    * Bufferless vs. buffered crossbars
 +    * Buffer overflow
 +    * Bufferless deflection routing
 +      * Deflect packets when there is contention
 +      * Hot-potato routing
 +
 +
 +    ​
buzzword.txt · Last modified: 2015/04/27 18:20 by rachata