Differences

This shows you the differences between two versions of the page.

--- buzzword [2014/03/31 18:15]
rachata
+++ buzzword [2015/01/28 19:18]
rachata
@@ Line 1: / Line 1: @@
 ====== Buzzwords ======
-Buzzwords are terms that are mentioned during lecture which are particularly important to understand thoroughly.  This page tracks the buzzwords for each of the lectures and can be used as a reference for finding gaps in your understanding of course material.
+Buzzwords are terms that are mentioned during lecture which are particularly important to understand thoroughly. This page tracks the buzzwords for each of the lectures and can be used as a reference for finding gaps in your understanding of course material.
-===== Lecture 1 (1/13 Mon.) =====
+===== Lecture 1 (1/12 Mon.) =====
   * Level of transformation
     * Algorithm
@@ Line 10: / Line 9: @@
     * Compiler
   * Cross abstraction layers
-    * Expose an interface
   * Tradeoffs
   * Caches
-  * Multi-thread
+  * DRAM/memory controller
-  * Multi-core
+  * DRAM banks
-  * Unfairness
-  * DRAM controller/Memory controller
-  * Memory hog
   * Row buffer hit/miss
   * Row buffer locality
-  * Streaming access/ Random access
+  * Unfairness
-  * DRAM refresh
+  * Memory performance hog
-  * Retention time
+  * Shared DRAM memory system
-  * Profiling DRAM retention time
+  * Streaming access vs. random access
+  * Memory scheduling policies
+  * Scheduling priority
+  * Retention time of DRAM
+  * Process variation
+  * Retention time profile
   * Power consumption
-  * Wimpy cores
   * Bloom filter
-    * Pros/Cons
+  * Hamming code
-    * False Positive
+  * Hamming distance
-  * Simulation
+  * DRAM row hammer
-  * Memory performance attacks
-  * RTL design
-===== Lecture 2 (1/15 Wed.) =====
+===== Lecture 2 (1/14 Wed.) =====
-  * Optimizing for energy/ Optimizing for the performance
-    * Generally you should optimize for the users
-  * state-of-the-art
-  * RTL Simulation
-    * Long, slow and can be costly
-  * High level simulation
-    * What should be employed?
-	* Important to get the idea of how they are implemented in RTL
-	* Allows designer to filter out techniques that do not work well
-  * Design points
-    * Design processors to meet the design points
-  * Software stack
-  * Design decisions
-  * Datacenters
-  * MIPS R2000
-    * What are architectural techniques that improve the performance of a processor over MIPS 2000
   * Moore's Law
+  * Algorithm --> step-by-step procedure to solve a problem
   * in-order execution
   * out-of-order execution
@@ Line 65: / Line 46: @@
   * Scaling issue
     * Transister are getting smaller
+  * Key components of a computer
+  * Design points
+    * Design processors to meet the design points
+  * Software stack
+  * Design decisions
+  * Datacenters
   * Reliability problems that cause errors
   * Analogies from Kuhn's "The Structure of Scientific Revolutions" (Recommended book)
@@ Line 73: / Line 60: @@
     * Computation
       * Communication
-        * Storage
+      * Storage
-          * DRAM
+        * DRAM
-          * NVRAM (Non-volatile memory): PCM, STT-MRAM
+        * NVRAM (Non-volatile memory): PCM, STT-MRAM
-          * Storage (Flash/Harddrive)
+        * Storage (Flash/Harddrive)
   * Von Neumann Model (Control flow model)
     * Stored program computer
-        * Properties of Von Neumann Model: Stored program, sequential instruction processing
+      * Properties of Von Neumann Model: Stored program, sequential instruction processing
-        * Unified memory
+      * Unified memory
-          * When does an instruction is being interpreted as an instruction (as oppose to a datum)?
+        * When does an instruction is being interpreted as an instruction (as oppose to a datum)?
-        * Program counter
+      * Program counter
-        * Examples: x86, ARM, Alpha, IBM Power series, SPARC, MIPS
+      * Examples: x86, ARM, Alpha, IBM Power series, SPARC, MIPS
   * Data flow model
     * Data flow machine
@@ Line 94: / Line 81: @@
   * Tradeoffs between control-driven and data-driven
     * What are easier to program?
-    * Which are easy to compile?
+      * Which are easy to compile?
-    * What are more parallel (does that mean it is faster?)
+      * What are more parallel (does that mean it is faster?)
-    * Which machines are more complex to design?
+      * Which machines are more complex to design?
     * In control flow, when a program is stop, there is a pointer to the current state (precise state).
   * ISA vs. Microarchitecture
     * Semantics in the ISA
-    * uArch should obey the ISA
+      * uArch should obey the ISA
-    * Changing ISA is costly, can affect compatibility.
+      * Changing ISA is costly, can affect compatibility.
   * Instruction pointers
   * uArch techniques: common and powerful techniques break Vonn Neumann model if done at the ISA level
@@ Line 109: / Line 96: @@
       * Out-of-order executions
       * etc.
-    * Design techniques
+        * Design techniques
-      * Adder implementation (Bit serial, ripple carry, carry lookahead)
+          * Adder implementation (Bit serial, ripple carry, carry lookahead)
-      * Connection machine (an example of a machine that use bit serial to tradeoff latency for more parallelism)
+          * Connection machine (an example of a machine that use bit serial to tradeoff latency for more parallelism)
   * Microprocessor: ISA + uArch + circuits
   * What are a part of the ISA? Instructions, memory, etc.
@@ Line 120: / Line 107: @@
 ===== Lecture 3 (1/17 Fri.) =====
-  * Design tradeoffs
+   * Microarchitecture
-  * Macro Architectures
+   * Three major tradeoffs of computer architecture
-  * Reconfiguribility vs. specialized designs
+   * Macro-architecture
-  * Parallelism (instructions, data parallel)
+   * LC-3b ISA
-  * Uniform decode (Example: Alpha)
+   * Unused instructions
-  * Steering bits (Sub-opcode)
+   * Bit steering
-  * 0,1,2,3 address machines
+   * Instruction processing style
-    * Stack machine
+   * 0,1,2,3 address machines
-    * Accumulator machine
+   * Stack machine
-    * 2-operand machine
+   * Accumulator machine
-    * 3-operand machine
+   * 2-operand machine
-    * Tradeoffs between 0,1,2,3 address machines
+   * 3-operand machine
-  * Instructions/Opcode/Operade specifiers (i.e. addressing modes)
+   * Tradeoffs between 0,1,2,3 address machines
-  * Simply vs. complex data type (and their tradeoffs)
+   * Postfix notation
-  * Semantic gap
+   * Instructions/Opcode/Operade specifiers (i.e. addressing modes)
-  * Translation layer
+   * Simply vs. complex data type (and their tradeoffs)
-  * Addressability
+   * Semantic gap and level
-  * Byte/bit addressable machines
+   * Translation layer
-  * Virtual memory
+   * Addressability
-  * Big/little endian
+   * Byte/bit addressable machines
-  * Benefits of having registers (data locality)
+   * Virtual memory
-  * Programmer visible (Architectural) state
+   * Big/little endian
-    * Programmers can access this directly
+   * Benefits of having registers (data locality)
-    * What are the benefits?
+   * Programmer visible (Architectural) state
-  * Microarchitectural state
+   * Programmers can access this directly
-    * Programmers cannot access this directly
+   * What are the benefits?
-  * Evolution of registers (from accumulators to registers)
+   * Microarchitectural state
-  * Different types of instructions
+   * Programmers cannot access this directly
-    * Control instructions
+   * Evolution of registers (from accumulators to registers)
-    * Data instructions
+   * Different types of instructions
-    * Operation instructions
+   * Control instructions
-  * Addressing modes
+   * Data instructions
-    * Tradeoffs (complexity, flexibility, etc.)
+   * Operation instructions
-  * Orthogonal ISA
+   * Addressing modes
-    * Addressing modes that are orthogonal to instructino types
+   * Tradeoffs (complexity, flexibility, etc.)
-  * Vectors vs. non vectored interrupts
+   * Orthogonal ISA
-  * Complex vs. simple instructions
+   * Addressing modes that are orthogonal to instruction types
-    * Tradeoffs
+   * I/O devices
-  * RISC vs. CISC
+   * Vectored vs. non-vectored interrupts
-    * Tradeoff
+   * Complex vs. simple instructions
-    * Backward compatibility
+   * Tradeoffs
-    * Performance
+   * RISC vs. CISC
-    * Optimization opportunity
+   * Tradeoff
+   * Backward compatibility
+   * Performance
+   * Optimization opportunity
+   * Translation
-===== Lecture 4 (1/22 Wed.) =====
+===== Lecture 4 (1/21 Wed.) =====
-  * Semantic gap
+  * Fixed vs. variable length instruction
-    * Small vs. Large semantic gap (CISC vs. RISC)
+  * Huffman encoding
-    * Benefit of RISC vs. CISC
+  * Uniform vs. non-uniform decode
-  * Micro operations/microcode
+  * Registers
-    * Translate complex instructions into smaller instructions
+    * Tradeoffs between number of registers
-  * Parallelism (motivation for RISC)
+  * Alignments
-  * Compiler optimization
+    * How does MIPS load words across alignment the boundary
-  * Code optimization through translation
-  * VLIW
-  * Fixed vs. variable length instructions
-    * Tradeoffs
-      * Alignment issues? (fetch/decode)
-      * Decoding issues?
-      * Code size?
-      * Adding additional instructions?
-      * Memory bandwidth and cache utilization?
-      * Energy?
-    * Encoding in variable length instructions
-  * Structure of Alpha instructions and other uniform decode instructions
-    * Different type of instructions
-    * Benefit of knowing what type of instructions
-      * Speculatively operate future instructions
-  * x86 and other non-uniform decode instructions
-    * Tradeoff vs. uniform decode
-  * Tradeoffs for different number of registers
-    * Spilling into memory if the number of registers is small
-    * Compiler optimization on how to manage which value to keep/spill
-  * Addressing modes
-    * Benefits?
-    * Types?
-    * Different uses of addressing modes?
-  * Various ISA-level tradeoffs
-  * Virtual memory
-  * Unalign memory access/aligned memory access
-    * Cost vs. benefit of unaligned access
-  * ISA specification
-    * Things you have to obey/specifie in the ISA specification
-  * Architectural states
-  * Microarchitecture implements how arch. state A transformed to the next arch. state A'
-  * Single cycle machines
-    * Critical path in the single cycle machine
-  * Multi cycle machines
-  * Functional units
-  * Performance metrics
-    * CPI/IPC
-      * CPI of a single cycle microarchitecture
-===== Lecture 5 (1/24 Fri.) =====
+===== Lecture 5 (1/26 Mon.) =====
+  * Tradeoffs in ISA: Instruction length
+    * Uniform vs. non-uniform
+  * Design point/Use cases
+    * What dictates the design point?
+  * Architectural states
+  * uArch
+    * How to implement the ISA in the uArch
+  * Different stages in the uArch
+  * Clock cycles
+  * Multi-cycle machine
+  * Datapath and control logic
+    * Control signals
+  * Execution time of instructions/program
+    * Metrics and what do they means
   * Instruction processing
     * Fetch
@@ Line 221: / Line 187: @@
     * Memory fetch
     * Writeback
-  * Datapath & Control logic in microprocessors
+  * Encoding and semantics
   * Different types of instructions (I-type, R-type, etc.)
   * Control flow instructions
@@ Line 230: / Line 196: @@
   * Critical path analysis
     * Critical path of a single cycle processor
-  * Combinational logic & Sequential logic
+  * What is in the control signals?
+    * Combinational logic & Sequential logic
   * Control store
   * Tradeoffs of a single cycle uarch
-  * Dynamic power/Static power
-  * Speedup calculation
-    * Parallelism
-    * Serial bottleneck
-    * Amdahl's bottleneck
   * Design principles
     * Common case design
     * Critical path design
     * Balanced designs
-  * Multi cycle design
+    * Dynamic power/Static power
+      * Increases in power due to frequency
-===== Lecture 6 (1/27 Mon.) =====
+===== Lecture 6 (1/28 Mon.) =====
+  * Design principles
+    * Common case design
+    * Critical path design
+    * Balanced designs
+  * Multi cycle design
   * Microcoded/Microprogrammed machines
     * States
+    * Translation from one state to another
     * Microinstructions
     * Microsequencing
@@ Line 273: / Line 242: @@
   * Vertical microcode
   * Primitives
-===== Lecture 7 (1/29 Wed.) =====
-  * Pipelining
-  * Limitations of the multi-programmed design
-    * Idle resources
-  * Throughput of a pipelined design
-    * What dictacts the throughput of a pipelined design?
-  * Latency of the pipelined design
-  * Dependency
-  * Overhead of pipelining
-    * Latch cost?
-  * Data forwarding/bypassing
-  * What are the ideal pipeline?
-  * External fragmentation
-  * Issues in pipeline designs
-    * Stalling
-      * Dependency (Hazard)
-        * Flow dependence
-        * Output dependence
-        * Anti dependence
-        * How to handle them?
-    * Resource contention
-    * Keeping the pipeline full
-    * Handling exception/interrupts
-    * Pipeline flush
-    * Speculation
-  * Interlocking
-  * Multipath execution
-  * Fine grain multithreading
-  * No-op (Bubbles in the pipeline)
-  * Valid bits in the instructions
-===== Lecture 8 (1/31 Fri.) =====
-  * Branch prediction
-  * Different types of data dependence
-  * Pipeline stalls
-    * bubbles
-    * How to handle stalls
-    * Stall conditions
-    * Stall signals
-    * Dependences
-      * Distant between dependences
-    * Data forwarding/bypassing
-    * Maintaining the correct dataflow
-  * Different ways to design data forwarding path/logic
-  * Different techniques to handle interlockings
-    * SW based
-    * HW based
-  * Profiling
-    * Static profiling
-    * Helps from the software (compiler)
-      * Superblock optimization
-      * Analyzing basic blocks
-  * How to deal with branches?
-    * Branch prediction
-    * Delayed branching (branch delay slot)
-    * Forward control flow/backward control flow
-    * Branch prediction accuracy
-  * Profile guided code positioning
-    * Based on the profile info. position the code based on it
-    * Try to make the next sequential instruction be the next inst. to be executed
-  * Trace cache
-  * Predicate combining (combine predicate for a branch instruction)
-  * Predicated execution (control dependence becomes data dependence)
-  * Definition of basic blocks
-  * Control flow graph
-===== Lecture 9 (2/3 Mon.) =====
-  * Delayed branching
-    * benefit?
-    * What does it eliminates?
-    * downside?
-    * Delayed branching in SPARC (with squashing)
-    * Backward compatibility with the delayed slot
-    * What should be filled in the delayed slot
-    * How to ensure correctness
-  * Fine-grained multithreading
-    * fetch from different threads
-    * What are the issues (what if the program doesn't have many threads)
-    * CDC 6000
-    * Denelcor HEP
-    * No dependency checking
-    * Inst. from different thread can fill-in the bubbles
-    * Cost?
-  * Simulteneuos multithreading
-  * Branch prediction
-    * Guess what to fetch next.
-    * Misprediction penalty
-    * Need to guess the direction and target
-    * How to perform the performance analysis?
-      * Given the branch prediction accuracy and penalty cost, how to compute a cost of a branch misprediction.
-      * Given the program/number of instructions, percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions.
-        * How many extra instructions are being fetched?
-        * What is the performance degredation?
-    * How to reduce the miss penalty?
-    * Predicting the next address (non PC+4 address)
-    * Branch target buffer (BTB)
-      * Predicting the address of the branch
-    * Global branch history - for directions
-    * Can use compiler to profile and get more info
-      * Input set dictacts the accuracy
-      * Add time to compilation
-    * Heuristics that are common and doesn't require profiling.
-      * Might be inaccurate
-      * Does not require profiling
-    * Programmer can tell the hardware (via pragmas (hints))
-      * For example, x86 has the hint bit
-    * Dynamic branch prediction
-      * Last time predictor
-      * Two bits counter based prediction
-        * One more bit for hysteresis
-===== Lecture 10 (2/5 Wed.) =====
-  * Branch prediction accuracy
-    * Why are they very important?
-      * Differences between 99% accuracy and 98% accuracy
-      * Cost of a misprediction when the pipeline is veryd eep
-  * Value prediction
-  * Global branch correlation
-    * Some branches are correlated
-  * Local branch correlation
-    * Some branches can depend on the result of past branches
-  * Pattern history table
-    * Record global taken/not taken results.
-    * Cost vs. accuracy (What to record, do you record PC? Just taken/not taken info.?)
-  * One-level branch predictor
-    * What information are used
-  * Two-level branch prediction
-    * What entries do you keep in the glocal history?
-    * What entries do you keep in the local history?
-    * How many table?
-    * Cost when training a table
-    * What are the purposes of each table?
-    * Potential problems of a two-level history
-  * GShare predictor
-    * Global history predictor is hashed with the PC
-    * Store both GHP and PC in one combined information
-    * How do you use the information? Why does the XOR result still usable?
-  * Slides (page 16-18) for a good overview of one- and two-level predictors
-  * Warmup cost of the branch predictor
-    * Hybrid solution? Fast warmup is used first, then switch to the slower one.
-  * Tournament predictor (Alpha 21264)
-  * Other types of branch predictor
-    * Using machine learning?
-    * Geometric history length
-      * Look at branches far behind (but using geometric step)
-  * Predicated execution - eliminate branches
-    * What are the tradeoffs
-    * What is the block is big (can lead to execution a lot of useless work)
-    * Allows easier code optimization
-      * From the compiler PoV, predicated execution combine multiple basic blocks into one bigger basic block
-      * Reduce control dependences
-    * Need ISA support
-  * Wish branches
-    * Compiler generate both predicated and non-predicated codes
-    * HW design which one to use
-      * Use branch prediction on an easy to predict code
-      * Use predicated execution on a hard to predict code
-      * Compiler can be more aggressive in optimimzing the code
-    * What are the tradeoffs (slide# 47)
-  * Multi-path execution
-    * Execute both paths
-    * Can lead to wasted work
-===== Lecture 11 (2/12 Wed.) =====
-  * Call and return prediction
-    * Direct call is easy to predict
-    * Retun is harder (indirect branches)
-      * Nested calls make return easier to predict
-        * Can use stack to predict the return
-    * Indirect branch prediction
-      * These branches have multiple targets
-      * For switch-case, virtual function calls, jump tables, interface calls
-      * BTB to predict the target address - low accuracy
-      * History based: BTB + GHR
-      * Virtual program counter prediction
-  * Complications in superscalar processors
-    * Fetch? What if multiple branches are fetched at the same time?
-    * Logic requires to ensure correctness?
-  * Multi-cycle executions (Different functional units take different number of cycles)
-    * Instructions can retire out-of-order
-      * How to deal with this case? Stall? Throw exceptions if there are problems?
-  * Exceptions and Interrupts
-    * When they are handled?
-      * Why are some interrupts should be handled right away?
-  * Precise exception
-    * arch. state should be consistent before handling the exception/interrupts
-      * Easier to debug (you see the sequential flow when the interrupt occurs)
-        * Deterministic
-      * Easier to recover from the exception
-      * Easier to restart the processes
-    * How to ensure precise exception?
-      * Tradeoffs between each method
-  * Reorder buffer
-    * Reorder results before they are visible to the arch. state
-      * Need to presearve the sequential sematic and data
-    * What are the informatinos in the ROB entry
-    * Where to get the value from (forwarding path? reorder buffer?)
-      * Extra logic to check where the youngest instructions/value is
-      * Content addressible search
-        * A lot of comparators
-    * Different ways to simplify the reorder buffer
-    * Register renaming
-      * Same register refers to independent values (lacks of registers)
-    * Where does the exception happen (after retire)
-  * History buffer
-    * Update the register file when the instruction complete. Unroll if there is an exception.
-  * Future file (commonly used, along with reorder buffer)
-    * Keep two set of register files
-      * An updated value (Speculative), called fiture file
-      * A backup value (to restore the state quickly
-    * Double the cost of the regfile, but reduce the area as you don't have to use a content addressible memory (compared to ROB alone)
-  * Branch misprediction resembles Exception
-    * The difference is that branch misprediction is not visible to the software
-      * Also much more common (say, divide by zero vs. a mispredicted branch)
-    * Recovery is similar to exception handling
-  * Latency of the state recovery
-  * What to do during the state recovery
-  * Checkpointing
-    * Advantages?
-===== Lecture 14 (2/19 Wed.) =====
-  * Predictor (branch predictor, cache line predictor ...)
-  * Power budget (and its importance)
-  * Architectural state, precise state
-  * Memory dependence is known dynamically
-  * Register state is not shared across threads/processors
-  * Memory state is shared across threads/processors
-  * How to maintain speculative memory states
-  * Write buffers (helps simplify the process of checking the reorder buffer)
-  * Overall OoO mechanism
-    * What are other ways of eliminating dispatch stalls
-    * Dispatch when the sources are ready
-    * Retired instructions make the source available
-    * Register renaming
-    * Reservation station
-      * What goes into the reservation station
-      * Tags required in the reservation station
-    * Tomasulo's algorithm
-    * Without precise exception, OoO is hard to debug
-    * Arch. register ID
-    * Examples in the slides
-      * Slides 28 --> register renaming
-      * Slides 30-35 --> Exercise (also on the board)
-        * This will be usefull for the midterm
-    * Register aliasing table
-===== Lecture 15 (2/21 Fri.) =====
-  * OoO --> Restricted Dataflow
-    * Extracting parallelism
-    * What are the bottlenecks?
-      * Issue width
-      * Dispatch width
-      * Parallelism in the program
-    * More example on slide #10
-    * What does it mean to be restricted data flow
-      * Still visible as a Von Neumann model
-    * Where does the efficiency come from?
-    * Size of the scheduling windors/reorder buffer. Tradeoffs? What make sense?
-  * Load/store handling
-    * Would like to schedule them out of order, but make them visible in-order
-    * When do you schedule the load/store instructions?
-    * Can we predict if load/store are dependent?
-    * This is one of the most complex structure of the load/store handling
-    * What information can be used to predict these load/store optimization?
-  * Note: IPC = 1/CPI
-  * Centralized vs. distributed? What are the tradeoffs?
-  * How to handle when there is a misprediction/recovery
-  * Token dataflow arch.
-    * What are tokens?
-    * How to match tokens
-    * Tagged token dataflow arch.
-    * What are the tradeoffs?
-    * Difficulties?
-===== Lecture 16 (2/24 Mon.) =====
-  * SISD/SIMD/MISD/MIMD
-  * Array processor
-  * Vector processor
-  * Data parallelism
-    * Where does the concurrency arise?
-  * Differences between array processor vs. vector processor
-  * VLIW
-  * Compactness of an array processor
-  * Vector operates on a vector of data (rather than a single datum (scalar))
-    * Vector length (also applies to array processor)
-    * No dependency within a vector --> can have a deep pipeline
-    * Highly parallel (both instruction level (ILP) and memory level (MLP))
-    * But the program needs to be very parallel
-    * Memory can be the bottleneck (due to very high MLP)
-    * What does the functional units look like? Deep pipelin and simpler control.
-    * CRAY-I is one of the examples of vector processor
-    * Memory access pattern in a vector processor
-      * How do the memory accesses benefit the memory bandwidth?
-      * Please refer to slides 73-74 in http://www.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring13-lecture25-mainmemory-afterlecture.pdf for a breif explanation of memory level parallelism
-      * Stride length vs. the number of banks
-        * stride length should be relatively prime to the number of banks
-      * Tradeoffs between row major and column major --> How can the vector processor deals with the two
-    * How to calculate the efficiency and performance of vector processors
-    * What if there are multiple memory ports?
-    * Gather/Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism)
-      * Helps handling sparse metrices
-    * Conditional operation
-    * Structure of vector units
-    * How to automatically parallelize code through the compiler?
-      * This is a hard problem. Compiler does not know the memory address.
-  * What do we need to ensure for both vector and array processor?
-  * Sequential bottleneck
-    * Amdahl's law
-  * Intel MMX --> An example of Intel's approach to SIMD
-    * No VLEN, use OpCode to define the length
-    * Stride is one in MMX
-    * Intel SSE --> Modern version of MMX
-===== Lecture 17 (2/26 Wed.) =====
-  * GPU
-    * Warp/Wavefront
-      * A bunch of threads sharing the same PC
-    * SIMT
-    * Lanes
-    * FGMT + massively parallel
-      * Tolerate long latency
-    * Warp based SIMD vs. traditional SIMD
-  * SPMD (Programming model)
-    * Single program operates on multiple data
-      * can have synchronization point
-    * Many scientific applications are programmed in this manner
-  * Control flow problem (branch divergence)
-    * Masking (in a branch, mask threads that should not execute that path)
-    * Lower SIMD efficiency
-    * What if you have layers of branches?
-  * Dynamic wrap formation
-    * Combining threads from different warps to increase SIMD utilization
-    * This can cause memory divergence
-  * VLIW
-    * Wide fetch
-    * IA-64
-    * Tradeoffs
-      * Simple hardware (no dynamic scheduling, no dependency checking within VLIW)
-      * A lot of loads at the compiler level
-  * Decoupled access/execute
-    * Limited form of OoO
-    * Tradeoffs
-    * How to street the instruction (determine dependency/stalling)?
-    * Instruction scheduling techniques (static vs. dynamic)
-  * Systoric arrays
-    * Processing elements transform data in chains
-    * Develop for image processing (for example, convolution)
-  * Stage processing
-===== Lecture 18 (2/28 Fri.) =====
-  * Tradeoffs of VLIW
-    * Why does VLIW required static instruction scheduling
-    * Whose job it is?
-      * Compiler can rearrange basic blocks/instruction
-  * Basic block
-    * Benefits of having large basic block
-  * Entry/Exit
-    * Handling entries/exits
-  * Trace cache
-    * How to ensure correctness?
-    * Profiling
-    * Fixing up the instruction order to ensure correctness
-    * Dealing with multiple entries into the block
-    * Dealing with multiple exits into the block
-  * Super block
-    * How to form super blocks?
-    * Benefit of super block
-    * Tradeoff between not forming a super block and forming a super block
-      * Ambiguous branch (after profiling, both taken/not taken are equally likely)
-      * Cleaning up
-  * What scenario would make trace cache/superblock/profiling less effective?
-  * List scheduling
-    * Help figuring out which instructions VLIW should fetch
-    * Try to maximize instruction throughput
-    * How to assign priorities
-    * What if some instructions take longer than others
-  * Block structured ISA (BS-ISA)
-    * Problems with trace scheduling?
-    * What type of program will benefit from BS-ISA
-    * How to form blocks in BS-ISA?
-      * Combining basic blocks
-      * multiples of merged basic blocks
-    * How to deal with entries/exits in BS-ISA?
-      * undo the executed instructions from the entry point, then fetch the new block
-    * Advantages over trace cache
-  * Benefit of VLIW + Static instruction scheduling
-  * Intel IA-64
-    * Static instruction scheduling and VLIW
-===== Lecture 19 (3/19 Wed.) =====
-  * Ideal cache
-    * More capacity
-    * Fast
-    * Cheap
-    * High bandwidth
-  * DRAM cell
-    * Cheap
-    * Sense the purturbation through sense amplifier
-    * Slow and leaky
-  * SRAM cell (Cross coupled inverter)
-    * Expensice
-    * Fast (easier to sense the value in the cell)
-  * Memory bank
-    * Read access sequence
-      * DRAM: Activate -> Read -> Precharge (if needed)
-    * What dominate the access laatency for DRAM and SRAM
-  * Scaling issue
-    * Hard to scale the scale to be small
-  * Memory hierarchy
-    * Prefetching
-    * Caching
-  * Spatial and temporal locality
-    * Cache can exploit these
-    * Recently used data is likely to be accessed
-    * Nearby data is likely to be accessed
-  * Caching in a pipeline design
-  * Cache management
-    * Manual
-      * Data movement is managed manually
-        * Embedded processor
-        * GPU scratchpad
-    * Automatic
-      * HW manage data movements
-  * Latency analysis
-    * Based on the hit and miss status, next level access time (if miss), and the current level access time
-  * Cache basics
-    * Set/block (line)/Placement/replacement/direct mapped vs. associative cache/etc.
-  * Cache access
-    * How to access tag and data (in parallel vs serially)
-    * How do tag and index get used?
-    * Modern processors perform serial access for higher level cache (L3 for example) to save power
-  * Cost and benefit of having more associativity
-    * Given the associativity, which block should be replace if it is full
-    * Replacement poligy
-      * Random
-      * Least recently used (LRU)
-      * Least frequently used
-      * Least costly to refetch
-      * etc.
-  * How to implement LRU
-    * How to keep track of access ordering
-      * Complexity increases rapidly
-    * Approximate LRU
-      * Victim and next Victim policy
-===== Lecture 20 (3/21 Fri.) =====
-  * Set thrashing
-    * Working set is bigger than the associativity
-  * Belady's OPT
-    * Is this optimal?
-    * Complexity?
-  * Similarity between cache and page table
-    * Number of blocks vs pages
-    * Time to find the block/page to replace
-  * Handling writes
-    * Write through
-      * Need a modified bit to make sure accesses to data got the updated data
-    * Write back
-      * Simpler, no consistency issues
-  * Sectored cache
-    * Use subblock
-      * lower bandwidth
-      * more complex
-  * Instruction vs data cache
-    * Where to place instructions
-      * Unified vs. separated
-    * In the first level cache
-  * Cache access
-    * First level access
-    * Second level access
-      * When to start the second level access
-        * Performance vs. energy
-  * Address translation
-  * Homonym and Synonyms
-    * Homonym: Same VA but maps to different PA
-      * With multiple processes
-    * Synonyms: Multiple VAs map to the same PA
-      * Shared libraries, shared data, copy-on-write
-      * I/O
-    * Can these create problems when we have the cache
-    * How to eliminate these problems?
-      * Page coloring
-  * Interaction between cache and TLB
-    * Virtually indexed vs. physically indexed
-    * Virtually tagged vs. physically tagged
-    * Virtually indexed physically tagged
-  * Virtual memory in DRAM
-    * Control where data is mapped to in channel/rank/bank
-      * More parallelism
-      * Reduce interference
-===== Lecture 21 (3/24 Mon.) =====
-  * Different parameters that affect cache miss
-  * Thrashing
-  * Different types of cache misses
-    * Compulsory misses
-      * Can mitigate with prefetches
-    * Capacity misses
-      * More assoc
-      * Victim cache
-    * Conflict misses
-      * Hashing
-  * Large block vs. small block
-  * Subblocks
-  * Victim cache
-    * Small, but fully assoc. cache behind the actual cache
-    * Cached misses cache block
-    * Prevent ping-ponging
-  * Pseudo associativity
-    * Simpler way to implement associative cache
-  * Skewed assoc. cache
-    * Different hashing functions for each way
-  * Restructure data access pattern
-    * Order of loop traversal
-    * Blocking
-  * Memory level parallelism
-    * Cost per miss of a parallel cache miss is less costly compared to serial misses
-  * MSHR
-    * Keep track of pending cache
-      * Think of this as the load/store buffer-ish for cache
-    * What information goes into the MSHR?
-    * When do you access the MSHR?
-===== Lecture 22 (3/26 Wed.) =====
-  * Multi-porting
-    * Virtual multi-porting
-      * Time-share the port, not too scalable but cheap
-    * True multiporting
-  * Multiple cache copies
-  * Banking
-    * Can have bank conflict
-    * Extra interconnects across banks
-    * Address mapping can mitigate bank conflict
-    * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity)
-  * Accessing DRAM
-    * Row bits
-    * Column bits
-    * Addressibility
-    * DRAM has its own clock
-  * DRAM (2T) vs. SRAM (6T)
-    * Cost
-    * Latency
-  * Interleaving in DRAM
-    * Effects from address mapping on memory interleaving
-    * Effects from memory access patterns from the program on interleaving
-  * DRAM Bank
-    * To minimize the cost of interleaving (Shared the data bus and the command bus)
-  * DRAM Rank
-    * Minimize the cost of the chip (a bundle of chips operated together)
-  * DRAM Channel
-    * An interface to DRAM, each with its own ranks/banks
-  * DIMM
-    * More DIMM adds the interconnect complexity
-  * List of commands to read/write data into DRAM
-    * Activate -> read/write -> precharge
-    * Activate moves data into the row buffer
-    * Precharge prepare the bank for the next access
-  * Row buffer hit
-  * Row buffer conflict
-  * Scheduling memory requests to lower row conflicts
-  * Burst mode of DRAM
-    * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits
-  * Address mapping
-    * Row interleaved
-    * Cache block interleaved
-  * Memory controller
-    * Sending DRAM commands
-    * Periodically send commands to refresh DRAM cells
-    * Ensure correctness and data integrity
-    * Where to place the memory controller
-      * On CPU chip vs. at the main memory
-        * Higher BW on-chip
-    * Determine the order of requests that will be serviced in DRAM
-      * Request queues that hold requests
-      * Send requests whenever the request can be sent to the bank
-      * Determine which command (across banks) should be sent to DRAM
-  * Priority of demand vs. prefetch requests
-  * Memory scheduling policies
-    * FCFS
-    * FR-FCFS
-      * Capped FR-FCFS: FR-FCFS with a timeout
-      * Usually this is done in a command level (read/write commands and precharge/activate commands)
-===== Lecture 23 (3/28 Fri.) =====
-  * DRAM design choices
-    * Cost/density/latency/BW/Yield
-  * Sense Amplifier
-    * How do they work
-  * Dual data rate
-  * Subarray
-  * Rowclone
-    * Moving bulk of data from one row to others
-    * Lower latency and BW when performing copies/zeroes out the data
-  * TL-DRAM
-    * Far segment
-    * Near segment
-    * What causes the long latency
-    * Benefit of TL-DRAM
-      * TL-DRAM vs. DRAM cache (adding a small cache in DRAM)
-===== Lecture 24 (3/31 Mon.) =====
-  * Memory controller
-    * Different commands
-  * Memory scheduler
-    * Determine the order of requests to be issued to DRAM
-    * Age/hit-miss status/types(load/store/prefetch/from GPU/from CPU)/criticality
-  * Row buffer
-    * hit/conflict
-    * open/closed row
-    * Open row policy
-    * Closed row policy
-    * Tradeoffs between open and closed row policy
-      * What if the programs has high row buffer locality: open row might benefit more
-      * Closed row will service misses request faster
-  * Bank conflict
-  * Interference from different applications/threads
-    * Differnt programs/processes/threads interfere with each other
-      * introduce more row buffer/bank conflicts
-    * Memory schedule has to manage these interference
-    * Memory hog problems
-  * Interference in the data/command bus
-  * FR-FCFS
-    * Why does FR-FCFS make sense?
-      * Row buffer has lower lantecy
-    * Issues with FR-FCFS
-      * Unfairness
-  * STFM
-    * Fairness issue in memory scheduling
-    * How does STFM calculate the fairness and slowdown
-      * How to estimate slowdown time when it is runing alone
-    * Definition of fairness (based on STFM, different papers/areas define fairness differently)
-  * PAR-BS
-    * Parallelism in programs
-    * Intereference across banks
-    * How to form  a batch
-    * How to determine ranking between batches/within a batch

18-447 Introduction to Computer Architecture – Spring 2015

User Tools

Site Tools

Differences

Page Tools