User Tools

Site Tools


buzzword

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
buzzword [2014/04/30 18:12]
rachata
buzzword [2015/02/12 19:01]
kevincha [Lecture 11 (2/11 Mon.)]
Line 1: Line 1:
 ====== Buzzwords ====== ====== Buzzwords ======
  
-Buzzwords are terms that are mentioned during lecture which are particularly important to understand thoroughly. ​ This page tracks the buzzwords for each of the lectures and can be used as a reference for finding gaps in your understanding of course material. +Buzzwords are terms that are mentioned during lecture which are particularly important to understand thoroughly. This page tracks the buzzwords for each of the lectures and can be used as a reference for finding gaps in your understanding of course material.
- +
-===== Lecture 1 (1/13 Mon.) =====+
  
 +===== Lecture 1 (1/12 Mon.) =====
   * Level of transformation   * Level of transformation
     * Algorithm     * Algorithm
Line 10: Line 9:
     * Compiler     * Compiler
   * Cross abstraction layers   * Cross abstraction layers
-    * Expose an interface 
   * Tradeoffs   * Tradeoffs
   * Caches   * Caches
-  ​* Multi-thread +  * DRAM/memory ​controller 
-  * Multi-core +  * DRAM banks
-  * Unfairness +
-  ​* DRAM controller/Memory ​controller +
-  * Memory hog+
   * Row buffer hit/miss   * Row buffer hit/miss
   * Row buffer locality   * Row buffer locality
-  * Streaming access/ Random ​access +  ​* Unfairness 
-  * DRAM refresh +  * Memory performance hog 
-  * Retention time +  * Shared DRAM memory system 
-  * Profiling DRAM retention ​time+  ​* Streaming access ​vs. random ​access 
 +  * Memory scheduling policies 
 +  * Scheduling priority 
 +  * Retention time of DRAM 
 +  * Process variation 
 +  * Retention ​time profile
   * Power consumption   * Power consumption
-  * Wimpy cores 
   * Bloom filter   * Bloom filter
-    ​Pros/Cons +  ​Hamming code 
-    * False Positive +  * Hamming distance 
-  * Simulation +  * DRAM row hammer
-  * Memory performance attacks +
-  * RTL design+
  
-===== Lecture 2 (1/15 Wed.) ===== +===== Lecture 2 (1/14 Wed.) =====
- +
-  * Optimizing for energy/ Optimizing for the performance +
-    * Generally you should optimize for the users +
-  * state-of-the-art +
-  * RTL Simulation +
-    * Long, slow and can be costly +
-  * High level simulation +
-    * What should be employed? +
- * Important to get the idea of how they are implemented in RTL +
- * Allows designer to filter out techniques that do not work well +
-  * Design points +
-    * Design processors to meet the design points +
-  * Software stack +
-  * Design decisions +
-  * Datacenters +
-  * MIPS R2000 +
-    * What are architectural techniques that improve the performance of a processor over MIPS 2000+
   * Moore'​s Law   * Moore'​s Law
 +  * Algorithm --> step-by-step procedure to solve a problem
   * in-order execution   * in-order execution
   * out-of-order execution   * out-of-order execution
Line 65: Line 46:
   * Scaling issue   * Scaling issue
     * Transister are getting smaller     * Transister are getting smaller
 +  * Key components of a computer
 +  * Design points
 +    * Design processors to meet the design points
 +  * Software stack
 +  * Design decisions
 +  * Datacenters
   * Reliability problems that cause errors   * Reliability problems that cause errors
   * Analogies from Kuhn's "The Structure of Scientific Revolutions"​ (Recommended book)   * Analogies from Kuhn's "The Structure of Scientific Revolutions"​ (Recommended book)
Line 73: Line 60:
     * Computation     * Computation
       * Communication       * Communication
-        ​* Storage +      ​* Storage 
-          * DRAM +        * DRAM 
-          * NVRAM (Non-volatile memory): PCM, STT-MRAM +        * NVRAM (Non-volatile memory): PCM, STT-MRAM 
-          * Storage (Flash/​Harddrive)+        * Storage (Flash/​Harddrive)
   * Von Neumann Model (Control flow model)   * Von Neumann Model (Control flow model)
     * Stored program computer     * Stored program computer
-        ​* Properties of Von Neumann Model: Stored program, sequential instruction processing +      ​* Properties of Von Neumann Model: Stored program, sequential instruction processing 
-        * Unified memory +      * Unified memory 
-          * When does an instruction is being interpreted as an instruction (as oppose to a datum)? +        * When does an instruction is being interpreted as an instruction (as oppose to a datum)? 
-        * Program counter +      * Program counter 
-        * Examples: x86, ARM, Alpha, IBM Power series, SPARC, MIPS+      * Examples: x86, ARM, Alpha, IBM Power series, SPARC, MIPS
   * Data flow model   * Data flow model
     * Data flow machine     * Data flow machine
Line 94: Line 81:
   * Tradeoffs between control-driven and data-driven   * Tradeoffs between control-driven and data-driven
     * What are easier to program?     * What are easier to program?
-    ​* Which are easy to compile? +      ​* Which are easy to compile? 
-    * What are more parallel (does that mean it is faster?) +      * What are more parallel (does that mean it is faster?) 
-    * Which machines are more complex to design?+      * Which machines are more complex to design?
     * In control flow, when a program is stop, there is a pointer to the current state (precise state).     * In control flow, when a program is stop, there is a pointer to the current state (precise state).
   * ISA vs. Microarchitecture   * ISA vs. Microarchitecture
     * Semantics in the ISA     * Semantics in the ISA
-    ​* uArch should obey the ISA +      ​* uArch should obey the ISA 
-    * Changing ISA is costly, can affect compatibility.+      * Changing ISA is costly, can affect compatibility.
   * Instruction pointers   * Instruction pointers
   * uArch techniques: common and powerful techniques break Vonn Neumann model if done at the ISA level   * uArch techniques: common and powerful techniques break Vonn Neumann model if done at the ISA level
Line 109: Line 96:
       * Out-of-order executions       * Out-of-order executions
       * etc.       * etc.
-    ​* Design techniques +        ​* Design techniques 
-      * Adder implementation (Bit serial, ripple carry, carry lookahead) +          * Adder implementation (Bit serial, ripple carry, carry lookahead) 
-      * Connection machine (an example of a machine that use bit serial to tradeoff latency for more parallelism)+          * Connection machine (an example of a machine that use bit serial to tradeoff latency for more parallelism)
   * Microprocessor:​ ISA + uArch + circuits   * Microprocessor:​ ISA + uArch + circuits
   * What are a part of the ISA? Instructions,​ memory, etc.   * What are a part of the ISA? Instructions,​ memory, etc.
Line 120: Line 107:
 ===== Lecture 3 (1/17 Fri.) ===== ===== Lecture 3 (1/17 Fri.) =====
  
-  * Design ​tradeoffs +   * Microarchitecture 
-  * Macro Architectures +   * Three major tradeoffs ​of computer architecture 
-  Reconfiguribility vs. specialized designs +   ​* Macro-architecture 
-  Parallelism (instructions, data parallel) +   ​LC-3b ISA 
-  Uniform decode (Example: Alpha) +   ​Unused ​instructions 
-  Steering bits (Sub-opcode) +   ​Bit steering 
-  * 0,1,2,3 address machines +   ​Instruction processing style 
-    * Stack machine +   ​* 0,1,2,3 address machines 
-    * Accumulator machine +   ​* Stack machine 
-    * 2-operand machine +   ​* Accumulator machine 
-    * 3-operand machine +   ​* 2-operand machine 
-    * Tradeoffs between 0,1,2,3 address machines +   ​* 3-operand machine 
-  * Instructions/​Opcode/​Operade specifiers (i.e. addressing modes) +   ​* Tradeoffs between 0,1,2,3 address machines 
-  * Simply vs. complex data type (and their tradeoffs) +   * Postfix notation 
-  * Semantic gap +   * Instructions/​Opcode/​Operade specifiers (i.e. addressing modes) 
-  * Translation layer +   ​* Simply vs. complex data type (and their tradeoffs) 
-  * Addressability +   ​* Semantic gap and level 
-  * Byte/bit addressable machines +   ​* Translation layer 
-  * Virtual memory +   ​* Addressability 
-  * Big/little endian +   ​* Byte/bit addressable machines 
-  * Benefits of having registers (data locality) +   ​* Virtual memory 
-  * Programmer visible (Architectural) state +   ​* Big/little endian 
-    * Programmers can access this directly +   ​* Benefits of having registers (data locality) 
-    * What are the benefits? +   ​* Programmer visible (Architectural) state 
-  * Microarchitectural state +   ​* Programmers can access this directly 
-    * Programmers cannot access this directly +   ​* What are the benefits? 
-  * Evolution of registers (from accumulators to registers) +   ​* Microarchitectural state 
-  * Different types of instructions +   ​* Programmers cannot access this directly 
-    * Control instructions +   ​* Evolution of registers (from accumulators to registers) 
-    * Data instructions +   ​* Different types of instructions 
-    * Operation instructions +   ​* Control instructions 
-  * Addressing modes +   ​* Data instructions 
-    * Tradeoffs (complexity,​ flexibility,​ etc.) +   ​* Operation instructions 
-  * Orthogonal ISA +   ​* Addressing modes 
-    * Addressing modes that are orthogonal to instructino ​types +   ​* Tradeoffs (complexity,​ flexibility,​ etc.) 
-  Vectors ​vs. non vectored interrupts +   ​* Orthogonal ISA 
-  * Complex vs. simple instructions +   ​* Addressing modes that are orthogonal to instruction ​types 
-    * Tradeoffs +   ​I/O devices 
-  * RISC vs. CISC +   * Vectored ​vs. non-vectored interrupts 
-    * Tradeoff +   ​* Complex vs. simple instructions 
-    * Backward compatibility +   ​* Tradeoffs 
-    * Performance +   ​* RISC vs. CISC 
-    * Optimization opportunity+   ​* Tradeoff 
 +   ​* Backward compatibility 
 +   ​* Performance 
 +   ​* Optimization opportunity 
 +   * Translation
  
-===== Lecture 4 (1/22 Wed.) =====+===== Lecture 4 (1/21 Wed.) =====
  
-  ​* Semantic gap +  * Fixed vs. variable length ​instruction 
-    * Small vs. Large semantic gap (CISC vs. RISC) +  * Huffman encoding 
-    * Benefit of RISC vs. CISC +  * Uniform vs. non-uniform decode 
-  * Micro operations/​microcode +  * Registers 
-    * Translate complex instructions into smaller instructions +    * Tradeoffs between ​number of registers 
-  * Parallelism (motivation for RISC) +  * Alignments 
-  * Compiler optimization +    * How does MIPS load words across alignment ​the boundary
-  * Code optimization through translation +
-  * VLIW +
-  ​* Fixed vs. variable length ​instructions +
-    * Tradeoffs +
-      * Alignment issues? (fetch/​decode) +
-      * Decoding issues? +
-      * Code size? +
-      * Adding additional instructions?​ +
-      * Memory bandwidth and cache utilization?​ +
-      * Energy? +
-    * Encoding in variable length instructions +
-  * Structure of Alpha instructions and other uniform decode instructions +
-    * Different type of instructions +
-    * Benefit of knowing what type of instructions +
-      * Speculatively operate future instructions +
-  * x86 and other non-uniform decode instructions +
-    * Tradeoff vs. uniform decode +
-  * Tradeoffs for different number of registers +
-    * Spilling into memory if the number of registers ​is small +
-    * Compiler optimization on how to manage which value to keep/spill +
-  * Addressing modes +
-    * Benefits? +
-    * Types? +
-    * Different uses of addressing modes? +
-  * Various ISA-level tradeoffs +
-  * Virtual memory +
-  * Unalign memory access/​aligned memory access +
-    * Cost vs. benefit of unaligned access +
-  * ISA specification +
-    * Things you have to obey/​specifie in the ISA specification +
-  * Architectural states +
-  * Microarchitecture implements how arch. state A transformed to the next arch. state A' +
-  * Single cycle machines +
-    * Critical path in the single cycle machine +
-  * Multi cycle machines +
-  * Functional units +
-  * Performance metrics +
-    * CPI/IPC +
-      * CPI of a single cycle microarchitecture+
  
-===== Lecture 5 (1/24 Fri.) =====+===== Lecture 5 (1/26 Mon.) =====
  
 +  * Tradeoffs in ISA: Instruction length
 +    * Uniform vs. non-uniform
 +  * Design point/Use cases
 +    * What dictates the design point?
 +  * Architectural states
 +  * uArch
 +    * How to implement the ISA in the uArch
 +  * Different stages in the uArch
 +  * Clock cycles
 +  * Multi-cycle machine
 +  * Datapath and control logic
 +    * Control signals
 +  * Execution time of instructions/​program
 +    * Metrics and what do they means
   * Instruction processing   * Instruction processing
     * Fetch     * Fetch
Line 221: Line 187:
     * Memory fetch     * Memory fetch
     * Writeback     * Writeback
-  * Datapath & Control logic in microprocessors+  * Encoding and semantics
   * Different types of instructions (I-type, R-type, etc.)   * Different types of instructions (I-type, R-type, etc.)
   * Control flow instructions   * Control flow instructions
Line 230: Line 196:
   * Critical path analysis   * Critical path analysis
     * Critical path of a single cycle processor     * Critical path of a single cycle processor
-  * Combinational logic & Sequential logic+  ​* What is in the control signals? 
 +    ​* Combinational logic & Sequential logic
   * Control store   * Control store
   * Tradeoffs of a single cycle uarch   * Tradeoffs of a single cycle uarch
-  * Dynamic power/​Static power 
-  * Speedup calculation 
-    * Parallelism 
-    * Serial bottleneck 
-    * Amdahl'​s bottleneck 
   * Design principles   * Design principles
     * Common case design     * Common case design
     * Critical path design     * Critical path design
     * Balanced designs     * Balanced designs
-  ​Multi cycle design+    ​Dynamic power/​Static power 
 +      * Increases in power due to frequency
  
-===== Lecture 6 (1/27 Mon.) =====+===== Lecture 6 (1/28 Mon.) =====
  
 +  * Design principles
 +    * Common case design
 +    * Critical path design
 +    * Balanced designs
 +  * Multi cycle design
   * Microcoded/​Microprogrammed machines   * Microcoded/​Microprogrammed machines
     * States     * States
 +    * Translation from one state to another
     * Microinstructions     * Microinstructions
     * Microsequencing     * Microsequencing
Line 273: Line 242:
   * Vertical microcode   * Vertical microcode
   * Primitives   * Primitives
 +  ​
 +===== Lecture 7 (1/30 Fri.) =====
  
-===== Lecture 7 (1/29 Wed.) ===== +  * Emulator ​(i.e. uCode allots minimal datapath to emulate the ISA
 +  * Updating machine behavior 
 +  * Horizontal microcode 
 +  * Vertical microcode 
 +  * Primitives 
 +  * nanocode and millicode 
 +    * what are the differences between nano/​milli/​microcode 
 +  * microprogrammed vs. hardwire control
   * Pipelining   * Pipelining
   * Limitations of the multi-programmed design   * Limitations of the multi-programmed design
Line 300: Line 277:
     * Pipeline flush     * Pipeline flush
     * Speculation     * Speculation
 +  ​
 +===== Lecture 8 (2/2 Mon.) =====  ​
 +
   * Interlocking   * Interlocking
   * Multipath execution   * Multipath execution
Line 305: Line 285:
   * No-op (Bubbles in the pipeline)   * No-op (Bubbles in the pipeline)
   * Valid bits in the instructions   * Valid bits in the instructions
- 
-===== Lecture 8 (1/31 Fri.) ===== 
   * Branch prediction   * Branch prediction
   * Different types of data dependence   * Different types of data dependence
Line 335: Line 313:
     * Based on the profile info. position the code based on it     * Based on the profile info. position the code based on it
     * Try to make the next sequential instruction be the next inst. to be executed     * Try to make the next sequential instruction be the next inst. to be executed
-  * Trace cache 
   * Predicate combining (combine predicate for a branch instruction)   * Predicate combining (combine predicate for a branch instruction)
   * Predicated execution (control dependence becomes data dependence)   * Predicated execution (control dependence becomes data dependence)
 +  ​
 +    ​
 +===== Lecture 9 (2/4 Wed.) =====  ​
   * Definition of basic blocks   * Definition of basic blocks
   * Control flow graph   * Control flow graph
- 
-===== Lecture 9 (2/3 Mon.) ===== 
   * Delayed branching   * Delayed branching
     * benefit?     * benefit?
Line 379: Line 357:
       * Might be inaccurate       * Might be inaccurate
       * Does not require profiling       * Does not require profiling
-    * Programmer can tell the hardware (via pragmas ​(hints))+    * Static branch prediction 
 +      * Pregrammer provides ​pragmas, hinting the likelihood of taken/not taken branch
       * For example, x86 has the hint bit       * For example, x86 has the hint bit
     * Dynamic branch prediction     * Dynamic branch prediction
Line 386: Line 365:
         * One more bit for hysteresis         * One more bit for hysteresis
  
- +===== Lecture 10 (2/6 Fri.) =====
-===== Lecture 10 (2/5 Wed.) ===== +
   * Branch prediction accuracy   * Branch prediction accuracy
     * Why are they very important?     * Why are they very important?
       * Differences between 99% accuracy and 98% accuracy       * Differences between 99% accuracy and 98% accuracy
       * Cost of a misprediction when the pipeline is veryd eep       * Cost of a misprediction when the pipeline is veryd eep
-  * Value prediction 
   * Global branch correlation   * Global branch correlation
     * Some branches are correlated     * Some branches are correlated
Line 404: Line 380:
     * What information are used     * What information are used
   * Two-level branch prediction   * Two-level branch prediction
-    * What entries do you keep in the glocal ​history?+    * What entries do you keep in the global ​history?
     * What entries do you keep in the local history?     * What entries do you keep in the local history?
     * How many table?     * How many table?
Line 414: Line 390:
     * Store both GHP and PC in one combined information     * Store both GHP and PC in one combined information
     * How do you use the information?​ Why does the XOR result still usable?     * How do you use the information?​ Why does the XOR result still usable?
-  * Slides (page 16-18) for a good overview of one- and two-level predictors 
   * Warmup cost of the branch predictor   * Warmup cost of the branch predictor
     * Hybrid solution? Fast warmup is used first, then switch to the slower one.     * Hybrid solution? Fast warmup is used first, then switch to the slower one.
   * Tournament predictor (Alpha 21264)   * Tournament predictor (Alpha 21264)
-  * Other types of branch predictor 
-    * Using machine learning? 
-    * Geometric history length 
-      * Look at branches far behind (but using geometric step) 
   * Predicated execution - eliminate branches   * Predicated execution - eliminate branches
     * What are the tradeoffs     * What are the tradeoffs
-    * What is the block is big (can lead to execution a lot of useless work)+    * What if the block is big (can lead to execution a lot of useless work)
     * Allows easier code optimization     * Allows easier code optimization
       * From the compiler PoV, predicated execution combine multiple basic blocks into one bigger basic block       * From the compiler PoV, predicated execution combine multiple basic blocks into one bigger basic block
Line 439: Line 410:
     * Execute both paths     * Execute both paths
     * Can lead to wasted work     * Can lead to wasted work
 +    * VLIW
 +    * SuperScalar
  
  
-===== Lecture 11 (2/12 Wed.) =====+===== Lecture 11 (2/11 Mon.) =====
  
-  * Call and return prediction +  * Geometric GHR length for branch prediction 
-    * Direct call is easy to predict +  * Perceptron branch predictor
-    * Retun is harder (indirect branches) +
-      * Nested calls make return easier to predict +
-        * Can use stack to predict the return +
-    * Indirect ​branch ​prediction +
-      * These branches have multiple targets +
-      * For switch-case,​ virtual function calls, jump tables, interface calls +
-      * BTB to predict the target address - low accuracy +
-      * History based: BTB + GHR +
-      * Virtual program counter ​prediction +
-  * Complications in superscalar processors +
-    * Fetch? What if multiple branches are fetched at the same time? +
-    * Logic requires to ensure correctness?​+
   * Multi-cycle executions (Different functional units take different number of cycles)   * Multi-cycle executions (Different functional units take different number of cycles)
     * Instructions can retire out-of-order     * Instructions can retire out-of-order
Line 462: Line 423:
   * Exceptions and Interrupts   * Exceptions and Interrupts
     * When they are handled? ​     * When they are handled? ​
-      ​* Why are some interrupts should be handled right away?+    ​* Why are some interrupts should be handled right away?
   * Precise exception   * Precise exception
     * arch. state should be consistent before handling the exception/​interrupts     * arch. state should be consistent before handling the exception/​interrupts
Line 470: Line 431:
       * Easier to restart the processes       * Easier to restart the processes
     * How to ensure precise exception?     * How to ensure precise exception?
-      ​* Tradeoffs between each method+    ​* Tradeoffs between each method
   * Reorder buffer   * Reorder buffer
     * Reorder results before they are visible to the arch. state     * Reorder results before they are visible to the arch. state
Line 477: Line 438:
     * Where to get the value from (forwarding path? reorder buffer?)     * Where to get the value from (forwarding path? reorder buffer?)
       * Extra logic to check where the youngest instructions/​value is       * Extra logic to check where the youngest instructions/​value is
-      * Content addressible search+      * Content addressible search ​(CAM)
         * A lot of comparators         * A lot of comparators
     * Different ways to simplify the reorder buffer     * Different ways to simplify the reorder buffer
Line 487: Line 448:
   * Future file (commonly used, along with reorder buffer)   * Future file (commonly used, along with reorder buffer)
     * Keep two set of register files     * Keep two set of register files
-      * An updated value (Speculative),​ called ​fiture ​file+      * An updated value (Speculative),​ called ​future ​file
       * A backup value (to restore the state quickly       * A backup value (to restore the state quickly
     * Double the cost of the regfile, but reduce the area as you don't have to use a content addressible memory (compared to ROB alone)     * Double the cost of the regfile, but reduce the area as you don't have to use a content addressible memory (compared to ROB alone)
Line 498: Line 459:
   * Checkpointing   * Checkpointing
     * Advantages?     * Advantages?
-    ​ 
-===== Lecture 14 (2/19 Wed.) ===== 
  
-  * Predictor (branch predictor, cache line predictor ...) 
-  * Power budget (and its importance) 
-  * Architectural state, precise state 
-  * Memory dependence is known dynamically 
-  * Register state is not shared across threads/​processors 
-  * Memory state is shared across threads/​processors 
-  * How to maintain speculative memory states 
-  * Write buffers (helps simplify the process of checking the reorder buffer) 
-  * Overall OoO mechanism 
-    * What are other ways of eliminating dispatch stalls 
-    * Dispatch when the sources are ready 
-    * Retired instructions make the source available 
-    * Register renaming 
-    * Reservation station 
-      * What goes into the reservation station 
-      * Tags required in the reservation station 
-    * Tomasulo'​s algorithm 
-    * Without precise exception, OoO is hard to debug 
-    * Arch. register ID 
-    * Examples in the slides 
-      * Slides 28 --> register renaming 
-      * Slides 30-35 --> Exercise (also on the board) 
-        * This will be usefull for the midterm 
-    * Register aliasing table 
  
-===== Lecture 15 (2/21 Fri.) ===== 
- 
-  * OoO --> Restricted Dataflow 
-    * Extracting parallelism 
-    * What are the bottlenecks?​ 
-      * Issue width 
-      * Dispatch width 
-      * Parallelism in the program 
-    * More example on slide #10 
-    * What does it mean to be restricted data flow 
-      * Still visible as a Von Neumann model 
-    * Where does the efficiency come from? 
-    * Size of the scheduling windors/​reorder buffer. Tradeoffs? What make sense? 
-  * Load/store handling 
-    * Would like to schedule them out of order, but make them visible in-order 
-    * When do you schedule the load/store instructions?​ 
-    * Can we predict if load/store are dependent? 
-    * This is one of the most complex structure of the load/store handling 
-    * What information can be used to predict these load/store optimization?​ 
-  * Note: IPC = 1/CPI 
-  * Centralized vs. distributed?​ What are the tradeoffs? 
-  * How to handle when there is a misprediction/​recovery 
-  * Token dataflow arch. 
-    * What are tokens? 
-    * How to match tokens 
-    * Tagged token dataflow arch. 
-    * What are the tradeoffs? 
-    * Difficulties?​ 
- 
-===== Lecture 16 (2/24 Mon.) ===== 
- 
-  * SISD/​SIMD/​MISD/​MIMD 
-  * Array processor 
-  * Vector processor 
-  * Data parallelism 
-    * Where does the concurrency arise? 
-  * Differences between array processor vs. vector processor 
-  * VLIW 
-  * Compactness of an array processor ​ 
-  * Vector operates on a vector of data (rather than a single datum (scalar)) 
-    * Vector length (also applies to array processor) 
-    * No dependency within a vector --> can have a deep pipeline 
-    * Highly parallel (both instruction level (ILP) and memory level (MLP)) 
-    * But the program needs to be very parallel 
-    * Memory can be the bottleneck (due to very high MLP) 
-    * What does the functional units look like? Deep pipelin and simpler control. 
-    * CRAY-I is one of the examples of vector processor 
-    * Memory access pattern in a vector processor 
-      * How do the memory accesses benefit the memory bandwidth? 
-      * Please refer to slides 73-74 in http://​www.ece.cmu.edu/​~ece447/​s13/​lib/​exe/​fetch.php?​media=onur-447-spring13-lecture25-mainmemory-afterlecture.pdf for a breif explanation of memory level parallelism 
-      * Stride length vs. the number of banks 
-        * stride length should be relatively prime to the number of banks 
-      * Tradeoffs between row major and column major --> How can the vector processor deals with the two 
-    * How to calculate the efficiency and performance of vector processors 
-    * What if there are multiple memory ports? 
-    * Gather/​Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism) 
-      * Helps handling sparse metrices 
-    * Conditional operation 
-    * Structure of vector units 
-    * How to automatically parallelize code through the compiler? 
-      * This is a hard problem. Compiler does not know the memory address. 
-  * What do we need to ensure for both vector and array processor? 
-  * Sequential bottleneck ​ 
-    * Amdahl'​s law  ​ 
-  * Intel MMX --> An example of Intel'​s approach to SIMD 
-    * No VLEN, use OpCode to define the length 
-    * Stride is one in MMX 
-    * Intel SSE --> Modern version of MMX 
- 
-===== Lecture 17 (2/26 Wed.) ===== 
- 
-  * GPU 
-    * Warp/​Wavefront 
-      * A bunch of threads sharing the same PC 
-    * SIMT 
-    * Lanes 
-    * FGMT + massively parallel 
-      * Tolerate long latency 
-    * Warp based SIMD vs. traditional SIMD 
-  * SPMD (Programming model) 
-    * Single program operates on multiple data 
-      * can have synchronization point 
-    * Many scientific applications are programmed in this manner 
-  * Control flow problem (branch divergence) 
-    * Masking (in a branch, mask threads that should not execute that path) 
-    * Lower SIMD efficiency 
-    * What if you have layers of branches? 
-  * Dynamic wrap formation 
-    * Combining threads from different warps to increase SIMD utilization 
-    * This can cause memory divergence 
-  * VLIW 
-    * Wide fetch 
-    * IA-64 
-    * Tradeoffs 
-      * Simple hardware (no dynamic scheduling, no dependency checking within VLIW) 
-      * A lot of loads at the compiler level 
-  * Decoupled access/​execute 
-    * Limited form of OoO 
-    * Tradeoffs 
-    * How to street the instruction (determine dependency/​stalling)?​ 
-    * Instruction scheduling techniques (static vs. dynamic) 
-  * Systoric arrays 
-    * Processing elements transform data in chains 
-    * Develop for image processing (for example, convolution) 
-  * Stage processing 
- 
-===== Lecture 18 (2/28 Fri.) ===== 
- 
-  * Tradeoffs of VLIW 
-    * Why does VLIW required static instruction scheduling 
-    * Whose job it is? 
-      * Compiler can rearrange basic blocks/​instruction 
-  * Basic block 
-    * Benefits of having large basic block 
-  * Entry/Exit 
-    * Handling entries/​exits 
-  * Trace cache 
-    * How to ensure correctness?​ 
-    * Profiling 
-    * Fixing up the instruction order to ensure correctness 
-    * Dealing with multiple entries into the block 
-    * Dealing with multiple exits into the block 
-  * Super block 
-    * How to form super blocks? 
-    * Benefit of super block 
-    * Tradeoff between not forming a super block and forming a super block 
-      * Ambiguous branch (after profiling, both taken/not taken are equally likely) 
-      * Cleaning up  
-  * What scenario would make trace cache/​superblock/​profiling less effective? 
-  * List scheduling 
-    * Help figuring out which instructions VLIW should fetch 
-    * Try to maximize instruction throughput 
-    * How to assign priorities 
-    * What if some instructions take longer than others 
-  * Block structured ISA (BS-ISA) 
-    * Problems with trace scheduling? 
-    * What type of program will benefit from BS-ISA 
-    * How to form blocks in BS-ISA? 
-      * Combining basic blocks 
-      * multiples of merged basic blocks 
-    * How to deal with entries/​exits in BS-ISA? 
-      * undo the executed instructions from the entry point, then fetch the new block 
-    * Advantages over trace cache 
-  * Benefit of VLIW + Static instruction scheduling 
-  * Intel IA-64 
-    * Static instruction scheduling and VLIW 
- 
-===== Lecture 19 (3/19 Wed.) ===== 
- 
-  * Ideal cache 
-    * More capacity 
-    * Fast 
-    * Cheap 
-    * High bandwidth 
-  * DRAM cell 
-    * Cheap 
-    * Sense the purturbation through sense amplifier 
-    * Slow and leaky 
-  * SRAM cell (Cross coupled inverter) 
-    * Expensice 
-    * Fast (easier to sense the value in the cell) 
-  * Memory bank 
-    * Read access sequence 
-      * DRAM: Activate -> Read -> Precharge (if needed) 
-    * What dominate the access laatency for DRAM and SRAM 
-  * Scaling issue 
-    * Hard to scale the scale to be small 
-  * Memory hierarchy 
-    * Prefetching 
-    * Caching 
-  * Spatial and temporal locality 
-    * Cache can exploit these 
-    * Recently used data is likely to be accessed 
-    * Nearby data is likely to be accessed 
-  * Caching in a pipeline design 
-  * Cache management 
-    * Manual 
-      * Data movement is managed manually 
-        * Embedded processor 
-        * GPU scratchpad 
-    * Automatic 
-      * HW manage data movements 
-  * Latency analysis 
-    * Based on the hit and miss status, next level access time (if miss), and the current level access time 
-  * Cache basics 
-    * Set/block (line)/​Placement/​replacement/​direct mapped vs. associative cache/etc. 
-  * Cache access 
-    * How to access tag and data (in parallel vs serially) 
-    * How do tag and index get used? 
-    * Modern processors perform serial access for higher level cache (L3 for example) to save power 
-  * Cost and benefit of having more associativity 
-    * Given the associativity,​ which block should be replace if it is full 
-    * Replacement poligy 
-      * Random 
-      * Least recently used (LRU) 
-      * Least frequently used 
-      * Least costly to refetch 
-      * etc. 
-  * How to implement LRU 
-    * How to keep track of access ordering 
-      * Complexity increases rapidly 
-    * Approximate LRU 
-      * Victim and next Victim policy 
- 
-===== Lecture 20 (3/21 Fri.) ===== 
- 
-  * Set thrashing 
-    * Working set is bigger than the associativity 
-  * Belady'​s OPT 
-    * Is this optimal? 
-    * Complexity? 
-  * Similarity between cache and page table 
-    * Number of blocks vs pages 
-    * Time to find the block/page to replace 
-  * Handling writes 
-    * Write through 
-      * Need a modified bit to make sure accesses to data got the updated data 
-    * Write back 
-      * Simpler, no consistency issues 
-  * Sectored cache 
-    * Use subblock 
-      * lower bandwidth 
-      * more complex 
-  * Instruction vs data cache 
-    * Where to place instructions 
-      * Unified vs. separated 
-    * In the first level cache 
-  * Cache access 
-    * First level access ​ 
-    * Second level access 
-      * When to start the second level access 
-        * Performance vs. energy 
-  * Address translation 
-  * Homonym and Synonyms 
-    * Homonym: Same VA but maps to different PA 
-      * With multiple processes 
-    * Synonyms: Multiple VAs map to the same PA 
-      * Shared libraries, shared data, copy-on-write 
-      * I/O 
-    * Can these create problems when we have the cache 
-    * How to eliminate these problems? 
-      * Page coloring 
-  * Interaction between cache and TLB 
-    * Virtually indexed vs. physically indexed 
-    * Virtually tagged vs. physically tagged 
-    * Virtually indexed physically tagged 
-  * Virtual memory in DRAM 
-    * Control where data is mapped to in channel/​rank/​bank 
-      * More parallelism 
-      * Reduce interference 
- 
-===== Lecture 21 (3/24 Mon.) ===== 
- 
- 
- 
-  * Different parameters that affect cache miss 
-  * Thrashing 
-  * Different types of cache misses 
-    * Compulsory misses 
-      * Can mitigate with prefetches 
-    * Capacity misses 
-      * More assoc 
-      * Victim cache 
-    * Conflict misses 
-      * Hashing 
-  * Large block vs. small block 
-  * Subblocks 
-  * Victim cache 
-    * Small, but fully assoc. cache behind the actual cache 
-    * Cached misses cache block 
-    * Prevent ping-ponging 
-  * Pseudo associativity 
-    * Simpler way to implement associative cache 
-  * Skewed assoc. cache 
-    * Different hashing functions for each way 
-  * Restructure data access pattern 
-    * Order of loop traversal 
-    * Blocking 
-  * Memory level parallelism 
-    * Cost per miss of a parallel cache miss is less costly compared to serial misses 
-  * MSHR 
-    * Keep track of pending cache 
-      * Think of this as the load/store buffer-ish for cache 
-    * What information goes into the MSHR? 
-    * When do you access the MSHR? 
- 
- 
-===== Lecture 22 (3/26 Wed.) ===== 
- 
- 
- 
-  * Multi-porting 
-    * Virtual multi-porting 
-      * Time-share the port, not too scalable but cheap 
-    * True multiporting 
-  * Multiple cache copies 
-  * Banking 
-    * Can have bank conflict 
-    * Extra interconnects across banks 
-    * Address mapping can mitigate bank conflict 
-    * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity) 
-  * Accessing DRAM 
-    * Row bits 
-    * Column bits 
-    * Addressibility 
-    * DRAM has its own clock 
-  * DRAM (2T) vs. SRAM (6T) 
-    * Cost 
-    * Latency 
-  * Interleaving in DRAM 
-    * Effects from address mapping on memory interleaving 
-    * Effects from memory access patterns from the program on interleaving 
-  * DRAM Bank 
-    * To minimize the cost of interleaving (Shared the data bus and the command bus) 
-  * DRAM Rank 
-    * Minimize the cost of the chip (a bundle of chips operated together) 
-  * DRAM Channel 
-    * An interface to DRAM, each with its own ranks/banks 
-  * DIMM 
-    * More DIMM adds the interconnect complexity 
-  * List of commands to read/write data into DRAM 
-    * Activate -> read/write -> precharge 
-    * Activate moves data into the row buffer 
-    * Precharge prepare the bank for the next access 
-  * Row buffer hit 
-  * Row buffer conflict 
-  * Scheduling memory requests to lower row conflicts 
-  * Burst mode of DRAM 
-    * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits 
-  * Address mapping 
-    * Row interleaved 
-    * Cache block interleaved 
-  * Memory controller 
-    * Sending DRAM commands 
-    * Periodically send commands to refresh DRAM cells 
-    * Ensure correctness and data integrity 
-    * Where to place the memory controller 
-      * On CPU chip vs. at the main memory 
-        * Higher BW on-chip 
-    * Determine the order of requests that will be serviced in DRAM 
-      * Request queues that hold requests 
-      * Send requests whenever the request can be sent to the bank 
-      * Determine which command (across banks) should be sent to DRAM 
-  * Priority of demand vs. prefetch requests 
-  * Memory scheduling policies 
-    * FCFS 
-    * FR-FCFS 
-      * Capped FR-FCFS: FR-FCFS with a timeout 
-      * Usually this is done in a command level (read/write commands and precharge/​activate commands) 
-    ​ 
-  ​ 
-===== Lecture 23 (3/28 Fri.) ===== 
- 
-  * DRAM design choices 
-    * Cost/​density/​latency/​BW/​Yield 
-  * Sense Amplifier 
-    * How do they work 
-  * Dual data rate 
-  * Subarray 
-  * Rowclone 
-    * Moving bulk of data from one row to others 
-    * Lower latency and BW when performing copies/​zeroes out the data 
-  * TL-DRAM 
-    * Far segment 
-    * Near segment 
-    * What causes the long latency 
-    * Benefit of TL-DRAM 
-      * TL-DRAM vs. DRAM cache (adding a small cache in DRAM) 
- 
-  ​ 
-  ​ 
-  ​ 
-===== Lecture 24 (3/31 Mon.) ===== 
-  ​ 
- 
-  * Memory controller 
-    * Different commands 
-  * Memory scheduler 
-    * Determine the order of requests to be issued to DRAM 
-    * Age/​hit-miss status/​types(load/​store/​prefetch/​from GPU/from CPU)/​criticality 
-  * Row buffer ​ 
-    * hit/​conflict 
-    * open/closed row 
-    * Open row policy 
-    * Closed row policy 
-    * Tradeoffs between open and closed row policy 
-      * What if the programs has high row buffer locality: open row might benefit more 
-      * Closed row will service misses request faster 
-  * Bank conflict 
-  * Interference from different applications/​threads 
-    * Differnt programs/​processes/​threads interfere with each other 
-      * introduce more row buffer/bank conflicts 
-    * Memory schedule has to manage these interference 
-    * Memory hog problems 
-  * Interference in the data/​command bus 
-  * FR-FCFS 
-    * Why does FR-FCFS make sense? 
-      * Row buffer has lower lantecy 
-    * Issues with FR-FCFS 
-      * Unfairness 
-  * STFM 
-    * Fairness issue in memory scheduling 
-    * How does STFM calculate the fairness and slowdown 
-      * How to estimate slowdown time when it is runing alone 
-    * Definition of fairness (based on STFM, different papers/​areas define fairness differently) 
-  * PAR-BS 
-    * Parallelism in programs 
-    * Intereference across banks 
-    * How to form  a batch 
-    * How to determine ranking between batches/​within a batch 
-    ​ 
-      
-      
-===== Lecture 25 (2/2 Wed.) ===== 
- 
- 
- 
-  * Latency sensitivity 
-    * Performance drops a lot when the memory request latency is long 
-  * TCM 
-    * Tradeoff between throughput and fairness 
-    * Latency sensitive cluster (non-intensive cluster) 
-      * Ranking based on memory intensity 
-    * Bandwidth intensive cluster 
-      * Round robin within the cluster 
-    * Generally latency sensitive cluster has more priority 
-    * Provide robust fairness vs. throughput 
-    * Complexity of TCM? 
-  * Different ways to control interference in DRAM 
-    * Partitioning of resource 
-      * Channel partitioning:​ map applications that interfere with each other in a different channel 
-        * Keep track of application'​s characteristics 
-        * Dedicate a channel might waste the bandwidth 
-        * Need OS support to determine the channel bits 
-    * Source throttling 
-      * A controller throttle the core depends on the performance target 
-      * Example: Fairness via source throttling 
-        * Detect unfairness and throttle application that is interfering 
-        * How do you estimate slowdown? 
-        * Threshold based solution: hard to configure 
-    * App/thread scheduling 
-      * Critical threads usually stall the progress 
-    * Designing DRAM controller 
-      * Has to handle the normal DRAM operations 
-        * Read/​write/​refresh/​all the timing constraints 
-      * Keep track of resources 
-      * Assign priorities to different requests 
-      * Manage requests to banks 
-    * Self-optimizing controller 
-      * Use machine learning to improve DRAM controller 
-  * DRAM Refresh 
-    * Why does DRAM has to refresh every 64ms 
-    * Banks are unavailable during refresh 
-      * LPDDR mitigate this by using a per-bank refresh 
-    * Has to spend longer time with bigger DRAM 
-    * Distributed refresh: stagger refresh every 64 ms in a distributed manner 
-      * As oppose to burst refresh (long pause time) 
-  * RAIDR: Reduce DRAM refresh by profiling and binning 
-    * Some row do not have to be refresh very frequently 
-      * Profile the row 
-        * High temperature changes the retention time: need online profiling 
-  * Bloom filter 
-    * Represent set membership 
-    * Approximated 
-    * Can contain false positive 
-      * Better/more hash function helps eliminate this 
-  ​ 
-  ​ 
-  ​ 
-===== Lecture 26 (4/7 Mon.) ===== 
- 
- 
- 
-  * Tolerate latency can be costly 
-    * Instruction window is complex 
-      * Benefit also diminishes 
-    * Designing the buffers can be complex 
-    * A simpler way to tolerate out of order is desirable 
-  * Different sources that cause the core to stall in OoO 
-    * Cache miss 
-    * Note that stall happens if the inst. window is full 
-  * Scaling instruction window size is hard 
-    * It is better (less complex) to make the windows more efficient 
-  * Runahead execution 
-    * Try to optain MLP w/o increasing instruction windows 
-    * Runahead (i.e. execute ahead) when there is a long memory instruction 
-      * Long memory instruction stall processor for a while anyways, so it's better to make use out of it 
-      * Execute future instruction to generate accurate prefetches 
-      * Allow future data to be in the cache 
-    * How to support runahead execution? 
-      * Need a way to checkpoing the state when entering runahead mode 
-      * How to make executing in the wrong path useful? 
-      * Need runahead cache to handle load/store in Runahead mode (since they are speculative) 
-    * Cost and benefit of runahead execution (slide number 27) 
-    * Runahead can have inefficiency 
-      * Runahead period that are useless 
-        * Get rid of useless inefficient period 
-    * What if there is a dependent cache miss 
-      * Cannot be paralellized in a vanilla runahead 
-      * Can predict the value of the dependent load 
-        * How to predict the address of the load 
-          * Delta value information 
-          * Stride predictor 
-          * AVD prediction 
-  ​ 
-===== Lecture 28 (4/9 Wed.) ===== 
- 
- 
-  * Questions regarding prefetching 
-    * What to prefetch 
-    * When to prefetch 
-    * how do we prefetch 
-    * where to prefetch from 
-  * Prefetching can cause thrasing (evict a useful block) 
-  * Prefetching can also be useless (not being used) 
-    * Need to be efficient 
-  * Can cause memory bandwidth problem in GPU 
-  * Prefetch the whole block, more than one block, or subblock? 
-    * Each one of them has pros and cons 
-    * Big prefetch is more likely to waste bandwidth 
-    * Commonly done in a cache block granularity 
-  * Prefetch accuracy: fraction of useful prefetches out of all the prefetches 
-  * Prefetcher usually predict based on 
-    * Past knowledge 
-    * Compiler hints 
-  * Prefetcher has to prefetch at the right time 
-    * Prefetch that is too early might get evicted 
-      * It might also evict other useful data 
-    * Prefetch too late does not hide the whole memory latency 
-  * Previous prefetches at the same PC can be used as the history 
-  * Previous demand requests also is a good information to use for prefetches 
-  * Prefetch buffer 
-    * Place the prefetch data to avoid thrashing 
-      * Can treat demand/​prefetch requests separately 
-      * More complex 
-  * Generally, demand block is more important 
-    * This means eviction should prefer prefetch block as oppose to demand block 
-  * Tradeoffs between where do we place the prefetcher 
-    * Look at L1 hits and misses 
-    * Look at L1 misses only 
-    * Look at L2 misses 
-    * Different access pattern affect accuracy 
-      * Tradeoffs between handling more requests (seeing L1 hits and misses) and less visibility (only see L2 miss) 
-  * Software vs. hardware vs. execution based prefetching 
-    * Software: ISA previde prefetch instructions,​ software utilize it 
-      * What information are useful 
-      * How to make sure the prefetch is timely 
-      * What if you have a pointer based structure 
-        * Not easy to prefetch pointer chasing (because in many case the work between prefetches is short, so you cannot predict the next one timely enough) 
-          * Can be solved by hinting the nextnext and/or nextnextnext address 
-    * Hardware: Identify the pattern and prefetch 
-    * Execution driven: Oppotunistically try to prefetch (runahead, dual-core execution) 
-  * Stride prefetcher 
-    * Predict strides, which is common in many programs 
-    * Cache block based or instruction based 
-  * Stream buffer design 
-    * Buffer the stream of accesses (next address) ​ 
-    * Use the information to prefetch 
-  * What affect prefetcher performance 
-    * Prefetch distance 
-      * How far ahead should we prefetch 
-    * Prefetch degree 
-      * How many prefetches do we prefetch 
-  * Prefetcher performance 
-    * Coverage 
-      * Out of the demand requests, how many are actually from the prefetch request 
-    * Accuracy 
-      * Out of all the prefetch requests, how many are actually getting used 
-    * Timeliness 
-      * How much memory latency can we hide from the prefetch requests 
-  * Feedback directed prefetcher 
-    * Use the result of the prefetcher as a feedback to the prefetcher 
-      * with accuracy, timeliness, polluting information 
-  * Markov prefetcher 
-    * Prefetch based on the previous history 
-    * Use markov model to predict 
-    * Pros: Can cover arbitary pattern (easy for link list traversal or trees) 
-    * Downside: High cost, cannot help with compulsory misses (no history) 
-  * Content directed prefetching 
-    * Indentify the content in memory for pointers (which is used as the address to prefetch 
-    * Not very efficient (hard to figure out which block is the pointer) 
-      * Software can give hints 
-    ​ 
-===== Lecture 28 (4/14 Mon.) ===== 
- 
- 
-  * Execution based prefetcher 
-    * Helper thread/​speculative thread 
-      * Use another thread to pre-execute a program 
-    * Can be a software based or hardware based 
-    * Discover misses before the main program (to prefetch data in a timely manner) 
-    * How do you construct the helper thread 
-    * Preexecute instruction (one example of how to initialize a speculative thread), slide 9 
-  * Benefit of multiprocessor 
-    * Improve performace without not significantly increase power consumption 
-    * Better cost efficiency and easier to scale 
-    * Improve dependability (in case the other core is faulty 
-  * Different types of parallelism 
-    * Instruction level parallelism 
-    * Data level parallelism 
-    * Task level parallelism 
-  * Task level parallelism 
-    * Partition a single, potentially big, task into multiple parallel sub-task 
-      * Can be done explicitly (parallel programming by the programmer) 
-      * Or implicitly (hardware partition a single thread specilatively) 
-    * Or, run multiple independent tasks (still improve throughput, but the speedup of any single tasks are not better, also simpler to implement) 
-  * Loosely coupled multiprocessor 
-    * No shared global address space 
-      * Message passing to communicate between different sources 
-    * Simple to manage memory 
-  * Tightly coupled multiprocesor 
-    * Shared global address space 
-    * Need to ensure consistency of data 
-  * Switch on event multithreading 
-    * Switch to another context when there is an event (for example, a cache miss) 
-  * Simulteneous multithreading 
-    * Dispatch instruction from multiple threads at the same time 
-  * Amdahl'​s law 
-    * Maximum speedup 
-    * Parallel portion is not perfect 
-      * Serial bottleneck 
-      * Synchronization cost 
-      * Load balance 
-        * Some threads has more work, requires more time to hit the sync. point 
-  * Issue in parallel programming 
-    * Correctness 
-    * Synchronization 
-    * Consistency 
-    ​ 
-===== Lecture 29 (4/16 Wed.) ===== 
-        ​ 
- 
- 
-  * Ordering of instructions 
-    * Maintaining memory consistency when there are multiple threads and shared memory 
-    * Need to ensure the semantic is not changed 
-    * Making sire the shared data is properly locked when used 
-      * Support mutual exclusion 
-    * Ordering depends on when each processor is executed 
-    * Debugging is also difficult (non-deterministic behavior) 
-  * Weak consistency:​ global ordering when sync 
-    * programmer hints where the synchronizations are 
-  * Total store order model: global ordering only with store 
-  * Cache coherence 
-    * Can be done in the software level or hardware level 
-  * Coherence protocol 
-    * Need to ensure that all the processors see and update the correct state of the cache block 
-    * Need to make sure that writes get propagated and serialized 
-    * Simple protocol are not scalable (one point of synchrnization) 
-  * Update vs. invalidate 
-    * For invalidate, only the core that needs to read retains the correct copy 
-      * Can lead to ping-ponging (tons of read/writes from several processors) 
-    * For updates, bus becomes the bottleneck 
-  * Snoopy bus  
-    * Bus based, single point of serialization 
-    * More efficient with small number of processors 
-    * All cache snoop other caches read/write requests to keep the cache block coherent 
-  * Directory based 
-    * Single point of serialization per block 
-    * Directory coordinate the coherency 
-    * More scalable 
-    * The directory keeps track of where the copies of each block resides 
-      * Supply data on a read 
-      * Invalide the block on a write 
-      * Has an exclusive state 
-  * MSI coherent protocol 
-    * Slide number 56-57 
-    * Consume bus bandwidth (need an "​exclusive"​ state 
-  * MESI coherent protocal 
-    * Add the exclusive state: this is the only cache copy and it is clean state to MSI 
-  * Tradeoffs between snooping and directory based 
-    * Slide 71 has a good summary on this 
-  * MOESI 
-    * Improvement over MESI protocol 
- 
-    ​ 
-===== Lecture 29 (4/18 Wed.) =====    ​ 
- 
- 
- 
-  * Interference 
-  * Complexity of the memory scheduler 
-    * Ranking/​prioritization has cost 
-    * Complex scheduler has higher latency 
-  * Performance metric for multicore/​multithead applications 
-    * Speedup 
-    * Slowdown 
-    * Harmonic vs wrighted 
-  * Fairness mertic 
-    * Maximum slowdown 
-      * Why does it make sense 
-      * Any scenario that it does not make sense? 
-  * Predictable performance 
-    * Why is it important? 
-      * In server environment,​ different jobs are on the same server 
-      * In a mobile environment,​ there are multiple sources that can slowdown other sources 
-    * How to relate slowdown with request service rate 
-    * MISE: soft slowdown guarantee 
-  * BDI 
-    * Memory wall 
-      * What is the concern regarding the memory wall 
-    * Size of the cache on the die (CPU die) 
-    * One possible solution: cache compression 
-      * What is the problems of existing cache compression mechanism 
-        * Some are too complex 
-        * Decompression is in the critical path 
-          * Need to decompress when reading the data -> decompression should not be in the critical path 
-          * Important factor to the performance 
-    * Software compression is not good enough to compress everything 
-    * Zero value compression 
-      * Simple 
-      * Good compression ratio 
-      * What is data does not have many zeroes 
-    * Frequent value compression 
-      * Some data appear fequently 
-      * Simple and good compression ratio 
-      * have to profile 
-      * decompression is complex 
-    * Frequent pattern compression 
-      * Still to complex in terms of decompression 
-    * Based delta compression 
-      * Easy to decompress but retain the benefit of compression 
-      ​ 
-    ​ 
-===== Lecture 31 (4/28 Mon.) =====  ​ 
- 
-  * Directory based cache coherent 
-    * Each directory has to handle validate/​invalidation 
-    * Extra cost of syncronization 
-    * Need to ensure race conditions are resolved 
-  * Interconnection 
-    * Topology 
-      * Bus 
-      * Mesh 
-        * Torus 
-      * Tree 
-      * Butterfly 
-      * Ring 
-        * Bi-directional ring 
-          * More scalable 
-        * Hierarchical ring 
-          * Even more scalable 
-          * More complex 
-      * Crossbar 
-      * etc. 
-    * Circuit switching 
-    * Multistage network 
-      * Butterfly 
-      * Delta network 
-    * Handling contention 
-      * Buffering vs. dropping/​deflection (no buffering) 
-    * Routing algorithm 
-      * Handling deadlock 
-      * X-Y routing 
-        * Turn model (to avoid deadlocks) 
-      * Add more buffering for an escape path 
-      * Oblivious routing 
-        * Can take different path 
-          * DOR between each intermediate location 
-        * Balance network load 
-      * Adaptive routing 
-        * Use the state of the network to determine the route 
-          * Aware of local and/or global congestions 
-        * Non minimal adaptive routing can have livelocks 
- 
-===== Lecture 32 (4/30 Wed.) =====  ​ 
- 
- 
-  * Serialized code section 
-    * Degrade performance 
-    * Waste energy 
-  * Heterogeneous cores 
-    * Can execute serialized portion on a powerful large core 
-  * Tradeoff between multiple small cores, multiple large cores or heterogenerous cores 
-  * Critical section 
-    * bottleneck in several multithreaded workloads 
-    * Assymmetry can help 
-    * Accelerated critical section 
-      * Use a large core to run serialized portion of the code 
-      * How to correctly support ACS 
-      * False serialization 
-      * Handling private/​shared data 
-    * BIS 
-      * Ideltify the bottleneck 
-        * Serial bottleneck 
-        * Barrier 
-        * Critical section 
-        * Pipeline stages 
-      * Application might wait on different types of bottlenecks 
-      * Allow bottleneckcall and bottleneckreturn 
-      * Acceleration can be done in multiple ways 
-        * ship to a big core 
-        * increase the frequency 
-        * Priorize the thread in share resources (memory scheduler always schedule reqeusts from the thread first, etc.) 
-      * Bottleneck table keeps track of different thread'​s bottleneck and determine the criticality 
-    ​ 
-    ​ 
-  ​ 
buzzword.txt · Last modified: 2015/04/27 18:20 by rachata