User Tools

Site Tools


buzzword

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
buzzword [2014/04/28 18:18]
rachata
buzzword [2015/04/15 18:16]
rachata
Line 1: Line 1:
 ====== Buzzwords ====== ====== Buzzwords ======
  
-Buzzwords are terms that are mentioned during lecture which are particularly important to understand thoroughly. ​ This page tracks the buzzwords for each of the lectures and can be used as a reference for finding gaps in your understanding of course material. +Buzzwords are terms that are mentioned during lecture which are particularly important to understand thoroughly. This page tracks the buzzwords for each of the lectures and can be used as a reference for finding gaps in your understanding of course material.
- +
-===== Lecture 1 (1/13 Mon.) =====+
  
 +===== Lecture 1 (1/12 Mon.) =====
   * Level of transformation   * Level of transformation
     * Algorithm     * Algorithm
Line 10: Line 9:
     * Compiler     * Compiler
   * Cross abstraction layers   * Cross abstraction layers
-    * Expose an interface 
   * Tradeoffs   * Tradeoffs
   * Caches   * Caches
-  ​* Multi-thread +  * DRAM/memory ​controller 
-  * Multi-core +  * DRAM banks
-  * Unfairness +
-  ​* DRAM controller/Memory ​controller +
-  * Memory hog+
   * Row buffer hit/miss   * Row buffer hit/miss
   * Row buffer locality   * Row buffer locality
-  * Streaming access/ Random ​access +  ​* Unfairness 
-  * DRAM refresh +  * Memory performance hog 
-  * Retention time +  * Shared DRAM memory system 
-  * Profiling DRAM retention ​time+  ​* Streaming access ​vs. random ​access 
 +  * Memory scheduling policies 
 +  * Scheduling priority 
 +  * Retention time of DRAM 
 +  * Process variation 
 +  * Retention ​time profile
   * Power consumption   * Power consumption
-  * Wimpy cores 
   * Bloom filter   * Bloom filter
-    ​Pros/Cons +  ​Hamming code 
-    * False Positive +  * Hamming distance 
-  * Simulation +  * DRAM row hammer
-  * Memory performance attacks +
-  * RTL design+
  
-===== Lecture 2 (1/15 Wed.) ===== +===== Lecture 2 (1/14 Wed.) =====
- +
-  * Optimizing for energy/ Optimizing for the performance +
-    * Generally you should optimize for the users +
-  * state-of-the-art +
-  * RTL Simulation +
-    * Long, slow and can be costly +
-  * High level simulation +
-    * What should be employed? +
- * Important to get the idea of how they are implemented in RTL +
- * Allows designer to filter out techniques that do not work well +
-  * Design points +
-    * Design processors to meet the design points +
-  * Software stack +
-  * Design decisions +
-  * Datacenters +
-  * MIPS R2000 +
-    * What are architectural techniques that improve the performance of a processor over MIPS 2000+
   * Moore'​s Law   * Moore'​s Law
 +  * Algorithm --> step-by-step procedure to solve a problem
   * in-order execution   * in-order execution
   * out-of-order execution   * out-of-order execution
Line 64: Line 45:
   * Memory wall (a part of scaling issue)   * Memory wall (a part of scaling issue)
   * Scaling issue   * Scaling issue
-    * Transister ​are getting smaller+    * Transistor ​are getting smaller 
 +  * Key components of a computer 
 +  * Design points 
 +    * Design processors to meet the design points 
 +  * Software stack 
 +  * Design decisions 
 +  * Datacenters
   * Reliability problems that cause errors   * Reliability problems that cause errors
   * Analogies from Kuhn's "The Structure of Scientific Revolutions"​ (Recommended book)   * Analogies from Kuhn's "The Structure of Scientific Revolutions"​ (Recommended book)
-    * Pre paradigm science+    * Pre-paradigm science
     * Normal science     * Normal science
-    * Revolutionalry ​science+    * Revolutionary ​science
   * Components of a computer   * Components of a computer
     * Computation     * Computation
       * Communication       * Communication
-        ​* Storage +      ​* Storage 
-          * DRAM +        * DRAM 
-          * NVRAM (Non-volatile memory): PCM, STT-MRAM +        * NVRAM (Non-volatile memory): PCM, STT-MRAM 
-          * Storage (Flash/​Harddrive)+        * Storage (Flash/​Harddrive)
   * Von Neumann Model (Control flow model)   * Von Neumann Model (Control flow model)
     * Stored program computer     * Stored program computer
-        ​* Properties of Von Neumann Model: Stored program, sequential instruction processing +      ​* Properties of Von Neumann Model: Stored program, sequential instruction processing 
-        * Unified memory +      * Unified memory 
-          * When does an instruction is being interpreted as an instruction (as oppose to a datum)? +        * When does an instruction is being interpreted as an instruction (as oppose to a datum)? 
-        * Program counter +      * Program counter 
-        * Examples: x86, ARM, Alpha, IBM Power series, SPARC, MIPS+      * Examples: x86, ARM, Alpha, IBM Power series, SPARC, MIPS
   * Data flow model   * Data flow model
     * Data flow machine     * Data flow machine
Line 89: Line 76:
     * Operands     * Operands
     * Live-outs/​Live-ins     * Live-outs/​Live-ins
-      * DIfferent ​types of data flow nodes (conditional/​relational/​barrier)+      * Different ​types of data flow nodes (conditional/​relational/​barrier)
     * How to do transactional transaction in dataflow?     * How to do transactional transaction in dataflow?
       * Example: bank transactions        * Example: bank transactions
   * Tradeoffs between control-driven and data-driven   * Tradeoffs between control-driven and data-driven
     * What are easier to program?     * What are easier to program?
-    ​* Which are easy to compile? +      ​* Which are easy to compile? 
-    * What are more parallel (does that mean it is faster?) +      * What are more parallel (does that mean it is faster?) 
-    * Which machines are more complex to design?+      * Which machines are more complex to design?
     * In control flow, when a program is stop, there is a pointer to the current state (precise state).     * In control flow, when a program is stop, there is a pointer to the current state (precise state).
   * ISA vs. Microarchitecture   * ISA vs. Microarchitecture
     * Semantics in the ISA     * Semantics in the ISA
-    ​* uArch should obey the ISA +      ​* uArch should obey the ISA 
-    * Changing ISA is costly, can affect compatibility.+      * Changing ISA is costly, can affect compatibility.
   * Instruction pointers   * Instruction pointers
   * uArch techniques: common and powerful techniques break Vonn Neumann model if done at the ISA level   * uArch techniques: common and powerful techniques break Vonn Neumann model if done at the ISA level
Line 109: Line 96:
       * Out-of-order executions       * Out-of-order executions
       * etc.       * etc.
-    ​* Design techniques +        ​* Design techniques 
-      * Adder implementation (Bit serial, ripple carry, carry lookahead) +          * Adder implementation (Bit serial, ripple carry, carry lookahead) 
-      * Connection machine (an example of a machine that use bit serial to tradeoff latency for more parallelism)+          * Connection machine (an example of a machine that use bit serial to tradeoff latency for more parallelism)
   * Microprocessor:​ ISA + uArch + circuits   * Microprocessor:​ ISA + uArch + circuits
   * What are a part of the ISA? Instructions,​ memory, etc.   * What are a part of the ISA? Instructions,​ memory, etc.
Line 120: Line 107:
 ===== Lecture 3 (1/17 Fri.) ===== ===== Lecture 3 (1/17 Fri.) =====
  
-  * Design ​tradeoffs +   * Microarchitecture 
-  * Macro Architectures +   * Three major tradeoffs ​of computer architecture 
-  Reconfiguribility vs. specialized designs +   ​* Macro-architecture 
-  Parallelism (instructions, data parallel) +   ​LC-3b ISA 
-  Uniform decode (Example: Alpha) +   ​Unused ​instructions 
-  Steering bits (Sub-opcode) +   ​Bit steering 
-  * 0,1,2,3 address machines +   ​Instruction processing style 
-    * Stack machine +   ​* 0,1,2,3 address machines 
-    * Accumulator machine +   ​* Stack machine 
-    * 2-operand machine +   ​* Accumulator machine 
-    * 3-operand machine +   ​* 2-operand machine 
-    * Tradeoffs between 0,1,2,3 address machines +   ​* 3-operand machine 
-  * Instructions/​Opcode/​Operade ​specifiers (i.e. addressing modes) +   ​* Tradeoffs between 0,1,2,3 address machines 
-  * Simply vs. complex data type (and their tradeoffs) +   * Postfix notation 
-  * Semantic gap +   * Instructions/​Opcode/​Operand ​specifiers (i.e. addressing modes) 
-  * Translation layer +   ​* Simply vs. complex data type (and their tradeoffs) 
-  * Addressability +   ​* Semantic gap and level 
-  * Byte/bit addressable machines +   ​* Translation layer 
-  * Virtual memory +   ​* Addressability 
-  * Big/little endian +   ​* Byte/bit addressable machines 
-  * Benefits of having registers (data locality) +   ​* Virtual memory 
-  * Programmer visible (Architectural) state +   ​* Big/little endian 
-    * Programmers can access this directly +   ​* Benefits of having registers (data locality) 
-    * What are the benefits? +   ​* Programmer visible (Architectural) state 
-  * Microarchitectural state +   ​* Programmers can access this directly 
-    * Programmers cannot access this directly +   ​* What are the benefits? 
-  * Evolution of registers (from accumulators to registers) +   ​* Microarchitectural state 
-  * Different types of instructions +   ​* Programmers cannot access this directly 
-    * Control instructions +   ​* Evolution of registers (from accumulators to registers) 
-    * Data instructions +   ​* Different types of instructions 
-    * Operation instructions +   ​* Control instructions 
-  * Addressing modes +   ​* Data instructions 
-    * Tradeoffs (complexity,​ flexibility,​ etc.) +   ​* Operation instructions 
-  * Orthogonal ISA +   ​* Addressing modes 
-    * Addressing modes that are orthogonal to instructino ​types +   ​* Tradeoffs (complexity,​ flexibility,​ etc.) 
-  Vectors ​vs. non vectored interrupts +   ​* Orthogonal ISA 
-  * Complex vs. simple instructions +   ​* Addressing modes that are orthogonal to instruction ​types 
-    * Tradeoffs +   ​I/O devices 
-  * RISC vs. CISC +   * Vectored ​vs. non-vectored interrupts 
-    * Tradeoff +   ​* Complex vs. simple instructions 
-    * Backward compatibility +   ​* Tradeoffs 
-    * Performance +   ​* RISC vs. CISC 
-    * Optimization opportunity+   ​* Tradeoff 
 +   ​* Backward compatibility 
 +   ​* Performance 
 +   ​* Optimization opportunity 
 +   * Translation
  
-===== Lecture 4 (1/22 Wed.) =====+===== Lecture 4 (1/21 Wed.) =====
  
-  ​* Semantic gap +  * Fixed vs. variable length ​instruction 
-    * Small vs. Large semantic gap (CISC vs. RISC) +  * Huffman encoding 
-    * Benefit of RISC vs. CISC +  * Uniform vs. non-uniform decode 
-  * Micro operations/​microcode +  * Registers 
-    * Translate complex instructions into smaller instructions +    * Tradeoffs between ​number of registers 
-  * Parallelism (motivation for RISC) +  * Alignments 
-  * Compiler optimization +    * How does MIPS load words across alignment ​the boundary
-  * Code optimization through translation +
-  * VLIW +
-  ​* Fixed vs. variable length ​instructions +
-    * Tradeoffs +
-      * Alignment issues? (fetch/​decode) +
-      * Decoding issues? +
-      * Code size? +
-      * Adding additional instructions?​ +
-      * Memory bandwidth and cache utilization?​ +
-      * Energy? +
-    * Encoding in variable length instructions +
-  * Structure of Alpha instructions and other uniform decode instructions +
-    * Different type of instructions +
-    * Benefit of knowing what type of instructions +
-      * Speculatively operate future instructions +
-  * x86 and other non-uniform decode instructions +
-    * Tradeoff vs. uniform decode +
-  * Tradeoffs for different number of registers +
-    * Spilling into memory if the number of registers ​is small +
-    * Compiler optimization on how to manage which value to keep/spill +
-  * Addressing modes +
-    * Benefits? +
-    * Types? +
-    * Different uses of addressing modes? +
-  * Various ISA-level tradeoffs +
-  * Virtual memory +
-  * Unalign memory access/​aligned memory access +
-    * Cost vs. benefit of unaligned access +
-  * ISA specification +
-    * Things you have to obey/​specifie in the ISA specification +
-  * Architectural states +
-  * Microarchitecture implements how arch. state A transformed to the next arch. state A' +
-  * Single cycle machines +
-    * Critical path in the single cycle machine +
-  * Multi cycle machines +
-  * Functional units +
-  * Performance metrics +
-    * CPI/IPC +
-      * CPI of a single cycle microarchitecture+
  
-===== Lecture 5 (1/24 Fri.) =====+===== Lecture 5 (1/26 Mon.) =====
  
 +  * Tradeoffs in ISA: Instruction length
 +    * Uniform vs. non-uniform
 +  * Design point/Use cases
 +    * What dictates the design point?
 +  * Architectural states
 +  * uArch
 +    * How to implement the ISA in the uArch
 +  * Different stages in the uArch
 +  * Clock cycles
 +  * Multi-cycle machine
 +  * Datapath and control logic
 +    * Control signals
 +  * Execution time of instructions/​program
 +    * Metrics and what do they means
   * Instruction processing   * Instruction processing
     * Fetch     * Fetch
Line 221: Line 187:
     * Memory fetch     * Memory fetch
     * Writeback     * Writeback
-  * Datapath & Control logic in microprocessors+  * Encoding and semantics
   * Different types of instructions (I-type, R-type, etc.)   * Different types of instructions (I-type, R-type, etc.)
   * Control flow instructions   * Control flow instructions
Line 230: Line 196:
   * Critical path analysis   * Critical path analysis
     * Critical path of a single cycle processor     * Critical path of a single cycle processor
-  * Combinational logic & Sequential logic+  ​* What is in the control signals? 
 +    ​* Combinational logic & Sequential logic
   * Control store   * Control store
   * Tradeoffs of a single cycle uarch   * Tradeoffs of a single cycle uarch
-  * Dynamic power/​Static power 
-  * Speedup calculation 
-    * Parallelism 
-    * Serial bottleneck 
-    * Amdahl'​s bottleneck 
   * Design principles   * Design principles
     * Common case design     * Common case design
     * Critical path design     * Critical path design
     * Balanced designs     * Balanced designs
-  ​Multi cycle design+    ​Dynamic power/​Static power 
 +      * Increases in power due to frequency
  
-===== Lecture 6 (1/27 Mon.) =====+===== Lecture 6 (1/28 Mon.) =====
  
 +  * Design principles
 +    * Common case design
 +    * Critical path design
 +    * Balanced designs
 +  * Multi cycle design
   * Microcoded/​Microprogrammed machines   * Microcoded/​Microprogrammed machines
     * States     * States
 +    * Translation from one state to another
     * Microinstructions     * Microinstructions
     * Microsequencing     * Microsequencing
Line 267: Line 236:
   * Variable latency memory   * Variable latency memory
   * Handling interrupts   * Handling interrupts
-  * Difference ​betwen ​interrupts and exceptions+  * Difference ​between ​interrupts and exceptions
   * Emulator (i.e. uCode allots minimal datapath to emulate the ISA)   * Emulator (i.e. uCode allots minimal datapath to emulate the ISA)
   * Updating machine behavior   * Updating machine behavior
Line 273: Line 242:
   * Vertical microcode   * Vertical microcode
   * Primitives   * Primitives
 +  ​
 +===== Lecture 7 (1/30 Fri.) =====
  
-===== Lecture 7 (1/29 Wed.) ===== +  * Emulator ​(i.e. uCode allots minimal datapath to emulate the ISA
 +  * Updating machine behavior 
 +  * Horizontal microcode 
 +  * Vertical microcode 
 +  * Primitives 
 +  * nanocode and millicode 
 +    * what are the differences between nano/​milli/​microcode 
 +  * microprogrammed vs. hardwire control
   * Pipelining   * Pipelining
   * Limitations of the multi-programmed design   * Limitations of the multi-programmed design
     * Idle resources     * Idle resources
   * Throughput of a pipelined design   * Throughput of a pipelined design
-    * What dictacts ​the throughput of a pipelined design?+    * What dictates ​the throughput of a pipelined design?
   * Latency of the pipelined design   * Latency of the pipelined design
   * Dependency   * Dependency
Line 300: Line 277:
     * Pipeline flush     * Pipeline flush
     * Speculation     * Speculation
 +  ​
 +===== Lecture 8 (2/2 Mon.) =====  ​
 +
   * Interlocking   * Interlocking
   * Multipath execution   * Multipath execution
Line 305: Line 285:
   * No-op (Bubbles in the pipeline)   * No-op (Bubbles in the pipeline)
   * Valid bits in the instructions   * Valid bits in the instructions
- 
-===== Lecture 8 (1/31 Fri.) ===== 
   * Branch prediction   * Branch prediction
   * Different types of data dependence   * Different types of data dependence
Line 335: Line 313:
     * Based on the profile info. position the code based on it     * Based on the profile info. position the code based on it
     * Try to make the next sequential instruction be the next inst. to be executed     * Try to make the next sequential instruction be the next inst. to be executed
-  * Trace cache 
   * Predicate combining (combine predicate for a branch instruction)   * Predicate combining (combine predicate for a branch instruction)
   * Predicated execution (control dependence becomes data dependence)   * Predicated execution (control dependence becomes data dependence)
 +  ​
 +    ​
 +===== Lecture 9 (2/4 Wed.) =====  ​
   * Definition of basic blocks   * Definition of basic blocks
   * Control flow graph   * Control flow graph
- 
-===== Lecture 9 (2/3 Mon.) ===== 
   * Delayed branching   * Delayed branching
     * benefit?     * benefit?
Line 358: Line 336:
     * Inst. from different thread can fill-in the bubbles     * Inst. from different thread can fill-in the bubbles
     * Cost?     * Cost?
-  * Simulteneuos ​multithreading+  * Simultaneuos ​multithreading
   * Branch prediction   * Branch prediction
     * Guess what to fetch next.     * Guess what to fetch next.
Line 367: Line 345:
       * Given the program/​number of instructions,​ percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions.       * Given the program/​number of instructions,​ percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions.
         * How many extra instructions are being fetched?         * How many extra instructions are being fetched?
-        * What is the performance ​degredation?+        * What is the performance ​degradation?
     * How to reduce the miss penalty?     * How to reduce the miss penalty?
     * Predicting the next address (non PC+4 address)     * Predicting the next address (non PC+4 address)
Line 374: Line 352:
     * Global branch history - for directions     * Global branch history - for directions
     * Can use compiler to profile and get more info     * Can use compiler to profile and get more info
-      * Input set dictacts ​the accuracy+      * Input set dictates ​the accuracy
       * Add time to compilation       * Add time to compilation
     * Heuristics that are common and doesn'​t require profiling.     * Heuristics that are common and doesn'​t require profiling.
       * Might be inaccurate       * Might be inaccurate
       * Does not require profiling       * Does not require profiling
-    * Programmer ​can tell the hardware (via pragmas ​(hints))+    ​* Static branch prediction 
 +      ​* Programmer ​provides ​pragmas, hinting the likelihood of taken/not taken branch
       * For example, x86 has the hint bit       * For example, x86 has the hint bit
     * Dynamic branch prediction     * Dynamic branch prediction
Line 386: Line 365:
         * One more bit for hysteresis         * One more bit for hysteresis
  
- +===== Lecture 10 (2/6 Fri.) =====
-===== Lecture 10 (2/5 Wed.) ===== +
   * Branch prediction accuracy   * Branch prediction accuracy
     * Why are they very important?     * Why are they very important?
       * Differences between 99% accuracy and 98% accuracy       * Differences between 99% accuracy and 98% accuracy
-      * Cost of a misprediction when the pipeline is veryd eep +      * Cost of a misprediction when the pipeline is very deep
-  * Value prediction+
   * Global branch correlation   * Global branch correlation
     * Some branches are correlated     * Some branches are correlated
Line 404: Line 380:
     * What information are used     * What information are used
   * Two-level branch prediction   * Two-level branch prediction
-    * What entries do you keep in the glocal ​history?+    * What entries do you keep in the global ​history?
     * What entries do you keep in the local history?     * What entries do you keep in the local history?
     * How many table?     * How many table?
Line 414: Line 390:
     * Store both GHP and PC in one combined information     * Store both GHP and PC in one combined information
     * How do you use the information?​ Why does the XOR result still usable?     * How do you use the information?​ Why does the XOR result still usable?
-  * Slides (page 16-18) for a good overview of one- and two-level predictors 
   * Warmup cost of the branch predictor   * Warmup cost of the branch predictor
     * Hybrid solution? Fast warmup is used first, then switch to the slower one.     * Hybrid solution? Fast warmup is used first, then switch to the slower one.
   * Tournament predictor (Alpha 21264)   * Tournament predictor (Alpha 21264)
-  * Other types of branch predictor 
-    * Using machine learning? 
-    * Geometric history length 
-      * Look at branches far behind (but using geometric step) 
   * Predicated execution - eliminate branches   * Predicated execution - eliminate branches
     * What are the tradeoffs     * What are the tradeoffs
-    * What is the block is big (can lead to execution a lot of useless work)+    * What if the block is big (can lead to execution a lot of useless work)
     * Allows easier code optimization     * Allows easier code optimization
       * From the compiler PoV, predicated execution combine multiple basic blocks into one bigger basic block       * From the compiler PoV, predicated execution combine multiple basic blocks into one bigger basic block
Line 434: Line 405:
       * Use branch prediction on an easy to predict code       * Use branch prediction on an easy to predict code
       * Use predicated execution on a hard to predict code       * Use predicated execution on a hard to predict code
-      * Compiler can be more aggressive in optimimzing ​the code+      * Compiler can be more aggressive in optmizing ​the code
     * What are the tradeoffs (slide# 47)     * What are the tradeoffs (slide# 47)
   * Multi-path execution   * Multi-path execution
     * Execute both paths     * Execute both paths
     * Can lead to wasted work     * Can lead to wasted work
 +    * VLIW
 +    * Superscalar
  
  
-===== Lecture 11 (2/12 Wed.) =====+===== Lecture 11 (2/11 Wed.) =====
  
-  * Call and return prediction +  * Geometric GHR length for branch prediction 
-    * Direct call is easy to predict +  * Perceptron branch predictor
-    * Retun is harder (indirect branches) +
-      * Nested calls make return easier to predict +
-        * Can use stack to predict the return +
-    * Indirect ​branch ​prediction +
-      * These branches have multiple targets +
-      * For switch-case,​ virtual function calls, jump tables, interface calls +
-      * BTB to predict the target address - low accuracy +
-      * History based: BTB + GHR +
-      * Virtual program counter ​prediction +
-  * Complications in superscalar processors +
-    * Fetch? What if multiple branches are fetched at the same time? +
-    * Logic requires to ensure correctness?​+
   * Multi-cycle executions (Different functional units take different number of cycles)   * Multi-cycle executions (Different functional units take different number of cycles)
     * Instructions can retire out-of-order     * Instructions can retire out-of-order
Line 462: Line 423:
   * Exceptions and Interrupts   * Exceptions and Interrupts
     * When they are handled? ​     * When they are handled? ​
-      ​* Why are some interrupts should be handled right away?+    ​* Why are some interrupts should be handled right away?
   * Precise exception   * Precise exception
     * arch. state should be consistent before handling the exception/​interrupts     * arch. state should be consistent before handling the exception/​interrupts
Line 470: Line 431:
       * Easier to restart the processes       * Easier to restart the processes
     * How to ensure precise exception?     * How to ensure precise exception?
-      ​* Tradeoffs between each method+    ​* Tradeoffs between each method
   * Reorder buffer   * Reorder buffer
     * Reorder results before they are visible to the arch. state     * Reorder results before they are visible to the arch. state
-      * Need to presearve ​the sequential ​sematic ​and data +      * Need to preserve ​the sequential ​semantic ​and data 
-    * What are the informatinos ​in the ROB entry+    * What are the information ​in the ROB entry
     * Where to get the value from (forwarding path? reorder buffer?)     * Where to get the value from (forwarding path? reorder buffer?)
       * Extra logic to check where the youngest instructions/​value is       * Extra logic to check where the youngest instructions/​value is
-      * Content ​addressible ​search+      * Content ​addressable ​search ​(CAM)
         * A lot of comparators         * A lot of comparators
     * Different ways to simplify the reorder buffer     * Different ways to simplify the reorder buffer
Line 487: Line 448:
   * Future file (commonly used, along with reorder buffer)   * Future file (commonly used, along with reorder buffer)
     * Keep two set of register files     * Keep two set of register files
-      * An updated value (Speculative),​ called ​fiture ​file+      * An updated value (Speculative),​ called ​future ​file
       * A backup value (to restore the state quickly       * A backup value (to restore the state quickly
-    * Double the cost of the regfile, but reduce the area as you don't have to use a content ​addressible ​memory (compared to ROB alone)+    * Double the cost of the regfile, but reduce the area as you don't have to use a content ​addressable ​memory (compared to ROB alone)
   * Branch misprediction resembles Exception   * Branch misprediction resembles Exception
     * The difference is that branch misprediction is not visible to the software     * The difference is that branch misprediction is not visible to the software
Line 498: Line 459:
   * Checkpointing   * Checkpointing
     * Advantages?     * Advantages?
-    ​ 
-===== Lecture 14 (2/19 Wed.) ===== 
  
 +===== Lecture 12 (2/13 Fri.) =====
 +
 +  * Renaming
 +  * Register renaming table
   * Predictor (branch predictor, cache line predictor ...)   * Predictor (branch predictor, cache line predictor ...)
   * Power budget (and its importance)   * Power budget (and its importance)
Line 523: Line 486:
       * Slides 28 --> register renaming       * Slides 28 --> register renaming
       * Slides 30-35 --> Exercise (also on the board)       * Slides 30-35 --> Exercise (also on the board)
-        * This will be usefull ​for the midterm+        * This will be useful ​for the midterm
     * Register aliasing table     * Register aliasing table
 +    * Broadcasting tags
 +    * Using dataflow
  
-===== Lecture ​15 (2/21 Fri.) =====+ 
 +===== Lecture ​13 (2/16 Mon.) =====
  
   * OoO --> Restricted Dataflow   * OoO --> Restricted Dataflow
Line 534: Line 500:
       * Dispatch width       * Dispatch width
       * Parallelism in the program       * Parallelism in the program
-    * More example on slide #10 
     * What does it mean to be restricted data flow     * What does it mean to be restricted data flow
       * Still visible as a Von Neumann model       * Still visible as a Von Neumann model
     * Where does the efficiency come from?     * Where does the efficiency come from?
-    * Size of the scheduling ​windors/reorder buffer. Tradeoffs? What make sense?+    * Size of the scheduling ​windows/reorder buffer. Tradeoffs? What make sense?
   * Load/store handling   * Load/store handling
     * Would like to schedule them out of order, but make them visible in-order     * Would like to schedule them out of order, but make them visible in-order
Line 545: Line 510:
     * This is one of the most complex structure of the load/store handling     * This is one of the most complex structure of the load/store handling
     * What information can be used to predict these load/store optimization?​     * What information can be used to predict these load/store optimization?​
-  * Note: IPC = 1/CPI 
   * Centralized vs. distributed?​ What are the tradeoffs?   * Centralized vs. distributed?​ What are the tradeoffs?
   * How to handle when there is a misprediction/​recovery   * How to handle when there is a misprediction/​recovery
 +    * OoO + branch prediction?
 +    * Speculatively update the history register
 +      * When do you update the GHR?
   * Token dataflow arch.   * Token dataflow arch.
     * What are tokens?     * What are tokens?
Line 555: Line 522:
     * Difficulties?​     * Difficulties?​
  
-===== Lecture ​16 (2/24 Mon.) =====+===== Lecture ​14 (2/18 Wed.) =====
  
   * SISD/​SIMD/​MISD/​MIMD   * SISD/​SIMD/​MISD/​MIMD
Line 571: Line 538:
     * But the program needs to be very parallel     * But the program needs to be very parallel
     * Memory can be the bottleneck (due to very high MLP)     * Memory can be the bottleneck (due to very high MLP)
-    * What does the functional units look like? Deep pipelin ​and simpler control.+    * What does the functional units look like? Deep pipeline ​and simpler control.
     * CRAY-I is one of the examples of vector processor     * CRAY-I is one of the examples of vector processor
     * Memory access pattern in a vector processor     * Memory access pattern in a vector processor
       * How do the memory accesses benefit the memory bandwidth?       * How do the memory accesses benefit the memory bandwidth?
-      * Please refer to slides 73-74 in http://​www.ece.cmu.edu/​~ece447/​s13/​lib/​exe/​fetch.php?​media=onur-447-spring13-lecture25-mainmemory-afterlecture.pdf for a breif explanation of memory ​level parallelism+      * Memory ​level parallelism
       * Stride length vs. the number of banks       * Stride length vs. the number of banks
         * stride length should be relatively prime to the number of banks         * stride length should be relatively prime to the number of banks
Line 582: Line 549:
     * What if there are multiple memory ports?     * What if there are multiple memory ports?
     * Gather/​Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism)     * Gather/​Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism)
-      * Helps handling sparse ​metrices+      * Helps handling sparse ​matrices
     * Conditional operation     * Conditional operation
     * Structure of vector units     * Structure of vector units
Line 595: Line 562:
     * Intel SSE --> Modern version of MMX     * Intel SSE --> Modern version of MMX
  
-===== Lecture ​17 (2/26 Wed.) ===== +===== Lecture ​15 (2/20 Fri.) =====
   * GPU   * GPU
     * Warp/​Wavefront     * Warp/​Wavefront
Line 613: Line 579:
     * Lower SIMD efficiency     * Lower SIMD efficiency
     * What if you have layers of branches?     * What if you have layers of branches?
-  * Dynamic ​wrap formation+  * Dynamic ​warp formation
     * Combining threads from different warps to increase SIMD utilization     * Combining threads from different warps to increase SIMD utilization
     * This can cause memory divergence     * This can cause memory divergence
Line 627: Line 593:
     * How to street the instruction (determine dependency/​stalling)?​     * How to street the instruction (determine dependency/​stalling)?​
     * Instruction scheduling techniques (static vs. dynamic)     * Instruction scheduling techniques (static vs. dynamic)
-  * Systoric ​arrays+  * Systolic ​arrays
     * Processing elements transform data in chains     * Processing elements transform data in chains
     * Develop for image processing (for example, convolution)     * Develop for image processing (for example, convolution)
   * Stage processing   * Stage processing
 +    ​
  
-===== Lecture 18 (2/28 Fri.) ===== 
- 
-  * Tradeoffs of VLIW 
-    * Why does VLIW required static instruction scheduling 
-    * Whose job it is? 
-      * Compiler can rearrange basic blocks/​instruction 
-  * Basic block 
-    * Benefits of having large basic block 
-  * Entry/Exit 
-    * Handling entries/​exits 
-  * Trace cache 
-    * How to ensure correctness?​ 
-    * Profiling 
-    * Fixing up the instruction order to ensure correctness 
-    * Dealing with multiple entries into the block 
-    * Dealing with multiple exits into the block 
-  * Super block 
-    * How to form super blocks? 
-    * Benefit of super block 
-    * Tradeoff between not forming a super block and forming a super block 
-      * Ambiguous branch (after profiling, both taken/not taken are equally likely) 
-      * Cleaning up  
-  * What scenario would make trace cache/​superblock/​profiling less effective? 
-  * List scheduling 
-    * Help figuring out which instructions VLIW should fetch 
-    * Try to maximize instruction throughput 
-    * How to assign priorities 
-    * What if some instructions take longer than others 
-  * Block structured ISA (BS-ISA) 
-    * Problems with trace scheduling? 
-    * What type of program will benefit from BS-ISA 
-    * How to form blocks in BS-ISA? 
-      * Combining basic blocks 
-      * multiples of merged basic blocks 
-    * How to deal with entries/​exits in BS-ISA? 
-      * undo the executed instructions from the entry point, then fetch the new block 
-    * Advantages over trace cache 
-  * Benefit of VLIW + Static instruction scheduling 
-  * Intel IA-64 
-    * Static instruction scheduling and VLIW 
- 
-===== Lecture 19 (3/19 Wed.) ===== 
  
 +===== Lecture 16 (2/23 Mon.) =====
 +  * Systolic arrays
 +    * Processing elements transform data in chains
 +    * Can be arrays of multi-dimensional processing elements
 +    * Develop for image processing (for example, convolution)
 +    * Can be use to break stages in pipeline programs, using a set of queues and processing elements
 +    * Can enable high concurrency and good for regular programs
 +    * Very special purpose
 +    * The warp computer
 +  * Static instruction scheduling
 +    * How do we find the next instruction to execute?
 +  * Live-in and live-out
 +  * Basic blocks
 +    * Rearranging instructions in the basic block
 +    * Code movement from one basic block to another
 +  * Straight line code
 +  * Independent instructions
 +    * How to identify independent instructions
 +  * Atomicity
 +  * Trace scheduling
 +    * Side entrance
 +    * Fixed up code
 +    * How scheduling is done
 +  * Instruction scheduling
 +    * Prioritization heuristics
 +  * Superblock
 +    * Traces with no side-entrance
 +  * Hyperblock
 +  * BS-ISA
 +  * Tradeoffs between trace cache/​Hyperblock/​Superblock/​BS-ISA
 +    ​
 +===== Lecture 17 (2/25 Wed.) =====
 +  * IA-64
 +    * EPIC
 +  * IA-64 instruction bundle
 +    * Multiple instructions in the bundle along with the template bit
 +    * Template bits
 +    * Stop bits
 +    * Non-faulting loads and exception propagation
 +  * Aggressive ST-LD reordering
 +  * Physical memory system
 +  * Ideal pipelines
   * Ideal cache   * Ideal cache
     * More capacity     * More capacity
Line 682: Line 649:
   * DRAM cell   * DRAM cell
     * Cheap     * Cheap
-    * Sense the purturbation ​through sense amplifier+    * Sense the perturbation ​through sense amplifier
     * Slow and leaky     * Slow and leaky
   * SRAM cell (Cross coupled inverter)   * SRAM cell (Cross coupled inverter)
Line 689: Line 656:
   * Memory bank   * Memory bank
     * Read access sequence     * Read access sequence
-      ​* DRAM: Activate -> Read -> Precharge (if needed) +    ​* DRAM: Activate -> Read -> Precharge (if needed) 
-    * What dominate the access ​laatency ​for DRAM and SRAM+    * What dominate the access ​latency ​for DRAM and SRAM
   * Scaling issue   * Scaling issue
     * Hard to scale the scale to be small     * Hard to scale the scale to be small
Line 718: Line 685:
   * Cost and benefit of having more associativity   * Cost and benefit of having more associativity
     * Given the associativity,​ which block should be replace if it is full     * Given the associativity,​ which block should be replace if it is full
-    * Replacement ​poligy+    * Replacement ​policy
       * Random       * Random
       * Least recently used (LRU)       * Least recently used (LRU)
Line 729: Line 696:
     * Approximate LRU     * Approximate LRU
       * Victim and next Victim policy       * Victim and next Victim policy
- +  
-===== Lecture ​20 (3/21 Fri.) ===== +===== Lecture ​18 (2/27 Fri.) ===== 
 +  * Tag store and data store 
 +  * Cache hit rate 
 +  * Average memory access time (AMAT) 
 +  * AMAT vs. Stall time 
 +  * Cache basics 
 +    * Direct mapped vs. associative cache 
 +    * Set/block (line)/​Placement/​replacement 
 +    * How do tag and index get used? 
 +  * Full associativity 
 +  * Set associative cache 
 +    * insertion, promotion, eviction (replacement) 
 +  * Various replacement policies 
 +  * How to implement LRU 
 +    * How to keep track of access ordering 
 +      * Complexity increases rapidly 
 +    * Approximate LRU 
 +      * Victim and next Victim policy
   * Set thrashing   * Set thrashing
     * Working set is bigger than the associativity     * Working set is bigger than the associativity
Line 737: Line 720:
     * Is this optimal?     * Is this optimal?
     * Complexity?     * Complexity?
-  * Similarity between ​cache and page table +  * DRAM as a cache for disk
-    * Number of blocks vs pages +
-    * Time to find the block/page to replace+
   * Handling writes   * Handling writes
     * Write through     * Write through
Line 754: Line 735:
     * In the first level cache     * In the first level cache
   * Cache access   * Cache access
-    * First level access ​+    * First level access
     * Second level access     * Second level access
       * When to start the second level access       * When to start the second level access
-        ​Performance vs. energy +  ​Cache performance 
-  * Address translation +    * capacity 
-  * Homonym and Synonyms +    * block size 
-    * Homonym: Same VA but maps to different PA +    * associativity 
-      * With multiple processes +  * Classification of cache misses 
-    * Synonyms: Multiple VAs map to the same PA +===== Lecture ​19 (03/02 Mon.) =====
-      * Shared libraries, shared data, copy-on-write +
-      * I/O +
-    * Can these create problems when we have the cache +
-    * How to eliminate these problems? +
-      * Page coloring +
-  * Interaction between ​cache and TLB +
-    * Virtually indexed vs. physically indexed +
-    * Virtually tagged vs. physically tagged +
-    * Virtually indexed physically tagged +
-  * Virtual memory in DRAM +
-    * Control where data is mapped to in channel/​rank/​bank +
-      * More parallelism +
-      * Reduce interference +
- +
-===== Lecture ​21 (3/24 Mon.) ===== +
- +
- +
- +
-  * Different parameters that affect cache miss +
-  * Thrashing +
-  * Different types of cache misses +
-    * Compulsory misses +
-      * Can mitigate with prefetches +
-    * Capacity misses +
-      * More assoc +
-      * Victim cache +
-    * Conflict misses +
-      * Hashing +
-  * Large block vs. small block+
   * Subblocks   * Subblocks
   * Victim cache   * Victim cache
Line 811: Line 763:
     * What information goes into the MSHR?     * What information goes into the MSHR?
     * When do you access the MSHR?     * When do you access the MSHR?
 +  * Memory banks
 +  * Shared caches in multi-core processors
 +===== Lecture 20 (03/04 Wed.) =====
 +  * Virtual vs. physical memory
 +  * System'​s management on memory
 +    * Benefits
 +  * Problem: physical memory has limited size
 +  * Mechanisms: indirection,​ virtual addresses, and translation
 +  * Demand paging
 +  * Physical memory as a cache
 +  * Tasks of system SW for VM
 +  * Serving a page fault
 +  * Address translation
 +  * Page table
 +    * PTE (page table entry)
 +  * Page replacement algorithm
 +    * CLOCK algo.
 +    * Inverted page table
 +  * Page size trade-offs
 +  * Protection
 +  * Multi-level page tables
 +  * x86 implementation of page table
 +  * TLB
 +    * Handling misses
 +  * When to do address translation?​
 +  * Homonym and Synonyms
 +    * Homonym: Same VA but maps to different PA with multiple processes
 +    * Synonyms: Multiple VAs map to the same PA
 +      * Shared libraries, shared data, copy-on-write
 +  * Virtually indexed vs. physically indexed
 +  * Virtually tagged vs. physically tagged
 +  * Virtually indexed physically tagged
 +  * Can these create problems when we have the cache
 +  * How to eliminate these problems?
 +  * Page coloring
 +  * Interaction between cache and TLB
  
  
-===== Lecture ​22 (3/26 Wed.) ===== +===== Lecture ​21 (03/23 Mon.) =====
- +
  
 +  * DRAM scaling problem
 +  * Demands/​trends affecting the main memory
 +    * More capacity
 +    * Low energy
 +    * More bandwidth
 +    * QoS
 +  * ECC in DRAM
   * Multi-porting   * Multi-porting
     * Virtual multi-porting     * Virtual multi-porting
Line 822: Line 815:
     * True multiporting     * True multiporting
   * Multiple cache copies   * Multiple cache copies
 +  * Alignment
   * Banking   * Banking
     * Can have bank conflict     * Can have bank conflict
Line 827: Line 821:
     * Address mapping can mitigate bank conflict     * Address mapping can mitigate bank conflict
     * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity)     * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity)
 +  * Bank mapping
 +    * How to avoid bank conflicts?
 +  * Channel mapping
 +    * Address mapping to minimize bank conflict
 +    * Page coloring
 +      * Virtual to physical mapping that can help reducing conflicts
   * Accessing DRAM   * Accessing DRAM
     * Row bits     * Row bits
Line 832: Line 832:
     * Addressibility     * Addressibility
     * DRAM has its own clock     * DRAM has its own clock
 +    * Sense amplifier
 +    * Bit lines
 +    * Word lines
   * DRAM (2T) vs. SRAM (6T)   * DRAM (2T) vs. SRAM (6T)
     * Cost     * Cost
Line 844: Line 847:
   * DRAM Channel   * DRAM Channel
     * An interface to DRAM, each with its own ranks/banks     * An interface to DRAM, each with its own ranks/banks
 +  * DRAM Chip
   * DIMM   * DIMM
     * More DIMM adds the interconnect complexity     * More DIMM adds the interconnect complexity
Line 869: Line 873:
       * Send requests whenever the request can be sent to the bank       * Send requests whenever the request can be sent to the bank
       * Determine which command (across banks) should be sent to DRAM       * Determine which command (across banks) should be sent to DRAM
-  * Priority of demand vs. prefetch requests 
-  * Memory scheduling policies 
-    * FCFS 
-    * FR-FCFS 
-      * Capped FR-FCFS: FR-FCFS with a timeout 
-      * Usually this is done in a command level (read/write commands and precharge/​activate commands) 
-    ​ 
-  ​ 
-===== Lecture 23 (3/28 Fri.) ===== 
  
 +===== Lecture 22 (03/25 Wed.) =====
 +
 +
 +
 +  * Flash controller
 +  * Flash memory
 +  * Garbage collection in flash
 +  * Overhead in flash memory
 +    * Erase (off the critical path, but takes a long time)
 +  * Different types of DRAM
   * DRAM design choices   * DRAM design choices
     * Cost/​density/​latency/​BW/​Yield     * Cost/​density/​latency/​BW/​Yield
Line 894: Line 899:
     * Benefit of TL-DRAM     * Benefit of TL-DRAM
       * TL-DRAM vs. DRAM cache (adding a small cache in DRAM)       * TL-DRAM vs. DRAM cache (adding a small cache in DRAM)
- +  * List of commands to read/write data into DRAM 
-   +    * Activate -> read/write -> precharge 
-   +    * Activate moves data into the row buffer 
-   +    * Precharge prepare the bank for the next access 
-===== Lecture 24 (3/31 Mon.) ===== +  ​* Row buffer hit 
-   +  ​* Row buffer conflict 
 +  ​* Scheduling memory requests to lower row conflicts 
 +  * Burst mode of DRAM 
 +    * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits 
 +  ​* Address mapping 
 +    * Row interleaved 
 +    * Cache block interleaved
   * Memory controller   * Memory controller
-    * Different ​commands +    * Sending DRAM commands 
-  * Memory scheduler +    * Periodically send commands ​to refresh ​DRAM cells 
-    * Determine the order of requests to be issued ​to DRAM +    * Ensure correctness and data integrity 
-    * Age/​hit-miss status/​types(load/​store/​prefetch/​from GPU/from CPU)/​criticality +    * Where to place the memory controller 
-  * Row buffer ​ +      On CPU chip vs. at the main memory 
-    * hit/​conflict +        Higher BW on-chip 
-    open/closed row +    * Determine the order of requests that will be serviced in DRAM 
-    Open row policy +      Request queues that hold requests 
-    * Closed row policy +      * Send requests whenever ​the request can be sent to the bank 
-    Tradeoffs between open and closed row policy +      * Determine which command (across banks) should be sent to DRAM 
-      * What if the programs has high row buffer locality: open row might benefit more +  * Priority of demand vs. prefetch requests 
-      * Closed row will service misses request faster +  * Memory scheduling policies 
-  * Bank conflict +    * FCFS 
-  * Interference from different applications/​threads +    * FR-FCFS 
-    * Differnt programs/​processes/​threads interfere with each other +      * Try to maximize ​row buffer ​hit rate 
-      * introduce more row buffer/bank conflicts +      Capped ​FR-FCFSFR-FCFS ​with a timeout 
-    * Memory schedule has to manage these interference +      * Usually this is done in a command level (read/write commands and precharge/​activate commands) 
-    * Memory hog problems +    * PAR-BS 
-  * Interference in the data/​command bus +      * Key benefits 
-  ​* FR-FCFS +      * stall time 
-    * Why does FR-FCFS ​make sense? +      * shortest job first 
-      * Row buffer has lower lantecy +    ​* STFM 
-    * Issues with FR-FCFS +    * ATLAS 
-      * Unfairness +    * TCM 
-  * STFM +      * Key benefits 
-    * Fairness issue in memory scheduling +      * Configurability 
-    * How does STFM calculate the fairness and slowdown +      * Fairness + performance at the same time 
-      * How to estimate slowdown ​time when it is runing alone +      Robuestness isuees 
-    Definition of fairness (based on STFM, different papers/​areas define fairness differently) +  * Open row policy 
-  * PAR-BS +  Closed row policy 
-    Parallelism in programs +  QoS 
-    Intereference across banks +    * QoS issues in memory scheduling 
-    * How to form  a batch +    * Fairness 
-    * How to determine ranking between batches/​within a batch +    ​* Performance guarantee
-    ​ +
-      +
-      +
-===== Lecture 25 (2/2 Wed.) =====+
  
  
 +===== Lecture 23 (03/27 Fri.) =====
  
-  * Latency sensitivity 
-    * Performance drops a lot when the memory request latency is long 
-  * TCM 
-    * Tradeoff between throughput and fairness 
-    * Latency sensitive cluster (non-intensive cluster) 
-      * Ranking based on memory intensity 
-    * Bandwidth intensive cluster 
-      * Round robin within the cluster 
-    * Generally latency sensitive cluster has more priority 
-    * Provide robust fairness vs. throughput 
-    * Complexity of TCM? 
   * Different ways to control interference in DRAM   * Different ways to control interference in DRAM
     * Partitioning of resource     * Partitioning of resource
Line 976: Line 972:
     * Self-optimizing controller     * Self-optimizing controller
       * Use machine learning to improve DRAM controller       * Use machine learning to improve DRAM controller
 +    * A-DRM
 +      * Architecture aware DRAM
 +  * Multithread
 +    * synchronization
 +    * Pipeline programs
 +      * Producer consumer model
 +    * Critical path
 +    * Limiter threads
 +    * Prioritization between threads
 +  * Different power mode in DRAM
   * DRAM Refresh   * DRAM Refresh
     * Why does DRAM has to refresh every 64ms     * Why does DRAM has to refresh every 64ms
Line 993: Line 999:
       * Better/more hash function helps eliminate this       * Better/more hash function helps eliminate this
   ​   ​
-   +===== Lecture ​24 (03/30 Mon.) =====
-   +
-===== Lecture ​26 (4/Mon.) ===== +
- +
  
 +  * Simulation
 +    * Drawbacks of RTL simulations
 +      * Time consuming
 +      * Complex to develop
 +      * Hard to perform design explorations
 +    * Explore the design space quickly
 +    * Match the behavior of existing systems
 +    * Tradeoffs: speed, accuracy, flexibility
 +    * High-level simulation vs. detailed simulation
 +      * High-level simulation is faster, but lower accuracy
 +  * Controllers that works on multiple types of cores
 +    * Design problems: how to find a good scheduling policy on its own?
 +    * Self-optimizing memory controller: using machine learning
 +      * Can adapt to the applications
 +      * The complexity is very high      ​
   * Tolerate latency can be costly   * Tolerate latency can be costly
     * Instruction window is complex     * Instruction window is complex
Line 1015: Line 1032:
       * Execute future instruction to generate accurate prefetches       * Execute future instruction to generate accurate prefetches
       * Allow future data to be in the cache       * Allow future data to be in the cache
 +    * How to support runahead execution?
 +      * Need a way to checkpoing the state when entering runahead mode
 +      * How to make executing in the wrong path useful?
 +      * Need runahead cache to handle load/store in Runahead mode (since they are speculative)
 +
 +
 +===== Lecture 25 (4/1 Wed.) =====
 +
 +  * More Runahead executions
     * How to support runahead execution?     * How to support runahead execution?
       * Need a way to checkpoing the state when entering runahead mode       * Need a way to checkpoing the state when entering runahead mode
Line 1030: Line 1056:
           * Stride predictor           * Stride predictor
           * AVD prediction           * AVD prediction
-  ​ 
-===== Lecture 28 (4/9 Wed.) ===== 
- 
- 
   * Questions regarding prefetching   * Questions regarding prefetching
     * What to prefetch     * What to prefetch
Line 1096: Line 1118:
     * Timeliness     * Timeliness
       * How much memory latency can we hide from the prefetch requests       * How much memory latency can we hide from the prefetch requests
 +    * Cache pullition
 +      * How much did the prefetcher cause misses in the demand misses?
 +        * Hard to quantify
 +
 +
 +===== Lecture 26 (4/3 Fri.) =====
 +
   * Feedback directed prefetcher   * Feedback directed prefetcher
     * Use the result of the prefetcher as a feedback to the prefetcher     * Use the result of the prefetcher as a feedback to the prefetcher
Line 1108: Line 1137:
     * Not very efficient (hard to figure out which block is the pointer)     * Not very efficient (hard to figure out which block is the pointer)
       * Software can give hints       * Software can give hints
-     +  * Correlation table 
-===== Lecture 28 (4/14 Mon.) ===== +    * Address correlation
- +
   * Execution based prefetcher   * Execution based prefetcher
     * Helper thread/​speculative thread     * Helper thread/​speculative thread
Line 1119: Line 1146:
     * How do you construct the helper thread     * How do you construct the helper thread
     * Preexecute instruction (one example of how to initialize a speculative thread), slide 9     * Preexecute instruction (one example of how to initialize a speculative thread), slide 9
-  ​Benefit of multiprocessor +    ​Thread-based pre-execution 
-    * Improve performace ​without ​not significantly increase ​power consumption+  * Error tolerance 
 +  * Solution to errors 
 +    * Tolerate errors 
 +      * New interface, new design 
 +    * Eliminate or minimize errors 
 +      * New technology, system-wide rethinking 
 +    * Embrace errors 
 +      * Map data that can tolerate errors to error-prone area 
 +  * Hybrid memory systesm 
 +    * Combining multiple memory technology together ​  
 +  * What can emerging technology help? 
 +    * Scalability 
 +    * Lower the cost 
 +    * Energy efficiency 
 +  * Possible solutions to the scaling problem 
 +    * Less leakage DRAM 
 +    * Heterogeneous DRAM (TL-DRAM, etc.) 
 +    * Add more functionality to DRAM 
 +    * Denser design (3D stack) 
 +    * Different technology 
 +      * NVM 
 +  * Charge vs. resistice memory 
 +    * How data is written? 
 +    * How to read the data? 
 +  * Non volatile memory 
 +    * Resistive memory 
 +      * PCM 
 +        * Inject current to change the phase 
 +        * Scales better than DRAM 
 +          * Multiple bits per cell 
 +            * Wider resistence range 
 +        * No refresh is needed 
 +        * Downside: Latency and write endurance 
 +      * STT-MRAM 
 +        * Inject current to change the polarity 
 +      * Memristor 
 +        * Inject current to change the structure 
 +    * Pros and cons between different technologies 
 +    * Persistency - data stay there even without ​power 
 +      * Unified memory and storage management (persistent data structure) - Single level store 
 +        * Improve energy and performance 
 +        * Simplify programming model 
 +  * Different design options for DRAM + NVM 
 +    * DRAM as a cache 
 +    * Place some data in DRAM and other in PCM 
 +      * Based on the characteristics 
 +        * Frequently accessed data that need lower write latency in DRAM 
 +   
 + 
 +===== Lecture 27 (4/6 Mon.) ===== 
 +  * Flynn'​s taxonomy 
 +  * Parallelism 
 +    * Reduces ​power consumption ​(P ~ CV^2F)
     * Better cost efficiency and easier to scale     * Better cost efficiency and easier to scale
-    * Improve ​dependability (in case the other core is faulty+    * Improves ​dependability (in case the other core is faulty
   * Different types of parallelism   * Different types of parallelism
     * Instruction level parallelism     * Instruction level parallelism
Line 1130: Line 1209:
     * Partition a single, potentially big, task into multiple parallel sub-task     * Partition a single, potentially big, task into multiple parallel sub-task
       * Can be done explicitly (parallel programming by the programmer)       * Can be done explicitly (parallel programming by the programmer)
-      * Or implicitly (hardware ​partition ​a single thread ​specilatively+      * Or implicitly (hardware ​partitions ​a single thread ​speculatively
-    * Or, run multiple independent tasks (still ​improve ​throughput, but the speedup of any single tasks are not better, also simpler to implement)+    * Or, run multiple independent tasks (still ​improves ​throughput, but the speedup of any single tasks is not better, also simpler to implement)
   * Loosely coupled multiprocessor   * Loosely coupled multiprocessor
     * No shared global address space     * No shared global address space
       * Message passing to communicate between different sources       * Message passing to communicate between different sources
     * Simple to manage memory     * Simple to manage memory
-  * Tightly coupled ​multiprocesor+  * Tightly coupled ​multiprocessor
     * Shared global address space     * Shared global address space
     * Need to ensure consistency of data     * Need to ensure consistency of data
-  ​Switch on event multithreading +    ​Programming issues 
-    * Switch to another context when there is an event (for example, a cache miss) +  * Hardware-based ​multithreading 
-  Simulteneous multithreading +    * Coarse grained 
-    * Dispatch instruction from multiple threads at the same time+    Find grained 
 +    * Simultaneous: ​Dispatch instruction from multiple threads at the same time 
 +  * Parallel speedup 
 +    * Superlinear speedup 
 +  * Utilization,​ Redundancy, Efficiency
   * Amdahl'​s law   * Amdahl'​s law
     * Maximum speedup     * Maximum speedup
Line 1150: Line 1233:
       * Load balance       * Load balance
         * Some threads has more work, requires more time to hit the sync. point         * Some threads has more work, requires more time to hit the sync. point
-  * Issue in parallel programming+  * Critical sections 
 +    * Enforce mutually exclusive access to shared data 
 +  * Issues ​in parallel programming
     * Correctness     * Correctness
     * Synchronization     * Synchronization
     * Consistency     * Consistency
-    ​ 
-===== Lecture 29 (4/16 Wed.) ===== 
-        ​ 
  
  
 +===== Lecture 28 (4/8 Wed.) =====
   * Ordering of instructions   * Ordering of instructions
     * Maintaining memory consistency when there are multiple threads and shared memory     * Maintaining memory consistency when there are multiple threads and shared memory
     * Need to ensure the semantic is not changed     * Need to ensure the semantic is not changed
-    * Making ​sire the shared data is properly locked when used+    * Making ​sure the shared data is properly locked when used
       * Support mutual exclusion       * Support mutual exclusion
     * Ordering depends on when each processor is executed     * Ordering depends on when each processor is executed
     * Debugging is also difficult (non-deterministic behavior)     * Debugging is also difficult (non-deterministic behavior)
 +  * Dekker'​s algorithm
 +    * Inconsistency -- the two processors did NOT see the same order of operations to memory
 +  * Sequential consistency
 +    * Multiple correct global orders
 +    * Two issues:
 +        * Too conservative/​strict
 +        * Performance limiting
   * Weak consistency:​ global ordering when sync   * Weak consistency:​ global ordering when sync
     * programmer hints where the synchronizations are     * programmer hints where the synchronizations are
-  ​Total store order model: global ordering only with store+    ​Memory fence 
 +    * More burden on the programmers
   * Cache coherence   * Cache coherence
     * Can be done in the software level or hardware level     * Can be done in the software level or hardware level
-  * Coherence protocol +  * Snoop-based coherence 
-    * Need to ensure that all the processors see and update the correct state of the cache block +    * A simple protocol with two states by broadcasting reads/​writes on a bus 
-    * Need to make sure that writes get propagated ​and serialized +  * Maintaining coherence 
-    ​* Simple protocol are not scalable (one point of synchrnization) +    * Needs to provide 1) write propagation ​and 2) write serialization 
-  ​* Update vs. invalidate +    * Update vs. Invalidate 
-    For invalidate, only the core that needs to read retains the correct copy +  Two cache coherence methods 
-      * Can lead to ping-ponging (tons of read/writes from several processors) +    * Snoopy bus 
-    ​* For updates, bus becomes the bottleneck +      * Bus based, single point of serialization 
-  ​* Snoopy bus  +      * More efficient with small number of processors 
-    * Bus based, single point of serialization +      Processors ​snoop other caches read/write requests to keep the cache block coherent 
-    * More efficient with small number of processors +    * Directory 
-    All cache snoop other caches read/write requests to keep the cache block coherent +      * Single point of serialization per block 
-  * Directory ​based +      * Directory ​coordinates ​the coherency 
-    * Single point of serialization per block +      * More scalable 
-    * Directory ​coordinate ​the coherency +      * The directory keeps track of where the copies of each block resides 
-    * More scalable +        Supplies ​data on a read 
-    * The directory keeps track of where the copies of each block resides +        Invalidates ​the block on a write 
-      Supply ​data on a read +        * Has an exclusive state 
-      Invalide ​the block on a write + 
-      * Has an exclusive state+===== Lecture 29 (4/13 Mon.) =====
   * MSI coherent protocol   * MSI coherent protocol
-    * Slide number 56-57 +    * The problem: unnecessary broadcasts of invalidations 
-    * Consume bus bandwidth (need an "​exclusive"​ state +  * MESI coherent ​protocol 
-  * MESI coherent ​protocal +    * Add the exclusive state: this is the only cache copy and it is clean state to MSI 
-    * Add the exclusive state: this is the only cache copy and it is clean state to MSI +    * Multiple invalidation tradeoffs 
-  * Tradeoffs between snooping and directory based +    * Problem: memory can be unnecessarily updated 
-    * Slide 71 has a good summary ​on this +    * A possible owner state (MOESI) 
-  * MOESI +  * Tradeoffs between snooping and directory based coherence protocols 
-    * Improvement over MESI protocol+    * Slide 31 has a good summary 
 +  * Directory: data structures 
 +    * Bit vectors vs. linked lists 
 +  * Scalability of directories 
 +    * Size? Latency? Thousand of nodes? Best of both snooping and directory?
  
-     +===== Lecture ​30 (4/15 Wed.) =====
-===== Lecture ​29 (4/18 Wed.) =====    +
  
- +  * Application slowdown 
- +  * Interference ​between different applications 
-  * Interference +    * Applications'​ performance depends on other applications that they are running with
-  * Complexity of the memory scheduler +
-    * Ranking/​prioritization has cost +
-    * Complex scheduler has higher latency +
-  * Performance metric for multicore/​multithead ​applications +
-    * Speedup +
-    * Slowdown +
-    * Harmonic vs wrighted +
-  * Fairness mertic +
-    * Maximum slowdown +
-      * Why does it make sense +
-      * Any scenario ​that it does not make sense?+
   * Predictable performance   * Predictable performance
-    * Why is it important?​ +    * Why are they important?​ 
-      In server environment,​ different jobs are on the same server +    Applications ​that need predictibility 
-      * In a mobile environment,​ there are multiple sources ​that can slowdown other sources +    * How to predict the performance?​ 
-    * How to relate slowdown with request service rate +      What information are useful? 
-    MISE: soft slowdown guarantee +      What need to be guarantee? 
-  BDI +      How to estimate the performance when running with others? 
-    Memory wall +        Easy, just measure ​the performance while it is running. 
-      What is the concern regarding the memory wall +      How to estimate ​the performance when the application is running by itself. 
-    Size of the cache on the die (CPU die) +        Hard if there is no profiling. 
-    One possible solution: cache compression +      * The relationship between memory service rate and the performance. 
-      * What is the problems of existing cache compression mechanism +        * Key assumption: applications ​are memory bound 
-        * Some are too complex +    Behavior of memory-bound applications 
-        Decompression is in the critical path +      With and without interference 
-          Need to decompress when reading the data -> decompression should not be in the critical path +  Memory phase vs. compute phase 
-          Important factor to the performance +  * MISE 
-    * Software compression is not good enough to compress everything +    * Estimating slowdown using request service rate 
-    * Zero value compression +    * Inaccuracy when measuring request service rate alone 
-      Simple +    Non-memory-bound applications 
-      Good compression ratio +    Control slowdown and provide soft guarantee 
-      What is data does not have many zeroes +  Taking into account of the shared cache 
-    * Frequent value compression +    * MISE model + cache resource management 
-      Some data appear fequently +    Aug tag store 
-      * Simple and good compression ratio +      * Separate tag store for different cores 
-      have to profile +    Cache access rate alone and shared as the metric ​to estimate slowdown 
-      decompression is complex +  Cache paritiioning 
-    * Frequent pattern compression +    * How to determine partitioning 
-      * Still to complex in terms of decompression +      * Utility based cache partitioning 
-    Based delta compression +      Others 
-      Easy to decompress but retain the benefit of compression +  Maximum slowdown and fairness metric
-      ​+
     ​     ​
-===== Lecture 31 (4/28 Mon.) =====  ​ 
- 
-  * Directory based cache coherent 
-    * Each directory has to handle validate/​invalidation 
-    * Extra cost of syncronization 
-    * Need to ensure race conditions are resolved 
-  * Interconnection 
-    * Topology 
-      * Bus 
-      * Mesh 
-        * Torus 
-      * Tree 
-      * Butterfly 
-      * Ring 
-        * Bi-directional ring 
-          * More scalable 
-        * Hierarchical ring 
-          * Even more scalable 
-          * More complex 
-      * Crossbar 
-      * etc. 
-    * Circuit switching 
-    * Multistage network 
-      * Butterfly 
-      * Delta network 
-    * Handling contention 
-      * Buffering vs. dropping/​deflection (no buffering) 
-    * Routing algorithm 
-      * Handling deadlock 
-      * X-Y routing 
-        * Turn model (to avoid deadlocks) 
-      * Add more buffering for an escape path 
-      * Oblivious routing 
-        * Can take different path 
-          * DOR between each intermediate location 
-        * Balance network load 
-      * Adaptive routing 
-        * Use the state of the network to determine the route 
-          * Aware of local and/or global congestions 
-        * Non minimal adaptive routing can have livelocks 
- 
     ​     ​
  
buzzword.txt · Last modified: 2015/04/27 18:20 by rachata