User Tools

Site Tools


buzzword

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
buzzword [2014/12/11 00:09]
127.0.0.1 external edit
buzzword [2015/03/25 18:19]
rachata
Line 1: Line 1:
 ====== Buzzwords ====== ====== Buzzwords ======
  
-Buzzwords are terms that are mentioned during lecture which are particularly important to understand thoroughly. ​ This page tracks the buzzwords for each of the lectures and can be used as a reference for finding gaps in your understanding of course material. +Buzzwords are terms that are mentioned during lecture which are particularly important to understand thoroughly. This page tracks the buzzwords for each of the lectures and can be used as a reference for finding gaps in your understanding of course material.
- +
-===== Lecture 1 (1/13 Mon.) =====+
  
 +===== Lecture 1 (1/12 Mon.) =====
   * Level of transformation   * Level of transformation
     * Algorithm     * Algorithm
Line 10: Line 9:
     * Compiler     * Compiler
   * Cross abstraction layers   * Cross abstraction layers
-    * Expose an interface 
   * Tradeoffs   * Tradeoffs
   * Caches   * Caches
-  ​* Multi-thread +  * DRAM/memory ​controller 
-  * Multi-core +  * DRAM banks
-  * Unfairness +
-  ​* DRAM controller/Memory ​controller +
-  * Memory hog+
   * Row buffer hit/miss   * Row buffer hit/miss
   * Row buffer locality   * Row buffer locality
-  * Streaming access/ Random ​access +  ​* Unfairness 
-  * DRAM refresh +  * Memory performance hog 
-  * Retention time +  * Shared DRAM memory system 
-  * Profiling DRAM retention ​time+  ​* Streaming access ​vs. random ​access 
 +  * Memory scheduling policies 
 +  * Scheduling priority 
 +  * Retention time of DRAM 
 +  * Process variation 
 +  * Retention ​time profile
   * Power consumption   * Power consumption
-  * Wimpy cores 
   * Bloom filter   * Bloom filter
-    ​Pros/Cons +  ​Hamming code 
-    * False Positive +  * Hamming distance 
-  * Simulation +  * DRAM row hammer
-  * Memory performance attacks +
-  * RTL design+
  
-===== Lecture 2 (1/15 Wed.) ===== +===== Lecture 2 (1/14 Wed.) =====
- +
-  * Optimizing for energy/ Optimizing for the performance +
-    * Generally you should optimize for the users +
-  * state-of-the-art +
-  * RTL Simulation +
-    * Long, slow and can be costly +
-  * High level simulation +
-    * What should be employed? +
- * Important to get the idea of how they are implemented in RTL +
- * Allows designer to filter out techniques that do not work well +
-  * Design points +
-    * Design processors to meet the design points +
-  * Software stack +
-  * Design decisions +
-  * Datacenters +
-  * MIPS R2000 +
-    * What are architectural techniques that improve the performance of a processor over MIPS 2000+
   * Moore'​s Law   * Moore'​s Law
 +  * Algorithm --> step-by-step procedure to solve a problem
   * in-order execution   * in-order execution
   * out-of-order execution   * out-of-order execution
Line 64: Line 45:
   * Memory wall (a part of scaling issue)   * Memory wall (a part of scaling issue)
   * Scaling issue   * Scaling issue
-    * Transister ​are getting smaller+    * Transistor ​are getting smaller 
 +  * Key components of a computer 
 +  * Design points 
 +    * Design processors to meet the design points 
 +  * Software stack 
 +  * Design decisions 
 +  * Datacenters
   * Reliability problems that cause errors   * Reliability problems that cause errors
   * Analogies from Kuhn's "The Structure of Scientific Revolutions"​ (Recommended book)   * Analogies from Kuhn's "The Structure of Scientific Revolutions"​ (Recommended book)
-    * Pre paradigm science+    * Pre-paradigm science
     * Normal science     * Normal science
-    * Revolutionalry ​science+    * Revolutionary ​science
   * Components of a computer   * Components of a computer
     * Computation     * Computation
       * Communication       * Communication
-        ​* Storage +      ​* Storage 
-          * DRAM +        * DRAM 
-          * NVRAM (Non-volatile memory): PCM, STT-MRAM +        * NVRAM (Non-volatile memory): PCM, STT-MRAM 
-          * Storage (Flash/​Harddrive)+        * Storage (Flash/​Harddrive)
   * Von Neumann Model (Control flow model)   * Von Neumann Model (Control flow model)
     * Stored program computer     * Stored program computer
-        ​* Properties of Von Neumann Model: Stored program, sequential instruction processing +      ​* Properties of Von Neumann Model: Stored program, sequential instruction processing 
-        * Unified memory +      * Unified memory 
-          * When does an instruction is being interpreted as an instruction (as oppose to a datum)? +        * When does an instruction is being interpreted as an instruction (as oppose to a datum)? 
-        * Program counter +      * Program counter 
-        * Examples: x86, ARM, Alpha, IBM Power series, SPARC, MIPS+      * Examples: x86, ARM, Alpha, IBM Power series, SPARC, MIPS
   * Data flow model   * Data flow model
     * Data flow machine     * Data flow machine
Line 89: Line 76:
     * Operands     * Operands
     * Live-outs/​Live-ins     * Live-outs/​Live-ins
-      * DIfferent ​types of data flow nodes (conditional/​relational/​barrier)+      * Different ​types of data flow nodes (conditional/​relational/​barrier)
     * How to do transactional transaction in dataflow?     * How to do transactional transaction in dataflow?
       * Example: bank transactions        * Example: bank transactions
   * Tradeoffs between control-driven and data-driven   * Tradeoffs between control-driven and data-driven
     * What are easier to program?     * What are easier to program?
-    ​* Which are easy to compile? +      ​* Which are easy to compile? 
-    * What are more parallel (does that mean it is faster?) +      * What are more parallel (does that mean it is faster?) 
-    * Which machines are more complex to design?+      * Which machines are more complex to design?
     * In control flow, when a program is stop, there is a pointer to the current state (precise state).     * In control flow, when a program is stop, there is a pointer to the current state (precise state).
   * ISA vs. Microarchitecture   * ISA vs. Microarchitecture
     * Semantics in the ISA     * Semantics in the ISA
-    ​* uArch should obey the ISA +      ​* uArch should obey the ISA 
-    * Changing ISA is costly, can affect compatibility.+      * Changing ISA is costly, can affect compatibility.
   * Instruction pointers   * Instruction pointers
   * uArch techniques: common and powerful techniques break Vonn Neumann model if done at the ISA level   * uArch techniques: common and powerful techniques break Vonn Neumann model if done at the ISA level
Line 109: Line 96:
       * Out-of-order executions       * Out-of-order executions
       * etc.       * etc.
-    ​* Design techniques +        ​* Design techniques 
-      * Adder implementation (Bit serial, ripple carry, carry lookahead) +          * Adder implementation (Bit serial, ripple carry, carry lookahead) 
-      * Connection machine (an example of a machine that use bit serial to tradeoff latency for more parallelism)+          * Connection machine (an example of a machine that use bit serial to tradeoff latency for more parallelism)
   * Microprocessor:​ ISA + uArch + circuits   * Microprocessor:​ ISA + uArch + circuits
   * What are a part of the ISA? Instructions,​ memory, etc.   * What are a part of the ISA? Instructions,​ memory, etc.
Line 120: Line 107:
 ===== Lecture 3 (1/17 Fri.) ===== ===== Lecture 3 (1/17 Fri.) =====
  
-  * Design ​tradeoffs +   * Microarchitecture 
-  * Macro Architectures +   * Three major tradeoffs ​of computer architecture 
-  Reconfiguribility vs. specialized designs +   ​* Macro-architecture 
-  Parallelism (instructions, data parallel) +   ​LC-3b ISA 
-  Uniform decode (Example: Alpha) +   ​Unused ​instructions 
-  Steering bits (Sub-opcode) +   ​Bit steering 
-  * 0,1,2,3 address machines +   ​Instruction processing style 
-    * Stack machine +   ​* 0,1,2,3 address machines 
-    * Accumulator machine +   ​* Stack machine 
-    * 2-operand machine +   ​* Accumulator machine 
-    * 3-operand machine +   ​* 2-operand machine 
-    * Tradeoffs between 0,1,2,3 address machines +   ​* 3-operand machine 
-  * Instructions/​Opcode/​Operade ​specifiers (i.e. addressing modes) +   ​* Tradeoffs between 0,1,2,3 address machines 
-  * Simply vs. complex data type (and their tradeoffs) +   * Postfix notation 
-  * Semantic gap +   * Instructions/​Opcode/​Operand ​specifiers (i.e. addressing modes) 
-  * Translation layer +   ​* Simply vs. complex data type (and their tradeoffs) 
-  * Addressability +   ​* Semantic gap and level 
-  * Byte/bit addressable machines +   ​* Translation layer 
-  * Virtual memory +   ​* Addressability 
-  * Big/little endian +   ​* Byte/bit addressable machines 
-  * Benefits of having registers (data locality) +   ​* Virtual memory 
-  * Programmer visible (Architectural) state +   ​* Big/little endian 
-    * Programmers can access this directly +   ​* Benefits of having registers (data locality) 
-    * What are the benefits? +   ​* Programmer visible (Architectural) state 
-  * Microarchitectural state +   ​* Programmers can access this directly 
-    * Programmers cannot access this directly +   ​* What are the benefits? 
-  * Evolution of registers (from accumulators to registers) +   ​* Microarchitectural state 
-  * Different types of instructions +   ​* Programmers cannot access this directly 
-    * Control instructions +   ​* Evolution of registers (from accumulators to registers) 
-    * Data instructions +   ​* Different types of instructions 
-    * Operation instructions +   ​* Control instructions 
-  * Addressing modes +   ​* Data instructions 
-    * Tradeoffs (complexity,​ flexibility,​ etc.) +   ​* Operation instructions 
-  * Orthogonal ISA +   ​* Addressing modes 
-    * Addressing modes that are orthogonal to instructino ​types +   ​* Tradeoffs (complexity,​ flexibility,​ etc.) 
-  Vectors ​vs. non vectored interrupts +   ​* Orthogonal ISA 
-  * Complex vs. simple instructions +   ​* Addressing modes that are orthogonal to instruction ​types 
-    * Tradeoffs +   ​I/O devices 
-  * RISC vs. CISC +   * Vectored ​vs. non-vectored interrupts 
-    * Tradeoff +   ​* Complex vs. simple instructions 
-    * Backward compatibility +   ​* Tradeoffs 
-    * Performance +   ​* RISC vs. CISC 
-    * Optimization opportunity+   ​* Tradeoff 
 +   ​* Backward compatibility 
 +   ​* Performance 
 +   ​* Optimization opportunity 
 +   * Translation
  
-===== Lecture 4 (1/22 Wed.) =====+===== Lecture 4 (1/21 Wed.) =====
  
-  ​* Semantic gap +  * Fixed vs. variable length ​instruction 
-    * Small vs. Large semantic gap (CISC vs. RISC) +  * Huffman encoding 
-    * Benefit of RISC vs. CISC +  * Uniform vs. non-uniform decode 
-  * Micro operations/​microcode +  * Registers 
-    * Translate complex instructions into smaller instructions +    * Tradeoffs between ​number of registers 
-  * Parallelism (motivation for RISC) +  * Alignments 
-  * Compiler optimization +    * How does MIPS load words across alignment ​the boundary
-  * Code optimization through translation +
-  * VLIW +
-  ​* Fixed vs. variable length ​instructions +
-    * Tradeoffs +
-      * Alignment issues? (fetch/​decode) +
-      * Decoding issues? +
-      * Code size? +
-      * Adding additional instructions?​ +
-      * Memory bandwidth and cache utilization?​ +
-      * Energy? +
-    * Encoding in variable length instructions +
-  * Structure of Alpha instructions and other uniform decode instructions +
-    * Different type of instructions +
-    * Benefit of knowing what type of instructions +
-      * Speculatively operate future instructions +
-  * x86 and other non-uniform decode instructions +
-    * Tradeoff vs. uniform decode +
-  * Tradeoffs for different number of registers +
-    * Spilling into memory if the number of registers ​is small +
-    * Compiler optimization on how to manage which value to keep/spill +
-  * Addressing modes +
-    * Benefits? +
-    * Types? +
-    * Different uses of addressing modes? +
-  * Various ISA-level tradeoffs +
-  * Virtual memory +
-  * Unalign memory access/​aligned memory access +
-    * Cost vs. benefit of unaligned access +
-  * ISA specification +
-    * Things you have to obey/​specifie in the ISA specification +
-  * Architectural states +
-  * Microarchitecture implements how arch. state A transformed to the next arch. state A' +
-  * Single cycle machines +
-    * Critical path in the single cycle machine +
-  * Multi cycle machines +
-  * Functional units +
-  * Performance metrics +
-    * CPI/IPC +
-      * CPI of a single cycle microarchitecture+
  
-===== Lecture 5 (1/24 Fri.) =====+===== Lecture 5 (1/26 Mon.) =====
  
 +  * Tradeoffs in ISA: Instruction length
 +    * Uniform vs. non-uniform
 +  * Design point/Use cases
 +    * What dictates the design point?
 +  * Architectural states
 +  * uArch
 +    * How to implement the ISA in the uArch
 +  * Different stages in the uArch
 +  * Clock cycles
 +  * Multi-cycle machine
 +  * Datapath and control logic
 +    * Control signals
 +  * Execution time of instructions/​program
 +    * Metrics and what do they means
   * Instruction processing   * Instruction processing
     * Fetch     * Fetch
Line 221: Line 187:
     * Memory fetch     * Memory fetch
     * Writeback     * Writeback
-  * Datapath & Control logic in microprocessors+  * Encoding and semantics
   * Different types of instructions (I-type, R-type, etc.)   * Different types of instructions (I-type, R-type, etc.)
   * Control flow instructions   * Control flow instructions
Line 230: Line 196:
   * Critical path analysis   * Critical path analysis
     * Critical path of a single cycle processor     * Critical path of a single cycle processor
-  * Combinational logic & Sequential logic+  ​* What is in the control signals? 
 +    ​* Combinational logic & Sequential logic
   * Control store   * Control store
   * Tradeoffs of a single cycle uarch   * Tradeoffs of a single cycle uarch
-  * Dynamic power/​Static power 
-  * Speedup calculation 
-    * Parallelism 
-    * Serial bottleneck 
-    * Amdahl'​s bottleneck 
   * Design principles   * Design principles
     * Common case design     * Common case design
     * Critical path design     * Critical path design
     * Balanced designs     * Balanced designs
-  ​Multi cycle design+    ​Dynamic power/​Static power 
 +      * Increases in power due to frequency
  
-===== Lecture 6 (1/27 Mon.) =====+===== Lecture 6 (1/28 Mon.) =====
  
 +  * Design principles
 +    * Common case design
 +    * Critical path design
 +    * Balanced designs
 +  * Multi cycle design
   * Microcoded/​Microprogrammed machines   * Microcoded/​Microprogrammed machines
     * States     * States
 +    * Translation from one state to another
     * Microinstructions     * Microinstructions
     * Microsequencing     * Microsequencing
Line 267: Line 236:
   * Variable latency memory   * Variable latency memory
   * Handling interrupts   * Handling interrupts
-  * Difference ​betwen ​interrupts and exceptions+  * Difference ​between ​interrupts and exceptions
   * Emulator (i.e. uCode allots minimal datapath to emulate the ISA)   * Emulator (i.e. uCode allots minimal datapath to emulate the ISA)
   * Updating machine behavior   * Updating machine behavior
Line 273: Line 242:
   * Vertical microcode   * Vertical microcode
   * Primitives   * Primitives
 +  ​
 +===== Lecture 7 (1/30 Fri.) =====
  
-===== Lecture 7 (1/29 Wed.) ===== +  * Emulator ​(i.e. uCode allots minimal datapath to emulate the ISA
 +  * Updating machine behavior 
 +  * Horizontal microcode 
 +  * Vertical microcode 
 +  * Primitives 
 +  * nanocode and millicode 
 +    * what are the differences between nano/​milli/​microcode 
 +  * microprogrammed vs. hardwire control
   * Pipelining   * Pipelining
   * Limitations of the multi-programmed design   * Limitations of the multi-programmed design
     * Idle resources     * Idle resources
   * Throughput of a pipelined design   * Throughput of a pipelined design
-    * What dictacts ​the throughput of a pipelined design?+    * What dictates ​the throughput of a pipelined design?
   * Latency of the pipelined design   * Latency of the pipelined design
   * Dependency   * Dependency
Line 300: Line 277:
     * Pipeline flush     * Pipeline flush
     * Speculation     * Speculation
 +  ​
 +===== Lecture 8 (2/2 Mon.) =====  ​
 +
   * Interlocking   * Interlocking
   * Multipath execution   * Multipath execution
Line 305: Line 285:
   * No-op (Bubbles in the pipeline)   * No-op (Bubbles in the pipeline)
   * Valid bits in the instructions   * Valid bits in the instructions
- 
-===== Lecture 8 (1/31 Fri.) ===== 
   * Branch prediction   * Branch prediction
   * Different types of data dependence   * Different types of data dependence
Line 335: Line 313:
     * Based on the profile info. position the code based on it     * Based on the profile info. position the code based on it
     * Try to make the next sequential instruction be the next inst. to be executed     * Try to make the next sequential instruction be the next inst. to be executed
-  * Trace cache 
   * Predicate combining (combine predicate for a branch instruction)   * Predicate combining (combine predicate for a branch instruction)
   * Predicated execution (control dependence becomes data dependence)   * Predicated execution (control dependence becomes data dependence)
 +  ​
 +    ​
 +===== Lecture 9 (2/4 Wed.) =====  ​
   * Definition of basic blocks   * Definition of basic blocks
   * Control flow graph   * Control flow graph
- 
-===== Lecture 9 (2/3 Mon.) ===== 
   * Delayed branching   * Delayed branching
     * benefit?     * benefit?
Line 358: Line 336:
     * Inst. from different thread can fill-in the bubbles     * Inst. from different thread can fill-in the bubbles
     * Cost?     * Cost?
-  * Simulteneuos ​multithreading+  * Simultaneuos ​multithreading
   * Branch prediction   * Branch prediction
     * Guess what to fetch next.     * Guess what to fetch next.
Line 367: Line 345:
       * Given the program/​number of instructions,​ percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions.       * Given the program/​number of instructions,​ percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions.
         * How many extra instructions are being fetched?         * How many extra instructions are being fetched?
-        * What is the performance ​degredation?+        * What is the performance ​degradation?
     * How to reduce the miss penalty?     * How to reduce the miss penalty?
     * Predicting the next address (non PC+4 address)     * Predicting the next address (non PC+4 address)
Line 374: Line 352:
     * Global branch history - for directions     * Global branch history - for directions
     * Can use compiler to profile and get more info     * Can use compiler to profile and get more info
-      * Input set dictacts ​the accuracy+      * Input set dictates ​the accuracy
       * Add time to compilation       * Add time to compilation
     * Heuristics that are common and doesn'​t require profiling.     * Heuristics that are common and doesn'​t require profiling.
       * Might be inaccurate       * Might be inaccurate
       * Does not require profiling       * Does not require profiling
-    * Programmer ​can tell the hardware (via pragmas ​(hints))+    ​* Static branch prediction 
 +      ​* Programmer ​provides ​pragmas, hinting the likelihood of taken/not taken branch
       * For example, x86 has the hint bit       * For example, x86 has the hint bit
     * Dynamic branch prediction     * Dynamic branch prediction
Line 386: Line 365:
         * One more bit for hysteresis         * One more bit for hysteresis
  
- +===== Lecture 10 (2/6 Fri.) =====
-===== Lecture 10 (2/5 Wed.) ===== +
   * Branch prediction accuracy   * Branch prediction accuracy
     * Why are they very important?     * Why are they very important?
       * Differences between 99% accuracy and 98% accuracy       * Differences between 99% accuracy and 98% accuracy
-      * Cost of a misprediction when the pipeline is veryd eep +      * Cost of a misprediction when the pipeline is very deep
-  * Value prediction+
   * Global branch correlation   * Global branch correlation
     * Some branches are correlated     * Some branches are correlated
Line 404: Line 380:
     * What information are used     * What information are used
   * Two-level branch prediction   * Two-level branch prediction
-    * What entries do you keep in the glocal ​history?+    * What entries do you keep in the global ​history?
     * What entries do you keep in the local history?     * What entries do you keep in the local history?
     * How many table?     * How many table?
Line 414: Line 390:
     * Store both GHP and PC in one combined information     * Store both GHP and PC in one combined information
     * How do you use the information?​ Why does the XOR result still usable?     * How do you use the information?​ Why does the XOR result still usable?
-  * Slides (page 16-18) for a good overview of one- and two-level predictors 
   * Warmup cost of the branch predictor   * Warmup cost of the branch predictor
     * Hybrid solution? Fast warmup is used first, then switch to the slower one.     * Hybrid solution? Fast warmup is used first, then switch to the slower one.
   * Tournament predictor (Alpha 21264)   * Tournament predictor (Alpha 21264)
-  * Other types of branch predictor 
-    * Using machine learning? 
-    * Geometric history length 
-      * Look at branches far behind (but using geometric step) 
   * Predicated execution - eliminate branches   * Predicated execution - eliminate branches
     * What are the tradeoffs     * What are the tradeoffs
-    * What is the block is big (can lead to execution a lot of useless work)+    * What if the block is big (can lead to execution a lot of useless work)
     * Allows easier code optimization     * Allows easier code optimization
       * From the compiler PoV, predicated execution combine multiple basic blocks into one bigger basic block       * From the compiler PoV, predicated execution combine multiple basic blocks into one bigger basic block
Line 434: Line 405:
       * Use branch prediction on an easy to predict code       * Use branch prediction on an easy to predict code
       * Use predicated execution on a hard to predict code       * Use predicated execution on a hard to predict code
-      * Compiler can be more aggressive in optimimzing ​the code+      * Compiler can be more aggressive in optmizing ​the code
     * What are the tradeoffs (slide# 47)     * What are the tradeoffs (slide# 47)
   * Multi-path execution   * Multi-path execution
     * Execute both paths     * Execute both paths
     * Can lead to wasted work     * Can lead to wasted work
 +    * VLIW
 +    * Superscalar
  
  
-===== Lecture 11 (2/12 Wed.) =====+===== Lecture 11 (2/11 Wed.) =====
  
-  * Call and return prediction +  * Geometric GHR length for branch prediction 
-    * Direct call is easy to predict +  * Perceptron branch predictor
-    * Retun is harder (indirect branches) +
-      * Nested calls make return easier to predict +
-        * Can use stack to predict the return +
-    * Indirect ​branch ​prediction +
-      * These branches have multiple targets +
-      * For switch-case,​ virtual function calls, jump tables, interface calls +
-      * BTB to predict the target address - low accuracy +
-      * History based: BTB + GHR +
-      * Virtual program counter ​prediction +
-  * Complications in superscalar processors +
-    * Fetch? What if multiple branches are fetched at the same time? +
-    * Logic requires to ensure correctness?​+
   * Multi-cycle executions (Different functional units take different number of cycles)   * Multi-cycle executions (Different functional units take different number of cycles)
     * Instructions can retire out-of-order     * Instructions can retire out-of-order
Line 462: Line 423:
   * Exceptions and Interrupts   * Exceptions and Interrupts
     * When they are handled? ​     * When they are handled? ​
-      ​* Why are some interrupts should be handled right away?+    ​* Why are some interrupts should be handled right away?
   * Precise exception   * Precise exception
     * arch. state should be consistent before handling the exception/​interrupts     * arch. state should be consistent before handling the exception/​interrupts
Line 470: Line 431:
       * Easier to restart the processes       * Easier to restart the processes
     * How to ensure precise exception?     * How to ensure precise exception?
-      ​* Tradeoffs between each method+    ​* Tradeoffs between each method
   * Reorder buffer   * Reorder buffer
     * Reorder results before they are visible to the arch. state     * Reorder results before they are visible to the arch. state
-      * Need to presearve ​the sequential ​sematic ​and data +      * Need to preserve ​the sequential ​semantic ​and data 
-    * What are the informatinos ​in the ROB entry+    * What are the information ​in the ROB entry
     * Where to get the value from (forwarding path? reorder buffer?)     * Where to get the value from (forwarding path? reorder buffer?)
       * Extra logic to check where the youngest instructions/​value is       * Extra logic to check where the youngest instructions/​value is
-      * Content ​addressible ​search+      * Content ​addressable ​search ​(CAM)
         * A lot of comparators         * A lot of comparators
     * Different ways to simplify the reorder buffer     * Different ways to simplify the reorder buffer
Line 487: Line 448:
   * Future file (commonly used, along with reorder buffer)   * Future file (commonly used, along with reorder buffer)
     * Keep two set of register files     * Keep two set of register files
-      * An updated value (Speculative),​ called ​fiture ​file+      * An updated value (Speculative),​ called ​future ​file
       * A backup value (to restore the state quickly       * A backup value (to restore the state quickly
-    * Double the cost of the regfile, but reduce the area as you don't have to use a content ​addressible ​memory (compared to ROB alone)+    * Double the cost of the regfile, but reduce the area as you don't have to use a content ​addressable ​memory (compared to ROB alone)
   * Branch misprediction resembles Exception   * Branch misprediction resembles Exception
     * The difference is that branch misprediction is not visible to the software     * The difference is that branch misprediction is not visible to the software
Line 498: Line 459:
   * Checkpointing   * Checkpointing
     * Advantages?     * Advantages?
-    ​ 
-===== Lecture 14 (2/19 Wed.) ===== 
  
 +===== Lecture 12 (2/13 Fri.) =====
 +
 +  * Renaming
 +  * Register renaming table
   * Predictor (branch predictor, cache line predictor ...)   * Predictor (branch predictor, cache line predictor ...)
   * Power budget (and its importance)   * Power budget (and its importance)
Line 523: Line 486:
       * Slides 28 --> register renaming       * Slides 28 --> register renaming
       * Slides 30-35 --> Exercise (also on the board)       * Slides 30-35 --> Exercise (also on the board)
-        * This will be usefull ​for the midterm+        * This will be useful ​for the midterm
     * Register aliasing table     * Register aliasing table
 +    * Broadcasting tags
 +    * Using dataflow
  
-===== Lecture ​15 (2/21 Fri.) =====+ 
 +===== Lecture ​13 (2/16 Mon.) =====
  
   * OoO --> Restricted Dataflow   * OoO --> Restricted Dataflow
Line 534: Line 500:
       * Dispatch width       * Dispatch width
       * Parallelism in the program       * Parallelism in the program
-    * More example on slide #10 
     * What does it mean to be restricted data flow     * What does it mean to be restricted data flow
       * Still visible as a Von Neumann model       * Still visible as a Von Neumann model
     * Where does the efficiency come from?     * Where does the efficiency come from?
-    * Size of the scheduling ​windors/reorder buffer. Tradeoffs? What make sense?+    * Size of the scheduling ​windows/reorder buffer. Tradeoffs? What make sense?
   * Load/store handling   * Load/store handling
     * Would like to schedule them out of order, but make them visible in-order     * Would like to schedule them out of order, but make them visible in-order
Line 545: Line 510:
     * This is one of the most complex structure of the load/store handling     * This is one of the most complex structure of the load/store handling
     * What information can be used to predict these load/store optimization?​     * What information can be used to predict these load/store optimization?​
-  * Note: IPC = 1/CPI 
   * Centralized vs. distributed?​ What are the tradeoffs?   * Centralized vs. distributed?​ What are the tradeoffs?
   * How to handle when there is a misprediction/​recovery   * How to handle when there is a misprediction/​recovery
 +    * OoO + branch prediction?
 +    * Speculatively update the history register
 +      * When do you update the GHR?
   * Token dataflow arch.   * Token dataflow arch.
     * What are tokens?     * What are tokens?
Line 555: Line 522:
     * Difficulties?​     * Difficulties?​
  
-===== Lecture ​16 (2/24 Mon.) =====+===== Lecture ​14 (2/18 Wed.) =====
  
   * SISD/​SIMD/​MISD/​MIMD   * SISD/​SIMD/​MISD/​MIMD
Line 571: Line 538:
     * But the program needs to be very parallel     * But the program needs to be very parallel
     * Memory can be the bottleneck (due to very high MLP)     * Memory can be the bottleneck (due to very high MLP)
-    * What does the functional units look like? Deep pipelin ​and simpler control.+    * What does the functional units look like? Deep pipeline ​and simpler control.
     * CRAY-I is one of the examples of vector processor     * CRAY-I is one of the examples of vector processor
     * Memory access pattern in a vector processor     * Memory access pattern in a vector processor
       * How do the memory accesses benefit the memory bandwidth?       * How do the memory accesses benefit the memory bandwidth?
-      * Please refer to slides 73-74 in http://​www.ece.cmu.edu/​~ece447/​s13/​lib/​exe/​fetch.php?​media=onur-447-spring13-lecture25-mainmemory-afterlecture.pdf for a breif explanation of memory ​level parallelism+      * Memory ​level parallelism
       * Stride length vs. the number of banks       * Stride length vs. the number of banks
         * stride length should be relatively prime to the number of banks         * stride length should be relatively prime to the number of banks
Line 582: Line 549:
     * What if there are multiple memory ports?     * What if there are multiple memory ports?
     * Gather/​Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism)     * Gather/​Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism)
-      * Helps handling sparse ​metrices+      * Helps handling sparse ​matrices
     * Conditional operation     * Conditional operation
     * Structure of vector units     * Structure of vector units
Line 595: Line 562:
     * Intel SSE --> Modern version of MMX     * Intel SSE --> Modern version of MMX
  
-===== Lecture ​17 (2/26 Wed.) ===== +===== Lecture ​15 (2/20 Fri.) =====
   * GPU   * GPU
     * Warp/​Wavefront     * Warp/​Wavefront
Line 613: Line 579:
     * Lower SIMD efficiency     * Lower SIMD efficiency
     * What if you have layers of branches?     * What if you have layers of branches?
-  * Dynamic ​wrap formation+  * Dynamic ​warp formation
     * Combining threads from different warps to increase SIMD utilization     * Combining threads from different warps to increase SIMD utilization
     * This can cause memory divergence     * This can cause memory divergence
Line 627: Line 593:
     * How to street the instruction (determine dependency/​stalling)?​     * How to street the instruction (determine dependency/​stalling)?​
     * Instruction scheduling techniques (static vs. dynamic)     * Instruction scheduling techniques (static vs. dynamic)
-  * Systoric ​arrays+  * Systolic ​arrays
     * Processing elements transform data in chains     * Processing elements transform data in chains
     * Develop for image processing (for example, convolution)     * Develop for image processing (for example, convolution)
   * Stage processing   * Stage processing
 +    ​
  
-===== Lecture 18 (2/28 Fri.) ===== 
- 
-  * Tradeoffs of VLIW 
-    * Why does VLIW required static instruction scheduling 
-    * Whose job it is? 
-      * Compiler can rearrange basic blocks/​instruction 
-  * Basic block 
-    * Benefits of having large basic block 
-  * Entry/Exit 
-    * Handling entries/​exits 
-  * Trace cache 
-    * How to ensure correctness?​ 
-    * Profiling 
-    * Fixing up the instruction order to ensure correctness 
-    * Dealing with multiple entries into the block 
-    * Dealing with multiple exits into the block 
-  * Super block 
-    * How to form super blocks? 
-    * Benefit of super block 
-    * Tradeoff between not forming a super block and forming a super block 
-      * Ambiguous branch (after profiling, both taken/not taken are equally likely) 
-      * Cleaning up  
-  * What scenario would make trace cache/​superblock/​profiling less effective? 
-  * List scheduling 
-    * Help figuring out which instructions VLIW should fetch 
-    * Try to maximize instruction throughput 
-    * How to assign priorities 
-    * What if some instructions take longer than others 
-  * Block structured ISA (BS-ISA) 
-    * Problems with trace scheduling? 
-    * What type of program will benefit from BS-ISA 
-    * How to form blocks in BS-ISA? 
-      * Combining basic blocks 
-      * multiples of merged basic blocks 
-    * How to deal with entries/​exits in BS-ISA? 
-      * undo the executed instructions from the entry point, then fetch the new block 
-    * Advantages over trace cache 
-  * Benefit of VLIW + Static instruction scheduling 
-  * Intel IA-64 
-    * Static instruction scheduling and VLIW 
- 
-===== Lecture 19 (3/19 Wed.) ===== 
  
 +===== Lecture 16 (2/23 Mon.) =====
 +  * Systolic arrays
 +    * Processing elements transform data in chains
 +    * Can be arrays of multi-dimensional processing elements
 +    * Develop for image processing (for example, convolution)
 +    * Can be use to break stages in pipeline programs, using a set of queues and processing elements
 +    * Can enable high concurrency and good for regular programs
 +    * Very special purpose
 +    * The warp computer
 +  * Static instruction scheduling
 +    * How do we find the next instruction to execute?
 +  * Live-in and live-out
 +  * Basic blocks
 +    * Rearranging instructions in the basic block
 +    * Code movement from one basic block to another
 +  * Straight line code
 +  * Independent instructions
 +    * How to identify independent instructions
 +  * Atomicity
 +  * Trace scheduling
 +    * Side entrance
 +    * Fixed up code
 +    * How scheduling is done
 +  * Instruction scheduling
 +    * Prioritization heuristics
 +  * Superblock
 +    * Traces with no side-entrance
 +  * Hyperblock
 +  * BS-ISA
 +  * Tradeoffs between trace cache/​Hyperblock/​Superblock/​BS-ISA
 +    ​
 +===== Lecture 17 (2/25 Wed.) =====
 +  * IA-64
 +    * EPIC
 +  * IA-64 instruction bundle
 +    * Multiple instructions in the bundle along with the template bit
 +    * Template bits
 +    * Stop bits
 +    * Non-faulting loads and exception propagation
 +  * Aggressive ST-LD reordering
 +  * Physical memory system
 +  * Ideal pipelines
   * Ideal cache   * Ideal cache
     * More capacity     * More capacity
Line 682: Line 649:
   * DRAM cell   * DRAM cell
     * Cheap     * Cheap
-    * Sense the purturbation ​through sense amplifier+    * Sense the perturbation ​through sense amplifier
     * Slow and leaky     * Slow and leaky
   * SRAM cell (Cross coupled inverter)   * SRAM cell (Cross coupled inverter)
Line 689: Line 656:
   * Memory bank   * Memory bank
     * Read access sequence     * Read access sequence
-      ​* DRAM: Activate -> Read -> Precharge (if needed) +    ​* DRAM: Activate -> Read -> Precharge (if needed) 
-    * What dominate the access ​laatency ​for DRAM and SRAM+    * What dominate the access ​latency ​for DRAM and SRAM
   * Scaling issue   * Scaling issue
     * Hard to scale the scale to be small     * Hard to scale the scale to be small
Line 718: Line 685:
   * Cost and benefit of having more associativity   * Cost and benefit of having more associativity
     * Given the associativity,​ which block should be replace if it is full     * Given the associativity,​ which block should be replace if it is full
-    * Replacement ​poligy+    * Replacement ​policy
       * Random       * Random
       * Least recently used (LRU)       * Least recently used (LRU)
Line 729: Line 696:
     * Approximate LRU     * Approximate LRU
       * Victim and next Victim policy       * Victim and next Victim policy
- +  
-===== Lecture ​20 (3/21 Fri.) ===== +===== Lecture ​18 (2/27 Fri.) ===== 
 +  * Tag store and data store 
 +  * Cache hit rate 
 +  * Average memory access time (AMAT) 
 +  * AMAT vs. Stall time 
 +  * Cache basics 
 +    * Direct mapped vs. associative cache 
 +    * Set/block (line)/​Placement/​replacement 
 +    * How do tag and index get used? 
 +  * Full associativity 
 +  * Set associative cache 
 +    * insertion, promotion, eviction (replacement) 
 +  * Various replacement policies 
 +  * How to implement LRU 
 +    * How to keep track of access ordering 
 +      * Complexity increases rapidly 
 +    * Approximate LRU 
 +      * Victim and next Victim policy
   * Set thrashing   * Set thrashing
     * Working set is bigger than the associativity     * Working set is bigger than the associativity
Line 737: Line 720:
     * Is this optimal?     * Is this optimal?
     * Complexity?     * Complexity?
-  * Similarity between ​cache and page table +  * DRAM as a cache for disk
-    * Number of blocks vs pages +
-    * Time to find the block/page to replace+
   * Handling writes   * Handling writes
     * Write through     * Write through
Line 754: Line 735:
     * In the first level cache     * In the first level cache
   * Cache access   * Cache access
-    * First level access ​+    * First level access
     * Second level access     * Second level access
       * When to start the second level access       * When to start the second level access
-        ​Performance vs. energy +  ​Cache performance 
-  * Address translation +    * capacity 
-  * Homonym and Synonyms +    * block size 
-    * Homonym: Same VA but maps to different PA +    * associativity 
-      * With multiple processes +  * Classification of cache misses 
-    * Synonyms: Multiple VAs map to the same PA +===== Lecture ​19 (03/02 Mon.) =====
-      * Shared libraries, shared data, copy-on-write +
-      * I/O +
-    * Can these create problems when we have the cache +
-    * How to eliminate these problems? +
-      * Page coloring +
-  * Interaction between ​cache and TLB +
-    * Virtually indexed vs. physically indexed +
-    * Virtually tagged vs. physically tagged +
-    * Virtually indexed physically tagged +
-  * Virtual memory in DRAM +
-    * Control where data is mapped to in channel/​rank/​bank +
-      * More parallelism +
-      * Reduce interference +
- +
-===== Lecture ​21 (3/24 Mon.) ===== +
- +
- +
- +
-  * Different parameters that affect cache miss +
-  * Thrashing +
-  * Different types of cache misses +
-    * Compulsory misses +
-      * Can mitigate with prefetches +
-    * Capacity misses +
-      * More assoc +
-      * Victim cache +
-    * Conflict misses +
-      * Hashing +
-  * Large block vs. small block+
   * Subblocks   * Subblocks
   * Victim cache   * Victim cache
Line 811: Line 763:
     * What information goes into the MSHR?     * What information goes into the MSHR?
     * When do you access the MSHR?     * When do you access the MSHR?
 +  * Memory banks
 +  * Shared caches in multi-core processors
 +===== Lecture 20 (03/04 Wed.) =====
 +  * Virtual vs. physical memory
 +  * System'​s management on memory
 +    * Benefits
 +  * Problem: physical memory has limited size
 +  * Mechanisms: indirection,​ virtual addresses, and translation
 +  * Demand paging
 +  * Physical memory as a cache
 +  * Tasks of system SW for VM
 +  * Serving a page fault
 +  * Address translation
 +  * Page table
 +    * PTE (page table entry)
 +  * Page replacement algorithm
 +    * CLOCK algo.
 +    * Inverted page table
 +  * Page size trade-offs
 +  * Protection
 +  * Multi-level page tables
 +  * x86 implementation of page table
 +  * TLB
 +    * Handling misses
 +  * When to do address translation?​
 +  * Homonym and Synonyms
 +    * Homonym: Same VA but maps to different PA with multiple processes
 +    * Synonyms: Multiple VAs map to the same PA
 +      * Shared libraries, shared data, copy-on-write
 +  * Virtually indexed vs. physically indexed
 +  * Virtually tagged vs. physically tagged
 +  * Virtually indexed physically tagged
 +  * Can these create problems when we have the cache
 +  * How to eliminate these problems?
 +  * Page coloring
 +  * Interaction between cache and TLB
  
  
-===== Lecture ​22 (3/26 Wed.) ===== +===== Lecture ​21 (03/23 Mon.) =====
- +
  
 +  * DRAM scaling problem
 +  * Demands/​trends affecting the main memory
 +    * More capacity
 +    * Low energy
 +    * More bandwidth
 +    * QoS
 +  * ECC in DRAM
   * Multi-porting   * Multi-porting
     * Virtual multi-porting     * Virtual multi-porting
Line 822: Line 815:
     * True multiporting     * True multiporting
   * Multiple cache copies   * Multiple cache copies
 +  * Alignment
   * Banking   * Banking
     * Can have bank conflict     * Can have bank conflict
Line 827: Line 821:
     * Address mapping can mitigate bank conflict     * Address mapping can mitigate bank conflict
     * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity)     * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity)
 +  * Bank mapping
 +    * How to avoid bank conflicts?
 +  * Channel mapping
 +    * Address mapping to minimize bank conflict
 +    * Page coloring
 +      * Virtual to physical mapping that can help reducing conflicts
   * Accessing DRAM   * Accessing DRAM
     * Row bits     * Row bits
Line 832: Line 832:
     * Addressibility     * Addressibility
     * DRAM has its own clock     * DRAM has its own clock
 +    * Sense amplifier
 +    * Bit lines
 +    * Word lines
   * DRAM (2T) vs. SRAM (6T)   * DRAM (2T) vs. SRAM (6T)
     * Cost     * Cost
Line 844: Line 847:
   * DRAM Channel   * DRAM Channel
     * An interface to DRAM, each with its own ranks/banks     * An interface to DRAM, each with its own ranks/banks
 +  * DRAM Chip
   * DIMM   * DIMM
     * More DIMM adds the interconnect complexity     * More DIMM adds the interconnect complexity
Line 869: Line 873:
       * Send requests whenever the request can be sent to the bank       * Send requests whenever the request can be sent to the bank
       * Determine which command (across banks) should be sent to DRAM       * Determine which command (across banks) should be sent to DRAM
-  * Priority of demand vs. prefetch requests 
-  * Memory scheduling policies 
-    * FCFS 
-    * FR-FCFS 
-      * Capped FR-FCFS: FR-FCFS with a timeout 
-      * Usually this is done in a command level (read/write commands and precharge/​activate commands) 
-    ​ 
-  ​ 
-===== Lecture 23 (3/28 Fri.) ===== 
  
 +===== Lecture 22 (03/25 Wed.) =====
 +
 +
 +
 +  * Flash controller
 +  * Flash memory
 +  * Garbage collection in flash
 +  * Overhead in flash memory
 +    * Erase (off the critical path, but takes a long time)
 +  * Different types of DRAM
   * DRAM design choices   * DRAM design choices
     * Cost/​density/​latency/​BW/​Yield     * Cost/​density/​latency/​BW/​Yield
Line 894: Line 899:
     * Benefit of TL-DRAM     * Benefit of TL-DRAM
       * TL-DRAM vs. DRAM cache (adding a small cache in DRAM)       * TL-DRAM vs. DRAM cache (adding a small cache in DRAM)
- +  * List of commands to read/write data into DRAM 
-   +    * Activate -> read/write -> precharge 
-   +    * Activate moves data into the row buffer 
-   +    * Precharge prepare the bank for the next access 
-===== Lecture 24 (3/31 Mon.) ===== +  ​* Row buffer hit 
-   +  ​* Row buffer conflict 
 +  ​* Scheduling memory requests to lower row conflicts 
 +  * Burst mode of DRAM 
 +    * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits 
 +  ​* Address mapping 
 +    * Row interleaved 
 +    * Cache block interleaved
   * Memory controller   * Memory controller
-    * Different ​commands +    * Sending DRAM commands 
-  * Memory scheduler +    * Periodically send commands ​to refresh ​DRAM cells 
-    * Determine the order of requests to be issued ​to DRAM +    * Ensure correctness and data integrity 
-    * Age/​hit-miss status/​types(load/​store/​prefetch/​from GPU/from CPU)/​criticality +    * Where to place the memory controller 
-  * Row buffer ​ +      On CPU chip vs. at the main memory 
-    * hit/​conflict +        Higher BW on-chip 
-    open/closed row +    * Determine the order of requests that will be serviced in DRAM 
-    Open row policy +      Request queues that hold requests 
-    * Closed row policy +      * Send requests whenever ​the request can be sent to the bank 
-    Tradeoffs between open and closed row policy +      * Determine which command (across banks) should be sent to DRAM 
-      * What if the programs has high row buffer locality: open row might benefit more +  * Priority of demand vs. prefetch requests 
-      * Closed row will service misses request faster +  * Memory scheduling policies 
-  * Bank conflict +    * FCFS 
-  * Interference from different applications/​threads +    * FR-FCFS 
-    * Differnt programs/​processes/​threads interfere with each other +      * Try to maximize ​row buffer ​hit rate 
-      * introduce more row buffer/bank conflicts +      Capped ​FR-FCFSFR-FCFS ​with a timeout 
-    * Memory schedule has to manage these interference +      * Usually this is done in a command level (read/write commands and precharge/​activate commands) 
-    * Memory hog problems +    * PAR-BS 
-  * Interference in the data/​command bus +      * Key benefits 
-  ​* FR-FCFS +      * stall time 
-    * Why does FR-FCFS ​make sense? +      * shortest job first 
-      * Row buffer has lower lantecy +    ​* STFM 
-    * Issues with FR-FCFS +    * ATLAS 
-      * Unfairness +    * TCM 
-  * STFM +      * Key benefits 
-    * Fairness issue in memory scheduling +      * Configurability 
-    * How does STFM calculate the fairness and slowdown +      * Fairness + performance at the same time 
-      * How to estimate slowdown ​time when it is runing alone +      Robuestness isuees 
-    Definition of fairness (based on STFM, different papers/​areas define fairness differently) +  * Open row policy 
-  * PAR-BS +  Closed row policy 
-    Parallelism in programs +  QoS 
-    Intereference across banks +    * QoS issues in memory scheduling 
-    * How to form  a batch +    * Fairness 
-    * How to determine ranking between batches/​within a batch +    ​* Performance guarantee
-    ​ +
-      +
-      +
-===== Lecture 25 (2/2 Wed.) =====+
  
  
- 
-  * Latency sensitivity 
-    * Performance drops a lot when the memory request latency is long 
-  * TCM 
-    * Tradeoff between throughput and fairness 
-    * Latency sensitive cluster (non-intensive cluster) 
-      * Ranking based on memory intensity 
-    * Bandwidth intensive cluster 
-      * Round robin within the cluster 
-    * Generally latency sensitive cluster has more priority 
-    * Provide robust fairness vs. throughput 
-    * Complexity of TCM? 
-  * Different ways to control interference in DRAM 
-    * Partitioning of resource 
-      * Channel partitioning:​ map applications that interfere with each other in a different channel 
-        * Keep track of application'​s characteristics 
-        * Dedicate a channel might waste the bandwidth 
-        * Need OS support to determine the channel bits 
-    * Source throttling 
-      * A controller throttle the core depends on the performance target 
-      * Example: Fairness via source throttling 
-        * Detect unfairness and throttle application that is interfering 
-        * How do you estimate slowdown? 
-        * Threshold based solution: hard to configure 
-    * App/thread scheduling 
-      * Critical threads usually stall the progress 
-    * Designing DRAM controller 
-      * Has to handle the normal DRAM operations 
-        * Read/​write/​refresh/​all the timing constraints 
-      * Keep track of resources 
-      * Assign priorities to different requests 
-      * Manage requests to banks 
-    * Self-optimizing controller 
-      * Use machine learning to improve DRAM controller 
-  * DRAM Refresh 
-    * Why does DRAM has to refresh every 64ms 
-    * Banks are unavailable during refresh 
-      * LPDDR mitigate this by using a per-bank refresh 
-    * Has to spend longer time with bigger DRAM 
-    * Distributed refresh: stagger refresh every 64 ms in a distributed manner 
-      * As oppose to burst refresh (long pause time) 
-  * RAIDR: Reduce DRAM refresh by profiling and binning 
-    * Some row do not have to be refresh very frequently 
-      * Profile the row 
-        * High temperature changes the retention time: need online profiling 
-  * Bloom filter 
-    * Represent set membership 
-    * Approximated 
-    * Can contain false positive 
-      * Better/more hash function helps eliminate this 
   ​   ​
-  ​ 
-  ​ 
-===== Lecture 26 (4/7 Mon.) ===== 
- 
- 
- 
-  * Tolerate latency can be costly 
-    * Instruction window is complex 
-      * Benefit also diminishes 
-    * Designing the buffers can be complex 
-    * A simpler way to tolerate out of order is desirable 
-  * Different sources that cause the core to stall in OoO 
-    * Cache miss 
-    * Note that stall happens if the inst. window is full 
-  * Scaling instruction window size is hard 
-    * It is better (less complex) to make the windows more efficient 
-  * Runahead execution 
-    * Try to optain MLP w/o increasing instruction windows 
-    * Runahead (i.e. execute ahead) when there is a long memory instruction 
-      * Long memory instruction stall processor for a while anyways, so it's better to make use out of it 
-      * Execute future instruction to generate accurate prefetches 
-      * Allow future data to be in the cache 
-    * How to support runahead execution? 
-      * Need a way to checkpoing the state when entering runahead mode 
-      * How to make executing in the wrong path useful? 
-      * Need runahead cache to handle load/store in Runahead mode (since they are speculative) 
-    * Cost and benefit of runahead execution (slide number 27) 
-    * Runahead can have inefficiency 
-      * Runahead period that are useless 
-        * Get rid of useless inefficient period 
-    * What if there is a dependent cache miss 
-      * Cannot be paralellized in a vanilla runahead 
-      * Can predict the value of the dependent load 
-        * How to predict the address of the load 
-          * Delta value information 
-          * Stride predictor 
-          * AVD prediction 
-  ​ 
-===== Lecture 28 (4/9 Wed.) ===== 
- 
- 
-  * Questions regarding prefetching 
-    * What to prefetch 
-    * When to prefetch 
-    * how do we prefetch 
-    * where to prefetch from 
-  * Prefetching can cause thrasing (evict a useful block) 
-  * Prefetching can also be useless (not being used) 
-    * Need to be efficient 
-  * Can cause memory bandwidth problem in GPU 
-  * Prefetch the whole block, more than one block, or subblock? 
-    * Each one of them has pros and cons 
-    * Big prefetch is more likely to waste bandwidth 
-    * Commonly done in a cache block granularity 
-  * Prefetch accuracy: fraction of useful prefetches out of all the prefetches 
-  * Prefetcher usually predict based on 
-    * Past knowledge 
-    * Compiler hints 
-  * Prefetcher has to prefetch at the right time 
-    * Prefetch that is too early might get evicted 
-      * It might also evict other useful data 
-    * Prefetch too late does not hide the whole memory latency 
-  * Previous prefetches at the same PC can be used as the history 
-  * Previous demand requests also is a good information to use for prefetches 
-  * Prefetch buffer 
-    * Place the prefetch data to avoid thrashing 
-      * Can treat demand/​prefetch requests separately 
-      * More complex 
-  * Generally, demand block is more important 
-    * This means eviction should prefer prefetch block as oppose to demand block 
-  * Tradeoffs between where do we place the prefetcher 
-    * Look at L1 hits and misses 
-    * Look at L1 misses only 
-    * Look at L2 misses 
-    * Different access pattern affect accuracy 
-      * Tradeoffs between handling more requests (seeing L1 hits and misses) and less visibility (only see L2 miss) 
-  * Software vs. hardware vs. execution based prefetching 
-    * Software: ISA previde prefetch instructions,​ software utilize it 
-      * What information are useful 
-      * How to make sure the prefetch is timely 
-      * What if you have a pointer based structure 
-        * Not easy to prefetch pointer chasing (because in many case the work between prefetches is short, so you cannot predict the next one timely enough) 
-          * Can be solved by hinting the nextnext and/or nextnextnext address 
-    * Hardware: Identify the pattern and prefetch 
-    * Execution driven: Oppotunistically try to prefetch (runahead, dual-core execution) 
-  * Stride prefetcher 
-    * Predict strides, which is common in many programs 
-    * Cache block based or instruction based 
-  * Stream buffer design 
-    * Buffer the stream of accesses (next address) ​ 
-    * Use the information to prefetch 
-  * What affect prefetcher performance 
-    * Prefetch distance 
-      * How far ahead should we prefetch 
-    * Prefetch degree 
-      * How many prefetches do we prefetch 
-  * Prefetcher performance 
-    * Coverage 
-      * Out of the demand requests, how many are actually from the prefetch request 
-    * Accuracy 
-      * Out of all the prefetch requests, how many are actually getting used 
-    * Timeliness 
-      * How much memory latency can we hide from the prefetch requests 
-  * Feedback directed prefetcher 
-    * Use the result of the prefetcher as a feedback to the prefetcher 
-      * with accuracy, timeliness, polluting information 
-  * Markov prefetcher 
-    * Prefetch based on the previous history 
-    * Use markov model to predict 
-    * Pros: Can cover arbitary pattern (easy for link list traversal or trees) 
-    * Downside: High cost, cannot help with compulsory misses (no history) 
-  * Content directed prefetching 
-    * Indentify the content in memory for pointers (which is used as the address to prefetch 
-    * Not very efficient (hard to figure out which block is the pointer) 
-      * Software can give hints 
-    ​ 
-===== Lecture 28 (4/14 Mon.) ===== 
- 
- 
-  * Execution based prefetcher 
-    * Helper thread/​speculative thread 
-      * Use another thread to pre-execute a program 
-    * Can be a software based or hardware based 
-    * Discover misses before the main program (to prefetch data in a timely manner) 
-    * How do you construct the helper thread 
-    * Preexecute instruction (one example of how to initialize a speculative thread), slide 9 
-  * Benefit of multiprocessor 
-    * Improve performace without not significantly increase power consumption 
-    * Better cost efficiency and easier to scale 
-    * Improve dependability (in case the other core is faulty 
-  * Different types of parallelism 
-    * Instruction level parallelism 
-    * Data level parallelism 
-    * Task level parallelism 
-  * Task level parallelism 
-    * Partition a single, potentially big, task into multiple parallel sub-task 
-      * Can be done explicitly (parallel programming by the programmer) 
-      * Or implicitly (hardware partition a single thread specilatively) 
-    * Or, run multiple independent tasks (still improve throughput, but the speedup of any single tasks are not better, also simpler to implement) 
-  * Loosely coupled multiprocessor 
-    * No shared global address space 
-      * Message passing to communicate between different sources 
-    * Simple to manage memory 
-  * Tightly coupled multiprocesor 
-    * Shared global address space 
-    * Need to ensure consistency of data 
-  * Switch on event multithreading 
-    * Switch to another context when there is an event (for example, a cache miss) 
-  * Simulteneous multithreading 
-    * Dispatch instruction from multiple threads at the same time 
-  * Amdahl'​s law 
-    * Maximum speedup 
-    * Parallel portion is not perfect 
-      * Serial bottleneck 
-      * Synchronization cost 
-      * Load balance 
-        * Some threads has more work, requires more time to hit the sync. point 
-  * Issue in parallel programming 
-    * Correctness 
-    * Synchronization 
-    * Consistency 
-    ​ 
-===== Lecture 29 (4/16 Wed.) ===== 
-        ​ 
- 
- 
-  * Ordering of instructions 
-    * Maintaining memory consistency when there are multiple threads and shared memory 
-    * Need to ensure the semantic is not changed 
-    * Making sire the shared data is properly locked when used 
-      * Support mutual exclusion 
-    * Ordering depends on when each processor is executed 
-    * Debugging is also difficult (non-deterministic behavior) 
-  * Weak consistency:​ global ordering when sync 
-    * programmer hints where the synchronizations are 
-  * Total store order model: global ordering only with store 
-  * Cache coherence 
-    * Can be done in the software level or hardware level 
-  * Coherence protocol 
-    * Need to ensure that all the processors see and update the correct state of the cache block 
-    * Need to make sure that writes get propagated and serialized 
-    * Simple protocol are not scalable (one point of synchrnization) 
-  * Update vs. invalidate 
-    * For invalidate, only the core that needs to read retains the correct copy 
-      * Can lead to ping-ponging (tons of read/writes from several processors) 
-    * For updates, bus becomes the bottleneck 
-  * Snoopy bus  
-    * Bus based, single point of serialization 
-    * More efficient with small number of processors 
-    * All cache snoop other caches read/write requests to keep the cache block coherent 
-  * Directory based 
-    * Single point of serialization per block 
-    * Directory coordinate the coherency 
-    * More scalable 
-    * The directory keeps track of where the copies of each block resides 
-      * Supply data on a read 
-      * Invalide the block on a write 
-      * Has an exclusive state 
-  * MSI coherent protocol 
-    * Slide number 56-57 
-    * Consume bus bandwidth (need an "​exclusive"​ state 
-  * MESI coherent protocal 
-    * Add the exclusive state: this is the only cache copy and it is clean state to MSI 
-  * Tradeoffs between snooping and directory based 
-    * Slide 71 has a good summary on this 
-  * MOESI 
-    * Improvement over MESI protocol 
- 
-    ​ 
-===== Lecture 29 (4/18 Wed.) =====    ​ 
- 
- 
- 
-  * Interference 
-  * Complexity of the memory scheduler 
-    * Ranking/​prioritization has cost 
-    * Complex scheduler has higher latency 
-  * Performance metric for multicore/​multithead applications 
-    * Speedup 
-    * Slowdown 
-    * Harmonic vs wrighted 
-  * Fairness mertic 
-    * Maximum slowdown 
-      * Why does it make sense 
-      * Any scenario that it does not make sense? 
-  * Predictable performance 
-    * Why is it important? 
-      * In server environment,​ different jobs are on the same server 
-      * In a mobile environment,​ there are multiple sources that can slowdown other sources 
-    * How to relate slowdown with request service rate 
-    * MISE: soft slowdown guarantee 
-  * BDI 
-    * Memory wall 
-      * What is the concern regarding the memory wall 
-    * Size of the cache on the die (CPU die) 
-    * One possible solution: cache compression 
-      * What is the problems of existing cache compression mechanism 
-        * Some are too complex 
-        * Decompression is in the critical path 
-          * Need to decompress when reading the data -> decompression should not be in the critical path 
-          * Important factor to the performance 
-    * Software compression is not good enough to compress everything 
-    * Zero value compression 
-      * Simple 
-      * Good compression ratio 
-      * What is data does not have many zeroes 
-    * Frequent value compression 
-      * Some data appear fequently 
-      * Simple and good compression ratio 
-      * have to profile 
-      * decompression is complex 
-    * Frequent pattern compression 
-      * Still to complex in terms of decompression 
-    * Based delta compression 
-      * Easy to decompress but retain the benefit of compression 
-      ​ 
-    ​ 
-===== Lecture 31 (4/28 Mon.) =====  ​ 
- 
-  * Directory based cache coherent 
-    * Each directory has to handle validate/​invalidation 
-    * Extra cost of syncronization 
-    * Need to ensure race conditions are resolved 
-  * Interconnection 
-    * Topology 
-      * Bus 
-      * Mesh 
-        * Torus 
-      * Tree 
-      * Butterfly 
-      * Ring 
-        * Bi-directional ring 
-          * More scalable 
-        * Hierarchical ring 
-          * Even more scalable 
-          * More complex 
-      * Crossbar 
-      * etc. 
-    * Circuit switching 
-    * Multistage network 
-      * Butterfly 
-      * Delta network 
-    * Handling contention 
-      * Buffering vs. dropping/​deflection (no buffering) 
-    * Routing algorithm 
-      * Handling deadlock 
-      * X-Y routing 
-        * Turn model (to avoid deadlocks) 
-      * Add more buffering for an escape path 
-      * Oblivious routing 
-        * Can take different path 
-          * DOR between each intermediate location 
-        * Balance network load 
-      * Adaptive routing 
-        * Use the state of the network to determine the route 
-          * Aware of local and/or global congestions 
-        * Non minimal adaptive routing can have livelocks 
- 
-===== Lecture 32 (4/30 Wed.) =====  ​ 
- 
- 
-  * Serialized code section 
-    * Degrade performance 
-    * Waste energy 
-  * Heterogeneous cores 
-    * Can execute serialized portion on a powerful large core 
-  * Tradeoff between multiple small cores, multiple large cores or heterogenerous cores 
-  * Critical section 
-    * bottleneck in several multithreaded workloads 
-    * Assymmetry can help 
-    * Accelerated critical section 
-      * Use a large core to run serialized portion of the code 
-      * How to correctly support ACS 
-      * False serialization 
-      * Handling private/​shared data 
-    * BIS 
-      * Ideltify the bottleneck 
-        * Serial bottleneck 
-        * Barrier 
-        * Critical section 
-        * Pipeline stages 
-      * Application might wait on different types of bottlenecks 
-      * Allow bottleneckcall and bottleneckreturn 
-      * Acceleration can be done in multiple ways 
-        * ship to a big core 
-        * increase the frequency 
-        * Priorize the thread in share resources (memory scheduler always schedule reqeusts from the thread first, etc.) 
-      * Bottleneck table keeps track of different thread'​s bottleneck and determine the criticality 
-    ​ 
- 
-===== Lecture 33 (5/2 Fri.) =====      
- 
- 
-  * DRAM scaling problem 
-  * Possible solutions to the scaling problem 
-    * Less leakage DRAM 
-    * Heterogeneous DRAM (TL-DRAM, etc.) 
-    * Add more functionality to DRAM 
-    * Denser design (3D stack) 
-    * Different technology 
-      * NVM 
-  * Non volatile memory 
-    * Resistive memory 
-      * PCM 
-        * Inject current to change the phase 
-        * Scales better than DRAM 
-          * Multiple bits per cell 
-            * Wider resistence range 
-        * No refresh is needed 
-        * Downside: Latency and write endurance 
-      * STT-MRAM 
-        * Inject current to change the polarity 
-      * Memristor 
-        * Inject current to change the structure 
-    * Persistency - data stay there even without power 
-      * Unified memory and storage management (persistent data structure) - Single level store 
-        * Improve energy and performance 
-        * Simplify programming model 
- 
buzzword.txt · Last modified: 2015/04/27 18:20 by rachata