Differences

This shows you the differences between two versions of the page.

--- buzzword [2014/12/11 00:09]
127.0.0.1 external edit
+++ buzzword [2015/03/25 18:19]
rachata
@@ Line 1: / Line 1: @@
 ====== Buzzwords ======
-Buzzwords are terms that are mentioned during lecture which are particularly important to understand thoroughly.  This page tracks the buzzwords for each of the lectures and can be used as a reference for finding gaps in your understanding of course material.
+Buzzwords are terms that are mentioned during lecture which are particularly important to understand thoroughly. This page tracks the buzzwords for each of the lectures and can be used as a reference for finding gaps in your understanding of course material.
-===== Lecture 1 (1/13 Mon.) =====
+===== Lecture 1 (1/12 Mon.) =====
   * Level of transformation
     * Algorithm
@@ Line 10: / Line 9: @@
     * Compiler
   * Cross abstraction layers
-    * Expose an interface
   * Tradeoffs
   * Caches
-  * Multi-thread
+  * DRAM/memory controller
-  * Multi-core
+  * DRAM banks
-  * Unfairness
-  * DRAM controller/Memory controller
-  * Memory hog
   * Row buffer hit/miss
   * Row buffer locality
-  * Streaming access/ Random access
+  * Unfairness
-  * DRAM refresh
+  * Memory performance hog
-  * Retention time
+  * Shared DRAM memory system
-  * Profiling DRAM retention time
+  * Streaming access vs. random access
+  * Memory scheduling policies
+  * Scheduling priority
+  * Retention time of DRAM
+  * Process variation
+  * Retention time profile
   * Power consumption
-  * Wimpy cores
   * Bloom filter
-    * Pros/Cons
+  * Hamming code
-    * False Positive
+  * Hamming distance
-  * Simulation
+  * DRAM row hammer
-  * Memory performance attacks
-  * RTL design
-===== Lecture 2 (1/15 Wed.) =====
+===== Lecture 2 (1/14 Wed.) =====
-  * Optimizing for energy/ Optimizing for the performance
-    * Generally you should optimize for the users
-  * state-of-the-art
-  * RTL Simulation
-    * Long, slow and can be costly
-  * High level simulation
-    * What should be employed?
-	* Important to get the idea of how they are implemented in RTL
-	* Allows designer to filter out techniques that do not work well
-  * Design points
-    * Design processors to meet the design points
-  * Software stack
-  * Design decisions
-  * Datacenters
-  * MIPS R2000
-    * What are architectural techniques that improve the performance of a processor over MIPS 2000
   * Moore's Law
+  * Algorithm --> step-by-step procedure to solve a problem
   * in-order execution
   * out-of-order execution
@@ Line 64: / Line 45: @@
   * Memory wall (a part of scaling issue)
   * Scaling issue
-    * Transister are getting smaller
+    * Transistor are getting smaller
+  * Key components of a computer
+  * Design points
+    * Design processors to meet the design points
+  * Software stack
+  * Design decisions
+  * Datacenters
   * Reliability problems that cause errors
   * Analogies from Kuhn's "The Structure of Scientific Revolutions" (Recommended book)
-    * Pre paradigm science
+    * Pre-paradigm science
     * Normal science
-    * Revolutionalry science
+    * Revolutionary science
   * Components of a computer
     * Computation
       * Communication
-        * Storage
+      * Storage
-          * DRAM
+        * DRAM
-          * NVRAM (Non-volatile memory): PCM, STT-MRAM
+        * NVRAM (Non-volatile memory): PCM, STT-MRAM
-          * Storage (Flash/Harddrive)
+        * Storage (Flash/Harddrive)
   * Von Neumann Model (Control flow model)
     * Stored program computer
-        * Properties of Von Neumann Model: Stored program, sequential instruction processing
+      * Properties of Von Neumann Model: Stored program, sequential instruction processing
-        * Unified memory
+      * Unified memory
-          * When does an instruction is being interpreted as an instruction (as oppose to a datum)?
+        * When does an instruction is being interpreted as an instruction (as oppose to a datum)?
-        * Program counter
+      * Program counter
-        * Examples: x86, ARM, Alpha, IBM Power series, SPARC, MIPS
+      * Examples: x86, ARM, Alpha, IBM Power series, SPARC, MIPS
   * Data flow model
     * Data flow machine
@@ Line 89: / Line 76: @@
     * Operands
     * Live-outs/Live-ins
-      * DIfferent types of data flow nodes (conditional/relational/barrier)
+      * Different types of data flow nodes (conditional/relational/barrier)
     * How to do transactional transaction in dataflow?
       * Example: bank transactions
   * Tradeoffs between control-driven and data-driven
     * What are easier to program?
-    * Which are easy to compile?
+      * Which are easy to compile?
-    * What are more parallel (does that mean it is faster?)
+      * What are more parallel (does that mean it is faster?)
-    * Which machines are more complex to design?
+      * Which machines are more complex to design?
     * In control flow, when a program is stop, there is a pointer to the current state (precise state).
   * ISA vs. Microarchitecture
     * Semantics in the ISA
-    * uArch should obey the ISA
+      * uArch should obey the ISA
-    * Changing ISA is costly, can affect compatibility.
+      * Changing ISA is costly, can affect compatibility.
   * Instruction pointers
   * uArch techniques: common and powerful techniques break Vonn Neumann model if done at the ISA level
@@ Line 109: / Line 96: @@
       * Out-of-order executions
       * etc.
-    * Design techniques
+        * Design techniques
-      * Adder implementation (Bit serial, ripple carry, carry lookahead)
+          * Adder implementation (Bit serial, ripple carry, carry lookahead)
-      * Connection machine (an example of a machine that use bit serial to tradeoff latency for more parallelism)
+          * Connection machine (an example of a machine that use bit serial to tradeoff latency for more parallelism)
   * Microprocessor: ISA + uArch + circuits
   * What are a part of the ISA? Instructions, memory, etc.
@@ Line 120: / Line 107: @@
 ===== Lecture 3 (1/17 Fri.) =====
-  * Design tradeoffs
+   * Microarchitecture
-  * Macro Architectures
+   * Three major tradeoffs of computer architecture
-  * Reconfiguribility vs. specialized designs
+   * Macro-architecture
-  * Parallelism (instructions, data parallel)
+   * LC-3b ISA
-  * Uniform decode (Example: Alpha)
+   * Unused instructions
-  * Steering bits (Sub-opcode)
+   * Bit steering
-  * 0,1,2,3 address machines
+   * Instruction processing style
-    * Stack machine
+   * 0,1,2,3 address machines
-    * Accumulator machine
+   * Stack machine
-    * 2-operand machine
+   * Accumulator machine
-    * 3-operand machine
+   * 2-operand machine
-    * Tradeoffs between 0,1,2,3 address machines
+   * 3-operand machine
-  * Instructions/Opcode/Operade specifiers (i.e. addressing modes)
+   * Tradeoffs between 0,1,2,3 address machines
-  * Simply vs. complex data type (and their tradeoffs)
+   * Postfix notation
-  * Semantic gap
+   * Instructions/Opcode/Operand specifiers (i.e. addressing modes)
-  * Translation layer
+   * Simply vs. complex data type (and their tradeoffs)
-  * Addressability
+   * Semantic gap and level
-  * Byte/bit addressable machines
+   * Translation layer
-  * Virtual memory
+   * Addressability
-  * Big/little endian
+   * Byte/bit addressable machines
-  * Benefits of having registers (data locality)
+   * Virtual memory
-  * Programmer visible (Architectural) state
+   * Big/little endian
-    * Programmers can access this directly
+   * Benefits of having registers (data locality)
-    * What are the benefits?
+   * Programmer visible (Architectural) state
-  * Microarchitectural state
+   * Programmers can access this directly
-    * Programmers cannot access this directly
+   * What are the benefits?
-  * Evolution of registers (from accumulators to registers)
+   * Microarchitectural state
-  * Different types of instructions
+   * Programmers cannot access this directly
-    * Control instructions
+   * Evolution of registers (from accumulators to registers)
-    * Data instructions
+   * Different types of instructions
-    * Operation instructions
+   * Control instructions
-  * Addressing modes
+   * Data instructions
-    * Tradeoffs (complexity, flexibility, etc.)
+   * Operation instructions
-  * Orthogonal ISA
+   * Addressing modes
-    * Addressing modes that are orthogonal to instructino types
+   * Tradeoffs (complexity, flexibility, etc.)
-  * Vectors vs. non vectored interrupts
+   * Orthogonal ISA
-  * Complex vs. simple instructions
+   * Addressing modes that are orthogonal to instruction types
-    * Tradeoffs
+   * I/O devices
-  * RISC vs. CISC
+   * Vectored vs. non-vectored interrupts
-    * Tradeoff
+   * Complex vs. simple instructions
-    * Backward compatibility
+   * Tradeoffs
-    * Performance
+   * RISC vs. CISC
-    * Optimization opportunity
+   * Tradeoff
+   * Backward compatibility
+   * Performance
+   * Optimization opportunity
+   * Translation
-===== Lecture 4 (1/22 Wed.) =====
+===== Lecture 4 (1/21 Wed.) =====
-  * Semantic gap
+  * Fixed vs. variable length instruction
-    * Small vs. Large semantic gap (CISC vs. RISC)
+  * Huffman encoding
-    * Benefit of RISC vs. CISC
+  * Uniform vs. non-uniform decode
-  * Micro operations/microcode
+  * Registers
-    * Translate complex instructions into smaller instructions
+    * Tradeoffs between number of registers
-  * Parallelism (motivation for RISC)
+  * Alignments
-  * Compiler optimization
+    * How does MIPS load words across alignment the boundary
-  * Code optimization through translation
-  * VLIW
-  * Fixed vs. variable length instructions
-    * Tradeoffs
-      * Alignment issues? (fetch/decode)
-      * Decoding issues?
-      * Code size?
-      * Adding additional instructions?
-      * Memory bandwidth and cache utilization?
-      * Energy?
-    * Encoding in variable length instructions
-  * Structure of Alpha instructions and other uniform decode instructions
-    * Different type of instructions
-    * Benefit of knowing what type of instructions
-      * Speculatively operate future instructions
-  * x86 and other non-uniform decode instructions
-    * Tradeoff vs. uniform decode
-  * Tradeoffs for different number of registers
-    * Spilling into memory if the number of registers is small
-    * Compiler optimization on how to manage which value to keep/spill
-  * Addressing modes
-    * Benefits?
-    * Types?
-    * Different uses of addressing modes?
-  * Various ISA-level tradeoffs
-  * Virtual memory
-  * Unalign memory access/aligned memory access
-    * Cost vs. benefit of unaligned access
-  * ISA specification
-    * Things you have to obey/specifie in the ISA specification
-  * Architectural states
-  * Microarchitecture implements how arch. state A transformed to the next arch. state A'
-  * Single cycle machines
-    * Critical path in the single cycle machine
-  * Multi cycle machines
-  * Functional units
-  * Performance metrics
-    * CPI/IPC
-      * CPI of a single cycle microarchitecture
-===== Lecture 5 (1/24 Fri.) =====
+===== Lecture 5 (1/26 Mon.) =====
+  * Tradeoffs in ISA: Instruction length
+    * Uniform vs. non-uniform
+  * Design point/Use cases
+    * What dictates the design point?
+  * Architectural states
+  * uArch
+    * How to implement the ISA in the uArch
+  * Different stages in the uArch
+  * Clock cycles
+  * Multi-cycle machine
+  * Datapath and control logic
+    * Control signals
+  * Execution time of instructions/program
+    * Metrics and what do they means
   * Instruction processing
     * Fetch
@@ Line 221: / Line 187: @@
     * Memory fetch
     * Writeback
-  * Datapath & Control logic in microprocessors
+  * Encoding and semantics
   * Different types of instructions (I-type, R-type, etc.)
   * Control flow instructions
@@ Line 230: / Line 196: @@
   * Critical path analysis
     * Critical path of a single cycle processor
-  * Combinational logic & Sequential logic
+  * What is in the control signals?
+    * Combinational logic & Sequential logic
   * Control store
   * Tradeoffs of a single cycle uarch
-  * Dynamic power/Static power
-  * Speedup calculation
-    * Parallelism
-    * Serial bottleneck
-    * Amdahl's bottleneck
   * Design principles
     * Common case design
     * Critical path design
     * Balanced designs
-  * Multi cycle design
+    * Dynamic power/Static power
+      * Increases in power due to frequency
-===== Lecture 6 (1/27 Mon.) =====
+===== Lecture 6 (1/28 Mon.) =====
+  * Design principles
+    * Common case design
+    * Critical path design
+    * Balanced designs
+  * Multi cycle design
   * Microcoded/Microprogrammed machines
     * States
+    * Translation from one state to another
     * Microinstructions
     * Microsequencing
@@ Line 267: / Line 236: @@
   * Variable latency memory
   * Handling interrupts
-  * Difference betwen interrupts and exceptions
+  * Difference between interrupts and exceptions
   * Emulator (i.e. uCode allots minimal datapath to emulate the ISA)
   * Updating machine behavior
@@ Line 273: / Line 242: @@
   * Vertical microcode
   * Primitives
+===== Lecture 7 (1/30 Fri.) =====
-===== Lecture 7 (1/29 Wed.) =====
+  * Emulator (i.e. uCode allots minimal datapath to emulate the ISA)
+  * Updating machine behavior
+  * Horizontal microcode
+  * Vertical microcode
+  * Primitives
+  * nanocode and millicode
+    * what are the differences between nano/milli/microcode
+  * microprogrammed vs. hardwire control
   * Pipelining
   * Limitations of the multi-programmed design
     * Idle resources
   * Throughput of a pipelined design
-    * What dictacts the throughput of a pipelined design?
+    * What dictates the throughput of a pipelined design?
   * Latency of the pipelined design
   * Dependency
@@ Line 300: / Line 277: @@
     * Pipeline flush
     * Speculation
+===== Lecture 8 (2/2 Mon.) =====
   * Interlocking
   * Multipath execution
@@ Line 305: / Line 285: @@
   * No-op (Bubbles in the pipeline)
   * Valid bits in the instructions
-===== Lecture 8 (1/31 Fri.) =====
   * Branch prediction
   * Different types of data dependence
@@ Line 335: / Line 313: @@
     * Based on the profile info. position the code based on it
     * Try to make the next sequential instruction be the next inst. to be executed
-  * Trace cache
   * Predicate combining (combine predicate for a branch instruction)
   * Predicated execution (control dependence becomes data dependence)
+===== Lecture 9 (2/4 Wed.) =====
   * Definition of basic blocks
   * Control flow graph
-===== Lecture 9 (2/3 Mon.) =====
   * Delayed branching
     * benefit?
@@ Line 358: / Line 336: @@
     * Inst. from different thread can fill-in the bubbles
     * Cost?
-  * Simulteneuos multithreading
+  * Simultaneuos multithreading
   * Branch prediction
     * Guess what to fetch next.
@@ Line 367: / Line 345: @@
       * Given the program/number of instructions, percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions.
         * How many extra instructions are being fetched?
-        * What is the performance degredation?
+        * What is the performance degradation?
     * How to reduce the miss penalty?
     * Predicting the next address (non PC+4 address)
@@ Line 374: / Line 352: @@
     * Global branch history - for directions
     * Can use compiler to profile and get more info
-      * Input set dictacts the accuracy
+      * Input set dictates the accuracy
       * Add time to compilation
     * Heuristics that are common and doesn't require profiling.
       * Might be inaccurate
       * Does not require profiling
-    * Programmer can tell the hardware (via pragmas (hints))
+    * Static branch prediction
+      * Programmer provides pragmas, hinting the likelihood of taken/not taken branch
       * For example, x86 has the hint bit
     * Dynamic branch prediction
@@ Line 386: / Line 365: @@
         * One more bit for hysteresis
+===== Lecture 10 (2/6 Fri.) =====
-===== Lecture 10 (2/5 Wed.) =====
   * Branch prediction accuracy
     * Why are they very important?
       * Differences between 99% accuracy and 98% accuracy
-      * Cost of a misprediction when the pipeline is veryd eep
+      * Cost of a misprediction when the pipeline is very deep
-  * Value prediction
   * Global branch correlation
     * Some branches are correlated
@@ Line 404: / Line 380: @@
     * What information are used
   * Two-level branch prediction
-    * What entries do you keep in the glocal history?
+    * What entries do you keep in the global history?
     * What entries do you keep in the local history?
     * How many table?
@@ Line 414: / Line 390: @@
     * Store both GHP and PC in one combined information
     * How do you use the information? Why does the XOR result still usable?
-  * Slides (page 16-18) for a good overview of one- and two-level predictors
   * Warmup cost of the branch predictor
     * Hybrid solution? Fast warmup is used first, then switch to the slower one.
   * Tournament predictor (Alpha 21264)
-  * Other types of branch predictor
-    * Using machine learning?
-    * Geometric history length
-      * Look at branches far behind (but using geometric step)
   * Predicated execution - eliminate branches
     * What are the tradeoffs
-    * What is the block is big (can lead to execution a lot of useless work)
+    * What if the block is big (can lead to execution a lot of useless work)
     * Allows easier code optimization
       * From the compiler PoV, predicated execution combine multiple basic blocks into one bigger basic block
@@ Line 434: / Line 405: @@
       * Use branch prediction on an easy to predict code
       * Use predicated execution on a hard to predict code
-      * Compiler can be more aggressive in optimimzing the code
+      * Compiler can be more aggressive in optmizing the code
     * What are the tradeoffs (slide# 47)
   * Multi-path execution
     * Execute both paths
     * Can lead to wasted work
+    * VLIW
+    * Superscalar
-===== Lecture 11 (2/12 Wed.) =====
+===== Lecture 11 (2/11 Wed.) =====
-  * Call and return prediction
+  * Geometric GHR length for branch prediction
-    * Direct call is easy to predict
+  * Perceptron branch predictor
-    * Retun is harder (indirect branches)
-      * Nested calls make return easier to predict
-        * Can use stack to predict the return
-    * Indirect branch prediction
-      * These branches have multiple targets
-      * For switch-case, virtual function calls, jump tables, interface calls
-      * BTB to predict the target address - low accuracy
-      * History based: BTB + GHR
-      * Virtual program counter prediction
-  * Complications in superscalar processors
-    * Fetch? What if multiple branches are fetched at the same time?
-    * Logic requires to ensure correctness?
   * Multi-cycle executions (Different functional units take different number of cycles)
     * Instructions can retire out-of-order
@@ Line 462: / Line 423: @@
   * Exceptions and Interrupts
     * When they are handled?
-      * Why are some interrupts should be handled right away?
+    * Why are some interrupts should be handled right away?
   * Precise exception
     * arch. state should be consistent before handling the exception/interrupts
@@ Line 470: / Line 431: @@
       * Easier to restart the processes
     * How to ensure precise exception?
-      * Tradeoffs between each method
+    * Tradeoffs between each method
   * Reorder buffer
     * Reorder results before they are visible to the arch. state
-      * Need to presearve the sequential sematic and data
+      * Need to preserve the sequential semantic and data
-    * What are the informatinos in the ROB entry
+    * What are the information in the ROB entry
     * Where to get the value from (forwarding path? reorder buffer?)
       * Extra logic to check where the youngest instructions/value is
-      * Content addressible search
+      * Content addressable search (CAM)
         * A lot of comparators
     * Different ways to simplify the reorder buffer
@@ Line 487: / Line 448: @@
   * Future file (commonly used, along with reorder buffer)
     * Keep two set of register files
-      * An updated value (Speculative), called fiture file
+      * An updated value (Speculative), called future file
       * A backup value (to restore the state quickly
-    * Double the cost of the regfile, but reduce the area as you don't have to use a content addressible memory (compared to ROB alone)
+    * Double the cost of the regfile, but reduce the area as you don't have to use a content addressable memory (compared to ROB alone)
   * Branch misprediction resembles Exception
     * The difference is that branch misprediction is not visible to the software
@@ Line 498: / Line 459: @@
   * Checkpointing
     * Advantages?
-===== Lecture 14 (2/19 Wed.) =====
+===== Lecture 12 (2/13 Fri.) =====
+  * Renaming
+  * Register renaming table
   * Predictor (branch predictor, cache line predictor ...)
   * Power budget (and its importance)
@@ Line 523: / Line 486: @@
       * Slides 28 --> register renaming
       * Slides 30-35 --> Exercise (also on the board)
-        * This will be usefull for the midterm
+        * This will be useful for the midterm
     * Register aliasing table
+    * Broadcasting tags
+    * Using dataflow
-===== Lecture 15 (2/21 Fri.) =====
+===== Lecture 13 (2/16 Mon.) =====
   * OoO --> Restricted Dataflow
@@ Line 534: / Line 500: @@
       * Dispatch width
       * Parallelism in the program
-    * More example on slide #10
     * What does it mean to be restricted data flow
       * Still visible as a Von Neumann model
     * Where does the efficiency come from?
-    * Size of the scheduling windors/reorder buffer. Tradeoffs? What make sense?
+    * Size of the scheduling windows/reorder buffer. Tradeoffs? What make sense?
   * Load/store handling
     * Would like to schedule them out of order, but make them visible in-order
@@ Line 545: / Line 510: @@
     * This is one of the most complex structure of the load/store handling
     * What information can be used to predict these load/store optimization?
-  * Note: IPC = 1/CPI
   * Centralized vs. distributed? What are the tradeoffs?
   * How to handle when there is a misprediction/recovery
+    * OoO + branch prediction?
+    * Speculatively update the history register
+      * When do you update the GHR?
   * Token dataflow arch.
     * What are tokens?
@@ Line 555: / Line 522: @@
     * Difficulties?
-===== Lecture 16 (2/24 Mon.) =====
+===== Lecture 14 (2/18 Wed.) =====
   * SISD/SIMD/MISD/MIMD
@@ Line 571: / Line 538: @@
     * But the program needs to be very parallel
     * Memory can be the bottleneck (due to very high MLP)
-    * What does the functional units look like? Deep pipelin and simpler control.
+    * What does the functional units look like? Deep pipeline and simpler control.
     * CRAY-I is one of the examples of vector processor
     * Memory access pattern in a vector processor
       * How do the memory accesses benefit the memory bandwidth?
-      * Please refer to slides 73-74 in http://www.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring13-lecture25-mainmemory-afterlecture.pdf for a breif explanation of memory level parallelism
+      * Memory level parallelism
       * Stride length vs. the number of banks
         * stride length should be relatively prime to the number of banks
@@ Line 582: / Line 549: @@
     * What if there are multiple memory ports?
     * Gather/Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism)
-      * Helps handling sparse metrices
+      * Helps handling sparse matrices
     * Conditional operation
     * Structure of vector units
@@ Line 595: / Line 562: @@
     * Intel SSE --> Modern version of MMX
-===== Lecture 17 (2/26 Wed.) =====
+===== Lecture 15 (2/20 Fri.) =====
   * GPU
     * Warp/Wavefront
@@ Line 613: / Line 579: @@
     * Lower SIMD efficiency
     * What if you have layers of branches?
-  * Dynamic wrap formation
+  * Dynamic warp formation
     * Combining threads from different warps to increase SIMD utilization
     * This can cause memory divergence
@@ Line 627: / Line 593: @@
     * How to street the instruction (determine dependency/stalling)?
     * Instruction scheduling techniques (static vs. dynamic)
-  * Systoric arrays
+  * Systolic arrays
     * Processing elements transform data in chains
     * Develop for image processing (for example, convolution)
   * Stage processing
-===== Lecture 18 (2/28 Fri.) =====
-  * Tradeoffs of VLIW
-    * Why does VLIW required static instruction scheduling
-    * Whose job it is?
-      * Compiler can rearrange basic blocks/instruction
-  * Basic block
-    * Benefits of having large basic block
-  * Entry/Exit
-    * Handling entries/exits
-  * Trace cache
-    * How to ensure correctness?
-    * Profiling
-    * Fixing up the instruction order to ensure correctness
-    * Dealing with multiple entries into the block
-    * Dealing with multiple exits into the block
-  * Super block
-    * How to form super blocks?
-    * Benefit of super block
-    * Tradeoff between not forming a super block and forming a super block
-      * Ambiguous branch (after profiling, both taken/not taken are equally likely)
-      * Cleaning up
-  * What scenario would make trace cache/superblock/profiling less effective?
-  * List scheduling
-    * Help figuring out which instructions VLIW should fetch
-    * Try to maximize instruction throughput
-    * How to assign priorities
-    * What if some instructions take longer than others
-  * Block structured ISA (BS-ISA)
-    * Problems with trace scheduling?
-    * What type of program will benefit from BS-ISA
-    * How to form blocks in BS-ISA?
-      * Combining basic blocks
-      * multiples of merged basic blocks
-    * How to deal with entries/exits in BS-ISA?
-      * undo the executed instructions from the entry point, then fetch the new block
-    * Advantages over trace cache
-  * Benefit of VLIW + Static instruction scheduling
-  * Intel IA-64
-    * Static instruction scheduling and VLIW
-===== Lecture 19 (3/19 Wed.) =====
+===== Lecture 16 (2/23 Mon.) =====
+  * Systolic arrays
+    * Processing elements transform data in chains
+    * Can be arrays of multi-dimensional processing elements
+    * Develop for image processing (for example, convolution)
+    * Can be use to break stages in pipeline programs, using a set of queues and processing elements
+    * Can enable high concurrency and good for regular programs
+    * Very special purpose
+    * The warp computer
+  * Static instruction scheduling
+    * How do we find the next instruction to execute?
+  * Live-in and live-out
+  * Basic blocks
+    * Rearranging instructions in the basic block
+    * Code movement from one basic block to another
+  * Straight line code
+  * Independent instructions
+    * How to identify independent instructions
+  * Atomicity
+  * Trace scheduling
+    * Side entrance
+    * Fixed up code
+    * How scheduling is done
+  * Instruction scheduling
+    * Prioritization heuristics
+  * Superblock
+    * Traces with no side-entrance
+  * Hyperblock
+  * BS-ISA
+  * Tradeoffs between trace cache/Hyperblock/Superblock/BS-ISA
+===== Lecture 17 (2/25 Wed.) =====
+  * IA-64
+    * EPIC
+  * IA-64 instruction bundle
+    * Multiple instructions in the bundle along with the template bit
+    * Template bits
+    * Stop bits
+    * Non-faulting loads and exception propagation
+  * Aggressive ST-LD reordering
+  * Physical memory system
+  * Ideal pipelines
   * Ideal cache
     * More capacity
@@ Line 682: / Line 649: @@
   * DRAM cell
     * Cheap
-    * Sense the purturbation through sense amplifier
+    * Sense the perturbation through sense amplifier
     * Slow and leaky
   * SRAM cell (Cross coupled inverter)
@@ Line 689: / Line 656: @@
   * Memory bank
     * Read access sequence
-      * DRAM: Activate -> Read -> Precharge (if needed)
+    * DRAM: Activate -> Read -> Precharge (if needed)
-    * What dominate the access laatency for DRAM and SRAM
+    * What dominate the access latency for DRAM and SRAM
   * Scaling issue
     * Hard to scale the scale to be small
@@ Line 718: / Line 685: @@
   * Cost and benefit of having more associativity
     * Given the associativity, which block should be replace if it is full
-    * Replacement poligy
+    * Replacement policy
       * Random
       * Least recently used (LRU)
@@ Line 729: / Line 696: @@
     * Approximate LRU
       * Victim and next Victim policy
-===== Lecture 20 (3/21 Fri.) =====
+===== Lecture 18 (2/27 Fri.) =====
+  * Tag store and data store
+  * Cache hit rate
+  * Average memory access time (AMAT)
+  * AMAT vs. Stall time
+  * Cache basics
+    * Direct mapped vs. associative cache
+    * Set/block (line)/Placement/replacement
+    * How do tag and index get used?
+  * Full associativity
+  * Set associative cache
+    * insertion, promotion, eviction (replacement)
+  * Various replacement policies
+  * How to implement LRU
+    * How to keep track of access ordering
+      * Complexity increases rapidly
+    * Approximate LRU
+      * Victim and next Victim policy
   * Set thrashing
     * Working set is bigger than the associativity
@@ Line 737: / Line 720: @@
     * Is this optimal?
     * Complexity?
-  * Similarity between cache and page table
+  * DRAM as a cache for disk
-    * Number of blocks vs pages
-    * Time to find the block/page to replace
   * Handling writes
     * Write through
@@ Line 754: / Line 735: @@
     * In the first level cache
   * Cache access
     * First level access
     * Second level access
       * When to start the second level access
-        * Performance vs. energy
+  * Cache performance
-  * Address translation
+    * capacity
-  * Homonym and Synonyms
+    * block size
-    * Homonym: Same VA but maps to different PA
+    * associativity
-      * With multiple processes
+  * Classification of cache misses
-    * Synonyms: Multiple VAs map to the same PA
+===== Lecture 19 (03/02 Mon.) =====
-      * Shared libraries, shared data, copy-on-write
-      * I/O
-    * Can these create problems when we have the cache
-    * How to eliminate these problems?
-      * Page coloring
-  * Interaction between cache and TLB
-    * Virtually indexed vs. physically indexed
-    * Virtually tagged vs. physically tagged
-    * Virtually indexed physically tagged
-  * Virtual memory in DRAM
-    * Control where data is mapped to in channel/rank/bank
-      * More parallelism
-      * Reduce interference
-===== Lecture 21 (3/24 Mon.) =====
-  * Different parameters that affect cache miss
-  * Thrashing
-  * Different types of cache misses
-    * Compulsory misses
-      * Can mitigate with prefetches
-    * Capacity misses
-      * More assoc
-      * Victim cache
-    * Conflict misses
-      * Hashing
-  * Large block vs. small block
   * Subblocks
   * Victim cache
@@ Line 811: / Line 763: @@
     * What information goes into the MSHR?
     * When do you access the MSHR?
+  * Memory banks
+  * Shared caches in multi-core processors
+===== Lecture 20 (03/04 Wed.) =====
+  * Virtual vs. physical memory
+  * System's management on memory
+    * Benefits
+  * Problem: physical memory has limited size
+  * Mechanisms: indirection, virtual addresses, and translation
+  * Demand paging
+  * Physical memory as a cache
+  * Tasks of system SW for VM
+  * Serving a page fault
+  * Address translation
+  * Page table
+    * PTE (page table entry)
+  * Page replacement algorithm
+    * CLOCK algo.
+    * Inverted page table
+  * Page size trade-offs
+  * Protection
+  * Multi-level page tables
+  * x86 implementation of page table
+  * TLB
+    * Handling misses
+  * When to do address translation?
+  * Homonym and Synonyms
+    * Homonym: Same VA but maps to different PA with multiple processes
+    * Synonyms: Multiple VAs map to the same PA
+      * Shared libraries, shared data, copy-on-write
+  * Virtually indexed vs. physically indexed
+  * Virtually tagged vs. physically tagged
+  * Virtually indexed physically tagged
+  * Can these create problems when we have the cache
+  * How to eliminate these problems?
+  * Page coloring
+  * Interaction between cache and TLB
-===== Lecture 22 (3/26 Wed.) =====
+===== Lecture 21 (03/23 Mon.) =====
+  * DRAM scaling problem
+  * Demands/trends affecting the main memory
+    * More capacity
+    * Low energy
+    * More bandwidth
+    * QoS
+  * ECC in DRAM
   * Multi-porting
     * Virtual multi-porting
@@ Line 822: / Line 815: @@
     * True multiporting
   * Multiple cache copies
+  * Alignment
   * Banking
     * Can have bank conflict
@@ Line 827: / Line 821: @@
     * Address mapping can mitigate bank conflict
     * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity)
+  * Bank mapping
+    * How to avoid bank conflicts?
+  * Channel mapping
+    * Address mapping to minimize bank conflict
+    * Page coloring
+      * Virtual to physical mapping that can help reducing conflicts
   * Accessing DRAM
     * Row bits
@@ Line 832: / Line 832: @@
     * Addressibility
     * DRAM has its own clock
+    * Sense amplifier
+    * Bit lines
+    * Word lines
   * DRAM (2T) vs. SRAM (6T)
     * Cost
@@ Line 844: / Line 847: @@
   * DRAM Channel
     * An interface to DRAM, each with its own ranks/banks
+  * DRAM Chip
   * DIMM
     * More DIMM adds the interconnect complexity
@@ Line 869: / Line 873: @@
       * Send requests whenever the request can be sent to the bank
       * Determine which command (across banks) should be sent to DRAM
-  * Priority of demand vs. prefetch requests
-  * Memory scheduling policies
-    * FCFS
-    * FR-FCFS
-      * Capped FR-FCFS: FR-FCFS with a timeout
-      * Usually this is done in a command level (read/write commands and precharge/activate commands)
-===== Lecture 23 (3/28 Fri.) =====
+===== Lecture 22 (03/25 Wed.) =====
+  * Flash controller
+  * Flash memory
+  * Garbage collection in flash
+  * Overhead in flash memory
+    * Erase (off the critical path, but takes a long time)
+  * Different types of DRAM
   * DRAM design choices
     * Cost/density/latency/BW/Yield
@@ Line 894: / Line 899: @@
     * Benefit of TL-DRAM
       * TL-DRAM vs. DRAM cache (adding a small cache in DRAM)
+  * List of commands to read/write data into DRAM
+    * Activate -> read/write -> precharge
+    * Activate moves data into the row buffer
+    * Precharge prepare the bank for the next access
-===== Lecture 24 (3/31 Mon.) =====
+  * Row buffer hit
+  * Row buffer conflict
+  * Scheduling memory requests to lower row conflicts
+  * Burst mode of DRAM
+    * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits
+  * Address mapping
+    * Row interleaved
+    * Cache block interleaved
   * Memory controller
-    * Different commands
+    * Sending DRAM commands
-  * Memory scheduler
+    * Periodically send commands to refresh DRAM cells
-    * Determine the order of requests to be issued to DRAM
+    * Ensure correctness and data integrity
-    * Age/hit-miss status/types(load/store/prefetch/from GPU/from CPU)/criticality
+    * Where to place the memory controller
-  * Row buffer
+      * On CPU chip vs. at the main memory
-    * hit/conflict
+        * Higher BW on-chip
-    * open/closed row
+    * Determine the order of requests that will be serviced in DRAM
-    * Open row policy
+      * Request queues that hold requests
-    * Closed row policy
+      * Send requests whenever the request can be sent to the bank
-    * Tradeoffs between open and closed row policy
+      * Determine which command (across banks) should be sent to DRAM
-      * What if the programs has high row buffer locality: open row might benefit more
+  * Priority of demand vs. prefetch requests
-      * Closed row will service misses request faster
+  * Memory scheduling policies
-  * Bank conflict
+    * FCFS
-  * Interference from different applications/threads
+    * FR-FCFS
-    * Differnt programs/processes/threads interfere with each other
+      * Try to maximize row buffer hit rate
-      * introduce more row buffer/bank conflicts
+      * Capped FR-FCFS: FR-FCFS with a timeout
-    * Memory schedule has to manage these interference
+      * Usually this is done in a command level (read/write commands and precharge/activate commands)
-    * Memory hog problems
+    * PAR-BS
-  * Interference in the data/command bus
+      * Key benefits
-  * FR-FCFS
+      * stall time
-    * Why does FR-FCFS make sense?
+      * shortest job first
-      * Row buffer has lower lantecy
+    * STFM
-    * Issues with FR-FCFS
+    * ATLAS
-      * Unfairness
+    * TCM
-  * STFM
+      * Key benefits
-    * Fairness issue in memory scheduling
+      * Configurability
-    * How does STFM calculate the fairness and slowdown
+      * Fairness + performance at the same time
-      * How to estimate slowdown time when it is runing alone
+      * Robuestness isuees
-    * Definition of fairness (based on STFM, different papers/areas define fairness differently)
+  * Open row policy
-  * PAR-BS
+  * Closed row policy
-    * Parallelism in programs
+  * QoS
-    * Intereference across banks
+    * QoS issues in memory scheduling
-    * How to form  a batch
+    * Fairness
-    * How to determine ranking between batches/within a batch
+    * Performance guarantee
-===== Lecture 25 (2/2 Wed.) =====
-  * Latency sensitivity
-    * Performance drops a lot when the memory request latency is long
-  * TCM
-    * Tradeoff between throughput and fairness
-    * Latency sensitive cluster (non-intensive cluster)
-      * Ranking based on memory intensity
-    * Bandwidth intensive cluster
-      * Round robin within the cluster
-    * Generally latency sensitive cluster has more priority
-    * Provide robust fairness vs. throughput
-    * Complexity of TCM?
-  * Different ways to control interference in DRAM
-    * Partitioning of resource
-      * Channel partitioning: map applications that interfere with each other in a different channel
-        * Keep track of application's characteristics
-        * Dedicate a channel might waste the bandwidth
-        * Need OS support to determine the channel bits
-    * Source throttling
-      * A controller throttle the core depends on the performance target
-      * Example: Fairness via source throttling
-        * Detect unfairness and throttle application that is interfering
-        * How do you estimate slowdown?
-        * Threshold based solution: hard to configure
-    * App/thread scheduling
-      * Critical threads usually stall the progress
-    * Designing DRAM controller
-      * Has to handle the normal DRAM operations
-        * Read/write/refresh/all the timing constraints
-      * Keep track of resources
-      * Assign priorities to different requests
-      * Manage requests to banks
-    * Self-optimizing controller
-      * Use machine learning to improve DRAM controller
-  * DRAM Refresh
-    * Why does DRAM has to refresh every 64ms
-    * Banks are unavailable during refresh
-      * LPDDR mitigate this by using a per-bank refresh
-    * Has to spend longer time with bigger DRAM
-    * Distributed refresh: stagger refresh every 64 ms in a distributed manner
-      * As oppose to burst refresh (long pause time)
-  * RAIDR: Reduce DRAM refresh by profiling and binning
-    * Some row do not have to be refresh very frequently
-      * Profile the row
-        * High temperature changes the retention time: need online profiling
-  * Bloom filter
-    * Represent set membership
-    * Approximated
-    * Can contain false positive
-      * Better/more hash function helps eliminate this
-===== Lecture 26 (4/7 Mon.) =====
-  * Tolerate latency can be costly
-    * Instruction window is complex
-      * Benefit also diminishes
-    * Designing the buffers can be complex
-    * A simpler way to tolerate out of order is desirable
-  * Different sources that cause the core to stall in OoO
-    * Cache miss
-    * Note that stall happens if the inst. window is full
-  * Scaling instruction window size is hard
-    * It is better (less complex) to make the windows more efficient
-  * Runahead execution
-    * Try to optain MLP w/o increasing instruction windows
-    * Runahead (i.e. execute ahead) when there is a long memory instruction
-      * Long memory instruction stall processor for a while anyways, so it's better to make use out of it
-      * Execute future instruction to generate accurate prefetches
-      * Allow future data to be in the cache
-    * How to support runahead execution?
-      * Need a way to checkpoing the state when entering runahead mode
-      * How to make executing in the wrong path useful?
-      * Need runahead cache to handle load/store in Runahead mode (since they are speculative)
-    * Cost and benefit of runahead execution (slide number 27)
-    * Runahead can have inefficiency
-      * Runahead period that are useless
-        * Get rid of useless inefficient period
-    * What if there is a dependent cache miss
-      * Cannot be paralellized in a vanilla runahead
-      * Can predict the value of the dependent load
-        * How to predict the address of the load
-          * Delta value information
-          * Stride predictor
-          * AVD prediction
-===== Lecture 28 (4/9 Wed.) =====
-  * Questions regarding prefetching
-    * What to prefetch
-    * When to prefetch
-    * how do we prefetch
-    * where to prefetch from
-  * Prefetching can cause thrasing (evict a useful block)
-  * Prefetching can also be useless (not being used)
-    * Need to be efficient
-  * Can cause memory bandwidth problem in GPU
-  * Prefetch the whole block, more than one block, or subblock?
-    * Each one of them has pros and cons
-    * Big prefetch is more likely to waste bandwidth
-    * Commonly done in a cache block granularity
-  * Prefetch accuracy: fraction of useful prefetches out of all the prefetches
-  * Prefetcher usually predict based on
-    * Past knowledge
-    * Compiler hints
-  * Prefetcher has to prefetch at the right time
-    * Prefetch that is too early might get evicted
-      * It might also evict other useful data
-    * Prefetch too late does not hide the whole memory latency
-  * Previous prefetches at the same PC can be used as the history
-  * Previous demand requests also is a good information to use for prefetches
-  * Prefetch buffer
-    * Place the prefetch data to avoid thrashing
-      * Can treat demand/prefetch requests separately
-      * More complex
-  * Generally, demand block is more important
-    * This means eviction should prefer prefetch block as oppose to demand block
-  * Tradeoffs between where do we place the prefetcher
-    * Look at L1 hits and misses
-    * Look at L1 misses only
-    * Look at L2 misses
-    * Different access pattern affect accuracy
-      * Tradeoffs between handling more requests (seeing L1 hits and misses) and less visibility (only see L2 miss)
-  * Software vs. hardware vs. execution based prefetching
-    * Software: ISA previde prefetch instructions, software utilize it
-      * What information are useful
-      * How to make sure the prefetch is timely
-      * What if you have a pointer based structure
-        * Not easy to prefetch pointer chasing (because in many case the work between prefetches is short, so you cannot predict the next one timely enough)
-          * Can be solved by hinting the nextnext and/or nextnextnext address
-    * Hardware: Identify the pattern and prefetch
-    * Execution driven: Oppotunistically try to prefetch (runahead, dual-core execution)
-  * Stride prefetcher
-    * Predict strides, which is common in many programs
-    * Cache block based or instruction based
-  * Stream buffer design
-    * Buffer the stream of accesses (next address)
-    * Use the information to prefetch
-  * What affect prefetcher performance
-    * Prefetch distance
-      * How far ahead should we prefetch
-    * Prefetch degree
-      * How many prefetches do we prefetch
-  * Prefetcher performance
-    * Coverage
-      * Out of the demand requests, how many are actually from the prefetch request
-    * Accuracy
-      * Out of all the prefetch requests, how many are actually getting used
-    * Timeliness
-      * How much memory latency can we hide from the prefetch requests
-  * Feedback directed prefetcher
-    * Use the result of the prefetcher as a feedback to the prefetcher
-      * with accuracy, timeliness, polluting information
-  * Markov prefetcher
-    * Prefetch based on the previous history
-    * Use markov model to predict
-    * Pros: Can cover arbitary pattern (easy for link list traversal or trees)
-    * Downside: High cost, cannot help with compulsory misses (no history)
-  * Content directed prefetching
-    * Indentify the content in memory for pointers (which is used as the address to prefetch
-    * Not very efficient (hard to figure out which block is the pointer)
-      * Software can give hints
-===== Lecture 28 (4/14 Mon.) =====
-  * Execution based prefetcher
-    * Helper thread/speculative thread
-      * Use another thread to pre-execute a program
-    * Can be a software based or hardware based
-    * Discover misses before the main program (to prefetch data in a timely manner)
-    * How do you construct the helper thread
-    * Preexecute instruction (one example of how to initialize a speculative thread), slide 9
-  * Benefit of multiprocessor
-    * Improve performace without not significantly increase power consumption
-    * Better cost efficiency and easier to scale
-    * Improve dependability (in case the other core is faulty
-  * Different types of parallelism
-    * Instruction level parallelism
-    * Data level parallelism
-    * Task level parallelism
-  * Task level parallelism
-    * Partition a single, potentially big, task into multiple parallel sub-task
-      * Can be done explicitly (parallel programming by the programmer)
-      * Or implicitly (hardware partition a single thread specilatively)
-    * Or, run multiple independent tasks (still improve throughput, but the speedup of any single tasks are not better, also simpler to implement)
-  * Loosely coupled multiprocessor
-    * No shared global address space
-      * Message passing to communicate between different sources
-    * Simple to manage memory
-  * Tightly coupled multiprocesor
-    * Shared global address space
-    * Need to ensure consistency of data
-  * Switch on event multithreading
-    * Switch to another context when there is an event (for example, a cache miss)
-  * Simulteneous multithreading
-    * Dispatch instruction from multiple threads at the same time
-  * Amdahl's law
-    * Maximum speedup
-    * Parallel portion is not perfect
-      * Serial bottleneck
-      * Synchronization cost
-      * Load balance
-        * Some threads has more work, requires more time to hit the sync. point
-  * Issue in parallel programming
-    * Correctness
-    * Synchronization
-    * Consistency
-===== Lecture 29 (4/16 Wed.) =====
-  * Ordering of instructions
-    * Maintaining memory consistency when there are multiple threads and shared memory
-    * Need to ensure the semantic is not changed
-    * Making sire the shared data is properly locked when used
-      * Support mutual exclusion
-    * Ordering depends on when each processor is executed
-    * Debugging is also difficult (non-deterministic behavior)
-  * Weak consistency: global ordering when sync
-    * programmer hints where the synchronizations are
-  * Total store order model: global ordering only with store
-  * Cache coherence
-    * Can be done in the software level or hardware level
-  * Coherence protocol
-    * Need to ensure that all the processors see and update the correct state of the cache block
-    * Need to make sure that writes get propagated and serialized
-    * Simple protocol are not scalable (one point of synchrnization)
-  * Update vs. invalidate
-    * For invalidate, only the core that needs to read retains the correct copy
-      * Can lead to ping-ponging (tons of read/writes from several processors)
-    * For updates, bus becomes the bottleneck
-  * Snoopy bus
-    * Bus based, single point of serialization
-    * More efficient with small number of processors
-    * All cache snoop other caches read/write requests to keep the cache block coherent
-  * Directory based
-    * Single point of serialization per block
-    * Directory coordinate the coherency
-    * More scalable
-    * The directory keeps track of where the copies of each block resides
-      * Supply data on a read
-      * Invalide the block on a write
-      * Has an exclusive state
-  * MSI coherent protocol
-    * Slide number 56-57
-    * Consume bus bandwidth (need an "exclusive" state
-  * MESI coherent protocal
-    * Add the exclusive state: this is the only cache copy and it is clean state to MSI
-  * Tradeoffs between snooping and directory based
-    * Slide 71 has a good summary on this
-  * MOESI
-    * Improvement over MESI protocol
-===== Lecture 29 (4/18 Wed.) =====
-  * Interference
-  * Complexity of the memory scheduler
-    * Ranking/prioritization has cost
-    * Complex scheduler has higher latency
-  * Performance metric for multicore/multithead applications
-    * Speedup
-    * Slowdown
-    * Harmonic vs wrighted
-  * Fairness mertic
-    * Maximum slowdown
-      * Why does it make sense
-      * Any scenario that it does not make sense?
-  * Predictable performance
-    * Why is it important?
-      * In server environment, different jobs are on the same server
-      * In a mobile environment, there are multiple sources that can slowdown other sources
-    * How to relate slowdown with request service rate
-    * MISE: soft slowdown guarantee
-  * BDI
-    * Memory wall
-      * What is the concern regarding the memory wall
-    * Size of the cache on the die (CPU die)
-    * One possible solution: cache compression
-      * What is the problems of existing cache compression mechanism
-        * Some are too complex
-        * Decompression is in the critical path
-          * Need to decompress when reading the data -> decompression should not be in the critical path
-          * Important factor to the performance
-    * Software compression is not good enough to compress everything
-    * Zero value compression
-      * Simple
-      * Good compression ratio
-      * What is data does not have many zeroes
-    * Frequent value compression
-      * Some data appear fequently
-      * Simple and good compression ratio
-      * have to profile
-      * decompression is complex
-    * Frequent pattern compression
-      * Still to complex in terms of decompression
-    * Based delta compression
-      * Easy to decompress but retain the benefit of compression
-===== Lecture 31 (4/28 Mon.) =====
-  * Directory based cache coherent
-    * Each directory has to handle validate/invalidation
-    * Extra cost of syncronization
-    * Need to ensure race conditions are resolved
-  * Interconnection
-    * Topology
-      * Bus
-      * Mesh
-        * Torus
-      * Tree
-      * Butterfly
-      * Ring
-        * Bi-directional ring
-          * More scalable
-        * Hierarchical ring
-          * Even more scalable
-          * More complex
-      * Crossbar
-      * etc.
-    * Circuit switching
-    * Multistage network
-      * Butterfly
-      * Delta network
-    * Handling contention
-      * Buffering vs. dropping/deflection (no buffering)
-    * Routing algorithm
-      * Handling deadlock
-      * X-Y routing
-        * Turn model (to avoid deadlocks)
-      * Add more buffering for an escape path
-      * Oblivious routing
-        * Can take different path
-          * DOR between each intermediate location
-        * Balance network load
-      * Adaptive routing
-        * Use the state of the network to determine the route
-          * Aware of local and/or global congestions
-        * Non minimal adaptive routing can have livelocks
-===== Lecture 32 (4/30 Wed.) =====
-  * Serialized code section
-    * Degrade performance
-    * Waste energy
-  * Heterogeneous cores
-    * Can execute serialized portion on a powerful large core
-  * Tradeoff between multiple small cores, multiple large cores or heterogenerous cores
-  * Critical section
-    * bottleneck in several multithreaded workloads
-    * Assymmetry can help
-    * Accelerated critical section
-      * Use a large core to run serialized portion of the code
-      * How to correctly support ACS
-      * False serialization
-      * Handling private/shared data
-    * BIS
-      * Ideltify the bottleneck
-        * Serial bottleneck
-        * Barrier
-        * Critical section
-        * Pipeline stages
-      * Application might wait on different types of bottlenecks
-      * Allow bottleneckcall and bottleneckreturn
-      * Acceleration can be done in multiple ways
-        * ship to a big core
-        * increase the frequency
-        * Priorize the thread in share resources (memory scheduler always schedule reqeusts from the thread first, etc.)
-      * Bottleneck table keeps track of different thread's bottleneck and determine the criticality
-===== Lecture 33 (5/2 Fri.) =====
-  * DRAM scaling problem
-  * Possible solutions to the scaling problem
-    * Less leakage DRAM
-    * Heterogeneous DRAM (TL-DRAM, etc.)
-    * Add more functionality to DRAM
-    * Denser design (3D stack)
-    * Different technology
-      * NVM
-  * Non volatile memory
-    * Resistive memory
-      * PCM
-        * Inject current to change the phase
-        * Scales better than DRAM
-          * Multiple bits per cell
-            * Wider resistence range
-        * No refresh is needed
-        * Downside: Latency and write endurance
-      * STT-MRAM
-        * Inject current to change the polarity
-      * Memristor
-        * Inject current to change the structure
-    * Persistency - data stay there even without power
-      * Unified memory and storage management (persistent data structure) - Single level store
-        * Improve energy and performance
-        * Simplify programming model

18-447 Introduction to Computer Architecture – Spring 2015

User Tools

Site Tools

Differences

Page Tools