Differences

This shows you the differences between two versions of the page.

--- buzzword [2015/02/13 19:19]
rachata
+++ buzzword [2015/03/17 02:29]
kevincha
@@ Line 45: / Line 45: @@
   * Memory wall (a part of scaling issue)
   * Scaling issue
-    * Transister are getting smaller
+    * Transistor are getting smaller
   * Key components of a computer
   * Design points
@@ Line 54: / Line 54: @@
   * Reliability problems that cause errors
   * Analogies from Kuhn's "The Structure of Scientific Revolutions" (Recommended book)
-    * Pre paradigm science
+    * Pre-paradigm science
     * Normal science
-    * Revolutionalry science
+    * Revolutionary science
   * Components of a computer
     * Computation
@@ Line 76: / Line 76: @@
     * Operands
     * Live-outs/Live-ins
-      * DIfferent types of data flow nodes (conditional/relational/barrier)
+      * Different types of data flow nodes (conditional/relational/barrier)
     * How to do transactional transaction in dataflow?
       * Example: bank transactions
@@ Line 121: / Line 121: @@
    * Tradeoffs between 0,1,2,3 address machines
    * Postfix notation
-   * Instructions/Opcode/Operade specifiers (i.e. addressing modes)
+   * Instructions/Opcode/Operand specifiers (i.e. addressing modes)
    * Simply vs. complex data type (and their tradeoffs)
    * Semantic gap and level
@@ Line 236: / Line 236: @@
   * Variable latency memory
   * Handling interrupts
-  * Difference betwen interrupts and exceptions
+  * Difference between interrupts and exceptions
   * Emulator (i.e. uCode allots minimal datapath to emulate the ISA)
   * Updating machine behavior
@@ Line 257: / Line 257: @@
     * Idle resources
   * Throughput of a pipelined design
-    * What dictacts the throughput of a pipelined design?
+    * What dictates the throughput of a pipelined design?
   * Latency of the pipelined design
   * Dependency
@@ Line 336: / Line 336: @@
     * Inst. from different thread can fill-in the bubbles
     * Cost?
-  * Simulteneuos multithreading
+  * Simultaneuos multithreading
   * Branch prediction
     * Guess what to fetch next.
@@ Line 345: / Line 345: @@
       * Given the program/number of instructions, percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions.
         * How many extra instructions are being fetched?
-        * What is the performance degredation?
+        * What is the performance degradation?
     * How to reduce the miss penalty?
     * Predicting the next address (non PC+4 address)
@@ Line 352: / Line 352: @@
     * Global branch history - for directions
     * Can use compiler to profile and get more info
-      * Input set dictacts the accuracy
+      * Input set dictates the accuracy
       * Add time to compilation
     * Heuristics that are common and doesn't require profiling.
@@ Line 358: / Line 358: @@
       * Does not require profiling
     * Static branch prediction
-      * Pregrammer provides pragmas, hinting the likelihood of taken/not taken branch
+      * Programmer provides pragmas, hinting the likelihood of taken/not taken branch
       * For example, x86 has the hint bit
     * Dynamic branch prediction
@@ Line 369: / Line 369: @@
     * Why are they very important?
       * Differences between 99% accuracy and 98% accuracy
-      * Cost of a misprediction when the pipeline is veryd eep
+      * Cost of a misprediction when the pipeline is very deep
   * Global branch correlation
     * Some branches are correlated
@@ Line 405: / Line 405: @@
       * Use branch prediction on an easy to predict code
       * Use predicated execution on a hard to predict code
-      * Compiler can be more aggressive in optimimzing the code
+      * Compiler can be more aggressive in optmizing the code
     * What are the tradeoffs (slide# 47)
   * Multi-path execution
@@ Line 411: / Line 411: @@
     * Can lead to wasted work
     * VLIW
-    * SuperScalar
+    * Superscalar
@@ Line 434: / Line 434: @@
   * Reorder buffer
     * Reorder results before they are visible to the arch. state
-      * Need to presearve the sequential sematic and data
+      * Need to preserve the sequential semantic and data
-    * What are the informatinos in the ROB entry
+    * What are the information in the ROB entry
     * Where to get the value from (forwarding path? reorder buffer?)
       * Extra logic to check where the youngest instructions/value is
-      * Content addressible search (CAM)
+      * Content addressable search (CAM)
         * A lot of comparators
     * Different ways to simplify the reorder buffer
@@ Line 450: / Line 450: @@
       * An updated value (Speculative), called future file
       * A backup value (to restore the state quickly
-    * Double the cost of the regfile, but reduce the area as you don't have to use a content addressible memory (compared to ROB alone)
+    * Double the cost of the regfile, but reduce the area as you don't have to use a content addressable memory (compared to ROB alone)
   * Branch misprediction resembles Exception
     * The difference is that branch misprediction is not visible to the software
@@ Line 486: / Line 486: @@
       * Slides 28 --> register renaming
       * Slides 30-35 --> Exercise (also on the board)
-        * This will be usefull for the midterm
+        * This will be useful for the midterm
     * Register aliasing table
     * Broadcasting tags
     * Using dataflow
+===== Lecture 13 (2/16 Mon.) =====
+  * OoO --> Restricted Dataflow
+    * Extracting parallelism
+    * What are the bottlenecks?
+      * Issue width
+      * Dispatch width
+      * Parallelism in the program
+    * What does it mean to be restricted data flow
+      * Still visible as a Von Neumann model
+    * Where does the efficiency come from?
+    * Size of the scheduling windows/reorder buffer. Tradeoffs? What make sense?
+  * Load/store handling
+    * Would like to schedule them out of order, but make them visible in-order
+    * When do you schedule the load/store instructions?
+    * Can we predict if load/store are dependent?
+    * This is one of the most complex structure of the load/store handling
+    * What information can be used to predict these load/store optimization?
+  * Centralized vs. distributed? What are the tradeoffs?
+  * How to handle when there is a misprediction/recovery
+    * OoO + branch prediction?
+    * Speculatively update the history register
+      * When do you update the GHR?
+  * Token dataflow arch.
+    * What are tokens?
+    * How to match tokens
+    * Tagged token dataflow arch.
+    * What are the tradeoffs?
+    * Difficulties?
+===== Lecture 14 (2/18 Wed.) =====
+  * SISD/SIMD/MISD/MIMD
+  * Array processor
+  * Vector processor
+  * Data parallelism
+    * Where does the concurrency arise?
+  * Differences between array processor vs. vector processor
+  * VLIW
+  * Compactness of an array processor
+  * Vector operates on a vector of data (rather than a single datum (scalar))
+    * Vector length (also applies to array processor)
+    * No dependency within a vector --> can have a deep pipeline
+    * Highly parallel (both instruction level (ILP) and memory level (MLP))
+    * But the program needs to be very parallel
+    * Memory can be the bottleneck (due to very high MLP)
+    * What does the functional units look like? Deep pipeline and simpler control.
+    * CRAY-I is one of the examples of vector processor
+    * Memory access pattern in a vector processor
+      * How do the memory accesses benefit the memory bandwidth?
+      * Memory level parallelism
+      * Stride length vs. the number of banks
+        * stride length should be relatively prime to the number of banks
+      * Tradeoffs between row major and column major --> How can the vector processor deals with the two
+    * How to calculate the efficiency and performance of vector processors
+    * What if there are multiple memory ports?
+    * Gather/Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism)
+      * Helps handling sparse matrices
+    * Conditional operation
+    * Structure of vector units
+    * How to automatically parallelize code through the compiler?
+      * This is a hard problem. Compiler does not know the memory address.
+  * What do we need to ensure for both vector and array processor?
+  * Sequential bottleneck
+    * Amdahl's law
+  * Intel MMX --> An example of Intel's approach to SIMD
+    * No VLEN, use OpCode to define the length
+    * Stride is one in MMX
+    * Intel SSE --> Modern version of MMX
+===== Lecture 15 (2/20 Fri.) =====
+  * GPU
+    * Warp/Wavefront
+      * A bunch of threads sharing the same PC
+    * SIMT
+    * Lanes
+    * FGMT + massively parallel
+      * Tolerate long latency
+    * Warp based SIMD vs. traditional SIMD
+  * SPMD (Programming model)
+    * Single program operates on multiple data
+      * can have synchronization point
+    * Many scientific applications are programmed in this manner
+  * Control flow problem (branch divergence)
+    * Masking (in a branch, mask threads that should not execute that path)
+    * Lower SIMD efficiency
+    * What if you have layers of branches?
+  * Dynamic warp formation
+    * Combining threads from different warps to increase SIMD utilization
+    * This can cause memory divergence
+  * VLIW
+    * Wide fetch
+    * IA-64
+    * Tradeoffs
+      * Simple hardware (no dynamic scheduling, no dependency checking within VLIW)
+      * A lot of loads at the compiler level
+  * Decoupled access/execute
+    * Limited form of OoO
+    * Tradeoffs
+    * How to street the instruction (determine dependency/stalling)?
+    * Instruction scheduling techniques (static vs. dynamic)
+  * Systolic arrays
+    * Processing elements transform data in chains
+    * Develop for image processing (for example, convolution)
+  * Stage processing
+===== Lecture 16 (2/23 Mon.) =====
+  * Systolic arrays
+    * Processing elements transform data in chains
+    * Can be arrays of multi-dimensional processing elements
+    * Develop for image processing (for example, convolution)
+    * Can be use to break stages in pipeline programs, using a set of queues and processing elements
+    * Can enable high concurrency and good for regular programs
+    * Very special purpose
+    * The warp computer
+  * Static instruction scheduling
+    * How do we find the next instruction to execute?
+  * Live-in and live-out
+  * Basic blocks
+    * Rearranging instructions in the basic block
+    * Code movement from one basic block to another
+  * Straight line code
+  * Independent instructions
+    * How to identify independent instructions
+  * Atomicity
+  * Trace scheduling
+    * Side entrance
+    * Fixed up code
+    * How scheduling is done
+  * Instruction scheduling
+    * Prioritization heuristics
+  * Superblock
+    * Traces with no side-entrance
+  * Hyperblock
+  * BS-ISA
+  * Tradeoffs between trace cache/Hyperblock/Superblock/BS-ISA
+===== Lecture 17 (2/25 Wed.) =====
+  * IA-64
+    * EPIC
+  * IA-64 instruction bundle
+    * Multiple instructions in the bundle along with the template bit
+    * Template bits
+    * Stop bits
+    * Non-faulting loads and exception propagation
+  * Aggressive ST-LD reordering
+  * Physical memory system
+  * Ideal pipelines
+  * Ideal cache
+    * More capacity
+    * Fast
+    * Cheap
+    * High bandwidth
+  * DRAM cell
+    * Cheap
+    * Sense the perturbation through sense amplifier
+    * Slow and leaky
+  * SRAM cell (Cross coupled inverter)
+    * Expensice
+    * Fast (easier to sense the value in the cell)
+  * Memory bank
+    * Read access sequence
+    * DRAM: Activate -> Read -> Precharge (if needed)
+    * What dominate the access latency for DRAM and SRAM
+  * Scaling issue
+    * Hard to scale the scale to be small
+  * Memory hierarchy
+    * Prefetching
+    * Caching
+  * Spatial and temporal locality
+    * Cache can exploit these
+    * Recently used data is likely to be accessed
+    * Nearby data is likely to be accessed
+  * Caching in a pipeline design
+  * Cache management
+    * Manual
+      * Data movement is managed manually
+        * Embedded processor
+        * GPU scratchpad
+    * Automatic
+      * HW manage data movements
+  * Latency analysis
+    * Based on the hit and miss status, next level access time (if miss), and the current level access time
+  * Cache basics
+    * Set/block (line)/Placement/replacement/direct mapped vs. associative cache/etc.
+  * Cache access
+    * How to access tag and data (in parallel vs serially)
+    * How do tag and index get used?
+    * Modern processors perform serial access for higher level cache (L3 for example) to save power
+  * Cost and benefit of having more associativity
+    * Given the associativity, which block should be replace if it is full
+    * Replacement policy
+      * Random
+      * Least recently used (LRU)
+      * Least frequently used
+      * Least costly to refetch
+      * etc.
+  * How to implement LRU
+    * How to keep track of access ordering
+      * Complexity increases rapidly
+    * Approximate LRU
+      * Victim and next Victim policy
+===== Lecture 18 (2/27 Fri.) =====
+  * Tag store and data store
+  * Cache hit rate
+  * Average memory access time (AMAT)
+  * AMAT vs. Stall time
+  * Cache basics
+    * Direct mapped vs. associative cache
+    * Set/block (line)/Placement/replacement
+    * How do tag and index get used?
+  * Full associativity
+  * Set associative cache
+    * insertion, promotion, eviction (replacement)
+  * Various replacement policies
+  * How to implement LRU
+    * How to keep track of access ordering
+      * Complexity increases rapidly
+    * Approximate LRU
+      * Victim and next Victim policy
+  * Set thrashing
+    * Working set is bigger than the associativity
+  * Belady's OPT
+    * Is this optimal?
+    * Complexity?
+  * DRAM as a cache for disk
+  * Handling writes
+    * Write through
+      * Need a modified bit to make sure accesses to data got the updated data
+    * Write back
+      * Simpler, no consistency issues
+  * Sectored cache
+    * Use subblock
+      * lower bandwidth
+      * more complex
+  * Instruction vs data cache
+    * Where to place instructions
+      * Unified vs. separated
+    * In the first level cache
+  * Cache access
+    * First level access
+    * Second level access
+      * When to start the second level access
+  * Cache performance
+    * capacity
+    * block size
+    * associativity
+  * Classification of cache misses
+===== Lecture 19 (03/02 Mon.) =====
+  * Subblocks
+  * Victim cache
+    * Small, but fully assoc. cache behind the actual cache
+    * Cached misses cache block
+    * Prevent ping-ponging
+  * Pseudo associativity
+    * Simpler way to implement associative cache
+  * Skewed assoc. cache
+    * Different hashing functions for each way
+  * Restructure data access pattern
+    * Order of loop traversal
+    * Blocking
+  * Memory level parallelism
+    * Cost per miss of a parallel cache miss is less costly compared to serial misses
+  * MSHR
+    * Keep track of pending cache
+      * Think of this as the load/store buffer-ish for cache
+    * What information goes into the MSHR?
+    * When do you access the MSHR?
+  * Memory banks
+  * Shared caches in multi-core processors
+===== Lecture 20 (03/04 Wed.) =====
+  * Virtual vs. physical memory
+  * System's management on memory
+    * Benefits
+  * Problem: physical memory has limited size
+  * Mechanisms: indirection, virtual addresses, and translation
+  * Demand paging
+  * Physical memory as a cache
+  * Tasks of system SW for VM
+  * Serving a page fault
+  * Address translation
+  * Page table
+    * PTE (page table entry)
+  * Page replacement algorithm
+    * CLOCK algo.
+    * Inverted page table
+  * Page size trade-offs
+  * Protection
+  * Multi-level page tables
+  * x86 implementation of page table
+  * TLB
+    * Handling misses
+  * When to do address translation?
+  * Homonym and Synonyms
+    * Homonym: Same VA but maps to different PA with multiple processes
+    * Synonyms: Multiple VAs map to the same PA
+      * Shared libraries, shared data, copy-on-write
+  * Virtually indexed vs. physically indexed
+  * Virtually tagged vs. physically tagged
+  * Virtually indexed physically tagged
+  * Can these create problems when we have the cache
+  * How to eliminate these problems?
+  * Page coloring
+  * Interaction between cache and TLB

18-447 Introduction to Computer Architecture – Spring 2015

User Tools

Site Tools

Differences

Page Tools