User Tools

Site Tools


buzzword

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
buzzword [2015/03/02 19:16]
kevincha [Lecture 19 (03/02 Mon.)]
buzzword [2015/04/27 18:20]
rachata
Line 45: Line 45:
   * Memory wall (a part of scaling issue)   * Memory wall (a part of scaling issue)
   * Scaling issue   * Scaling issue
-    * Transister ​are getting smaller+    * Transistor ​are getting smaller
   * Key components of a computer   * Key components of a computer
   * Design points   * Design points
Line 54: Line 54:
   * Reliability problems that cause errors   * Reliability problems that cause errors
   * Analogies from Kuhn's "The Structure of Scientific Revolutions"​ (Recommended book)   * Analogies from Kuhn's "The Structure of Scientific Revolutions"​ (Recommended book)
-    * Pre paradigm science+    * Pre-paradigm science
     * Normal science     * Normal science
-    * Revolutionalry ​science+    * Revolutionary ​science
   * Components of a computer   * Components of a computer
     * Computation     * Computation
Line 76: Line 76:
     * Operands     * Operands
     * Live-outs/​Live-ins     * Live-outs/​Live-ins
-      * DIfferent ​types of data flow nodes (conditional/​relational/​barrier)+      * Different ​types of data flow nodes (conditional/​relational/​barrier)
     * How to do transactional transaction in dataflow?     * How to do transactional transaction in dataflow?
       * Example: bank transactions        * Example: bank transactions
Line 121: Line 121:
    * Tradeoffs between 0,1,2,3 address machines    * Tradeoffs between 0,1,2,3 address machines
    * Postfix notation    * Postfix notation
-   * Instructions/​Opcode/​Operade ​specifiers (i.e. addressing modes)+   * Instructions/​Opcode/​Operand ​specifiers (i.e. addressing modes)
    * Simply vs. complex data type (and their tradeoffs)    * Simply vs. complex data type (and their tradeoffs)
    * Semantic gap and level    * Semantic gap and level
Line 236: Line 236:
   * Variable latency memory   * Variable latency memory
   * Handling interrupts   * Handling interrupts
-  * Difference ​betwen ​interrupts and exceptions+  * Difference ​between ​interrupts and exceptions
   * Emulator (i.e. uCode allots minimal datapath to emulate the ISA)   * Emulator (i.e. uCode allots minimal datapath to emulate the ISA)
   * Updating machine behavior   * Updating machine behavior
Line 257: Line 257:
     * Idle resources     * Idle resources
   * Throughput of a pipelined design   * Throughput of a pipelined design
-    * What dictacts ​the throughput of a pipelined design?+    * What dictates ​the throughput of a pipelined design?
   * Latency of the pipelined design   * Latency of the pipelined design
   * Dependency   * Dependency
Line 336: Line 336:
     * Inst. from different thread can fill-in the bubbles     * Inst. from different thread can fill-in the bubbles
     * Cost?     * Cost?
-  * Simulteneuos ​multithreading+  * Simultaneuos ​multithreading
   * Branch prediction   * Branch prediction
     * Guess what to fetch next.     * Guess what to fetch next.
Line 345: Line 345:
       * Given the program/​number of instructions,​ percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions.       * Given the program/​number of instructions,​ percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions.
         * How many extra instructions are being fetched?         * How many extra instructions are being fetched?
-        * What is the performance ​degredation?+        * What is the performance ​degradation?
     * How to reduce the miss penalty?     * How to reduce the miss penalty?
     * Predicting the next address (non PC+4 address)     * Predicting the next address (non PC+4 address)
Line 352: Line 352:
     * Global branch history - for directions     * Global branch history - for directions
     * Can use compiler to profile and get more info     * Can use compiler to profile and get more info
-      * Input set dictacts ​the accuracy+      * Input set dictates ​the accuracy
       * Add time to compilation       * Add time to compilation
     * Heuristics that are common and doesn'​t require profiling.     * Heuristics that are common and doesn'​t require profiling.
Line 358: Line 358:
       * Does not require profiling       * Does not require profiling
     * Static branch prediction     * Static branch prediction
-      * Pregrammer ​provides pragmas, hinting the likelihood of taken/not taken branch+      * Programmer ​provides pragmas, hinting the likelihood of taken/not taken branch
       * For example, x86 has the hint bit       * For example, x86 has the hint bit
     * Dynamic branch prediction     * Dynamic branch prediction
Line 369: Line 369:
     * Why are they very important?     * Why are they very important?
       * Differences between 99% accuracy and 98% accuracy       * Differences between 99% accuracy and 98% accuracy
-      * Cost of a misprediction when the pipeline is veryd eep+      * Cost of a misprediction when the pipeline is very deep
   * Global branch correlation   * Global branch correlation
     * Some branches are correlated     * Some branches are correlated
Line 405: Line 405:
       * Use branch prediction on an easy to predict code       * Use branch prediction on an easy to predict code
       * Use predicated execution on a hard to predict code       * Use predicated execution on a hard to predict code
-      * Compiler can be more aggressive in optimimzing ​the code+      * Compiler can be more aggressive in optmizing ​the code
     * What are the tradeoffs (slide# 47)     * What are the tradeoffs (slide# 47)
   * Multi-path execution   * Multi-path execution
Line 411: Line 411:
     * Can lead to wasted work     * Can lead to wasted work
     * VLIW     * VLIW
-    * SuperScalar+    * Superscalar
  
  
Line 434: Line 434:
   * Reorder buffer   * Reorder buffer
     * Reorder results before they are visible to the arch. state     * Reorder results before they are visible to the arch. state
-      * Need to presearve ​the sequential ​sematic ​and data +      * Need to preserve ​the sequential ​semantic ​and data 
-    * What are the informatinos ​in the ROB entry+    * What are the information ​in the ROB entry
     * Where to get the value from (forwarding path? reorder buffer?)     * Where to get the value from (forwarding path? reorder buffer?)
       * Extra logic to check where the youngest instructions/​value is       * Extra logic to check where the youngest instructions/​value is
-      * Content ​addressible ​search (CAM)+      * Content ​addressable ​search (CAM)
         * A lot of comparators         * A lot of comparators
     * Different ways to simplify the reorder buffer     * Different ways to simplify the reorder buffer
Line 450: Line 450:
       * An updated value (Speculative),​ called future file       * An updated value (Speculative),​ called future file
       * A backup value (to restore the state quickly       * A backup value (to restore the state quickly
-    * Double the cost of the regfile, but reduce the area as you don't have to use a content ​addressible ​memory (compared to ROB alone)+    * Double the cost of the regfile, but reduce the area as you don't have to use a content ​addressable ​memory (compared to ROB alone)
   * Branch misprediction resembles Exception   * Branch misprediction resembles Exception
     * The difference is that branch misprediction is not visible to the software     * The difference is that branch misprediction is not visible to the software
Line 486: Line 486:
       * Slides 28 --> register renaming       * Slides 28 --> register renaming
       * Slides 30-35 --> Exercise (also on the board)       * Slides 30-35 --> Exercise (also on the board)
-        * This will be usefull ​for the midterm+        * This will be useful ​for the midterm
     * Register aliasing table     * Register aliasing table
     * Broadcasting tags     * Broadcasting tags
Line 503: Line 503:
       * Still visible as a Von Neumann model       * Still visible as a Von Neumann model
     * Where does the efficiency come from?     * Where does the efficiency come from?
-    * Size of the scheduling ​windors/reorder buffer. Tradeoffs? What make sense?+    * Size of the scheduling ​windows/reorder buffer. Tradeoffs? What make sense?
   * Load/store handling   * Load/store handling
     * Would like to schedule them out of order, but make them visible in-order     * Would like to schedule them out of order, but make them visible in-order
Line 538: Line 538:
     * But the program needs to be very parallel     * But the program needs to be very parallel
     * Memory can be the bottleneck (due to very high MLP)     * Memory can be the bottleneck (due to very high MLP)
-    * What does the functional units look like? Deep pipelin ​and simpler control.+    * What does the functional units look like? Deep pipeline ​and simpler control.
     * CRAY-I is one of the examples of vector processor     * CRAY-I is one of the examples of vector processor
     * Memory access pattern in a vector processor     * Memory access pattern in a vector processor
Line 549: Line 549:
     * What if there are multiple memory ports?     * What if there are multiple memory ports?
     * Gather/​Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism)     * Gather/​Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism)
-      * Helps handling sparse ​metrices+      * Helps handling sparse ​matrices
     * Conditional operation     * Conditional operation
     * Structure of vector units     * Structure of vector units
Line 579: Line 579:
     * Lower SIMD efficiency     * Lower SIMD efficiency
     * What if you have layers of branches?     * What if you have layers of branches?
-  * Dynamic ​wrap formation+  * Dynamic ​warp formation
     * Combining threads from different warps to increase SIMD utilization     * Combining threads from different warps to increase SIMD utilization
     * This can cause memory divergence     * This can cause memory divergence
Line 593: Line 593:
     * How to street the instruction (determine dependency/​stalling)?​     * How to street the instruction (determine dependency/​stalling)?​
     * Instruction scheduling techniques (static vs. dynamic)     * Instruction scheduling techniques (static vs. dynamic)
-  * Systoric ​arrays+  * Systolic ​arrays
     * Processing elements transform data in chains     * Processing elements transform data in chains
     * Develop for image processing (for example, convolution)     * Develop for image processing (for example, convolution)
Line 601: Line 601:
  
 ===== Lecture 16 (2/23 Mon.) ===== ===== Lecture 16 (2/23 Mon.) =====
-  * Systoric ​arrays+  * Systolic ​arrays
     * Processing elements transform data in chains     * Processing elements transform data in chains
     * Can be arrays of multi-dimensional processing elements     * Can be arrays of multi-dimensional processing elements
Line 622: Line 622:
     * Side entrance     * Side entrance
     * Fixed up code     * Fixed up code
-    * How scheudling ​is done+    * How scheduling ​is done
   * Instruction scheduling   * Instruction scheduling
     * Prioritization heuristics     * Prioritization heuristics
Line 629: Line 629:
   * Hyperblock   * Hyperblock
   * BS-ISA   * BS-ISA
-  * Tradeoffs ​betwwen ​trace cache/​Hyperblock/​Superblock/​BS-ISA+  * Tradeoffs ​between ​trace cache/​Hyperblock/​Superblock/​BS-ISA
     ​     ​
 ===== Lecture 17 (2/25 Wed.) ===== ===== Lecture 17 (2/25 Wed.) =====
Line 640: Line 640:
     * Non-faulting loads and exception propagation     * Non-faulting loads and exception propagation
   * Aggressive ST-LD reordering   * Aggressive ST-LD reordering
-  * Phyiscal ​memory system+  * Physical ​memory system
   * Ideal pipelines   * Ideal pipelines
   * Ideal cache   * Ideal cache
Line 649: Line 649:
   * DRAM cell   * DRAM cell
     * Cheap     * Cheap
-    * Sense the purturbation ​through sense amplifier+    * Sense the perturbation ​through sense amplifier
     * Slow and leaky     * Slow and leaky
   * SRAM cell (Cross coupled inverter)   * SRAM cell (Cross coupled inverter)
Line 657: Line 657:
     * Read access sequence     * Read access sequence
     * DRAM: Activate -> Read -> Precharge (if needed)     * DRAM: Activate -> Read -> Precharge (if needed)
-    * What dominate the access ​laatency ​for DRAM and SRAM+    * What dominate the access ​latency ​for DRAM and SRAM
   * Scaling issue   * Scaling issue
     * Hard to scale the scale to be small     * Hard to scale the scale to be small
Line 685: Line 685:
   * Cost and benefit of having more associativity   * Cost and benefit of having more associativity
     * Given the associativity,​ which block should be replace if it is full     * Given the associativity,​ which block should be replace if it is full
-    * Replacement ​poligy+    * Replacement ​policy
       * Random       * Random
       * Least recently used (LRU)       * Least recently used (LRU)
Line 749: Line 749:
     * Cached misses cache block     * Cached misses cache block
     * Prevent ping-ponging     * Prevent ping-ponging
-  * Pseudo ​associtivity+  * Pseudo ​associativity
     * Simpler way to implement associative cache     * Simpler way to implement associative cache
   * Skewed assoc. cache   * Skewed assoc. cache
Line 765: Line 765:
   * Memory banks   * Memory banks
   * Shared caches in multi-core processors   * Shared caches in multi-core processors
 +===== Lecture 20 (03/04 Wed.) =====
 +  * Virtual vs. physical memory
 +  * System'​s management on memory
 +    * Benefits
 +  * Problem: physical memory has limited size
 +  * Mechanisms: indirection,​ virtual addresses, and translation
 +  * Demand paging
 +  * Physical memory as a cache
 +  * Tasks of system SW for VM
 +  * Serving a page fault
 +  * Address translation
 +  * Page table
 +    * PTE (page table entry)
 +  * Page replacement algorithm
 +    * CLOCK algo.
 +    * Inverted page table
 +  * Page size trade-offs
 +  * Protection
 +  * Multi-level page tables
 +  * x86 implementation of page table
 +  * TLB
 +    * Handling misses
 +  * When to do address translation?​
 +  * Homonym and Synonyms
 +    * Homonym: Same VA but maps to different PA with multiple processes
 +    * Synonyms: Multiple VAs map to the same PA
 +      * Shared libraries, shared data, copy-on-write
 +  * Virtually indexed vs. physically indexed
 +  * Virtually tagged vs. physically tagged
 +  * Virtually indexed physically tagged
 +  * Can these create problems when we have the cache
 +  * How to eliminate these problems?
 +  * Page coloring
 +  * Interaction between cache and TLB
 +
 +
 +===== Lecture 21 (03/23 Mon.) =====
 +
 +  * DRAM scaling problem
 +  * Demands/​trends affecting the main memory
 +    * More capacity
 +    * Low energy
 +    * More bandwidth
 +    * QoS
 +  * ECC in DRAM
 +  * Multi-porting
 +    * Virtual multi-porting
 +      * Time-share the port, not too scalable but cheap
 +    * True multiporting
 +  * Multiple cache copies
 +  * Alignment
 +  * Banking
 +    * Can have bank conflict
 +    * Extra interconnects across banks
 +    * Address mapping can mitigate bank conflict
 +    * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity)
 +  * Bank mapping
 +    * How to avoid bank conflicts?
 +  * Channel mapping
 +    * Address mapping to minimize bank conflict
 +    * Page coloring
 +      * Virtual to physical mapping that can help reducing conflicts
 +  * Accessing DRAM
 +    * Row bits
 +    * Column bits
 +    * Addressibility
 +    * DRAM has its own clock
 +    * Sense amplifier
 +    * Bit lines
 +    * Word lines
 +  * DRAM (2T) vs. SRAM (6T)
 +    * Cost
 +    * Latency
 +  * Interleaving in DRAM
 +    * Effects from address mapping on memory interleaving
 +    * Effects from memory access patterns from the program on interleaving
 +  * DRAM Bank
 +    * To minimize the cost of interleaving (Shared the data bus and the command bus)
 +  * DRAM Rank
 +    * Minimize the cost of the chip (a bundle of chips operated together)
 +  * DRAM Channel
 +    * An interface to DRAM, each with its own ranks/banks
 +  * DRAM Chip
 +  * DIMM
 +    * More DIMM adds the interconnect complexity
 +  * List of commands to read/write data into DRAM
 +    * Activate -> read/write -> precharge
 +    * Activate moves data into the row buffer
 +    * Precharge prepare the bank for the next access
 +  * Row buffer hit
 +  * Row buffer conflict
 +  * Scheduling memory requests to lower row conflicts
 +  * Burst mode of DRAM
 +    * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits
 +  * Address mapping
 +    * Row interleaved
 +    * Cache block interleaved
 +  * Memory controller
 +    * Sending DRAM commands
 +    * Periodically send commands to refresh DRAM cells
 +    * Ensure correctness and data integrity
 +    * Where to place the memory controller
 +      * On CPU chip vs. at the main memory
 +        * Higher BW on-chip
 +    * Determine the order of requests that will be serviced in DRAM
 +      * Request queues that hold requests
 +      * Send requests whenever the request can be sent to the bank
 +      * Determine which command (across banks) should be sent to DRAM
 +
 +===== Lecture 22 (03/25 Wed.) =====
 +
 +
 +
 +  * Flash controller
 +  * Flash memory
 +  * Garbage collection in flash
 +  * Overhead in flash memory
 +    * Erase (off the critical path, but takes a long time)
 +  * Different types of DRAM
 +  * DRAM design choices
 +    * Cost/​density/​latency/​BW/​Yield
 +  * Sense Amplifier
 +    * How do they work
 +  * Dual data rate
 +  * Subarray
 +  * Rowclone
 +    * Moving bulk of data from one row to others
 +    * Lower latency and BW when performing copies/​zeroes out the data
 +  * TL-DRAM
 +    * Far segment
 +    * Near segment
 +    * What causes the long latency
 +    * Benefit of TL-DRAM
 +      * TL-DRAM vs. DRAM cache (adding a small cache in DRAM)
 +  * List of commands to read/write data into DRAM
 +    * Activate -> read/write -> precharge
 +    * Activate moves data into the row buffer
 +    * Precharge prepare the bank for the next access
 +  * Row buffer hit
 +  * Row buffer conflict
 +  * Scheduling memory requests to lower row conflicts
 +  * Burst mode of DRAM
 +    * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits
 +  * Address mapping
 +    * Row interleaved
 +    * Cache block interleaved
 +  * Memory controller
 +    * Sending DRAM commands
 +    * Periodically send commands to refresh DRAM cells
 +    * Ensure correctness and data integrity
 +    * Where to place the memory controller
 +      * On CPU chip vs. at the main memory
 +        * Higher BW on-chip
 +    * Determine the order of requests that will be serviced in DRAM
 +      * Request queues that hold requests
 +      * Send requests whenever the request can be sent to the bank
 +      * Determine which command (across banks) should be sent to DRAM
 +  * Priority of demand vs. prefetch requests
 +  * Memory scheduling policies
 +    * FCFS
 +    * FR-FCFS
 +      * Try to maximize row buffer hit rate
 +      * Capped FR-FCFS: FR-FCFS with a timeout
 +      * Usually this is done in a command level (read/write commands and precharge/​activate commands)
 +    * PAR-BS
 +      * Key benefits
 +      * stall time
 +      * shortest job first
 +    * STFM
 +    * ATLAS
 +    * TCM
 +      * Key benefits
 +      * Configurability
 +      * Fairness + performance at the same time
 +      * Robuestness isuees
 +  * Open row policy
 +  * Closed row policy
 +  * QoS
 +    * QoS issues in memory scheduling
 +    * Fairness
 +    * Performance guarantee
 +
 +
 +===== Lecture 23 (03/27 Fri.) =====
 +
 +  * Different ways to control interference in DRAM
 +    * Partitioning of resource
 +      * Channel partitioning:​ map applications that interfere with each other in a different channel
 +        * Keep track of application'​s characteristics
 +        * Dedicate a channel might waste the bandwidth
 +        * Need OS support to determine the channel bits
 +    * Source throttling
 +      * A controller throttle the core depends on the performance target
 +      * Example: Fairness via source throttling
 +        * Detect unfairness and throttle application that is interfering
 +        * How do you estimate slowdown?
 +        * Threshold based solution: hard to configure
 +    * App/thread scheduling
 +      * Critical threads usually stall the progress
 +    * Designing DRAM controller
 +      * Has to handle the normal DRAM operations
 +        * Read/​write/​refresh/​all the timing constraints
 +      * Keep track of resources
 +      * Assign priorities to different requests
 +      * Manage requests to banks
 +    * Self-optimizing controller
 +      * Use machine learning to improve DRAM controller
 +    * A-DRM
 +      * Architecture aware DRAM
 +  * Multithread
 +    * synchronization
 +    * Pipeline programs
 +      * Producer consumer model
 +    * Critical path
 +    * Limiter threads
 +    * Prioritization between threads
 +  * Different power mode in DRAM
 +  * DRAM Refresh
 +    * Why does DRAM has to refresh every 64ms
 +    * Banks are unavailable during refresh
 +      * LPDDR mitigate this by using a per-bank refresh
 +    * Has to spend longer time with bigger DRAM
 +    * Distributed refresh: stagger refresh every 64 ms in a distributed manner
 +      * As oppose to burst refresh (long pause time)
 +  * RAIDR: Reduce DRAM refresh by profiling and binning
 +    * Some row do not have to be refresh very frequently
 +      * Profile the row
 +        * High temperature changes the retention time: need online profiling
 +  * Bloom filter
 +    * Represent set membership
 +    * Approximated
 +    * Can contain false positive
 +      * Better/more hash function helps eliminate this
 +  ​
 +===== Lecture 24 (03/30 Mon.) =====
 +
 +  * Simulation
 +    * Drawbacks of RTL simulations
 +      * Time consuming
 +      * Complex to develop
 +      * Hard to perform design explorations
 +    * Explore the design space quickly
 +    * Match the behavior of existing systems
 +    * Tradeoffs: speed, accuracy, flexibility
 +    * High-level simulation vs. detailed simulation
 +      * High-level simulation is faster, but lower accuracy
 +  * Controllers that works on multiple types of cores
 +    * Design problems: how to find a good scheduling policy on its own?
 +    * Self-optimizing memory controller: using machine learning
 +      * Can adapt to the applications
 +      * The complexity is very high      ​
 +  * Tolerate latency can be costly
 +    * Instruction window is complex
 +      * Benefit also diminishes
 +    * Designing the buffers can be complex
 +    * A simpler way to tolerate out of order is desirable
 +  * Different sources that cause the core to stall in OoO
 +    * Cache miss
 +    * Note that stall happens if the inst. window is full
 +  * Scaling instruction window size is hard
 +    * It is better (less complex) to make the windows more efficient
 +  * Runahead execution
 +    * Try to optain MLP w/o increasing instruction windows
 +    * Runahead (i.e. execute ahead) when there is a long memory instruction
 +      * Long memory instruction stall processor for a while anyways, so it's better to make use out of it
 +      * Execute future instruction to generate accurate prefetches
 +      * Allow future data to be in the cache
 +    * How to support runahead execution?
 +      * Need a way to checkpoing the state when entering runahead mode
 +      * How to make executing in the wrong path useful?
 +      * Need runahead cache to handle load/store in Runahead mode (since they are speculative)
 +
 +
 +===== Lecture 25 (4/1 Wed.) =====
 +
 +  * More Runahead executions
 +    * How to support runahead execution?
 +      * Need a way to checkpoing the state when entering runahead mode
 +      * How to make executing in the wrong path useful?
 +      * Need runahead cache to handle load/store in Runahead mode (since they are speculative)
 +    * Cost and benefit of runahead execution (slide number 27)
 +    * Runahead can have inefficiency
 +      * Runahead period that are useless
 +        * Get rid of useless inefficient period
 +    * What if there is a dependent cache miss
 +      * Cannot be paralellized in a vanilla runahead
 +      * Can predict the value of the dependent load
 +        * How to predict the address of the load
 +          * Delta value information
 +          * Stride predictor
 +          * AVD prediction
 +  * Questions regarding prefetching
 +    * What to prefetch
 +    * When to prefetch
 +    * how do we prefetch
 +    * where to prefetch from
 +  * Prefetching can cause thrasing (evict a useful block)
 +  * Prefetching can also be useless (not being used)
 +    * Need to be efficient
 +  * Can cause memory bandwidth problem in GPU
 +  * Prefetch the whole block, more than one block, or subblock?
 +    * Each one of them has pros and cons
 +    * Big prefetch is more likely to waste bandwidth
 +    * Commonly done in a cache block granularity
 +  * Prefetch accuracy: fraction of useful prefetches out of all the prefetches
 +  * Prefetcher usually predict based on
 +    * Past knowledge
 +    * Compiler hints
 +  * Prefetcher has to prefetch at the right time
 +    * Prefetch that is too early might get evicted
 +      * It might also evict other useful data
 +    * Prefetch too late does not hide the whole memory latency
 +  * Previous prefetches at the same PC can be used as the history
 +  * Previous demand requests also is a good information to use for prefetches
 +  * Prefetch buffer
 +    * Place the prefetch data to avoid thrashing
 +      * Can treat demand/​prefetch requests separately
 +      * More complex
 +  * Generally, demand block is more important
 +    * This means eviction should prefer prefetch block as oppose to demand block
 +  * Tradeoffs between where do we place the prefetcher
 +    * Look at L1 hits and misses
 +    * Look at L1 misses only
 +    * Look at L2 misses
 +    * Different access pattern affect accuracy
 +      * Tradeoffs between handling more requests (seeing L1 hits and misses) and less visibility (only see L2 miss)
 +  * Software vs. hardware vs. execution based prefetching
 +    * Software: ISA previde prefetch instructions,​ software utilize it
 +      * What information are useful
 +      * How to make sure the prefetch is timely
 +      * What if you have a pointer based structure
 +        * Not easy to prefetch pointer chasing (because in many case the work between prefetches is short, so you cannot predict the next one timely enough)
 +          * Can be solved by hinting the nextnext and/or nextnextnext address
 +    * Hardware: Identify the pattern and prefetch
 +    * Execution driven: Oppotunistically try to prefetch (runahead, dual-core execution)
 +  * Stride prefetcher
 +    * Predict strides, which is common in many programs
 +    * Cache block based or instruction based
 +  * Stream buffer design
 +    * Buffer the stream of accesses (next address) ​
 +    * Use the information to prefetch
 +  * What affect prefetcher performance
 +    * Prefetch distance
 +      * How far ahead should we prefetch
 +    * Prefetch degree
 +      * How many prefetches do we prefetch
 +  * Prefetcher performance
 +    * Coverage
 +      * Out of the demand requests, how many are actually from the prefetch request
 +    * Accuracy
 +      * Out of all the prefetch requests, how many are actually getting used
 +    * Timeliness
 +      * How much memory latency can we hide from the prefetch requests
 +    * Cache pullition
 +      * How much did the prefetcher cause misses in the demand misses?
 +        * Hard to quantify
 +
 +
 +===== Lecture 26 (4/3 Fri.) =====
 +
 +  * Feedback directed prefetcher
 +    * Use the result of the prefetcher as a feedback to the prefetcher
 +      * with accuracy, timeliness, polluting information
 +  * Markov prefetcher
 +    * Prefetch based on the previous history
 +    * Use markov model to predict
 +    * Pros: Can cover arbitary pattern (easy for link list traversal or trees)
 +    * Downside: High cost, cannot help with compulsory misses (no history)
 +  * Content directed prefetching
 +    * Indentify the content in memory for pointers (which is used as the address to prefetch
 +    * Not very efficient (hard to figure out which block is the pointer)
 +      * Software can give hints
 +  * Correlation table
 +    * Address correlation
 +  * Execution based prefetcher
 +    * Helper thread/​speculative thread
 +      * Use another thread to pre-execute a program
 +    * Can be a software based or hardware based
 +    * Discover misses before the main program (to prefetch data in a timely manner)
 +    * How do you construct the helper thread
 +    * Preexecute instruction (one example of how to initialize a speculative thread), slide 9
 +    * Thread-based pre-execution
 +  * Error tolerance
 +  * Solution to errors
 +    * Tolerate errors
 +      * New interface, new design
 +    * Eliminate or minimize errors
 +      * New technology, system-wide rethinking
 +    * Embrace errors
 +      * Map data that can tolerate errors to error-prone area
 +  * Hybrid memory systesm
 +    * Combining multiple memory technology together  ​
 +  * What can emerging technology help?
 +    * Scalability
 +    * Lower the cost
 +    * Energy efficiency
 +  * Possible solutions to the scaling problem
 +    * Less leakage DRAM
 +    * Heterogeneous DRAM (TL-DRAM, etc.)
 +    * Add more functionality to DRAM
 +    * Denser design (3D stack)
 +    * Different technology
 +      * NVM
 +  * Charge vs. resistice memory
 +    * How data is written?
 +    * How to read the data?
 +  * Non volatile memory
 +    * Resistive memory
 +      * PCM
 +        * Inject current to change the phase
 +        * Scales better than DRAM
 +          * Multiple bits per cell
 +            * Wider resistence range
 +        * No refresh is needed
 +        * Downside: Latency and write endurance
 +      * STT-MRAM
 +        * Inject current to change the polarity
 +      * Memristor
 +        * Inject current to change the structure
 +    * Pros and cons between different technologies
 +    * Persistency - data stay there even without power
 +      * Unified memory and storage management (persistent data structure) - Single level store
 +        * Improve energy and performance
 +        * Simplify programming model
 +  * Different design options for DRAM + NVM
 +    * DRAM as a cache
 +    * Place some data in DRAM and other in PCM
 +      * Based on the characteristics
 +        * Frequently accessed data that need lower write latency in DRAM
 +  ​
 +
 +===== Lecture 27 (4/6 Mon.) =====
 +  * Flynn'​s taxonomy
 +  * Parallelism
 +    * Reduces power consumption (P ~ CV^2F)
 +    * Better cost efficiency and easier to scale
 +    * Improves dependability (in case the other core is faulty
 +  * Different types of parallelism
 +    * Instruction level parallelism
 +    * Data level parallelism
 +    * Task level parallelism
 +  * Task level parallelism
 +    * Partition a single, potentially big, task into multiple parallel sub-task
 +      * Can be done explicitly (parallel programming by the programmer)
 +      * Or implicitly (hardware partitions a single thread speculatively)
 +    * Or, run multiple independent tasks (still improves throughput, but the speedup of any single tasks is not better, also simpler to implement)
 +  * Loosely coupled multiprocessor
 +    * No shared global address space
 +      * Message passing to communicate between different sources
 +    * Simple to manage memory
 +  * Tightly coupled multiprocessor
 +    * Shared global address space
 +    * Need to ensure consistency of data
 +    * Programming issues
 +  * Hardware-based multithreading
 +    * Coarse grained
 +    * Find grained
 +    * Simultaneous:​ Dispatch instruction from multiple threads at the same time
 +  * Parallel speedup
 +    * Superlinear speedup
 +  * Utilization,​ Redundancy, Efficiency
 +  * Amdahl'​s law
 +    * Maximum speedup
 +    * Parallel portion is not perfect
 +      * Serial bottleneck
 +      * Synchronization cost
 +      * Load balance
 +        * Some threads has more work, requires more time to hit the sync. point
 +  * Critical sections
 +    * Enforce mutually exclusive access to shared data
 +  * Issues in parallel programming
 +    * Correctness
 +    * Synchronization
 +    * Consistency
 +
 +
 +===== Lecture 28 (4/8 Wed.) =====
 +  * Ordering of instructions
 +    * Maintaining memory consistency when there are multiple threads and shared memory
 +    * Need to ensure the semantic is not changed
 +    * Making sure the shared data is properly locked when used
 +      * Support mutual exclusion
 +    * Ordering depends on when each processor is executed
 +    * Debugging is also difficult (non-deterministic behavior)
 +  * Dekker'​s algorithm
 +    * Inconsistency -- the two processors did NOT see the same order of operations to memory
 +  * Sequential consistency
 +    * Multiple correct global orders
 +    * Two issues:
 +        * Too conservative/​strict
 +        * Performance limiting
 +  * Weak consistency:​ global ordering when sync
 +    * programmer hints where the synchronizations are
 +    * Memory fence
 +    * More burden on the programmers
 +  * Cache coherence
 +    * Can be done in the software level or hardware level
 +  * Snoop-based coherence
 +    * A simple protocol with two states by broadcasting reads/​writes on a bus
 +  * Maintaining coherence
 +    * Needs to provide 1) write propagation and 2) write serialization
 +    * Update vs. Invalidate
 +  * Two cache coherence methods
 +    * Snoopy bus
 +      * Bus based, single point of serialization
 +      * More efficient with small number of processors
 +      * Processors snoop other caches read/write requests to keep the cache block coherent
 +    * Directory
 +      * Single point of serialization per block
 +      * Directory coordinates the coherency
 +      * More scalable
 +      * The directory keeps track of where the copies of each block resides
 +        * Supplies data on a read
 +        * Invalidates the block on a write
 +        * Has an exclusive state
 +
 +===== Lecture 29 (4/10 Fri.) =====
 +  * MSI coherent protocol
 +    * The problem: unnecessary broadcasts of invalidations
 +  * MESI coherent protocol
 +    * Add the exclusive state: this is the only cache copy and it is a clean state to MSI
 +    * Multiple invalidation tradeoffs
 +    * Problem: memory can be unnecessarily updated
 +    * A possible owner state (MOESI)
 +  * Tradeoffs between snooping and directory based coherence protocols
 +    * Slide 31 has a good summary
 +  * Directory: data structures
 +    * Bit vectors vs. linked lists
 +  * Scalability of directories
 +    * Size? Latency? Thousand of nodes? Best of both snooping and directory?
 +
 +    ​
 +===== Lecture 30 (4/13 Mon.) =====
 +  * In-memory computing
 +  * Design goals of DRAM
 +  * DRAM structures
 +    * Banks
 +    * Capacitors and sense amplifiers
 +    * Trade-offs b/w number of sense amps and cells
 +    * Width of bank I/O vs. row size
 +  * DRAM operations
 +    * ACTIVATE, READ/WRITE, and PRECHARGE
 +  * Trade-offs
 +    * Latency
 +    * Bandwidth: Chip vs. rank vs. bank
 +      * What's the benefit of having 8 chips?
 +    * Parallelism
 +  * RowClone
 +    * What are the problems?
 +    * Copying b/w two rows that share the same sense amplifier
 +    * System software support
 +  * Bitwise AND/OR
 +
 +===== Lecture 31 (4/15 Wed.) =====
 +
 +  * Application slowdown
 +  * Interference between different applications
 +    * Applications'​ performance depends on other applications that they are running with
 +  * Predictable performance
 +    * Why are they important?
 +    * Applications that need predictibility
 +    * How to predict the performance?​
 +      * What information are useful?
 +      * What need to be guarantee?
 +      * How to estimate the performance when running with others?
 +        * Easy, just measure the performance while it is running.
 +      * How to estimate the performance when the application is running by itself.
 +        * Hard if there is no profiling.
 +      * The relationship between memory service rate and the performance.
 +        * Key assumption: applications are memory bound
 +    * Behavior of memory-bound applications
 +      * With and without interference
 +  * Memory phase vs. compute phase
 +  * MISE
 +    * Estimating slowdown using request service rate
 +    * Inaccuracy when measuring request service rate alone
 +    * Non-memory-bound applications
 +    * Control slowdown and provide soft guarantee
 +  * Taking into account of the shared cache
 +    * MISE model + cache resource management
 +    * Aug tag store
 +      * Separate tag store for different cores
 +    * Cache access rate alone and shared as the metric to estimate slowdown
 +  * Cache paritiioning
 +    * How to determine partitioning
 +      * Utility based cache partitioning
 +      * Others
 +  * Maximum slowdown and fairness metric
 +    ​
 +
 +
 +===== Lecture 32 (4/20 Mon.) =====
 +
 +  * Heterogeneous systems
 +    * Asymmetric cores: different types of cores on the chip
 +      * Each of these cores are optimized for different workloads/​requirements/​goals
 +      * Multiple special purpose processors
 +      * Flexible and can adapt to workload behavior
 +      * Disadvantages:​ complex and high overhead
 +    * Examples: CPU-GPU systems, heterogeneity in execution models
 +    * Heterogeneous resources
 +      * Example: reliable and non-reliable DRAM in the same system
 +  * Key problems in modern systems
 +    * Memory system
 +    * Efficiency
 +    * Predictability
 +    * Assymmetric design can help solving these problems
 +  * Serialized code sections
 +    * Bottleneck in multicore execution
 +    * Parallelizable vs. serial portion
 +    * Accelerate critical section
 +    * Cache ping-ponging
 +      * Synchronization latency
 +    * Symmetric vs. assymmetric design
 +  * Large cores + small cores
 +    * Core assymmetry
 +  * Amdahl'​s law with heterogeneous cores
 +  * Parallel bottlenecks
 +    * Resource contention
 +      * Depends on what are running
 +  * Accelerated critical section
 +    * Ship critical sections to large cores
 +    * Small modifications and low overhead
 +    * False serialization might become the bottleneck
 +    * Can reduce parallel throughput
 +    * Effect on private cache misses and shared cache misses
 +    ​
 +  ​
 +===== Lecture 33 (4/27 Mon.) =====
 +
 +  * Interconnects
 +    * Connecting multiple components together
 +    * Goal: Scalability,​ flexibility,​ performance and energy efficiency
 +  * Metric: Performance,​ bandwidth, bisection bandwidth, cost, energy efficienct, system performance,​ contention, latency
 +    * Saturation point
 +      * Saturation throughput
 +  * Topology
 +    * How to wire components together, affects routing, throughput, latency
 +    * Bus: All nodes connected to a single ring
 +      * Hard to increase frequency, bandwidth, poor scalability but simple
 +    * Point-to-point
 +      * Low contention and potentially low latency. Costly, not scalable and hard to wire.
 +    * Crossbar
 +      * No contention. Concurrent request from different src/dest can be sent concurrently. Costly.
 +    * Multistage logarithmic network
 +      * Indirect network, low contention, multiple request can be sent concurrently. More scalable compared to crossbar.
 +      * Circuit switch
 +      * Omega network, delta network.
 +      * Butterfly network
 +      * Intermediate switch between sources and destinations
 +    * Switching vs. topology
 +    * Ring
 +      * Each node connected to two other nodes, forming a ring
 +      * Low overhead, high latency, not as scalable.
 +      * Unidirectional ring and bi-directional ring
 +    * Hierarchical Rings
 +      * Layers of rings. More scalable, lower latency.
 +      * Bridge router connect multiple rings together
 +    * Mesh
 +      * 4 input and output ports
 +      * More bisection bandwidth and more scalable
 +      * Easy to layout
 +      * Path diversity
 +      * Routers are more complex
 +    * Tree
 +      * Another hierarchical topology
 +      * Specialized topology
 +      * Good for local traffic
 +      * Fat tree: higher level have more bandwidth
 +      * CM-5 Fat tree
 +        * Fat tree with 4x2 switches
 +    * Hypercube
 +      * N-Dimensional cubes
 +      * Caltech cosmic cube
 +      * Very complex
 +  * Routing algorithm
 +    * How does message get sent from source to destination
 +    * Static or adaptive
 +    * Handling contention
 +      * Buffering helps handling contention, but adds complexity
 +    * Three types of routing algorithms
 +      * Deterministic:​ always takes the same path
 +      * Oblivious: takes different paths without taking into account of the state of the network
 +        * For example, Valiant algorithm
 +      * Adaptive: takes different paths taking into account of the state of the network
 +        * Non-minimal adaptive routing vs. minimal adaptive routing
 +      * Minimal path: path that has minimum number of hops
 +  * Buffering and flow control
 +    * How to store within the network
 +    * Handling oversubscription
 +    * Source throttling
 +    * Bufferless vs. buffered crossbars
 +    * Buffer overflow
 +    * Bufferless deflection routing
 +      * Deflect packets when there is contention
 +      * Hot-potato routing
 +
 +
 +    ​
buzzword.txt · Last modified: 2015/04/27 18:20 by rachata