User Tools

Site Tools


buzzword

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
buzzword [2014/02/26 19:18]
rachata
buzzword [2014/04/07 18:17]
rachata
Line 631: Line 631:
     * Develop for image processing (for example, convolution)     * Develop for image processing (for example, convolution)
   * Stage processing   * Stage processing
 +
 +===== Lecture 18 (2/28 Fri.) =====
 +
 +  * Tradeoffs of VLIW
 +    * Why does VLIW required static instruction scheduling
 +    * Whose job it is?
 +      * Compiler can rearrange basic blocks/​instruction
 +  * Basic block
 +    * Benefits of having large basic block
 +  * Entry/Exit
 +    * Handling entries/​exits
 +  * Trace cache
 +    * How to ensure correctness?​
 +    * Profiling
 +    * Fixing up the instruction order to ensure correctness
 +    * Dealing with multiple entries into the block
 +    * Dealing with multiple exits into the block
 +  * Super block
 +    * How to form super blocks?
 +    * Benefit of super block
 +    * Tradeoff between not forming a super block and forming a super block
 +      * Ambiguous branch (after profiling, both taken/not taken are equally likely)
 +      * Cleaning up 
 +  * What scenario would make trace cache/​superblock/​profiling less effective?
 +  * List scheduling
 +    * Help figuring out which instructions VLIW should fetch
 +    * Try to maximize instruction throughput
 +    * How to assign priorities
 +    * What if some instructions take longer than others
 +  * Block structured ISA (BS-ISA)
 +    * Problems with trace scheduling?
 +    * What type of program will benefit from BS-ISA
 +    * How to form blocks in BS-ISA?
 +      * Combining basic blocks
 +      * multiples of merged basic blocks
 +    * How to deal with entries/​exits in BS-ISA?
 +      * undo the executed instructions from the entry point, then fetch the new block
 +    * Advantages over trace cache
 +  * Benefit of VLIW + Static instruction scheduling
 +  * Intel IA-64
 +    * Static instruction scheduling and VLIW
 +
 +===== Lecture 19 (3/19 Wed.) =====
 +
 +  * Ideal cache
 +    * More capacity
 +    * Fast
 +    * Cheap
 +    * High bandwidth
 +  * DRAM cell
 +    * Cheap
 +    * Sense the purturbation through sense amplifier
 +    * Slow and leaky
 +  * SRAM cell (Cross coupled inverter)
 +    * Expensice
 +    * Fast (easier to sense the value in the cell)
 +  * Memory bank
 +    * Read access sequence
 +      * DRAM: Activate -> Read -> Precharge (if needed)
 +    * What dominate the access laatency for DRAM and SRAM
 +  * Scaling issue
 +    * Hard to scale the scale to be small
 +  * Memory hierarchy
 +    * Prefetching
 +    * Caching
 +  * Spatial and temporal locality
 +    * Cache can exploit these
 +    * Recently used data is likely to be accessed
 +    * Nearby data is likely to be accessed
 +  * Caching in a pipeline design
 +  * Cache management
 +    * Manual
 +      * Data movement is managed manually
 +        * Embedded processor
 +        * GPU scratchpad
 +    * Automatic
 +      * HW manage data movements
 +  * Latency analysis
 +    * Based on the hit and miss status, next level access time (if miss), and the current level access time
 +  * Cache basics
 +    * Set/block (line)/​Placement/​replacement/​direct mapped vs. associative cache/etc.
 +  * Cache access
 +    * How to access tag and data (in parallel vs serially)
 +    * How do tag and index get used?
 +    * Modern processors perform serial access for higher level cache (L3 for example) to save power
 +  * Cost and benefit of having more associativity
 +    * Given the associativity,​ which block should be replace if it is full
 +    * Replacement poligy
 +      * Random
 +      * Least recently used (LRU)
 +      * Least frequently used
 +      * Least costly to refetch
 +      * etc.
 +  * How to implement LRU
 +    * How to keep track of access ordering
 +      * Complexity increases rapidly
 +    * Approximate LRU
 +      * Victim and next Victim policy
 +
 +===== Lecture 20 (3/21 Fri.) =====
 +
 +  * Set thrashing
 +    * Working set is bigger than the associativity
 +  * Belady'​s OPT
 +    * Is this optimal?
 +    * Complexity?
 +  * Similarity between cache and page table
 +    * Number of blocks vs pages
 +    * Time to find the block/page to replace
 +  * Handling writes
 +    * Write through
 +      * Need a modified bit to make sure accesses to data got the updated data
 +    * Write back
 +      * Simpler, no consistency issues
 +  * Sectored cache
 +    * Use subblock
 +      * lower bandwidth
 +      * more complex
 +  * Instruction vs data cache
 +    * Where to place instructions
 +      * Unified vs. separated
 +    * In the first level cache
 +  * Cache access
 +    * First level access ​
 +    * Second level access
 +      * When to start the second level access
 +        * Performance vs. energy
 +  * Address translation
 +  * Homonym and Synonyms
 +    * Homonym: Same VA but maps to different PA
 +      * With multiple processes
 +    * Synonyms: Multiple VAs map to the same PA
 +      * Shared libraries, shared data, copy-on-write
 +      * I/O
 +    * Can these create problems when we have the cache
 +    * How to eliminate these problems?
 +      * Page coloring
 +  * Interaction between cache and TLB
 +    * Virtually indexed vs. physically indexed
 +    * Virtually tagged vs. physically tagged
 +    * Virtually indexed physically tagged
 +  * Virtual memory in DRAM
 +    * Control where data is mapped to in channel/​rank/​bank
 +      * More parallelism
 +      * Reduce interference
 +
 +===== Lecture 21 (3/24 Mon.) =====
 +
 +
 +
 +  * Different parameters that affect cache miss
 +  * Thrashing
 +  * Different types of cache misses
 +    * Compulsory misses
 +      * Can mitigate with prefetches
 +    * Capacity misses
 +      * More assoc
 +      * Victim cache
 +    * Conflict misses
 +      * Hashing
 +  * Large block vs. small block
 +  * Subblocks
 +  * Victim cache
 +    * Small, but fully assoc. cache behind the actual cache
 +    * Cached misses cache block
 +    * Prevent ping-ponging
 +  * Pseudo associativity
 +    * Simpler way to implement associative cache
 +  * Skewed assoc. cache
 +    * Different hashing functions for each way
 +  * Restructure data access pattern
 +    * Order of loop traversal
 +    * Blocking
 +  * Memory level parallelism
 +    * Cost per miss of a parallel cache miss is less costly compared to serial misses
 +  * MSHR
 +    * Keep track of pending cache
 +      * Think of this as the load/store buffer-ish for cache
 +    * What information goes into the MSHR?
 +    * When do you access the MSHR?
 +
 +
 +===== Lecture 22 (3/26 Wed.) =====
 +
 +
 +
 +  * Multi-porting
 +    * Virtual multi-porting
 +      * Time-share the port, not too scalable but cheap
 +    * True multiporting
 +  * Multiple cache copies
 +  * Banking
 +    * Can have bank conflict
 +    * Extra interconnects across banks
 +    * Address mapping can mitigate bank conflict
 +    * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity)
 +  * Accessing DRAM
 +    * Row bits
 +    * Column bits
 +    * Addressibility
 +    * DRAM has its own clock
 +  * DRAM (2T) vs. SRAM (6T)
 +    * Cost
 +    * Latency
 +  * Interleaving in DRAM
 +    * Effects from address mapping on memory interleaving
 +    * Effects from memory access patterns from the program on interleaving
 +  * DRAM Bank
 +    * To minimize the cost of interleaving (Shared the data bus and the command bus)
 +  * DRAM Rank
 +    * Minimize the cost of the chip (a bundle of chips operated together)
 +  * DRAM Channel
 +    * An interface to DRAM, each with its own ranks/banks
 +  * DIMM
 +    * More DIMM adds the interconnect complexity
 +  * List of commands to read/write data into DRAM
 +    * Activate -> read/write -> precharge
 +    * Activate moves data into the row buffer
 +    * Precharge prepare the bank for the next access
 +  * Row buffer hit
 +  * Row buffer conflict
 +  * Scheduling memory requests to lower row conflicts
 +  * Burst mode of DRAM
 +    * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits
 +  * Address mapping
 +    * Row interleaved
 +    * Cache block interleaved
 +  * Memory controller
 +    * Sending DRAM commands
 +    * Periodically send commands to refresh DRAM cells
 +    * Ensure correctness and data integrity
 +    * Where to place the memory controller
 +      * On CPU chip vs. at the main memory
 +        * Higher BW on-chip
 +    * Determine the order of requests that will be serviced in DRAM
 +      * Request queues that hold requests
 +      * Send requests whenever the request can be sent to the bank
 +      * Determine which command (across banks) should be sent to DRAM
 +  * Priority of demand vs. prefetch requests
 +  * Memory scheduling policies
 +    * FCFS
 +    * FR-FCFS
 +      * Capped FR-FCFS: FR-FCFS with a timeout
 +      * Usually this is done in a command level (read/write commands and precharge/​activate commands)
 +    ​
 +  ​
 +===== Lecture 23 (3/28 Fri.) =====
 +
 +  * DRAM design choices
 +    * Cost/​density/​latency/​BW/​Yield
 +  * Sense Amplifier
 +    * How do they work
 +  * Dual data rate
 +  * Subarray
 +  * Rowclone
 +    * Moving bulk of data from one row to others
 +    * Lower latency and BW when performing copies/​zeroes out the data
 +  * TL-DRAM
 +    * Far segment
 +    * Near segment
 +    * What causes the long latency
 +    * Benefit of TL-DRAM
 +      * TL-DRAM vs. DRAM cache (adding a small cache in DRAM)
 +
 +  ​
 +  ​
 +  ​
 +===== Lecture 24 (3/31 Mon.) =====
 +  ​
 +
 +  * Memory controller
 +    * Different commands
 +  * Memory scheduler
 +    * Determine the order of requests to be issued to DRAM
 +    * Age/​hit-miss status/​types(load/​store/​prefetch/​from GPU/from CPU)/​criticality
 +  * Row buffer ​
 +    * hit/​conflict
 +    * open/closed row
 +    * Open row policy
 +    * Closed row policy
 +    * Tradeoffs between open and closed row policy
 +      * What if the programs has high row buffer locality: open row might benefit more
 +      * Closed row will service misses request faster
 +  * Bank conflict
 +  * Interference from different applications/​threads
 +    * Differnt programs/​processes/​threads interfere with each other
 +      * introduce more row buffer/bank conflicts
 +    * Memory schedule has to manage these interference
 +    * Memory hog problems
 +  * Interference in the data/​command bus
 +  * FR-FCFS
 +    * Why does FR-FCFS make sense?
 +      * Row buffer has lower lantecy
 +    * Issues with FR-FCFS
 +      * Unfairness
 +  * STFM
 +    * Fairness issue in memory scheduling
 +    * How does STFM calculate the fairness and slowdown
 +      * How to estimate slowdown time when it is runing alone
 +    * Definition of fairness (based on STFM, different papers/​areas define fairness differently)
 +  * PAR-BS
 +    * Parallelism in programs
 +    * Intereference across banks
 +    * How to form  a batch
 +    * How to determine ranking between batches/​within a batch
 +    ​
 +     
 +     
 +===== Lecture 25 (2/2 Wed.) =====
 +
 +
 +
 +  * Latency sensitivity
 +    * Performance drops a lot when the memory request latency is long
 +  * TCM
 +    * Tradeoff between throughput and fairness
 +    * Latency sensitive cluster (non-intensive cluster)
 +      * Ranking based on memory intensity
 +    * Bandwidth intensive cluster
 +      * Round robin within the cluster
 +    * Generally latency sensitive cluster has more priority
 +    * Provide robust fairness vs. throughput
 +    * Complexity of TCM?
 +  * Different ways to control interference in DRAM
 +    * Partitioning of resource
 +      * Channel partitioning:​ map applications that interfere with each other in a different channel
 +        * Keep track of application'​s characteristics
 +        * Dedicate a channel might waste the bandwidth
 +        * Need OS support to determine the channel bits
 +    * Source throttling
 +      * A controller throttle the core depends on the performance target
 +      * Example: Fairness via source throttling
 +        * Detect unfairness and throttle application that is interfering
 +        * How do you estimate slowdown?
 +        * Threshold based solution: hard to configure
 +    * App/thread scheduling
 +      * Critical threads usually stall the progress
 +    * Designing DRAM controller
 +      * Has to handle the normal DRAM operations
 +        * Read/​write/​refresh/​all the timing constraints
 +      * Keep track of resources
 +      * Assign priorities to different requests
 +      * Manage requests to banks
 +    * Self-optimizing controller
 +      * Use machine learning to improve DRAM controller
 +  * DRAM Refresh
 +    * Why does DRAM has to refresh every 64ms
 +    * Banks are unavailable during refresh
 +      * LPDDR mitigate this by using a per-bank refresh
 +    * Has to spend longer time with bigger DRAM
 +    * Distributed refresh: stagger refresh every 64 ms in a distributed manner
 +      * As oppose to burst refresh (long pause time)
 +  * RAIDR: Reduce DRAM refresh by profiling and binning
 +    * Some row do not have to be refresh very frequently
 +      * Profile the row
 +        * High temperature changes the retention time: need online profiling
 +  * Bloom filter
 +    * Represent set membership
 +    * Approximated
 +    * Can contain false positive
 +      * Better/more hash function helps eliminate this
 +  ​
 +  ​
 +  ​
 +===== Lecture 26 (4/7 Mon.) =====
 +
 +
 +
 +  * Tolerate latency can be costly
 +    * Instruction window is complex
 +      * Benefit also diminishes
 +    * Designing the buffers can be complex
 +    * A simpler way to tolerate out of order is desirable
 +  * Different sources that cause the core to stall in OoO
 +    * Cache miss
 +    * Note that stall happens if the inst. window is full
 +  * Scaling instruction window size is hard
 +    * It is better (less complex) to make the windows more efficient
 +  * Runahead execution
 +    * Try to optain MLP w/o increasing instruction windows
 +    * Runahead (i.e. execute ahead) when there is a long memory instruction
 +      * Long memory instruction stall processor for a while anyways, so it's better to make use out of it
 +      * Execute future instruction to generate accurate prefetches
 +      * Allow future data to be in the cache
 +    * How to support runahead execution?
 +      * Need a way to checkpoing the state when entering runahead mode
 +      * How to make executing in the wrong path useful?
 +      * Need runahead cache to handle load/store in Runahead mode (since they are speculative)
 +    * Cost and benefit of runahead execution (slide number 27)
 +    * Runahead can have inefficiency
 +      * Runahead period that are useless
 +        * Get rid of useless inefficient period
 +    * What if there is a dependent cache miss
 +      * Cannot be paralellized in a vanilla runahead
 +      * Can predict the value of the dependent load
 +        * How to predict the address of the load
 +          * Delta value information
 +          * Stride predictor
 +          * AVD prediction
 +  ​
 +    ​
 +
buzzword.txt ยท Last modified: 2015/04/27 18:20 by rachata