User Tools

Site Tools


buzzword

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
buzzword [2014/03/24 18:16]
rachata
buzzword [2014/04/28 18:18]
rachata
Line 797: Line 797:
     * Cached misses cache block     * Cached misses cache block
     * Prevent ping-ponging     * Prevent ping-ponging
-  * Pseudo ​associtivity+  * Pseudo ​associativity
     * Simpler way to implement associative cache     * Simpler way to implement associative cache
   * Skewed assoc. cache   * Skewed assoc. cache
Line 811: Line 811:
     * What information goes into the MSHR?     * What information goes into the MSHR?
     * When do you access the MSHR?     * When do you access the MSHR?
 +
 +
 +===== Lecture 22 (3/26 Wed.) =====
 +
 +
 +
 +  * Multi-porting
 +    * Virtual multi-porting
 +      * Time-share the port, not too scalable but cheap
 +    * True multiporting
 +  * Multiple cache copies
 +  * Banking
 +    * Can have bank conflict
 +    * Extra interconnects across banks
 +    * Address mapping can mitigate bank conflict
 +    * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity)
 +  * Accessing DRAM
 +    * Row bits
 +    * Column bits
 +    * Addressibility
 +    * DRAM has its own clock
 +  * DRAM (2T) vs. SRAM (6T)
 +    * Cost
 +    * Latency
 +  * Interleaving in DRAM
 +    * Effects from address mapping on memory interleaving
 +    * Effects from memory access patterns from the program on interleaving
 +  * DRAM Bank
 +    * To minimize the cost of interleaving (Shared the data bus and the command bus)
 +  * DRAM Rank
 +    * Minimize the cost of the chip (a bundle of chips operated together)
 +  * DRAM Channel
 +    * An interface to DRAM, each with its own ranks/banks
 +  * DIMM
 +    * More DIMM adds the interconnect complexity
 +  * List of commands to read/write data into DRAM
 +    * Activate -> read/write -> precharge
 +    * Activate moves data into the row buffer
 +    * Precharge prepare the bank for the next access
 +  * Row buffer hit
 +  * Row buffer conflict
 +  * Scheduling memory requests to lower row conflicts
 +  * Burst mode of DRAM
 +    * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits
 +  * Address mapping
 +    * Row interleaved
 +    * Cache block interleaved
 +  * Memory controller
 +    * Sending DRAM commands
 +    * Periodically send commands to refresh DRAM cells
 +    * Ensure correctness and data integrity
 +    * Where to place the memory controller
 +      * On CPU chip vs. at the main memory
 +        * Higher BW on-chip
 +    * Determine the order of requests that will be serviced in DRAM
 +      * Request queues that hold requests
 +      * Send requests whenever the request can be sent to the bank
 +      * Determine which command (across banks) should be sent to DRAM
 +  * Priority of demand vs. prefetch requests
 +  * Memory scheduling policies
 +    * FCFS
 +    * FR-FCFS
 +      * Capped FR-FCFS: FR-FCFS with a timeout
 +      * Usually this is done in a command level (read/write commands and precharge/​activate commands)
 +    ​
 +  ​
 +===== Lecture 23 (3/28 Fri.) =====
 +
 +  * DRAM design choices
 +    * Cost/​density/​latency/​BW/​Yield
 +  * Sense Amplifier
 +    * How do they work
 +  * Dual data rate
 +  * Subarray
 +  * Rowclone
 +    * Moving bulk of data from one row to others
 +    * Lower latency and BW when performing copies/​zeroes out the data
 +  * TL-DRAM
 +    * Far segment
 +    * Near segment
 +    * What causes the long latency
 +    * Benefit of TL-DRAM
 +      * TL-DRAM vs. DRAM cache (adding a small cache in DRAM)
 +
 +  ​
 +  ​
 +  ​
 +===== Lecture 24 (3/31 Mon.) =====
 +  ​
 +
 +  * Memory controller
 +    * Different commands
 +  * Memory scheduler
 +    * Determine the order of requests to be issued to DRAM
 +    * Age/​hit-miss status/​types(load/​store/​prefetch/​from GPU/from CPU)/​criticality
 +  * Row buffer ​
 +    * hit/​conflict
 +    * open/closed row
 +    * Open row policy
 +    * Closed row policy
 +    * Tradeoffs between open and closed row policy
 +      * What if the programs has high row buffer locality: open row might benefit more
 +      * Closed row will service misses request faster
 +  * Bank conflict
 +  * Interference from different applications/​threads
 +    * Differnt programs/​processes/​threads interfere with each other
 +      * introduce more row buffer/bank conflicts
 +    * Memory schedule has to manage these interference
 +    * Memory hog problems
 +  * Interference in the data/​command bus
 +  * FR-FCFS
 +    * Why does FR-FCFS make sense?
 +      * Row buffer has lower lantecy
 +    * Issues with FR-FCFS
 +      * Unfairness
 +  * STFM
 +    * Fairness issue in memory scheduling
 +    * How does STFM calculate the fairness and slowdown
 +      * How to estimate slowdown time when it is runing alone
 +    * Definition of fairness (based on STFM, different papers/​areas define fairness differently)
 +  * PAR-BS
 +    * Parallelism in programs
 +    * Intereference across banks
 +    * How to form  a batch
 +    * How to determine ranking between batches/​within a batch
 +    ​
 +     
 +     
 +===== Lecture 25 (2/2 Wed.) =====
 +
 +
 +
 +  * Latency sensitivity
 +    * Performance drops a lot when the memory request latency is long
 +  * TCM
 +    * Tradeoff between throughput and fairness
 +    * Latency sensitive cluster (non-intensive cluster)
 +      * Ranking based on memory intensity
 +    * Bandwidth intensive cluster
 +      * Round robin within the cluster
 +    * Generally latency sensitive cluster has more priority
 +    * Provide robust fairness vs. throughput
 +    * Complexity of TCM?
 +  * Different ways to control interference in DRAM
 +    * Partitioning of resource
 +      * Channel partitioning:​ map applications that interfere with each other in a different channel
 +        * Keep track of application'​s characteristics
 +        * Dedicate a channel might waste the bandwidth
 +        * Need OS support to determine the channel bits
 +    * Source throttling
 +      * A controller throttle the core depends on the performance target
 +      * Example: Fairness via source throttling
 +        * Detect unfairness and throttle application that is interfering
 +        * How do you estimate slowdown?
 +        * Threshold based solution: hard to configure
 +    * App/thread scheduling
 +      * Critical threads usually stall the progress
 +    * Designing DRAM controller
 +      * Has to handle the normal DRAM operations
 +        * Read/​write/​refresh/​all the timing constraints
 +      * Keep track of resources
 +      * Assign priorities to different requests
 +      * Manage requests to banks
 +    * Self-optimizing controller
 +      * Use machine learning to improve DRAM controller
 +  * DRAM Refresh
 +    * Why does DRAM has to refresh every 64ms
 +    * Banks are unavailable during refresh
 +      * LPDDR mitigate this by using a per-bank refresh
 +    * Has to spend longer time with bigger DRAM
 +    * Distributed refresh: stagger refresh every 64 ms in a distributed manner
 +      * As oppose to burst refresh (long pause time)
 +  * RAIDR: Reduce DRAM refresh by profiling and binning
 +    * Some row do not have to be refresh very frequently
 +      * Profile the row
 +        * High temperature changes the retention time: need online profiling
 +  * Bloom filter
 +    * Represent set membership
 +    * Approximated
 +    * Can contain false positive
 +      * Better/more hash function helps eliminate this
 +  ​
 +  ​
 +  ​
 +===== Lecture 26 (4/7 Mon.) =====
 +
 +
 +
 +  * Tolerate latency can be costly
 +    * Instruction window is complex
 +      * Benefit also diminishes
 +    * Designing the buffers can be complex
 +    * A simpler way to tolerate out of order is desirable
 +  * Different sources that cause the core to stall in OoO
 +    * Cache miss
 +    * Note that stall happens if the inst. window is full
 +  * Scaling instruction window size is hard
 +    * It is better (less complex) to make the windows more efficient
 +  * Runahead execution
 +    * Try to optain MLP w/o increasing instruction windows
 +    * Runahead (i.e. execute ahead) when there is a long memory instruction
 +      * Long memory instruction stall processor for a while anyways, so it's better to make use out of it
 +      * Execute future instruction to generate accurate prefetches
 +      * Allow future data to be in the cache
 +    * How to support runahead execution?
 +      * Need a way to checkpoing the state when entering runahead mode
 +      * How to make executing in the wrong path useful?
 +      * Need runahead cache to handle load/store in Runahead mode (since they are speculative)
 +    * Cost and benefit of runahead execution (slide number 27)
 +    * Runahead can have inefficiency
 +      * Runahead period that are useless
 +        * Get rid of useless inefficient period
 +    * What if there is a dependent cache miss
 +      * Cannot be paralellized in a vanilla runahead
 +      * Can predict the value of the dependent load
 +        * How to predict the address of the load
 +          * Delta value information
 +          * Stride predictor
 +          * AVD prediction
 +  ​
 +===== Lecture 28 (4/9 Wed.) =====
 +
 +
 +  * Questions regarding prefetching
 +    * What to prefetch
 +    * When to prefetch
 +    * how do we prefetch
 +    * where to prefetch from
 +  * Prefetching can cause thrasing (evict a useful block)
 +  * Prefetching can also be useless (not being used)
 +    * Need to be efficient
 +  * Can cause memory bandwidth problem in GPU
 +  * Prefetch the whole block, more than one block, or subblock?
 +    * Each one of them has pros and cons
 +    * Big prefetch is more likely to waste bandwidth
 +    * Commonly done in a cache block granularity
 +  * Prefetch accuracy: fraction of useful prefetches out of all the prefetches
 +  * Prefetcher usually predict based on
 +    * Past knowledge
 +    * Compiler hints
 +  * Prefetcher has to prefetch at the right time
 +    * Prefetch that is too early might get evicted
 +      * It might also evict other useful data
 +    * Prefetch too late does not hide the whole memory latency
 +  * Previous prefetches at the same PC can be used as the history
 +  * Previous demand requests also is a good information to use for prefetches
 +  * Prefetch buffer
 +    * Place the prefetch data to avoid thrashing
 +      * Can treat demand/​prefetch requests separately
 +      * More complex
 +  * Generally, demand block is more important
 +    * This means eviction should prefer prefetch block as oppose to demand block
 +  * Tradeoffs between where do we place the prefetcher
 +    * Look at L1 hits and misses
 +    * Look at L1 misses only
 +    * Look at L2 misses
 +    * Different access pattern affect accuracy
 +      * Tradeoffs between handling more requests (seeing L1 hits and misses) and less visibility (only see L2 miss)
 +  * Software vs. hardware vs. execution based prefetching
 +    * Software: ISA previde prefetch instructions,​ software utilize it
 +      * What information are useful
 +      * How to make sure the prefetch is timely
 +      * What if you have a pointer based structure
 +        * Not easy to prefetch pointer chasing (because in many case the work between prefetches is short, so you cannot predict the next one timely enough)
 +          * Can be solved by hinting the nextnext and/or nextnextnext address
 +    * Hardware: Identify the pattern and prefetch
 +    * Execution driven: Oppotunistically try to prefetch (runahead, dual-core execution)
 +  * Stride prefetcher
 +    * Predict strides, which is common in many programs
 +    * Cache block based or instruction based
 +  * Stream buffer design
 +    * Buffer the stream of accesses (next address) ​
 +    * Use the information to prefetch
 +  * What affect prefetcher performance
 +    * Prefetch distance
 +      * How far ahead should we prefetch
 +    * Prefetch degree
 +      * How many prefetches do we prefetch
 +  * Prefetcher performance
 +    * Coverage
 +      * Out of the demand requests, how many are actually from the prefetch request
 +    * Accuracy
 +      * Out of all the prefetch requests, how many are actually getting used
 +    * Timeliness
 +      * How much memory latency can we hide from the prefetch requests
 +  * Feedback directed prefetcher
 +    * Use the result of the prefetcher as a feedback to the prefetcher
 +      * with accuracy, timeliness, polluting information
 +  * Markov prefetcher
 +    * Prefetch based on the previous history
 +    * Use markov model to predict
 +    * Pros: Can cover arbitary pattern (easy for link list traversal or trees)
 +    * Downside: High cost, cannot help with compulsory misses (no history)
 +  * Content directed prefetching
 +    * Indentify the content in memory for pointers (which is used as the address to prefetch
 +    * Not very efficient (hard to figure out which block is the pointer)
 +      * Software can give hints
 +    ​
 +===== Lecture 28 (4/14 Mon.) =====
 +
 +
 +  * Execution based prefetcher
 +    * Helper thread/​speculative thread
 +      * Use another thread to pre-execute a program
 +    * Can be a software based or hardware based
 +    * Discover misses before the main program (to prefetch data in a timely manner)
 +    * How do you construct the helper thread
 +    * Preexecute instruction (one example of how to initialize a speculative thread), slide 9
 +  * Benefit of multiprocessor
 +    * Improve performace without not significantly increase power consumption
 +    * Better cost efficiency and easier to scale
 +    * Improve dependability (in case the other core is faulty
 +  * Different types of parallelism
 +    * Instruction level parallelism
 +    * Data level parallelism
 +    * Task level parallelism
 +  * Task level parallelism
 +    * Partition a single, potentially big, task into multiple parallel sub-task
 +      * Can be done explicitly (parallel programming by the programmer)
 +      * Or implicitly (hardware partition a single thread specilatively)
 +    * Or, run multiple independent tasks (still improve throughput, but the speedup of any single tasks are not better, also simpler to implement)
 +  * Loosely coupled multiprocessor
 +    * No shared global address space
 +      * Message passing to communicate between different sources
 +    * Simple to manage memory
 +  * Tightly coupled multiprocesor
 +    * Shared global address space
 +    * Need to ensure consistency of data
 +  * Switch on event multithreading
 +    * Switch to another context when there is an event (for example, a cache miss)
 +  * Simulteneous multithreading
 +    * Dispatch instruction from multiple threads at the same time
 +  * Amdahl'​s law
 +    * Maximum speedup
 +    * Parallel portion is not perfect
 +      * Serial bottleneck
 +      * Synchronization cost
 +      * Load balance
 +        * Some threads has more work, requires more time to hit the sync. point
 +  * Issue in parallel programming
 +    * Correctness
 +    * Synchronization
 +    * Consistency
 +    ​
 +===== Lecture 29 (4/16 Wed.) =====
 +        ​
 +
 +
 +  * Ordering of instructions
 +    * Maintaining memory consistency when there are multiple threads and shared memory
 +    * Need to ensure the semantic is not changed
 +    * Making sire the shared data is properly locked when used
 +      * Support mutual exclusion
 +    * Ordering depends on when each processor is executed
 +    * Debugging is also difficult (non-deterministic behavior)
 +  * Weak consistency:​ global ordering when sync
 +    * programmer hints where the synchronizations are
 +  * Total store order model: global ordering only with store
 +  * Cache coherence
 +    * Can be done in the software level or hardware level
 +  * Coherence protocol
 +    * Need to ensure that all the processors see and update the correct state of the cache block
 +    * Need to make sure that writes get propagated and serialized
 +    * Simple protocol are not scalable (one point of synchrnization)
 +  * Update vs. invalidate
 +    * For invalidate, only the core that needs to read retains the correct copy
 +      * Can lead to ping-ponging (tons of read/writes from several processors)
 +    * For updates, bus becomes the bottleneck
 +  * Snoopy bus 
 +    * Bus based, single point of serialization
 +    * More efficient with small number of processors
 +    * All cache snoop other caches read/write requests to keep the cache block coherent
 +  * Directory based
 +    * Single point of serialization per block
 +    * Directory coordinate the coherency
 +    * More scalable
 +    * The directory keeps track of where the copies of each block resides
 +      * Supply data on a read
 +      * Invalide the block on a write
 +      * Has an exclusive state
 +  * MSI coherent protocol
 +    * Slide number 56-57
 +    * Consume bus bandwidth (need an "​exclusive"​ state
 +  * MESI coherent protocal
 +    * Add the exclusive state: this is the only cache copy and it is clean state to MSI
 +  * Tradeoffs between snooping and directory based
 +    * Slide 71 has a good summary on this
 +  * MOESI
 +    * Improvement over MESI protocol
 +
 +    ​
 +===== Lecture 29 (4/18 Wed.) =====    ​
 +
 +
 +
 +  * Interference
 +  * Complexity of the memory scheduler
 +    * Ranking/​prioritization has cost
 +    * Complex scheduler has higher latency
 +  * Performance metric for multicore/​multithead applications
 +    * Speedup
 +    * Slowdown
 +    * Harmonic vs wrighted
 +  * Fairness mertic
 +    * Maximum slowdown
 +      * Why does it make sense
 +      * Any scenario that it does not make sense?
 +  * Predictable performance
 +    * Why is it important?
 +      * In server environment,​ different jobs are on the same server
 +      * In a mobile environment,​ there are multiple sources that can slowdown other sources
 +    * How to relate slowdown with request service rate
 +    * MISE: soft slowdown guarantee
 +  * BDI
 +    * Memory wall
 +      * What is the concern regarding the memory wall
 +    * Size of the cache on the die (CPU die)
 +    * One possible solution: cache compression
 +      * What is the problems of existing cache compression mechanism
 +        * Some are too complex
 +        * Decompression is in the critical path
 +          * Need to decompress when reading the data -> decompression should not be in the critical path
 +          * Important factor to the performance
 +    * Software compression is not good enough to compress everything
 +    * Zero value compression
 +      * Simple
 +      * Good compression ratio
 +      * What is data does not have many zeroes
 +    * Frequent value compression
 +      * Some data appear fequently
 +      * Simple and good compression ratio
 +      * have to profile
 +      * decompression is complex
 +    * Frequent pattern compression
 +      * Still to complex in terms of decompression
 +    * Based delta compression
 +      * Easy to decompress but retain the benefit of compression
 +      ​
 +    ​
 +===== Lecture 31 (4/28 Mon.) =====  ​
 +
 +  * Directory based cache coherent
 +    * Each directory has to handle validate/​invalidation
 +    * Extra cost of syncronization
 +    * Need to ensure race conditions are resolved
 +  * Interconnection
 +    * Topology
 +      * Bus
 +      * Mesh
 +        * Torus
 +      * Tree
 +      * Butterfly
 +      * Ring
 +        * Bi-directional ring
 +          * More scalable
 +        * Hierarchical ring
 +          * Even more scalable
 +          * More complex
 +      * Crossbar
 +      * etc.
 +    * Circuit switching
 +    * Multistage network
 +      * Butterfly
 +      * Delta network
 +    * Handling contention
 +      * Buffering vs. dropping/​deflection (no buffering)
 +    * Routing algorithm
 +      * Handling deadlock
 +      * X-Y routing
 +        * Turn model (to avoid deadlocks)
 +      * Add more buffering for an escape path
 +      * Oblivious routing
 +        * Can take different path
 +          * DOR between each intermediate location
 +        * Balance network load
 +      * Adaptive routing
 +        * Use the state of the network to determine the route
 +          * Aware of local and/or global congestions
 +        * Non minimal adaptive routing can have livelocks
 +
 +    ​
 +
buzzword.txt ยท Last modified: 2015/04/27 18:20 by rachata