Differences

This shows you the differences between two versions of the page.

--- buzzword [2014/03/24 18:16]
rachata
+++ buzzword [2014/04/28 18:18]
rachata
@@ Line 797: / Line 797: @@
     * Cached misses cache block
     * Prevent ping-ponging
-  * Pseudo associtivity
+  * Pseudo associativity
     * Simpler way to implement associative cache
   * Skewed assoc. cache
@@ Line 811: / Line 811: @@
     * What information goes into the MSHR?
     * When do you access the MSHR?
+===== Lecture 22 (3/26 Wed.) =====
+  * Multi-porting
+    * Virtual multi-porting
+      * Time-share the port, not too scalable but cheap
+    * True multiporting
+  * Multiple cache copies
+  * Banking
+    * Can have bank conflict
+    * Extra interconnects across banks
+    * Address mapping can mitigate bank conflict
+    * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity)
+  * Accessing DRAM
+    * Row bits
+    * Column bits
+    * Addressibility
+    * DRAM has its own clock
+  * DRAM (2T) vs. SRAM (6T)
+    * Cost
+    * Latency
+  * Interleaving in DRAM
+    * Effects from address mapping on memory interleaving
+    * Effects from memory access patterns from the program on interleaving
+  * DRAM Bank
+    * To minimize the cost of interleaving (Shared the data bus and the command bus)
+  * DRAM Rank
+    * Minimize the cost of the chip (a bundle of chips operated together)
+  * DRAM Channel
+    * An interface to DRAM, each with its own ranks/banks
+  * DIMM
+    * More DIMM adds the interconnect complexity
+  * List of commands to read/write data into DRAM
+    * Activate -> read/write -> precharge
+    * Activate moves data into the row buffer
+    * Precharge prepare the bank for the next access
+  * Row buffer hit
+  * Row buffer conflict
+  * Scheduling memory requests to lower row conflicts
+  * Burst mode of DRAM
+    * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits
+  * Address mapping
+    * Row interleaved
+    * Cache block interleaved
+  * Memory controller
+    * Sending DRAM commands
+    * Periodically send commands to refresh DRAM cells
+    * Ensure correctness and data integrity
+    * Where to place the memory controller
+      * On CPU chip vs. at the main memory
+        * Higher BW on-chip
+    * Determine the order of requests that will be serviced in DRAM
+      * Request queues that hold requests
+      * Send requests whenever the request can be sent to the bank
+      * Determine which command (across banks) should be sent to DRAM
+  * Priority of demand vs. prefetch requests
+  * Memory scheduling policies
+    * FCFS
+    * FR-FCFS
+      * Capped FR-FCFS: FR-FCFS with a timeout
+      * Usually this is done in a command level (read/write commands and precharge/activate commands)
+===== Lecture 23 (3/28 Fri.) =====
+  * DRAM design choices
+    * Cost/density/latency/BW/Yield
+  * Sense Amplifier
+    * How do they work
+  * Dual data rate
+  * Subarray
+  * Rowclone
+    * Moving bulk of data from one row to others
+    * Lower latency and BW when performing copies/zeroes out the data
+  * TL-DRAM
+    * Far segment
+    * Near segment
+    * What causes the long latency
+    * Benefit of TL-DRAM
+      * TL-DRAM vs. DRAM cache (adding a small cache in DRAM)
+===== Lecture 24 (3/31 Mon.) =====
+  * Memory controller
+    * Different commands
+  * Memory scheduler
+    * Determine the order of requests to be issued to DRAM
+    * Age/hit-miss status/types(load/store/prefetch/from GPU/from CPU)/criticality
+  * Row buffer
+    * hit/conflict
+    * open/closed row
+    * Open row policy
+    * Closed row policy
+    * Tradeoffs between open and closed row policy
+      * What if the programs has high row buffer locality: open row might benefit more
+      * Closed row will service misses request faster
+  * Bank conflict
+  * Interference from different applications/threads
+    * Differnt programs/processes/threads interfere with each other
+      * introduce more row buffer/bank conflicts
+    * Memory schedule has to manage these interference
+    * Memory hog problems
+  * Interference in the data/command bus
+  * FR-FCFS
+    * Why does FR-FCFS make sense?
+      * Row buffer has lower lantecy
+    * Issues with FR-FCFS
+      * Unfairness
+  * STFM
+    * Fairness issue in memory scheduling
+    * How does STFM calculate the fairness and slowdown
+      * How to estimate slowdown time when it is runing alone
+    * Definition of fairness (based on STFM, different papers/areas define fairness differently)
+  * PAR-BS
+    * Parallelism in programs
+    * Intereference across banks
+    * How to form  a batch
+    * How to determine ranking between batches/within a batch
+===== Lecture 25 (2/2 Wed.) =====
+  * Latency sensitivity
+    * Performance drops a lot when the memory request latency is long
+  * TCM
+    * Tradeoff between throughput and fairness
+    * Latency sensitive cluster (non-intensive cluster)
+      * Ranking based on memory intensity
+    * Bandwidth intensive cluster
+      * Round robin within the cluster
+    * Generally latency sensitive cluster has more priority
+    * Provide robust fairness vs. throughput
+    * Complexity of TCM?
+  * Different ways to control interference in DRAM
+    * Partitioning of resource
+      * Channel partitioning: map applications that interfere with each other in a different channel
+        * Keep track of application's characteristics
+        * Dedicate a channel might waste the bandwidth
+        * Need OS support to determine the channel bits
+    * Source throttling
+      * A controller throttle the core depends on the performance target
+      * Example: Fairness via source throttling
+        * Detect unfairness and throttle application that is interfering
+        * How do you estimate slowdown?
+        * Threshold based solution: hard to configure
+    * App/thread scheduling
+      * Critical threads usually stall the progress
+    * Designing DRAM controller
+      * Has to handle the normal DRAM operations
+        * Read/write/refresh/all the timing constraints
+      * Keep track of resources
+      * Assign priorities to different requests
+      * Manage requests to banks
+    * Self-optimizing controller
+      * Use machine learning to improve DRAM controller
+  * DRAM Refresh
+    * Why does DRAM has to refresh every 64ms
+    * Banks are unavailable during refresh
+      * LPDDR mitigate this by using a per-bank refresh
+    * Has to spend longer time with bigger DRAM
+    * Distributed refresh: stagger refresh every 64 ms in a distributed manner
+      * As oppose to burst refresh (long pause time)
+  * RAIDR: Reduce DRAM refresh by profiling and binning
+    * Some row do not have to be refresh very frequently
+      * Profile the row
+        * High temperature changes the retention time: need online profiling
+  * Bloom filter
+    * Represent set membership
+    * Approximated
+    * Can contain false positive
+      * Better/more hash function helps eliminate this
+===== Lecture 26 (4/7 Mon.) =====
+  * Tolerate latency can be costly
+    * Instruction window is complex
+      * Benefit also diminishes
+    * Designing the buffers can be complex
+    * A simpler way to tolerate out of order is desirable
+  * Different sources that cause the core to stall in OoO
+    * Cache miss
+    * Note that stall happens if the inst. window is full
+  * Scaling instruction window size is hard
+    * It is better (less complex) to make the windows more efficient
+  * Runahead execution
+    * Try to optain MLP w/o increasing instruction windows
+    * Runahead (i.e. execute ahead) when there is a long memory instruction
+      * Long memory instruction stall processor for a while anyways, so it's better to make use out of it
+      * Execute future instruction to generate accurate prefetches
+      * Allow future data to be in the cache
+    * How to support runahead execution?
+      * Need a way to checkpoing the state when entering runahead mode
+      * How to make executing in the wrong path useful?
+      * Need runahead cache to handle load/store in Runahead mode (since they are speculative)
+    * Cost and benefit of runahead execution (slide number 27)
+    * Runahead can have inefficiency
+      * Runahead period that are useless
+        * Get rid of useless inefficient period
+    * What if there is a dependent cache miss
+      * Cannot be paralellized in a vanilla runahead
+      * Can predict the value of the dependent load
+        * How to predict the address of the load
+          * Delta value information
+          * Stride predictor
+          * AVD prediction
+===== Lecture 28 (4/9 Wed.) =====
+  * Questions regarding prefetching
+    * What to prefetch
+    * When to prefetch
+    * how do we prefetch
+    * where to prefetch from
+  * Prefetching can cause thrasing (evict a useful block)
+  * Prefetching can also be useless (not being used)
+    * Need to be efficient
+  * Can cause memory bandwidth problem in GPU
+  * Prefetch the whole block, more than one block, or subblock?
+    * Each one of them has pros and cons
+    * Big prefetch is more likely to waste bandwidth
+    * Commonly done in a cache block granularity
+  * Prefetch accuracy: fraction of useful prefetches out of all the prefetches
+  * Prefetcher usually predict based on
+    * Past knowledge
+    * Compiler hints
+  * Prefetcher has to prefetch at the right time
+    * Prefetch that is too early might get evicted
+      * It might also evict other useful data
+    * Prefetch too late does not hide the whole memory latency
+  * Previous prefetches at the same PC can be used as the history
+  * Previous demand requests also is a good information to use for prefetches
+  * Prefetch buffer
+    * Place the prefetch data to avoid thrashing
+      * Can treat demand/prefetch requests separately
+      * More complex
+  * Generally, demand block is more important
+    * This means eviction should prefer prefetch block as oppose to demand block
+  * Tradeoffs between where do we place the prefetcher
+    * Look at L1 hits and misses
+    * Look at L1 misses only
+    * Look at L2 misses
+    * Different access pattern affect accuracy
+      * Tradeoffs between handling more requests (seeing L1 hits and misses) and less visibility (only see L2 miss)
+  * Software vs. hardware vs. execution based prefetching
+    * Software: ISA previde prefetch instructions, software utilize it
+      * What information are useful
+      * How to make sure the prefetch is timely
+      * What if you have a pointer based structure
+        * Not easy to prefetch pointer chasing (because in many case the work between prefetches is short, so you cannot predict the next one timely enough)
+          * Can be solved by hinting the nextnext and/or nextnextnext address
+    * Hardware: Identify the pattern and prefetch
+    * Execution driven: Oppotunistically try to prefetch (runahead, dual-core execution)
+  * Stride prefetcher
+    * Predict strides, which is common in many programs
+    * Cache block based or instruction based
+  * Stream buffer design
+    * Buffer the stream of accesses (next address)
+    * Use the information to prefetch
+  * What affect prefetcher performance
+    * Prefetch distance
+      * How far ahead should we prefetch
+    * Prefetch degree
+      * How many prefetches do we prefetch
+  * Prefetcher performance
+    * Coverage
+      * Out of the demand requests, how many are actually from the prefetch request
+    * Accuracy
+      * Out of all the prefetch requests, how many are actually getting used
+    * Timeliness
+      * How much memory latency can we hide from the prefetch requests
+  * Feedback directed prefetcher
+    * Use the result of the prefetcher as a feedback to the prefetcher
+      * with accuracy, timeliness, polluting information
+  * Markov prefetcher
+    * Prefetch based on the previous history
+    * Use markov model to predict
+    * Pros: Can cover arbitary pattern (easy for link list traversal or trees)
+    * Downside: High cost, cannot help with compulsory misses (no history)
+  * Content directed prefetching
+    * Indentify the content in memory for pointers (which is used as the address to prefetch
+    * Not very efficient (hard to figure out which block is the pointer)
+      * Software can give hints
+===== Lecture 28 (4/14 Mon.) =====
+  * Execution based prefetcher
+    * Helper thread/speculative thread
+      * Use another thread to pre-execute a program
+    * Can be a software based or hardware based
+    * Discover misses before the main program (to prefetch data in a timely manner)
+    * How do you construct the helper thread
+    * Preexecute instruction (one example of how to initialize a speculative thread), slide 9
+  * Benefit of multiprocessor
+    * Improve performace without not significantly increase power consumption
+    * Better cost efficiency and easier to scale
+    * Improve dependability (in case the other core is faulty
+  * Different types of parallelism
+    * Instruction level parallelism
+    * Data level parallelism
+    * Task level parallelism
+  * Task level parallelism
+    * Partition a single, potentially big, task into multiple parallel sub-task
+      * Can be done explicitly (parallel programming by the programmer)
+      * Or implicitly (hardware partition a single thread specilatively)
+    * Or, run multiple independent tasks (still improve throughput, but the speedup of any single tasks are not better, also simpler to implement)
+  * Loosely coupled multiprocessor
+    * No shared global address space
+      * Message passing to communicate between different sources
+    * Simple to manage memory
+  * Tightly coupled multiprocesor
+    * Shared global address space
+    * Need to ensure consistency of data
+  * Switch on event multithreading
+    * Switch to another context when there is an event (for example, a cache miss)
+  * Simulteneous multithreading
+    * Dispatch instruction from multiple threads at the same time
+  * Amdahl's law
+    * Maximum speedup
+    * Parallel portion is not perfect
+      * Serial bottleneck
+      * Synchronization cost
+      * Load balance
+        * Some threads has more work, requires more time to hit the sync. point
+  * Issue in parallel programming
+    * Correctness
+    * Synchronization
+    * Consistency
+===== Lecture 29 (4/16 Wed.) =====
+  * Ordering of instructions
+    * Maintaining memory consistency when there are multiple threads and shared memory
+    * Need to ensure the semantic is not changed
+    * Making sire the shared data is properly locked when used
+      * Support mutual exclusion
+    * Ordering depends on when each processor is executed
+    * Debugging is also difficult (non-deterministic behavior)
+  * Weak consistency: global ordering when sync
+    * programmer hints where the synchronizations are
+  * Total store order model: global ordering only with store
+  * Cache coherence
+    * Can be done in the software level or hardware level
+  * Coherence protocol
+    * Need to ensure that all the processors see and update the correct state of the cache block
+    * Need to make sure that writes get propagated and serialized
+    * Simple protocol are not scalable (one point of synchrnization)
+  * Update vs. invalidate
+    * For invalidate, only the core that needs to read retains the correct copy
+      * Can lead to ping-ponging (tons of read/writes from several processors)
+    * For updates, bus becomes the bottleneck
+  * Snoopy bus
+    * Bus based, single point of serialization
+    * More efficient with small number of processors
+    * All cache snoop other caches read/write requests to keep the cache block coherent
+  * Directory based
+    * Single point of serialization per block
+    * Directory coordinate the coherency
+    * More scalable
+    * The directory keeps track of where the copies of each block resides
+      * Supply data on a read
+      * Invalide the block on a write
+      * Has an exclusive state
+  * MSI coherent protocol
+    * Slide number 56-57
+    * Consume bus bandwidth (need an "exclusive" state
+  * MESI coherent protocal
+    * Add the exclusive state: this is the only cache copy and it is clean state to MSI
+  * Tradeoffs between snooping and directory based
+    * Slide 71 has a good summary on this
+  * MOESI
+    * Improvement over MESI protocol
+===== Lecture 29 (4/18 Wed.) =====
+  * Interference
+  * Complexity of the memory scheduler
+    * Ranking/prioritization has cost
+    * Complex scheduler has higher latency
+  * Performance metric for multicore/multithead applications
+    * Speedup
+    * Slowdown
+    * Harmonic vs wrighted
+  * Fairness mertic
+    * Maximum slowdown
+      * Why does it make sense
+      * Any scenario that it does not make sense?
+  * Predictable performance
+    * Why is it important?
+      * In server environment, different jobs are on the same server
+      * In a mobile environment, there are multiple sources that can slowdown other sources
+    * How to relate slowdown with request service rate
+    * MISE: soft slowdown guarantee
+  * BDI
+    * Memory wall
+      * What is the concern regarding the memory wall
+    * Size of the cache on the die (CPU die)
+    * One possible solution: cache compression
+      * What is the problems of existing cache compression mechanism
+        * Some are too complex
+        * Decompression is in the critical path
+          * Need to decompress when reading the data -> decompression should not be in the critical path
+          * Important factor to the performance
+    * Software compression is not good enough to compress everything
+    * Zero value compression
+      * Simple
+      * Good compression ratio
+      * What is data does not have many zeroes
+    * Frequent value compression
+      * Some data appear fequently
+      * Simple and good compression ratio
+      * have to profile
+      * decompression is complex
+    * Frequent pattern compression
+      * Still to complex in terms of decompression
+    * Based delta compression
+      * Easy to decompress but retain the benefit of compression
+===== Lecture 31 (4/28 Mon.) =====
+  * Directory based cache coherent
+    * Each directory has to handle validate/invalidation
+    * Extra cost of syncronization
+    * Need to ensure race conditions are resolved
+  * Interconnection
+    * Topology
+      * Bus
+      * Mesh
+        * Torus
+      * Tree
+      * Butterfly
+      * Ring
+        * Bi-directional ring
+          * More scalable
+        * Hierarchical ring
+          * Even more scalable
+          * More complex
+      * Crossbar
+      * etc.
+    * Circuit switching
+    * Multistage network
+      * Butterfly
+      * Delta network
+    * Handling contention
+      * Buffering vs. dropping/deflection (no buffering)
+    * Routing algorithm
+      * Handling deadlock
+      * X-Y routing
+        * Turn model (to avoid deadlocks)
+      * Add more buffering for an escape path
+      * Oblivious routing
+        * Can take different path
+          * DOR between each intermediate location
+        * Balance network load
+      * Adaptive routing
+        * Use the state of the network to determine the route
+          * Aware of local and/or global congestions
+        * Non minimal adaptive routing can have livelocks

18-447 Introduction to Computer Architecture – Spring 2015

User Tools

Site Tools

Differences

Page Tools