Differences

This shows you the differences between two versions of the page.

--- buzzword [2015/03/04 19:21]
kevincha [Lecture 19 (03/02 Mon.)]
+++ buzzword [2015/04/27 18:20] (current)
rachata
@@ Line 45: / Line 45: @@
   * Memory wall (a part of scaling issue)
   * Scaling issue
-    * Transister are getting smaller
+    * Transistor are getting smaller
   * Key components of a computer
   * Design points
@@ Line 54: / Line 54: @@
   * Reliability problems that cause errors
   * Analogies from Kuhn's "The Structure of Scientific Revolutions" (Recommended book)
-    * Pre paradigm science
+    * Pre-paradigm science
     * Normal science
-    * Revolutionalry science
+    * Revolutionary science
   * Components of a computer
     * Computation
@@ Line 76: / Line 76: @@
     * Operands
     * Live-outs/Live-ins
-      * DIfferent types of data flow nodes (conditional/relational/barrier)
+      * Different types of data flow nodes (conditional/relational/barrier)
     * How to do transactional transaction in dataflow?
       * Example: bank transactions
@@ Line 121: / Line 121: @@
    * Tradeoffs between 0,1,2,3 address machines
    * Postfix notation
-   * Instructions/Opcode/Operade specifiers (i.e. addressing modes)
+   * Instructions/Opcode/Operand specifiers (i.e. addressing modes)
    * Simply vs. complex data type (and their tradeoffs)
    * Semantic gap and level
@@ Line 236: / Line 236: @@
   * Variable latency memory
   * Handling interrupts
-  * Difference betwen interrupts and exceptions
+  * Difference between interrupts and exceptions
   * Emulator (i.e. uCode allots minimal datapath to emulate the ISA)
   * Updating machine behavior
@@ Line 257: / Line 257: @@
     * Idle resources
   * Throughput of a pipelined design
-    * What dictacts the throughput of a pipelined design?
+    * What dictates the throughput of a pipelined design?
   * Latency of the pipelined design
   * Dependency
@@ Line 336: / Line 336: @@
     * Inst. from different thread can fill-in the bubbles
     * Cost?
-  * Simulteneuos multithreading
+  * Simultaneuos multithreading
   * Branch prediction
     * Guess what to fetch next.
@@ Line 345: / Line 345: @@
       * Given the program/number of instructions, percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions.
         * How many extra instructions are being fetched?
-        * What is the performance degredation?
+        * What is the performance degradation?
     * How to reduce the miss penalty?
     * Predicting the next address (non PC+4 address)
@@ Line 352: / Line 352: @@
     * Global branch history - for directions
     * Can use compiler to profile and get more info
-      * Input set dictacts the accuracy
+      * Input set dictates the accuracy
       * Add time to compilation
     * Heuristics that are common and doesn't require profiling.
@@ Line 358: / Line 358: @@
       * Does not require profiling
     * Static branch prediction
-      * Pregrammer provides pragmas, hinting the likelihood of taken/not taken branch
+      * Programmer provides pragmas, hinting the likelihood of taken/not taken branch
       * For example, x86 has the hint bit
     * Dynamic branch prediction
@@ Line 369: / Line 369: @@
     * Why are they very important?
       * Differences between 99% accuracy and 98% accuracy
-      * Cost of a misprediction when the pipeline is veryd eep
+      * Cost of a misprediction when the pipeline is very deep
   * Global branch correlation
     * Some branches are correlated
@@ Line 405: / Line 405: @@
       * Use branch prediction on an easy to predict code
       * Use predicated execution on a hard to predict code
-      * Compiler can be more aggressive in optimimzing the code
+      * Compiler can be more aggressive in optmizing the code
     * What are the tradeoffs (slide# 47)
   * Multi-path execution
@@ Line 411: / Line 411: @@
     * Can lead to wasted work
     * VLIW
-    * SuperScalar
+    * Superscalar
@@ Line 434: / Line 434: @@
   * Reorder buffer
     * Reorder results before they are visible to the arch. state
-      * Need to presearve the sequential sematic and data
+      * Need to preserve the sequential semantic and data
-    * What are the informatinos in the ROB entry
+    * What are the information in the ROB entry
     * Where to get the value from (forwarding path? reorder buffer?)
       * Extra logic to check where the youngest instructions/value is
-      * Content addressible search (CAM)
+      * Content addressable search (CAM)
         * A lot of comparators
     * Different ways to simplify the reorder buffer
@@ Line 450: / Line 450: @@
       * An updated value (Speculative), called future file
       * A backup value (to restore the state quickly
-    * Double the cost of the regfile, but reduce the area as you don't have to use a content addressible memory (compared to ROB alone)
+    * Double the cost of the regfile, but reduce the area as you don't have to use a content addressable memory (compared to ROB alone)
   * Branch misprediction resembles Exception
     * The difference is that branch misprediction is not visible to the software
@@ Line 486: / Line 486: @@
       * Slides 28 --> register renaming
       * Slides 30-35 --> Exercise (also on the board)
-        * This will be usefull for the midterm
+        * This will be useful for the midterm
     * Register aliasing table
     * Broadcasting tags
@@ Line 503: / Line 503: @@
       * Still visible as a Von Neumann model
     * Where does the efficiency come from?
-    * Size of the scheduling windors/reorder buffer. Tradeoffs? What make sense?
+    * Size of the scheduling windows/reorder buffer. Tradeoffs? What make sense?
   * Load/store handling
     * Would like to schedule them out of order, but make them visible in-order
@@ Line 538: / Line 538: @@
     * But the program needs to be very parallel
     * Memory can be the bottleneck (due to very high MLP)
-    * What does the functional units look like? Deep pipelin and simpler control.
+    * What does the functional units look like? Deep pipeline and simpler control.
     * CRAY-I is one of the examples of vector processor
     * Memory access pattern in a vector processor
@@ Line 549: / Line 549: @@
     * What if there are multiple memory ports?
     * Gather/Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism)
-      * Helps handling sparse metrices
+      * Helps handling sparse matrices
     * Conditional operation
     * Structure of vector units
@@ Line 579: / Line 579: @@
     * Lower SIMD efficiency
     * What if you have layers of branches?
-  * Dynamic wrap formation
+  * Dynamic warp formation
     * Combining threads from different warps to increase SIMD utilization
     * This can cause memory divergence
@@ Line 593: / Line 593: @@
     * How to street the instruction (determine dependency/stalling)?
     * Instruction scheduling techniques (static vs. dynamic)
-  * Systoric arrays
+  * Systolic arrays
     * Processing elements transform data in chains
     * Develop for image processing (for example, convolution)
@@ Line 601: / Line 601: @@
 ===== Lecture 16 (2/23 Mon.) =====
-  * Systoric arrays
+  * Systolic arrays
     * Processing elements transform data in chains
     * Can be arrays of multi-dimensional processing elements
@@ Line 622: / Line 622: @@
     * Side entrance
     * Fixed up code
-    * How scheudling is done
+    * How scheduling is done
   * Instruction scheduling
     * Prioritization heuristics
@@ Line 629: / Line 629: @@
   * Hyperblock
   * BS-ISA
-  * Tradeoffs betwwen trace cache/Hyperblock/Superblock/BS-ISA
+  * Tradeoffs between trace cache/Hyperblock/Superblock/BS-ISA
 ===== Lecture 17 (2/25 Wed.) =====
@@ Line 640: / Line 640: @@
     * Non-faulting loads and exception propagation
   * Aggressive ST-LD reordering
-  * Phyiscal memory system
+  * Physical memory system
   * Ideal pipelines
   * Ideal cache
@@ Line 649: / Line 649: @@
   * DRAM cell
     * Cheap
-    * Sense the purturbation through sense amplifier
+    * Sense the perturbation through sense amplifier
     * Slow and leaky
   * SRAM cell (Cross coupled inverter)
@@ Line 657: / Line 657: @@
     * Read access sequence
     * DRAM: Activate -> Read -> Precharge (if needed)
-    * What dominate the access laatency for DRAM and SRAM
+    * What dominate the access latency for DRAM and SRAM
   * Scaling issue
     * Hard to scale the scale to be small
@@ Line 685: / Line 685: @@
   * Cost and benefit of having more associativity
     * Given the associativity, which block should be replace if it is full
-    * Replacement poligy
+    * Replacement policy
       * Random
       * Least recently used (LRU)
@@ Line 749: / Line 749: @@
     * Cached misses cache block
     * Prevent ping-ponging
-  * Pseudo associtivity
+  * Pseudo associativity
     * Simpler way to implement associative cache
   * Skewed assoc. cache
@@ Line 799: / Line 799: @@
   * Page coloring
   * Interaction between cache and TLB
+===== Lecture 21 (03/23 Mon.) =====
+  * DRAM scaling problem
+  * Demands/trends affecting the main memory
+    * More capacity
+    * Low energy
+    * More bandwidth
+    * QoS
+  * ECC in DRAM
+  * Multi-porting
+    * Virtual multi-porting
+      * Time-share the port, not too scalable but cheap
+    * True multiporting
+  * Multiple cache copies
+  * Alignment
+  * Banking
+    * Can have bank conflict
+    * Extra interconnects across banks
+    * Address mapping can mitigate bank conflict
+    * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity)
+  * Bank mapping
+    * How to avoid bank conflicts?
+  * Channel mapping
+    * Address mapping to minimize bank conflict
+    * Page coloring
+      * Virtual to physical mapping that can help reducing conflicts
+  * Accessing DRAM
+    * Row bits
+    * Column bits
+    * Addressibility
+    * DRAM has its own clock
+    * Sense amplifier
+    * Bit lines
+    * Word lines
+  * DRAM (2T) vs. SRAM (6T)
+    * Cost
+    * Latency
+  * Interleaving in DRAM
+    * Effects from address mapping on memory interleaving
+    * Effects from memory access patterns from the program on interleaving
+  * DRAM Bank
+    * To minimize the cost of interleaving (Shared the data bus and the command bus)
+  * DRAM Rank
+    * Minimize the cost of the chip (a bundle of chips operated together)
+  * DRAM Channel
+    * An interface to DRAM, each with its own ranks/banks
+  * DRAM Chip
+  * DIMM
+    * More DIMM adds the interconnect complexity
+  * List of commands to read/write data into DRAM
+    * Activate -> read/write -> precharge
+    * Activate moves data into the row buffer
+    * Precharge prepare the bank for the next access
+  * Row buffer hit
+  * Row buffer conflict
+  * Scheduling memory requests to lower row conflicts
+  * Burst mode of DRAM
+    * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits
+  * Address mapping
+    * Row interleaved
+    * Cache block interleaved
+  * Memory controller
+    * Sending DRAM commands
+    * Periodically send commands to refresh DRAM cells
+    * Ensure correctness and data integrity
+    * Where to place the memory controller
+      * On CPU chip vs. at the main memory
+        * Higher BW on-chip
+    * Determine the order of requests that will be serviced in DRAM
+      * Request queues that hold requests
+      * Send requests whenever the request can be sent to the bank
+      * Determine which command (across banks) should be sent to DRAM
+===== Lecture 22 (03/25 Wed.) =====
+  * Flash controller
+  * Flash memory
+  * Garbage collection in flash
+  * Overhead in flash memory
+    * Erase (off the critical path, but takes a long time)
+  * Different types of DRAM
+  * DRAM design choices
+    * Cost/density/latency/BW/Yield
+  * Sense Amplifier
+    * How do they work
+  * Dual data rate
+  * Subarray
+  * Rowclone
+    * Moving bulk of data from one row to others
+    * Lower latency and BW when performing copies/zeroes out the data
+  * TL-DRAM
+    * Far segment
+    * Near segment
+    * What causes the long latency
+    * Benefit of TL-DRAM
+      * TL-DRAM vs. DRAM cache (adding a small cache in DRAM)
+  * List of commands to read/write data into DRAM
+    * Activate -> read/write -> precharge
+    * Activate moves data into the row buffer
+    * Precharge prepare the bank for the next access
+  * Row buffer hit
+  * Row buffer conflict
+  * Scheduling memory requests to lower row conflicts
+  * Burst mode of DRAM
+    * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits
+  * Address mapping
+    * Row interleaved
+    * Cache block interleaved
+  * Memory controller
+    * Sending DRAM commands
+    * Periodically send commands to refresh DRAM cells
+    * Ensure correctness and data integrity
+    * Where to place the memory controller
+      * On CPU chip vs. at the main memory
+        * Higher BW on-chip
+    * Determine the order of requests that will be serviced in DRAM
+      * Request queues that hold requests
+      * Send requests whenever the request can be sent to the bank
+      * Determine which command (across banks) should be sent to DRAM
+  * Priority of demand vs. prefetch requests
+  * Memory scheduling policies
+    * FCFS
+    * FR-FCFS
+      * Try to maximize row buffer hit rate
+      * Capped FR-FCFS: FR-FCFS with a timeout
+      * Usually this is done in a command level (read/write commands and precharge/activate commands)
+    * PAR-BS
+      * Key benefits
+      * stall time
+      * shortest job first
+    * STFM
+    * ATLAS
+    * TCM
+      * Key benefits
+      * Configurability
+      * Fairness + performance at the same time
+      * Robuestness isuees
+  * Open row policy
+  * Closed row policy
+  * QoS
+    * QoS issues in memory scheduling
+    * Fairness
+    * Performance guarantee
+===== Lecture 23 (03/27 Fri.) =====
+  * Different ways to control interference in DRAM
+    * Partitioning of resource
+      * Channel partitioning: map applications that interfere with each other in a different channel
+        * Keep track of application's characteristics
+        * Dedicate a channel might waste the bandwidth
+        * Need OS support to determine the channel bits
+    * Source throttling
+      * A controller throttle the core depends on the performance target
+      * Example: Fairness via source throttling
+        * Detect unfairness and throttle application that is interfering
+        * How do you estimate slowdown?
+        * Threshold based solution: hard to configure
+    * App/thread scheduling
+      * Critical threads usually stall the progress
+    * Designing DRAM controller
+      * Has to handle the normal DRAM operations
+        * Read/write/refresh/all the timing constraints
+      * Keep track of resources
+      * Assign priorities to different requests
+      * Manage requests to banks
+    * Self-optimizing controller
+      * Use machine learning to improve DRAM controller
+    * A-DRM
+      * Architecture aware DRAM
+  * Multithread
+    * synchronization
+    * Pipeline programs
+      * Producer consumer model
+    * Critical path
+    * Limiter threads
+    * Prioritization between threads
+  * Different power mode in DRAM
+  * DRAM Refresh
+    * Why does DRAM has to refresh every 64ms
+    * Banks are unavailable during refresh
+      * LPDDR mitigate this by using a per-bank refresh
+    * Has to spend longer time with bigger DRAM
+    * Distributed refresh: stagger refresh every 64 ms in a distributed manner
+      * As oppose to burst refresh (long pause time)
+  * RAIDR: Reduce DRAM refresh by profiling and binning
+    * Some row do not have to be refresh very frequently
+      * Profile the row
+        * High temperature changes the retention time: need online profiling
+  * Bloom filter
+    * Represent set membership
+    * Approximated
+    * Can contain false positive
+      * Better/more hash function helps eliminate this
+===== Lecture 24 (03/30 Mon.) =====
+  * Simulation
+    * Drawbacks of RTL simulations
+      * Time consuming
+      * Complex to develop
+      * Hard to perform design explorations
+    * Explore the design space quickly
+    * Match the behavior of existing systems
+    * Tradeoffs: speed, accuracy, flexibility
+    * High-level simulation vs. detailed simulation
+      * High-level simulation is faster, but lower accuracy
+  * Controllers that works on multiple types of cores
+    * Design problems: how to find a good scheduling policy on its own?
+    * Self-optimizing memory controller: using machine learning
+      * Can adapt to the applications
+      * The complexity is very high
+  * Tolerate latency can be costly
+    * Instruction window is complex
+      * Benefit also diminishes
+    * Designing the buffers can be complex
+    * A simpler way to tolerate out of order is desirable
+  * Different sources that cause the core to stall in OoO
+    * Cache miss
+    * Note that stall happens if the inst. window is full
+  * Scaling instruction window size is hard
+    * It is better (less complex) to make the windows more efficient
+  * Runahead execution
+    * Try to optain MLP w/o increasing instruction windows
+    * Runahead (i.e. execute ahead) when there is a long memory instruction
+      * Long memory instruction stall processor for a while anyways, so it's better to make use out of it
+      * Execute future instruction to generate accurate prefetches
+      * Allow future data to be in the cache
+    * How to support runahead execution?
+      * Need a way to checkpoing the state when entering runahead mode
+      * How to make executing in the wrong path useful?
+      * Need runahead cache to handle load/store in Runahead mode (since they are speculative)
+===== Lecture 25 (4/1 Wed.) =====
+  * More Runahead executions
+    * How to support runahead execution?
+      * Need a way to checkpoing the state when entering runahead mode
+      * How to make executing in the wrong path useful?
+      * Need runahead cache to handle load/store in Runahead mode (since they are speculative)
+    * Cost and benefit of runahead execution (slide number 27)
+    * Runahead can have inefficiency
+      * Runahead period that are useless
+        * Get rid of useless inefficient period
+    * What if there is a dependent cache miss
+      * Cannot be paralellized in a vanilla runahead
+      * Can predict the value of the dependent load
+        * How to predict the address of the load
+          * Delta value information
+          * Stride predictor
+          * AVD prediction
+  * Questions regarding prefetching
+    * What to prefetch
+    * When to prefetch
+    * how do we prefetch
+    * where to prefetch from
+  * Prefetching can cause thrasing (evict a useful block)
+  * Prefetching can also be useless (not being used)
+    * Need to be efficient
+  * Can cause memory bandwidth problem in GPU
+  * Prefetch the whole block, more than one block, or subblock?
+    * Each one of them has pros and cons
+    * Big prefetch is more likely to waste bandwidth
+    * Commonly done in a cache block granularity
+  * Prefetch accuracy: fraction of useful prefetches out of all the prefetches
+  * Prefetcher usually predict based on
+    * Past knowledge
+    * Compiler hints
+  * Prefetcher has to prefetch at the right time
+    * Prefetch that is too early might get evicted
+      * It might also evict other useful data
+    * Prefetch too late does not hide the whole memory latency
+  * Previous prefetches at the same PC can be used as the history
+  * Previous demand requests also is a good information to use for prefetches
+  * Prefetch buffer
+    * Place the prefetch data to avoid thrashing
+      * Can treat demand/prefetch requests separately
+      * More complex
+  * Generally, demand block is more important
+    * This means eviction should prefer prefetch block as oppose to demand block
+  * Tradeoffs between where do we place the prefetcher
+    * Look at L1 hits and misses
+    * Look at L1 misses only
+    * Look at L2 misses
+    * Different access pattern affect accuracy
+      * Tradeoffs between handling more requests (seeing L1 hits and misses) and less visibility (only see L2 miss)
+  * Software vs. hardware vs. execution based prefetching
+    * Software: ISA previde prefetch instructions, software utilize it
+      * What information are useful
+      * How to make sure the prefetch is timely
+      * What if you have a pointer based structure
+        * Not easy to prefetch pointer chasing (because in many case the work between prefetches is short, so you cannot predict the next one timely enough)
+          * Can be solved by hinting the nextnext and/or nextnextnext address
+    * Hardware: Identify the pattern and prefetch
+    * Execution driven: Oppotunistically try to prefetch (runahead, dual-core execution)
+  * Stride prefetcher
+    * Predict strides, which is common in many programs
+    * Cache block based or instruction based
+  * Stream buffer design
+    * Buffer the stream of accesses (next address)
+    * Use the information to prefetch
+  * What affect prefetcher performance
+    * Prefetch distance
+      * How far ahead should we prefetch
+    * Prefetch degree
+      * How many prefetches do we prefetch
+  * Prefetcher performance
+    * Coverage
+      * Out of the demand requests, how many are actually from the prefetch request
+    * Accuracy
+      * Out of all the prefetch requests, how many are actually getting used
+    * Timeliness
+      * How much memory latency can we hide from the prefetch requests
+    * Cache pullition
+      * How much did the prefetcher cause misses in the demand misses?
+        * Hard to quantify
+===== Lecture 26 (4/3 Fri.) =====
+  * Feedback directed prefetcher
+    * Use the result of the prefetcher as a feedback to the prefetcher
+      * with accuracy, timeliness, polluting information
+  * Markov prefetcher
+    * Prefetch based on the previous history
+    * Use markov model to predict
+    * Pros: Can cover arbitary pattern (easy for link list traversal or trees)
+    * Downside: High cost, cannot help with compulsory misses (no history)
+  * Content directed prefetching
+    * Indentify the content in memory for pointers (which is used as the address to prefetch
+    * Not very efficient (hard to figure out which block is the pointer)
+      * Software can give hints
+  * Correlation table
+    * Address correlation
+  * Execution based prefetcher
+    * Helper thread/speculative thread
+      * Use another thread to pre-execute a program
+    * Can be a software based or hardware based
+    * Discover misses before the main program (to prefetch data in a timely manner)
+    * How do you construct the helper thread
+    * Preexecute instruction (one example of how to initialize a speculative thread), slide 9
+    * Thread-based pre-execution
+  * Error tolerance
+  * Solution to errors
+    * Tolerate errors
+      * New interface, new design
+    * Eliminate or minimize errors
+      * New technology, system-wide rethinking
+    * Embrace errors
+      * Map data that can tolerate errors to error-prone area
+  * Hybrid memory systesm
+    * Combining multiple memory technology together
+  * What can emerging technology help?
+    * Scalability
+    * Lower the cost
+    * Energy efficiency
+  * Possible solutions to the scaling problem
+    * Less leakage DRAM
+    * Heterogeneous DRAM (TL-DRAM, etc.)
+    * Add more functionality to DRAM
+    * Denser design (3D stack)
+    * Different technology
+      * NVM
+  * Charge vs. resistice memory
+    * How data is written?
+    * How to read the data?
+  * Non volatile memory
+    * Resistive memory
+      * PCM
+        * Inject current to change the phase
+        * Scales better than DRAM
+          * Multiple bits per cell
+            * Wider resistence range
+        * No refresh is needed
+        * Downside: Latency and write endurance
+      * STT-MRAM
+        * Inject current to change the polarity
+      * Memristor
+        * Inject current to change the structure
+    * Pros and cons between different technologies
+    * Persistency - data stay there even without power
+      * Unified memory and storage management (persistent data structure) - Single level store
+        * Improve energy and performance
+        * Simplify programming model
+  * Different design options for DRAM + NVM
+    * DRAM as a cache
+    * Place some data in DRAM and other in PCM
+      * Based on the characteristics
+        * Frequently accessed data that need lower write latency in DRAM
+===== Lecture 27 (4/6 Mon.) =====
+  * Flynn's taxonomy
+  * Parallelism
+    * Reduces power consumption (P ~ CV^2F)
+    * Better cost efficiency and easier to scale
+    * Improves dependability (in case the other core is faulty
+  * Different types of parallelism
+    * Instruction level parallelism
+    * Data level parallelism
+    * Task level parallelism
+  * Task level parallelism
+    * Partition a single, potentially big, task into multiple parallel sub-task
+      * Can be done explicitly (parallel programming by the programmer)
+      * Or implicitly (hardware partitions a single thread speculatively)
+    * Or, run multiple independent tasks (still improves throughput, but the speedup of any single tasks is not better, also simpler to implement)
+  * Loosely coupled multiprocessor
+    * No shared global address space
+      * Message passing to communicate between different sources
+    * Simple to manage memory
+  * Tightly coupled multiprocessor
+    * Shared global address space
+    * Need to ensure consistency of data
+    * Programming issues
+  * Hardware-based multithreading
+    * Coarse grained
+    * Find grained
+    * Simultaneous: Dispatch instruction from multiple threads at the same time
+  * Parallel speedup
+    * Superlinear speedup
+  * Utilization, Redundancy, Efficiency
+  * Amdahl's law
+    * Maximum speedup
+    * Parallel portion is not perfect
+      * Serial bottleneck
+      * Synchronization cost
+      * Load balance
+        * Some threads has more work, requires more time to hit the sync. point
+  * Critical sections
+    * Enforce mutually exclusive access to shared data
+  * Issues in parallel programming
+    * Correctness
+    * Synchronization
+    * Consistency
+===== Lecture 28 (4/8 Wed.) =====
+  * Ordering of instructions
+    * Maintaining memory consistency when there are multiple threads and shared memory
+    * Need to ensure the semantic is not changed
+    * Making sure the shared data is properly locked when used
+      * Support mutual exclusion
+    * Ordering depends on when each processor is executed
+    * Debugging is also difficult (non-deterministic behavior)
+  * Dekker's algorithm
+    * Inconsistency -- the two processors did NOT see the same order of operations to memory
+  * Sequential consistency
+    * Multiple correct global orders
+    * Two issues:
+        * Too conservative/strict
+        * Performance limiting
+  * Weak consistency: global ordering when sync
+    * programmer hints where the synchronizations are
+    * Memory fence
+    * More burden on the programmers
+  * Cache coherence
+    * Can be done in the software level or hardware level
+  * Snoop-based coherence
+    * A simple protocol with two states by broadcasting reads/writes on a bus
+  * Maintaining coherence
+    * Needs to provide 1) write propagation and 2) write serialization
+    * Update vs. Invalidate
+  * Two cache coherence methods
+    * Snoopy bus
+      * Bus based, single point of serialization
+      * More efficient with small number of processors
+      * Processors snoop other caches read/write requests to keep the cache block coherent
+    * Directory
+      * Single point of serialization per block
+      * Directory coordinates the coherency
+      * More scalable
+      * The directory keeps track of where the copies of each block resides
+        * Supplies data on a read
+        * Invalidates the block on a write
+        * Has an exclusive state
+===== Lecture 29 (4/10 Fri.) =====
+  * MSI coherent protocol
+    * The problem: unnecessary broadcasts of invalidations
+  * MESI coherent protocol
+    * Add the exclusive state: this is the only cache copy and it is a clean state to MSI
+    * Multiple invalidation tradeoffs
+    * Problem: memory can be unnecessarily updated
+    * A possible owner state (MOESI)
+  * Tradeoffs between snooping and directory based coherence protocols
+    * Slide 31 has a good summary
+  * Directory: data structures
+    * Bit vectors vs. linked lists
+  * Scalability of directories
+    * Size? Latency? Thousand of nodes? Best of both snooping and directory?
+===== Lecture 30 (4/13 Mon.) =====
+  * In-memory computing
+  * Design goals of DRAM
+  * DRAM structures
+    * Banks
+    * Capacitors and sense amplifiers
+    * Trade-offs b/w number of sense amps and cells
+    * Width of bank I/O vs. row size
+  * DRAM operations
+    * ACTIVATE, READ/WRITE, and PRECHARGE
+  * Trade-offs
+    * Latency
+    * Bandwidth: Chip vs. rank vs. bank
+      * What's the benefit of having 8 chips?
+    * Parallelism
+  * RowClone
+    * What are the problems?
+    * Copying b/w two rows that share the same sense amplifier
+    * System software support
+  * Bitwise AND/OR
+===== Lecture 31 (4/15 Wed.) =====
+  * Application slowdown
+  * Interference between different applications
+    * Applications' performance depends on other applications that they are running with
+  * Predictable performance
+    * Why are they important?
+    * Applications that need predictibility
+    * How to predict the performance?
+      * What information are useful?
+      * What need to be guarantee?
+      * How to estimate the performance when running with others?
+        * Easy, just measure the performance while it is running.
+      * How to estimate the performance when the application is running by itself.
+        * Hard if there is no profiling.
+      * The relationship between memory service rate and the performance.
+        * Key assumption: applications are memory bound
+    * Behavior of memory-bound applications
+      * With and without interference
+  * Memory phase vs. compute phase
+  * MISE
+    * Estimating slowdown using request service rate
+    * Inaccuracy when measuring request service rate alone
+    * Non-memory-bound applications
+    * Control slowdown and provide soft guarantee
+  * Taking into account of the shared cache
+    * MISE model + cache resource management
+    * Aug tag store
+      * Separate tag store for different cores
+    * Cache access rate alone and shared as the metric to estimate slowdown
+  * Cache paritiioning
+    * How to determine partitioning
+      * Utility based cache partitioning
+      * Others
+  * Maximum slowdown and fairness metric
+===== Lecture 32 (4/20 Mon.) =====
+  * Heterogeneous systems
+    * Asymmetric cores: different types of cores on the chip
+      * Each of these cores are optimized for different workloads/requirements/goals
+      * Multiple special purpose processors
+      * Flexible and can adapt to workload behavior
+      * Disadvantages: complex and high overhead
+    * Examples: CPU-GPU systems, heterogeneity in execution models
+    * Heterogeneous resources
+      * Example: reliable and non-reliable DRAM in the same system
+  * Key problems in modern systems
+    * Memory system
+    * Efficiency
+    * Predictability
+    * Assymmetric design can help solving these problems
+  * Serialized code sections
+    * Bottleneck in multicore execution
+    * Parallelizable vs. serial portion
+    * Accelerate critical section
+    * Cache ping-ponging
+      * Synchronization latency
+    * Symmetric vs. assymmetric design
+  * Large cores + small cores
+    * Core assymmetry
+  * Amdahl's law with heterogeneous cores
+  * Parallel bottlenecks
+    * Resource contention
+      * Depends on what are running
+  * Accelerated critical section
+    * Ship critical sections to large cores
+    * Small modifications and low overhead
+    * False serialization might become the bottleneck
+    * Can reduce parallel throughput
+    * Effect on private cache misses and shared cache misses
+===== Lecture 33 (4/27 Mon.) =====
+  * Interconnects
+    * Connecting multiple components together
+    * Goal: Scalability, flexibility, performance and energy efficiency
+  * Metric: Performance, bandwidth, bisection bandwidth, cost, energy efficienct, system performance, contention, latency
+    * Saturation point
+      * Saturation throughput
+  * Topology
+    * How to wire components together, affects routing, throughput, latency
+    * Bus: All nodes connected to a single ring
+      * Hard to increase frequency, bandwidth, poor scalability but simple
+    * Point-to-point
+      * Low contention and potentially low latency. Costly, not scalable and hard to wire.
+    * Crossbar
+      * No contention. Concurrent request from different src/dest can be sent concurrently. Costly.
+    * Multistage logarithmic network
+      * Indirect network, low contention, multiple request can be sent concurrently. More scalable compared to crossbar.
+      * Circuit switch
+      * Omega network, delta network.
+      * Butterfly network
+      * Intermediate switch between sources and destinations
+    * Switching vs. topology
+    * Ring
+      * Each node connected to two other nodes, forming a ring
+      * Low overhead, high latency, not as scalable.
+      * Unidirectional ring and bi-directional ring
+    * Hierarchical Rings
+      * Layers of rings. More scalable, lower latency.
+      * Bridge router connect multiple rings together
+    * Mesh
+      * 4 input and output ports
+      * More bisection bandwidth and more scalable
+      * Easy to layout
+      * Path diversity
+      * Routers are more complex
+    * Tree
+      * Another hierarchical topology
+      * Specialized topology
+      * Good for local traffic
+      * Fat tree: higher level have more bandwidth
+      * CM-5 Fat tree
+        * Fat tree with 4x2 switches
+    * Hypercube
+      * N-Dimensional cubes
+      * Caltech cosmic cube
+      * Very complex
+  * Routing algorithm
+    * How does message get sent from source to destination
+    * Static or adaptive
+    * Handling contention
+      * Buffering helps handling contention, but adds complexity
+    * Three types of routing algorithms
+      * Deterministic: always takes the same path
+      * Oblivious: takes different paths without taking into account of the state of the network
+        * For example, Valiant algorithm
+      * Adaptive: takes different paths taking into account of the state of the network
+        * Non-minimal adaptive routing vs. minimal adaptive routing
+      * Minimal path: path that has minimum number of hops
+  * Buffering and flow control
+    * How to store within the network
+    * Handling oversubscription
+    * Source throttling
+    * Bufferless vs. buffered crossbars
+    * Buffer overflow
+    * Bufferless deflection routing
+      * Deflect packets when there is contention
+      * Hot-potato routing

18-447 Introduction to Computer Architecture – Spring 2015

User Tools

Site Tools

Differences

Page Tools