Differences

This shows you the differences between two versions of the page.

--- buzzword [2014/02/12 19:20]
rachata
+++ buzzword [2014/04/02 18:13]
rachata
@@ Line 499: / Line 499: @@
     * Advantages?
+===== Lecture 14 (2/19 Wed.) =====
+  * Predictor (branch predictor, cache line predictor ...)
+  * Power budget (and its importance)
+  * Architectural state, precise state
+  * Memory dependence is known dynamically
+  * Register state is not shared across threads/processors
+  * Memory state is shared across threads/processors
+  * How to maintain speculative memory states
+  * Write buffers (helps simplify the process of checking the reorder buffer)
+  * Overall OoO mechanism
+    * What are other ways of eliminating dispatch stalls
+    * Dispatch when the sources are ready
+    * Retired instructions make the source available
+    * Register renaming
+    * Reservation station
+      * What goes into the reservation station
+      * Tags required in the reservation station
+    * Tomasulo's algorithm
+    * Without precise exception, OoO is hard to debug
+    * Arch. register ID
+    * Examples in the slides
+      * Slides 28 --> register renaming
+      * Slides 30-35 --> Exercise (also on the board)
+        * This will be usefull for the midterm
+    * Register aliasing table
+===== Lecture 15 (2/21 Fri.) =====
+  * OoO --> Restricted Dataflow
+    * Extracting parallelism
+    * What are the bottlenecks?
+      * Issue width
+      * Dispatch width
+      * Parallelism in the program
+    * More example on slide #10
+    * What does it mean to be restricted data flow
+      * Still visible as a Von Neumann model
+    * Where does the efficiency come from?
+    * Size of the scheduling windors/reorder buffer. Tradeoffs? What make sense?
+  * Load/store handling
+    * Would like to schedule them out of order, but make them visible in-order
+    * When do you schedule the load/store instructions?
+    * Can we predict if load/store are dependent?
+    * This is one of the most complex structure of the load/store handling
+    * What information can be used to predict these load/store optimization?
+  * Note: IPC = 1/CPI
+  * Centralized vs. distributed? What are the tradeoffs?
+  * How to handle when there is a misprediction/recovery
+  * Token dataflow arch.
+    * What are tokens?
+    * How to match tokens
+    * Tagged token dataflow arch.
+    * What are the tradeoffs?
+    * Difficulties?
+===== Lecture 16 (2/24 Mon.) =====
+  * SISD/SIMD/MISD/MIMD
+  * Array processor
+  * Vector processor
+  * Data parallelism
+    * Where does the concurrency arise?
+  * Differences between array processor vs. vector processor
+  * VLIW
+  * Compactness of an array processor
+  * Vector operates on a vector of data (rather than a single datum (scalar))
+    * Vector length (also applies to array processor)
+    * No dependency within a vector --> can have a deep pipeline
+    * Highly parallel (both instruction level (ILP) and memory level (MLP))
+    * But the program needs to be very parallel
+    * Memory can be the bottleneck (due to very high MLP)
+    * What does the functional units look like? Deep pipelin and simpler control.
+    * CRAY-I is one of the examples of vector processor
+    * Memory access pattern in a vector processor
+      * How do the memory accesses benefit the memory bandwidth?
+      * Please refer to slides 73-74 in http://www.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring13-lecture25-mainmemory-afterlecture.pdf for a breif explanation of memory level parallelism
+      * Stride length vs. the number of banks
+        * stride length should be relatively prime to the number of banks
+      * Tradeoffs between row major and column major --> How can the vector processor deals with the two
+    * How to calculate the efficiency and performance of vector processors
+    * What if there are multiple memory ports?
+    * Gather/Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism)
+      * Helps handling sparse metrices
+    * Conditional operation
+    * Structure of vector units
+    * How to automatically parallelize code through the compiler?
+      * This is a hard problem. Compiler does not know the memory address.
+  * What do we need to ensure for both vector and array processor?
+  * Sequential bottleneck
+    * Amdahl's law
+  * Intel MMX --> An example of Intel's approach to SIMD
+    * No VLEN, use OpCode to define the length
+    * Stride is one in MMX
+    * Intel SSE --> Modern version of MMX
+===== Lecture 17 (2/26 Wed.) =====
+  * GPU
+    * Warp/Wavefront
+      * A bunch of threads sharing the same PC
+    * SIMT
+    * Lanes
+    * FGMT + massively parallel
+      * Tolerate long latency
+    * Warp based SIMD vs. traditional SIMD
+  * SPMD (Programming model)
+    * Single program operates on multiple data
+      * can have synchronization point
+    * Many scientific applications are programmed in this manner
+  * Control flow problem (branch divergence)
+    * Masking (in a branch, mask threads that should not execute that path)
+    * Lower SIMD efficiency
+    * What if you have layers of branches?
+  * Dynamic wrap formation
+    * Combining threads from different warps to increase SIMD utilization
+    * This can cause memory divergence
+  * VLIW
+    * Wide fetch
+    * IA-64
+    * Tradeoffs
+      * Simple hardware (no dynamic scheduling, no dependency checking within VLIW)
+      * A lot of loads at the compiler level
+  * Decoupled access/execute
+    * Limited form of OoO
+    * Tradeoffs
+    * How to street the instruction (determine dependency/stalling)?
+    * Instruction scheduling techniques (static vs. dynamic)
+  * Systoric arrays
+    * Processing elements transform data in chains
+    * Develop for image processing (for example, convolution)
+  * Stage processing
+===== Lecture 18 (2/28 Fri.) =====
+  * Tradeoffs of VLIW
+    * Why does VLIW required static instruction scheduling
+    * Whose job it is?
+      * Compiler can rearrange basic blocks/instruction
+  * Basic block
+    * Benefits of having large basic block
+  * Entry/Exit
+    * Handling entries/exits
+  * Trace cache
+    * How to ensure correctness?
+    * Profiling
+    * Fixing up the instruction order to ensure correctness
+    * Dealing with multiple entries into the block
+    * Dealing with multiple exits into the block
+  * Super block
+    * How to form super blocks?
+    * Benefit of super block
+    * Tradeoff between not forming a super block and forming a super block
+      * Ambiguous branch (after profiling, both taken/not taken are equally likely)
+      * Cleaning up
+  * What scenario would make trace cache/superblock/profiling less effective?
+  * List scheduling
+    * Help figuring out which instructions VLIW should fetch
+    * Try to maximize instruction throughput
+    * How to assign priorities
+    * What if some instructions take longer than others
+  * Block structured ISA (BS-ISA)
+    * Problems with trace scheduling?
+    * What type of program will benefit from BS-ISA
+    * How to form blocks in BS-ISA?
+      * Combining basic blocks
+      * multiples of merged basic blocks
+    * How to deal with entries/exits in BS-ISA?
+      * undo the executed instructions from the entry point, then fetch the new block
+    * Advantages over trace cache
+  * Benefit of VLIW + Static instruction scheduling
+  * Intel IA-64
+    * Static instruction scheduling and VLIW
+===== Lecture 19 (3/19 Wed.) =====
+  * Ideal cache
+    * More capacity
+    * Fast
+    * Cheap
+    * High bandwidth
+  * DRAM cell
+    * Cheap
+    * Sense the purturbation through sense amplifier
+    * Slow and leaky
+  * SRAM cell (Cross coupled inverter)
+    * Expensice
+    * Fast (easier to sense the value in the cell)
+  * Memory bank
+    * Read access sequence
+      * DRAM: Activate -> Read -> Precharge (if needed)
+    * What dominate the access laatency for DRAM and SRAM
+  * Scaling issue
+    * Hard to scale the scale to be small
+  * Memory hierarchy
+    * Prefetching
+    * Caching
+  * Spatial and temporal locality
+    * Cache can exploit these
+    * Recently used data is likely to be accessed
+    * Nearby data is likely to be accessed
+  * Caching in a pipeline design
+  * Cache management
+    * Manual
+      * Data movement is managed manually
+        * Embedded processor
+        * GPU scratchpad
+    * Automatic
+      * HW manage data movements
+  * Latency analysis
+    * Based on the hit and miss status, next level access time (if miss), and the current level access time
+  * Cache basics
+    * Set/block (line)/Placement/replacement/direct mapped vs. associative cache/etc.
+  * Cache access
+    * How to access tag and data (in parallel vs serially)
+    * How do tag and index get used?
+    * Modern processors perform serial access for higher level cache (L3 for example) to save power
+  * Cost and benefit of having more associativity
+    * Given the associativity, which block should be replace if it is full
+    * Replacement poligy
+      * Random
+      * Least recently used (LRU)
+      * Least frequently used
+      * Least costly to refetch
+      * etc.
+  * How to implement LRU
+    * How to keep track of access ordering
+      * Complexity increases rapidly
+    * Approximate LRU
+      * Victim and next Victim policy
+===== Lecture 20 (3/21 Fri.) =====
+  * Set thrashing
+    * Working set is bigger than the associativity
+  * Belady's OPT
+    * Is this optimal?
+    * Complexity?
+  * Similarity between cache and page table
+    * Number of blocks vs pages
+    * Time to find the block/page to replace
+  * Handling writes
+    * Write through
+      * Need a modified bit to make sure accesses to data got the updated data
+    * Write back
+      * Simpler, no consistency issues
+  * Sectored cache
+    * Use subblock
+      * lower bandwidth
+      * more complex
+  * Instruction vs data cache
+    * Where to place instructions
+      * Unified vs. separated
+    * In the first level cache
+  * Cache access
+    * First level access
+    * Second level access
+      * When to start the second level access
+        * Performance vs. energy
+  * Address translation
+  * Homonym and Synonyms
+    * Homonym: Same VA but maps to different PA
+      * With multiple processes
+    * Synonyms: Multiple VAs map to the same PA
+      * Shared libraries, shared data, copy-on-write
+      * I/O
+    * Can these create problems when we have the cache
+    * How to eliminate these problems?
+      * Page coloring
+  * Interaction between cache and TLB
+    * Virtually indexed vs. physically indexed
+    * Virtually tagged vs. physically tagged
+    * Virtually indexed physically tagged
+  * Virtual memory in DRAM
+    * Control where data is mapped to in channel/rank/bank
+      * More parallelism
+      * Reduce interference
+===== Lecture 21 (3/24 Mon.) =====
+  * Different parameters that affect cache miss
+  * Thrashing
+  * Different types of cache misses
+    * Compulsory misses
+      * Can mitigate with prefetches
+    * Capacity misses
+      * More assoc
+      * Victim cache
+    * Conflict misses
+      * Hashing
+  * Large block vs. small block
+  * Subblocks
+  * Victim cache
+    * Small, but fully assoc. cache behind the actual cache
+    * Cached misses cache block
+    * Prevent ping-ponging
+  * Pseudo associativity
+    * Simpler way to implement associative cache
+  * Skewed assoc. cache
+    * Different hashing functions for each way
+  * Restructure data access pattern
+    * Order of loop traversal
+    * Blocking
+  * Memory level parallelism
+    * Cost per miss of a parallel cache miss is less costly compared to serial misses
+  * MSHR
+    * Keep track of pending cache
+      * Think of this as the load/store buffer-ish for cache
+    * What information goes into the MSHR?
+    * When do you access the MSHR?
+===== Lecture 22 (3/26 Wed.) =====
+  * Multi-porting
+    * Virtual multi-porting
+      * Time-share the port, not too scalable but cheap
+    * True multiporting
+  * Multiple cache copies
+  * Banking
+    * Can have bank conflict
+    * Extra interconnects across banks
+    * Address mapping can mitigate bank conflict
+    * Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity)
+  * Accessing DRAM
+    * Row bits
+    * Column bits
+    * Addressibility
+    * DRAM has its own clock
+  * DRAM (2T) vs. SRAM (6T)
+    * Cost
+    * Latency
+  * Interleaving in DRAM
+    * Effects from address mapping on memory interleaving
+    * Effects from memory access patterns from the program on interleaving
+  * DRAM Bank
+    * To minimize the cost of interleaving (Shared the data bus and the command bus)
+  * DRAM Rank
+    * Minimize the cost of the chip (a bundle of chips operated together)
+  * DRAM Channel
+    * An interface to DRAM, each with its own ranks/banks
+  * DIMM
+    * More DIMM adds the interconnect complexity
+  * List of commands to read/write data into DRAM
+    * Activate -> read/write -> precharge
+    * Activate moves data into the row buffer
+    * Precharge prepare the bank for the next access
+  * Row buffer hit
+  * Row buffer conflict
+  * Scheduling memory requests to lower row conflicts
+  * Burst mode of DRAM
+    * Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits
+  * Address mapping
+    * Row interleaved
+    * Cache block interleaved
+  * Memory controller
+    * Sending DRAM commands
+    * Periodically send commands to refresh DRAM cells
+    * Ensure correctness and data integrity
+    * Where to place the memory controller
+      * On CPU chip vs. at the main memory
+        * Higher BW on-chip
+    * Determine the order of requests that will be serviced in DRAM
+      * Request queues that hold requests
+      * Send requests whenever the request can be sent to the bank
+      * Determine which command (across banks) should be sent to DRAM
+  * Priority of demand vs. prefetch requests
+  * Memory scheduling policies
+    * FCFS
+    * FR-FCFS
+      * Capped FR-FCFS: FR-FCFS with a timeout
+      * Usually this is done in a command level (read/write commands and precharge/activate commands)
+===== Lecture 23 (3/28 Fri.) =====
+  * DRAM design choices
+    * Cost/density/latency/BW/Yield
+  * Sense Amplifier
+    * How do they work
+  * Dual data rate
+  * Subarray
+  * Rowclone
+    * Moving bulk of data from one row to others
+    * Lower latency and BW when performing copies/zeroes out the data
+  * TL-DRAM
+    * Far segment
+    * Near segment
+    * What causes the long latency
+    * Benefit of TL-DRAM
+      * TL-DRAM vs. DRAM cache (adding a small cache in DRAM)
+===== Lecture 24 (3/31 Mon.) =====
+  * Memory controller
+    * Different commands
+  * Memory scheduler
+    * Determine the order of requests to be issued to DRAM
+    * Age/hit-miss status/types(load/store/prefetch/from GPU/from CPU)/criticality
+  * Row buffer
+    * hit/conflict
+    * open/closed row
+    * Open row policy
+    * Closed row policy
+    * Tradeoffs between open and closed row policy
+      * What if the programs has high row buffer locality: open row might benefit more
+      * Closed row will service misses request faster
+  * Bank conflict
+  * Interference from different applications/threads
+    * Differnt programs/processes/threads interfere with each other
+      * introduce more row buffer/bank conflicts
+    * Memory schedule has to manage these interference
+    * Memory hog problems
+  * Interference in the data/command bus
+  * FR-FCFS
+    * Why does FR-FCFS make sense?
+      * Row buffer has lower lantecy
+    * Issues with FR-FCFS
+      * Unfairness
+  * STFM
+    * Fairness issue in memory scheduling
+    * How does STFM calculate the fairness and slowdown
+      * How to estimate slowdown time when it is runing alone
+    * Definition of fairness (based on STFM, different papers/areas define fairness differently)
+  * PAR-BS
+    * Parallelism in programs
+    * Intereference across banks
+    * How to form  a batch
+    * How to determine ranking between batches/within a batch
+===== Lecture 25 (2/2 Wed.) =====
+  * Latency sensitivity
+    * Performance drops a lot when the memory request latency is long
+  * TCM
+    * Tradeoff between throughput and fairness
+    * Latency sensitive cluster (non-intensive cluster)
+      * Ranking based on memory intensity
+    * Bandwidth intensive cluster
+      * Round robin within the cluster
+    * Generally latency sensitive cluster has more priority
+    * Provide robust fairness vs. throughput
+    * Complexity of TCM?
+  * Different ways to control interference in DRAM
+    * Partitioning of resource
+      * Channel partitioning: map applications that interfere with each other in a different channel
+        * Keep track of application's characteristics
+        * Dedicate a channel might waste the bandwidth
+        * Need OS support to determine the channel bits
+    * Source throttling
+      * A controller throttle the core depends on the performance target
+      * Example: Fairness via source throttling
+        * Detect unfairness and throttle application that is interfering
+        * How do you estimate slowdown?
+        * Threshold based solution: hard to configure
+    * App/thread scheduling
+      * Critical threads usually stall the progress
+    * Designing DRAM controller
+      * Has to handle the normal DRAM operations
+        * Read/write/refresh/all the timing constraints
+      * Keep track of resources
+      * Assign priorities to different requests
+      * Manage requests to banks
+    * Self-optimizing controller
+      * Use machine learning to improve DRAM controller
+  * DRAM Refresh
+    * Why does DRAM has to refresh every 64ms
+    * Banks are unavailable during refresh
+      * LPDDR mitigate this by using a per-bank refresh
+    * Has to spend longer time with bigger DRAM
+    * Distributed refresh: stagger refresh every 64 ms in a distributed manner
+      * As oppose to burst refresh (long pause time)
+  * RAIDR: Reduce DRAM refresh by profiling and binning
+    * Some row do not have to be refresh very frequently
+      * Profile the row
+        * High temperature changes the retention time: need online profiling
+  * Bloom filter
+    * Represent set membership
+    * Approximated
+    * Can contain false positive
+      * Better/more hash function helps eliminate this

18-447 Introduction to Computer Architecture – Spring 2015

User Tools

Site Tools

Differences

Page Tools