Differences

This shows you the differences between two versions of the page.

--- buzzword [2014/01/15 19:27]
rachata
+++ buzzword [2014/03/24 18:16]
rachata
@@ Line 94: / Line 94: @@
   * Tradeoffs between control-driven and data-driven
     * What are easier to program?
-	* Which are easy to compile?
+    * Which are easy to compile?
-	* What are more parallel (does that mean it is faster?)
+    * What are more parallel (does that mean it is faster?)
-	* Which machines are more complex to design?
+    * Which machines are more complex to design?
     * In control flow, when a program is stop, there is a pointer to the current state (precise state).
   * ISA vs. Microarchitecture
     * Semantics in the ISA
-	* uArch should obey the ISA
+    * uArch should obey the ISA
-	* Changing ISA is costly, can affect compatibility.
+    * Changing ISA is costly, can affect compatibility.
   * Instruction pointers
   * uArch techniques: common and powerful techniques break Vonn Neumann model if done at the ISA level
@@ Line 117: / Line 117: @@
   * What are not a part of the ISA? (what goes inside: uArch techniques)
     * Things that are not suppose to be visible to the programmer/software but typically make the processor faster and/or consumes less power and/or less complex
+===== Lecture 3 (1/17 Fri.) =====
+  * Design tradeoffs
+  * Macro Architectures
+  * Reconfiguribility vs. specialized designs
+  * Parallelism (instructions, data parallel)
+  * Uniform decode (Example: Alpha)
+  * Steering bits (Sub-opcode)
+  * 0,1,2,3 address machines
+    * Stack machine
+    * Accumulator machine
+    * 2-operand machine
+    * 3-operand machine
+    * Tradeoffs between 0,1,2,3 address machines
+  * Instructions/Opcode/Operade specifiers (i.e. addressing modes)
+  * Simply vs. complex data type (and their tradeoffs)
+  * Semantic gap
+  * Translation layer
+  * Addressability
+  * Byte/bit addressable machines
+  * Virtual memory
+  * Big/little endian
+  * Benefits of having registers (data locality)
+  * Programmer visible (Architectural) state
+    * Programmers can access this directly
+    * What are the benefits?
+  * Microarchitectural state
+    * Programmers cannot access this directly
+  * Evolution of registers (from accumulators to registers)
+  * Different types of instructions
+    * Control instructions
+    * Data instructions
+    * Operation instructions
+  * Addressing modes
+    * Tradeoffs (complexity, flexibility, etc.)
+  * Orthogonal ISA
+    * Addressing modes that are orthogonal to instructino types
+  * Vectors vs. non vectored interrupts
+  * Complex vs. simple instructions
+    * Tradeoffs
+  * RISC vs. CISC
+    * Tradeoff
+    * Backward compatibility
+    * Performance
+    * Optimization opportunity
+===== Lecture 4 (1/22 Wed.) =====
+  * Semantic gap
+    * Small vs. Large semantic gap (CISC vs. RISC)
+    * Benefit of RISC vs. CISC
+  * Micro operations/microcode
+    * Translate complex instructions into smaller instructions
+  * Parallelism (motivation for RISC)
+  * Compiler optimization
+  * Code optimization through translation
+  * VLIW
+  * Fixed vs. variable length instructions
+    * Tradeoffs
+      * Alignment issues? (fetch/decode)
+      * Decoding issues?
+      * Code size?
+      * Adding additional instructions?
+      * Memory bandwidth and cache utilization?
+      * Energy?
+    * Encoding in variable length instructions
+  * Structure of Alpha instructions and other uniform decode instructions
+    * Different type of instructions
+    * Benefit of knowing what type of instructions
+      * Speculatively operate future instructions
+  * x86 and other non-uniform decode instructions
+    * Tradeoff vs. uniform decode
+  * Tradeoffs for different number of registers
+    * Spilling into memory if the number of registers is small
+    * Compiler optimization on how to manage which value to keep/spill
+  * Addressing modes
+    * Benefits?
+    * Types?
+    * Different uses of addressing modes?
+  * Various ISA-level tradeoffs
+  * Virtual memory
+  * Unalign memory access/aligned memory access
+    * Cost vs. benefit of unaligned access
+  * ISA specification
+    * Things you have to obey/specifie in the ISA specification
+  * Architectural states
+  * Microarchitecture implements how arch. state A transformed to the next arch. state A'
+  * Single cycle machines
+    * Critical path in the single cycle machine
+  * Multi cycle machines
+  * Functional units
+  * Performance metrics
+    * CPI/IPC
+      * CPI of a single cycle microarchitecture
+===== Lecture 5 (1/24 Fri.) =====
+  * Instruction processing
+    * Fetch
+    * Decode
+    * Execute
+    * Memory fetch
+    * Writeback
+  * Datapath & Control logic in microprocessors
+  * Different types of instructions (I-type, R-type, etc.)
+  * Control flow instructions
+  * Non-control flow instructions
+  * Delayed slot/Delayed branch
+  * Single cycle control logic
+  * Lockstep
+  * Critical path analysis
+    * Critical path of a single cycle processor
+  * Combinational logic & Sequential logic
+  * Control store
+  * Tradeoffs of a single cycle uarch
+  * Dynamic power/Static power
+  * Speedup calculation
+    * Parallelism
+    * Serial bottleneck
+    * Amdahl's bottleneck
+  * Design principles
+    * Common case design
+    * Critical path design
+    * Balanced designs
+  * Multi cycle design
+===== Lecture 6 (1/27 Mon.) =====
+  * Microcoded/Microprogrammed machines
+    * States
+    * Microinstructions
+    * Microsequencing
+    * Control store - Product control signals
+    * Microsequencer
+    * Control signal
+      * What do they have to control?
+  * Instruction processing cycle
+  * Latch signals
+  * State machine
+  * State variables
+  * Condition code
+  * Steering bits
+  * Branch enable logic
+  * Difference between gating and loading? (write enable vs. driving the bus)
+  * Memory mapped I/O
+  * Hardwired logic
+    * What control signals come from hardwired logic?
+  * Variable latency memory
+  * Handling interrupts
+  * Difference betwen interrupts and exceptions
+  * Emulator (i.e. uCode allots minimal datapath to emulate the ISA)
+  * Updating machine behavior
+  * Horizontal microcode
+  * Vertical microcode
+  * Primitives
+===== Lecture 7 (1/29 Wed.) =====
+  * Pipelining
+  * Limitations of the multi-programmed design
+    * Idle resources
+  * Throughput of a pipelined design
+    * What dictacts the throughput of a pipelined design?
+  * Latency of the pipelined design
+  * Dependency
+  * Overhead of pipelining
+    * Latch cost?
+  * Data forwarding/bypassing
+  * What are the ideal pipeline?
+  * External fragmentation
+  * Issues in pipeline designs
+    * Stalling
+      * Dependency (Hazard)
+        * Flow dependence
+        * Output dependence
+        * Anti dependence
+        * How to handle them?
+    * Resource contention
+    * Keeping the pipeline full
+    * Handling exception/interrupts
+    * Pipeline flush
+    * Speculation
+  * Interlocking
+  * Multipath execution
+  * Fine grain multithreading
+  * No-op (Bubbles in the pipeline)
+  * Valid bits in the instructions
+===== Lecture 8 (1/31 Fri.) =====
+  * Branch prediction
+  * Different types of data dependence
+  * Pipeline stalls
+    * bubbles
+    * How to handle stalls
+    * Stall conditions
+    * Stall signals
+    * Dependences
+      * Distant between dependences
+    * Data forwarding/bypassing
+    * Maintaining the correct dataflow
+  * Different ways to design data forwarding path/logic
+  * Different techniques to handle interlockings
+    * SW based
+    * HW based
+  * Profiling
+    * Static profiling
+    * Helps from the software (compiler)
+      * Superblock optimization
+      * Analyzing basic blocks
+  * How to deal with branches?
+    * Branch prediction
+    * Delayed branching (branch delay slot)
+    * Forward control flow/backward control flow
+    * Branch prediction accuracy
+  * Profile guided code positioning
+    * Based on the profile info. position the code based on it
+    * Try to make the next sequential instruction be the next inst. to be executed
+  * Trace cache
+  * Predicate combining (combine predicate for a branch instruction)
+  * Predicated execution (control dependence becomes data dependence)
+  * Definition of basic blocks
+  * Control flow graph
+===== Lecture 9 (2/3 Mon.) =====
+  * Delayed branching
+    * benefit?
+    * What does it eliminates?
+    * downside?
+    * Delayed branching in SPARC (with squashing)
+    * Backward compatibility with the delayed slot
+    * What should be filled in the delayed slot
+    * How to ensure correctness
+  * Fine-grained multithreading
+    * fetch from different threads
+    * What are the issues (what if the program doesn't have many threads)
+    * CDC 6000
+    * Denelcor HEP
+    * No dependency checking
+    * Inst. from different thread can fill-in the bubbles
+    * Cost?
+  * Simulteneuos multithreading
+  * Branch prediction
+    * Guess what to fetch next.
+    * Misprediction penalty
+    * Need to guess the direction and target
+    * How to perform the performance analysis?
+      * Given the branch prediction accuracy and penalty cost, how to compute a cost of a branch misprediction.
+      * Given the program/number of instructions, percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions.
+        * How many extra instructions are being fetched?
+        * What is the performance degredation?
+    * How to reduce the miss penalty?
+    * Predicting the next address (non PC+4 address)
+    * Branch target buffer (BTB)
+      * Predicting the address of the branch
+    * Global branch history - for directions
+    * Can use compiler to profile and get more info
+      * Input set dictacts the accuracy
+      * Add time to compilation
+    * Heuristics that are common and doesn't require profiling.
+      * Might be inaccurate
+      * Does not require profiling
+    * Programmer can tell the hardware (via pragmas (hints))
+      * For example, x86 has the hint bit
+    * Dynamic branch prediction
+      * Last time predictor
+      * Two bits counter based prediction
+        * One more bit for hysteresis
+===== Lecture 10 (2/5 Wed.) =====
+  * Branch prediction accuracy
+    * Why are they very important?
+      * Differences between 99% accuracy and 98% accuracy
+      * Cost of a misprediction when the pipeline is veryd eep
+  * Value prediction
+  * Global branch correlation
+    * Some branches are correlated
+  * Local branch correlation
+    * Some branches can depend on the result of past branches
+  * Pattern history table
+    * Record global taken/not taken results.
+    * Cost vs. accuracy (What to record, do you record PC? Just taken/not taken info.?)
+  * One-level branch predictor
+    * What information are used
+  * Two-level branch prediction
+    * What entries do you keep in the glocal history?
+    * What entries do you keep in the local history?
+    * How many table?
+    * Cost when training a table
+    * What are the purposes of each table?
+    * Potential problems of a two-level history
+  * GShare predictor
+    * Global history predictor is hashed with the PC
+    * Store both GHP and PC in one combined information
+    * How do you use the information? Why does the XOR result still usable?
+  * Slides (page 16-18) for a good overview of one- and two-level predictors
+  * Warmup cost of the branch predictor
+    * Hybrid solution? Fast warmup is used first, then switch to the slower one.
+  * Tournament predictor (Alpha 21264)
+  * Other types of branch predictor
+    * Using machine learning?
+    * Geometric history length
+      * Look at branches far behind (but using geometric step)
+  * Predicated execution - eliminate branches
+    * What are the tradeoffs
+    * What is the block is big (can lead to execution a lot of useless work)
+    * Allows easier code optimization
+      * From the compiler PoV, predicated execution combine multiple basic blocks into one bigger basic block
+      * Reduce control dependences
+    * Need ISA support
+  * Wish branches
+    * Compiler generate both predicated and non-predicated codes
+    * HW design which one to use
+      * Use branch prediction on an easy to predict code
+      * Use predicated execution on a hard to predict code
+      * Compiler can be more aggressive in optimimzing the code
+    * What are the tradeoffs (slide# 47)
+  * Multi-path execution
+    * Execute both paths
+    * Can lead to wasted work
+===== Lecture 11 (2/12 Wed.) =====
+  * Call and return prediction
+    * Direct call is easy to predict
+    * Retun is harder (indirect branches)
+      * Nested calls make return easier to predict
+        * Can use stack to predict the return
+    * Indirect branch prediction
+      * These branches have multiple targets
+      * For switch-case, virtual function calls, jump tables, interface calls
+      * BTB to predict the target address - low accuracy
+      * History based: BTB + GHR
+      * Virtual program counter prediction
+  * Complications in superscalar processors
+    * Fetch? What if multiple branches are fetched at the same time?
+    * Logic requires to ensure correctness?
+  * Multi-cycle executions (Different functional units take different number of cycles)
+    * Instructions can retire out-of-order
+      * How to deal with this case? Stall? Throw exceptions if there are problems?
+  * Exceptions and Interrupts
+    * When they are handled?
+      * Why are some interrupts should be handled right away?
+  * Precise exception
+    * arch. state should be consistent before handling the exception/interrupts
+      * Easier to debug (you see the sequential flow when the interrupt occurs)
+        * Deterministic
+      * Easier to recover from the exception
+      * Easier to restart the processes
+    * How to ensure precise exception?
+      * Tradeoffs between each method
+  * Reorder buffer
+    * Reorder results before they are visible to the arch. state
+      * Need to presearve the sequential sematic and data
+    * What are the informatinos in the ROB entry
+    * Where to get the value from (forwarding path? reorder buffer?)
+      * Extra logic to check where the youngest instructions/value is
+      * Content addressible search
+        * A lot of comparators
+    * Different ways to simplify the reorder buffer
+    * Register renaming
+      * Same register refers to independent values (lacks of registers)
+    * Where does the exception happen (after retire)
+  * History buffer
+    * Update the register file when the instruction complete. Unroll if there is an exception.
+  * Future file (commonly used, along with reorder buffer)
+    * Keep two set of register files
+      * An updated value (Speculative), called fiture file
+      * A backup value (to restore the state quickly
+    * Double the cost of the regfile, but reduce the area as you don't have to use a content addressible memory (compared to ROB alone)
+  * Branch misprediction resembles Exception
+    * The difference is that branch misprediction is not visible to the software
+      * Also much more common (say, divide by zero vs. a mispredicted branch)
+    * Recovery is similar to exception handling
+  * Latency of the state recovery
+  * What to do during the state recovery
+  * Checkpointing
+    * Advantages?
+===== Lecture 14 (2/19 Wed.) =====
+  * Predictor (branch predictor, cache line predictor ...)
+  * Power budget (and its importance)
+  * Architectural state, precise state
+  * Memory dependence is known dynamically
+  * Register state is not shared across threads/processors
+  * Memory state is shared across threads/processors
+  * How to maintain speculative memory states
+  * Write buffers (helps simplify the process of checking the reorder buffer)
+  * Overall OoO mechanism
+    * What are other ways of eliminating dispatch stalls
+    * Dispatch when the sources are ready
+    * Retired instructions make the source available
+    * Register renaming
+    * Reservation station
+      * What goes into the reservation station
+      * Tags required in the reservation station
+    * Tomasulo's algorithm
+    * Without precise exception, OoO is hard to debug
+    * Arch. register ID
+    * Examples in the slides
+      * Slides 28 --> register renaming
+      * Slides 30-35 --> Exercise (also on the board)
+        * This will be usefull for the midterm
+    * Register aliasing table
+===== Lecture 15 (2/21 Fri.) =====
+  * OoO --> Restricted Dataflow
+    * Extracting parallelism
+    * What are the bottlenecks?
+      * Issue width
+      * Dispatch width
+      * Parallelism in the program
+    * More example on slide #10
+    * What does it mean to be restricted data flow
+      * Still visible as a Von Neumann model
+    * Where does the efficiency come from?
+    * Size of the scheduling windors/reorder buffer. Tradeoffs? What make sense?
+  * Load/store handling
+    * Would like to schedule them out of order, but make them visible in-order
+    * When do you schedule the load/store instructions?
+    * Can we predict if load/store are dependent?
+    * This is one of the most complex structure of the load/store handling
+    * What information can be used to predict these load/store optimization?
+  * Note: IPC = 1/CPI
+  * Centralized vs. distributed? What are the tradeoffs?
+  * How to handle when there is a misprediction/recovery
+  * Token dataflow arch.
+    * What are tokens?
+    * How to match tokens
+    * Tagged token dataflow arch.
+    * What are the tradeoffs?
+    * Difficulties?
+===== Lecture 16 (2/24 Mon.) =====
+  * SISD/SIMD/MISD/MIMD
+  * Array processor
+  * Vector processor
+  * Data parallelism
+    * Where does the concurrency arise?
+  * Differences between array processor vs. vector processor
+  * VLIW
+  * Compactness of an array processor
+  * Vector operates on a vector of data (rather than a single datum (scalar))
+    * Vector length (also applies to array processor)
+    * No dependency within a vector --> can have a deep pipeline
+    * Highly parallel (both instruction level (ILP) and memory level (MLP))
+    * But the program needs to be very parallel
+    * Memory can be the bottleneck (due to very high MLP)
+    * What does the functional units look like? Deep pipelin and simpler control.
+    * CRAY-I is one of the examples of vector processor
+    * Memory access pattern in a vector processor
+      * How do the memory accesses benefit the memory bandwidth?
+      * Please refer to slides 73-74 in http://www.ece.cmu.edu/~ece447/s13/lib/exe/fetch.php?media=onur-447-spring13-lecture25-mainmemory-afterlecture.pdf for a breif explanation of memory level parallelism
+      * Stride length vs. the number of banks
+        * stride length should be relatively prime to the number of banks
+      * Tradeoffs between row major and column major --> How can the vector processor deals with the two
+    * How to calculate the efficiency and performance of vector processors
+    * What if there are multiple memory ports?
+    * Gather/Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism)
+      * Helps handling sparse metrices
+    * Conditional operation
+    * Structure of vector units
+    * How to automatically parallelize code through the compiler?
+      * This is a hard problem. Compiler does not know the memory address.
+  * What do we need to ensure for both vector and array processor?
+  * Sequential bottleneck
+    * Amdahl's law
+  * Intel MMX --> An example of Intel's approach to SIMD
+    * No VLEN, use OpCode to define the length
+    * Stride is one in MMX
+    * Intel SSE --> Modern version of MMX
+===== Lecture 17 (2/26 Wed.) =====
+  * GPU
+    * Warp/Wavefront
+      * A bunch of threads sharing the same PC
+    * SIMT
+    * Lanes
+    * FGMT + massively parallel
+      * Tolerate long latency
+    * Warp based SIMD vs. traditional SIMD
+  * SPMD (Programming model)
+    * Single program operates on multiple data
+      * can have synchronization point
+    * Many scientific applications are programmed in this manner
+  * Control flow problem (branch divergence)
+    * Masking (in a branch, mask threads that should not execute that path)
+    * Lower SIMD efficiency
+    * What if you have layers of branches?
+  * Dynamic wrap formation
+    * Combining threads from different warps to increase SIMD utilization
+    * This can cause memory divergence
+  * VLIW
+    * Wide fetch
+    * IA-64
+    * Tradeoffs
+      * Simple hardware (no dynamic scheduling, no dependency checking within VLIW)
+      * A lot of loads at the compiler level
+  * Decoupled access/execute
+    * Limited form of OoO
+    * Tradeoffs
+    * How to street the instruction (determine dependency/stalling)?
+    * Instruction scheduling techniques (static vs. dynamic)
+  * Systoric arrays
+    * Processing elements transform data in chains
+    * Develop for image processing (for example, convolution)
+  * Stage processing
+===== Lecture 18 (2/28 Fri.) =====
+  * Tradeoffs of VLIW
+    * Why does VLIW required static instruction scheduling
+    * Whose job it is?
+      * Compiler can rearrange basic blocks/instruction
+  * Basic block
+    * Benefits of having large basic block
+  * Entry/Exit
+    * Handling entries/exits
+  * Trace cache
+    * How to ensure correctness?
+    * Profiling
+    * Fixing up the instruction order to ensure correctness
+    * Dealing with multiple entries into the block
+    * Dealing with multiple exits into the block
+  * Super block
+    * How to form super blocks?
+    * Benefit of super block
+    * Tradeoff between not forming a super block and forming a super block
+      * Ambiguous branch (after profiling, both taken/not taken are equally likely)
+      * Cleaning up
+  * What scenario would make trace cache/superblock/profiling less effective?
+  * List scheduling
+    * Help figuring out which instructions VLIW should fetch
+    * Try to maximize instruction throughput
+    * How to assign priorities
+    * What if some instructions take longer than others
+  * Block structured ISA (BS-ISA)
+    * Problems with trace scheduling?
+    * What type of program will benefit from BS-ISA
+    * How to form blocks in BS-ISA?
+      * Combining basic blocks
+      * multiples of merged basic blocks
+    * How to deal with entries/exits in BS-ISA?
+      * undo the executed instructions from the entry point, then fetch the new block
+    * Advantages over trace cache
+  * Benefit of VLIW + Static instruction scheduling
+  * Intel IA-64
+    * Static instruction scheduling and VLIW
+===== Lecture 19 (3/19 Wed.) =====
+  * Ideal cache
+    * More capacity
+    * Fast
+    * Cheap
+    * High bandwidth
+  * DRAM cell
+    * Cheap
+    * Sense the purturbation through sense amplifier
+    * Slow and leaky
+  * SRAM cell (Cross coupled inverter)
+    * Expensice
+    * Fast (easier to sense the value in the cell)
+  * Memory bank
+    * Read access sequence
+      * DRAM: Activate -> Read -> Precharge (if needed)
+    * What dominate the access laatency for DRAM and SRAM
+  * Scaling issue
+    * Hard to scale the scale to be small
+  * Memory hierarchy
+    * Prefetching
+    * Caching
+  * Spatial and temporal locality
+    * Cache can exploit these
+    * Recently used data is likely to be accessed
+    * Nearby data is likely to be accessed
+  * Caching in a pipeline design
+  * Cache management
+    * Manual
+      * Data movement is managed manually
+        * Embedded processor
+        * GPU scratchpad
+    * Automatic
+      * HW manage data movements
+  * Latency analysis
+    * Based on the hit and miss status, next level access time (if miss), and the current level access time
+  * Cache basics
+    * Set/block (line)/Placement/replacement/direct mapped vs. associative cache/etc.
+  * Cache access
+    * How to access tag and data (in parallel vs serially)
+    * How do tag and index get used?
+    * Modern processors perform serial access for higher level cache (L3 for example) to save power
+  * Cost and benefit of having more associativity
+    * Given the associativity, which block should be replace if it is full
+    * Replacement poligy
+      * Random
+      * Least recently used (LRU)
+      * Least frequently used
+      * Least costly to refetch
+      * etc.
+  * How to implement LRU
+    * How to keep track of access ordering
+      * Complexity increases rapidly
+    * Approximate LRU
+      * Victim and next Victim policy
+===== Lecture 20 (3/21 Fri.) =====
+  * Set thrashing
+    * Working set is bigger than the associativity
+  * Belady's OPT
+    * Is this optimal?
+    * Complexity?
+  * Similarity between cache and page table
+    * Number of blocks vs pages
+    * Time to find the block/page to replace
+  * Handling writes
+    * Write through
+      * Need a modified bit to make sure accesses to data got the updated data
+    * Write back
+      * Simpler, no consistency issues
+  * Sectored cache
+    * Use subblock
+      * lower bandwidth
+      * more complex
+  * Instruction vs data cache
+    * Where to place instructions
+      * Unified vs. separated
+    * In the first level cache
+  * Cache access
+    * First level access
+    * Second level access
+      * When to start the second level access
+        * Performance vs. energy
+  * Address translation
+  * Homonym and Synonyms
+    * Homonym: Same VA but maps to different PA
+      * With multiple processes
+    * Synonyms: Multiple VAs map to the same PA
+      * Shared libraries, shared data, copy-on-write
+      * I/O
+    * Can these create problems when we have the cache
+    * How to eliminate these problems?
+      * Page coloring
+  * Interaction between cache and TLB
+    * Virtually indexed vs. physically indexed
+    * Virtually tagged vs. physically tagged
+    * Virtually indexed physically tagged
+  * Virtual memory in DRAM
+    * Control where data is mapped to in channel/rank/bank
+      * More parallelism
+      * Reduce interference
+===== Lecture 21 (3/24 Mon.) =====
+  * Different parameters that affect cache miss
+  * Thrashing
+  * Different types of cache misses
+    * Compulsory misses
+      * Can mitigate with prefetches
+    * Capacity misses
+      * More assoc
+      * Victim cache
+    * Conflict misses
+      * Hashing
+  * Large block vs. small block
+  * Subblocks
+  * Victim cache
+    * Small, but fully assoc. cache behind the actual cache
+    * Cached misses cache block
+    * Prevent ping-ponging
+  * Pseudo associtivity
+    * Simpler way to implement associative cache
+  * Skewed assoc. cache
+    * Different hashing functions for each way
+  * Restructure data access pattern
+    * Order of loop traversal
+    * Blocking
+  * Memory level parallelism
+    * Cost per miss of a parallel cache miss is less costly compared to serial misses
+  * MSHR
+    * Keep track of pending cache
+      * Think of this as the load/store buffer-ish for cache
+    * What information goes into the MSHR?
+    * When do you access the MSHR?

18-447 Introduction to Computer Architecture – Spring 2015

User Tools

Site Tools

Differences

Page Tools