This is an old revision of the document!

Buzzwords

Buzzwords are terms that are mentioned during lecture which are particularly important to understand thoroughly. This page tracks the buzzwords for each of the lectures and can be used as a reference for finding gaps in your understanding of course material.

Lecture 1 (1/12 Mon.)

Level of transformation
- Algorithm
- System software
- Compiler
Cross abstraction layers
Tradeoffs
Caches
DRAM/memory controller
DRAM banks
Row buffer hit/miss
Row buffer locality
Unfairness
Memory performance hog
Shared DRAM memory system
Streaming access vs. random access
Memory scheduling policies
Scheduling priority
Retention time of DRAM
Process variation
Retention time profile
Power consumption
Bloom filter
Hamming code
Hamming distance
DRAM row hammer

Lecture 2 (1/14 Wed.)

Moore's Law
Algorithm –> step-by-step procedure to solve a problem
in-order execution
out-of-order execution
technologies that are available on cellphones
new applications that are made available through new computer architecture techniques
- more data mining (genomics/medical areas)
lower power (cellphones)
smaller cores (cellphones/computers)
etc.
Performance bottlenecks in a single thread/core processors
- multi-core as an alternative
Memory wall (a part of scaling issue)
Scaling issue
- Transistor are getting smaller
Key components of a computer
Design points
- Design processors to meet the design points
Software stack
Design decisions
Datacenters
Reliability problems that cause errors
Analogies from Kuhn's “The Structure of Scientific Revolutions” (Recommended book)
- Pre-paradigm science
- Normal science
- Revolutionary science
Components of a computer
- Computation
  - Communication
  - Storage
    - DRAM
    - NVRAM (Non-volatile memory): PCM, STT-MRAM
    - Storage (Flash/Harddrive)
Von Neumann Model (Control flow model)
- Stored program computer
  - Properties of Von Neumann Model: Stored program, sequential instruction processing
  - Unified memory
    - When does an instruction is being interpreted as an instruction (as oppose to a datum)?
  - Program counter
  - Examples: x86, ARM, Alpha, IBM Power series, SPARC, MIPS
Data flow model
- Data flow machine
  - Data flow graph
- Operands
- Live-outs/Live-ins
  - Different types of data flow nodes (conditional/relational/barrier)
- How to do transactional transaction in dataflow?
  - Example: bank transactions
Tradeoffs between control-driven and data-driven
- What are easier to program?
  - Which are easy to compile?
  - What are more parallel (does that mean it is faster?)
  - Which machines are more complex to design?
- In control flow, when a program is stop, there is a pointer to the current state (precise state).
ISA vs. Microarchitecture
- Semantics in the ISA
  - uArch should obey the ISA
  - Changing ISA is costly, can affect compatibility.
Instruction pointers
uArch techniques: common and powerful techniques break Vonn Neumann model if done at the ISA level
- Conceptual techniques
  - Pipelining
  - Multiple instructions at a time
  - Out-of-order executions
  - etc.
    - Design techniques
      - Adder implementation (Bit serial, ripple carry, carry lookahead)
      - Connection machine (an example of a machine that use bit serial to tradeoff latency for more parallelism)
Microprocessor: ISA + uArch + circuits
What are a part of the ISA? Instructions, memory, etc.
- Things that are visible to the programmer/software
What are not a part of the ISA? (what goes inside: uArch techniques)
- Things that are not suppose to be visible to the programmer/software but typically make the processor faster and/or consumes less power and/or less complex

Lecture 3 (1/17 Fri.)

Microarchitecture
Three major tradeoffs of computer architecture
Macro-architecture
LC-3b ISA
Unused instructions
Bit steering
Instruction processing style
0,1,2,3 address machines
Stack machine
Accumulator machine
2-operand machine
3-operand machine
Tradeoffs between 0,1,2,3 address machines
Postfix notation
Instructions/Opcode/Operand specifiers (i.e. addressing modes)
Simply vs. complex data type (and their tradeoffs)
Semantic gap and level
Translation layer
Addressability
Byte/bit addressable machines
Virtual memory
Big/little endian
Benefits of having registers (data locality)
Programmer visible (Architectural) state
Programmers can access this directly
What are the benefits?
Microarchitectural state
Programmers cannot access this directly
Evolution of registers (from accumulators to registers)
Different types of instructions
Control instructions
Data instructions
Operation instructions
Addressing modes
Tradeoffs (complexity, flexibility, etc.)
Orthogonal ISA
Addressing modes that are orthogonal to instruction types
I/O devices
Vectored vs. non-vectored interrupts
Complex vs. simple instructions
Tradeoffs
RISC vs. CISC
Tradeoff
Backward compatibility
Performance
Optimization opportunity
Translation

Lecture 4 (1/21 Wed.)

Fixed vs. variable length instruction
Huffman encoding
Uniform vs. non-uniform decode
Registers
- Tradeoffs between number of registers
Alignments
- How does MIPS load words across alignment the boundary

Lecture 5 (1/26 Mon.)

Tradeoffs in ISA: Instruction length
- Uniform vs. non-uniform
Design point/Use cases
- What dictates the design point?
Architectural states
uArch
- How to implement the ISA in the uArch
Different stages in the uArch
Clock cycles
Multi-cycle machine
Datapath and control logic
- Control signals
Execution time of instructions/program
- Metrics and what do they means
Instruction processing
- Fetch
- Decode
- Execute
- Memory fetch
- Writeback
Encoding and semantics
Different types of instructions (I-type, R-type, etc.)
Control flow instructions
Non-control flow instructions
Delayed slot/Delayed branch
Single cycle control logic
Lockstep
Critical path analysis
- Critical path of a single cycle processor
What is in the control signals?
- Combinational logic & Sequential logic
Control store
Tradeoffs of a single cycle uarch
Design principles
- Common case design
- Critical path design
- Balanced designs
- Dynamic power/Static power
  - Increases in power due to frequency

Lecture 6 (1/28 Mon.)

Design principles
- Common case design
- Critical path design
- Balanced designs
Multi cycle design
Microcoded/Microprogrammed machines
- States
- Translation from one state to another
- Microinstructions
- Microsequencing
- Control store - Product control signals
- Microsequencer
- Control signal
  - What do they have to control?
Instruction processing cycle
Latch signals
State machine
State variables
Condition code
Steering bits
Branch enable logic
Difference between gating and loading? (write enable vs. driving the bus)
Memory mapped I/O
Hardwired logic
- What control signals come from hardwired logic?
Variable latency memory
Handling interrupts
Difference between interrupts and exceptions
Emulator (i.e. uCode allots minimal datapath to emulate the ISA)
Updating machine behavior
Horizontal microcode
Vertical microcode
Primitives

Lecture 7 (1/30 Fri.)

Emulator (i.e. uCode allots minimal datapath to emulate the ISA)
Updating machine behavior
Horizontal microcode
Vertical microcode
Primitives
nanocode and millicode
- what are the differences between nano/milli/microcode
microprogrammed vs. hardwire control
Pipelining
Limitations of the multi-programmed design
- Idle resources
Throughput of a pipelined design
- What dictates the throughput of a pipelined design?
Latency of the pipelined design
Dependency
Overhead of pipelining
- Latch cost?
Data forwarding/bypassing
What are the ideal pipeline?
External fragmentation
Issues in pipeline designs
- Stalling
  - Dependency (Hazard)
    - Flow dependence
    - Output dependence
    - Anti dependence
    - How to handle them?
- Resource contention
- Keeping the pipeline full
- Handling exception/interrupts
- Pipeline flush
- Speculation

Lecture 8 (2/2 Mon.)

Interlocking
Multipath execution
Fine grain multithreading
No-op (Bubbles in the pipeline)
Valid bits in the instructions
Branch prediction
Different types of data dependence
Pipeline stalls
- bubbles
- How to handle stalls
- Stall conditions
- Stall signals
- Dependences
  - Distant between dependences
- Data forwarding/bypassing
- Maintaining the correct dataflow
Different ways to design data forwarding path/logic
Different techniques to handle interlockings
- SW based
- HW based
Profiling
- Static profiling
- Helps from the software (compiler)
  - Superblock optimization
  - Analyzing basic blocks
How to deal with branches?
- Branch prediction
- Delayed branching (branch delay slot)
- Forward control flow/backward control flow
- Branch prediction accuracy
Profile guided code positioning
- Based on the profile info. position the code based on it
- Try to make the next sequential instruction be the next inst. to be executed
Predicate combining (combine predicate for a branch instruction)
Predicated execution (control dependence becomes data dependence)

Lecture 9 (2/4 Wed.)

Definition of basic blocks
Control flow graph
Delayed branching
- benefit?
- What does it eliminates?
- downside?
- Delayed branching in SPARC (with squashing)
- Backward compatibility with the delayed slot
- What should be filled in the delayed slot
- How to ensure correctness
Fine-grained multithreading
- fetch from different threads
- What are the issues (what if the program doesn't have many threads)
- CDC 6000
- Denelcor HEP
- No dependency checking
- Inst. from different thread can fill-in the bubbles
- Cost?
Simultaneuos multithreading
Branch prediction
- Guess what to fetch next.
- Misprediction penalty
- Need to guess the direction and target
- How to perform the performance analysis?
  - Given the branch prediction accuracy and penalty cost, how to compute a cost of a branch misprediction.
  - Given the program/number of instructions, percent of branches, branch prediction accuracy and penalty cost, how to compute a cost coming from branch mispredictions.
    - How many extra instructions are being fetched?
    - What is the performance degradation?
- How to reduce the miss penalty?
- Predicting the next address (non PC+4 address)
- Branch target buffer (BTB)
  - Predicting the address of the branch
- Global branch history - for directions
- Can use compiler to profile and get more info
  - Input set dictates the accuracy
  - Add time to compilation
- Heuristics that are common and doesn't require profiling.
  - Might be inaccurate
  - Does not require profiling
- Static branch prediction
  - Programmer provides pragmas, hinting the likelihood of taken/not taken branch
  - For example, x86 has the hint bit
- Dynamic branch prediction
  - Last time predictor
  - Two bits counter based prediction
    - One more bit for hysteresis

Lecture 10 (2/6 Fri.)

Branch prediction accuracy
- Why are they very important?
  - Differences between 99% accuracy and 98% accuracy
  - Cost of a misprediction when the pipeline is very deep
Global branch correlation
- Some branches are correlated
Local branch correlation
- Some branches can depend on the result of past branches
Pattern history table
- Record global taken/not taken results.
- Cost vs. accuracy (What to record, do you record PC? Just taken/not taken info.?)
One-level branch predictor
- What information are used
Two-level branch prediction
- What entries do you keep in the global history?
- What entries do you keep in the local history?
- How many table?
- Cost when training a table
- What are the purposes of each table?
- Potential problems of a two-level history
GShare predictor
- Global history predictor is hashed with the PC
- Store both GHP and PC in one combined information
- How do you use the information? Why does the XOR result still usable?
Warmup cost of the branch predictor
- Hybrid solution? Fast warmup is used first, then switch to the slower one.
Tournament predictor (Alpha 21264)
Predicated execution - eliminate branches
- What are the tradeoffs
- What if the block is big (can lead to execution a lot of useless work)
- Allows easier code optimization
  - From the compiler PoV, predicated execution combine multiple basic blocks into one bigger basic block
  - Reduce control dependences
- Need ISA support
Wish branches
- Compiler generate both predicated and non-predicated codes
- HW design which one to use
  - Use branch prediction on an easy to predict code
  - Use predicated execution on a hard to predict code
  - Compiler can be more aggressive in optmizing the code
- What are the tradeoffs (slide# 47)
Multi-path execution
- Execute both paths
- Can lead to wasted work
- VLIW
- Superscalar

Lecture 11 (2/11 Wed.)

Geometric GHR length for branch prediction
Perceptron branch predictor
Multi-cycle executions (Different functional units take different number of cycles)
- Instructions can retire out-of-order
  - How to deal with this case? Stall? Throw exceptions if there are problems?
Exceptions and Interrupts
- When they are handled?
- Why are some interrupts should be handled right away?
Precise exception
- arch. state should be consistent before handling the exception/interrupts
  - Easier to debug (you see the sequential flow when the interrupt occurs)
    - Deterministic
  - Easier to recover from the exception
  - Easier to restart the processes
- How to ensure precise exception?
- Tradeoffs between each method
Reorder buffer
- Reorder results before they are visible to the arch. state
  - Need to preserve the sequential semantic and data
- What are the information in the ROB entry
- Where to get the value from (forwarding path? reorder buffer?)
  - Extra logic to check where the youngest instructions/value is
  - Content addressable search (CAM)
    - A lot of comparators
- Different ways to simplify the reorder buffer
- Register renaming
  - Same register refers to independent values (lacks of registers)
- Where does the exception happen (after retire)
History buffer
- Update the register file when the instruction complete. Unroll if there is an exception.
Future file (commonly used, along with reorder buffer)
- Keep two set of register files
  - An updated value (Speculative), called future file
  - A backup value (to restore the state quickly
- Double the cost of the regfile, but reduce the area as you don't have to use a content addressable memory (compared to ROB alone)
Branch misprediction resembles Exception
- The difference is that branch misprediction is not visible to the software
  - Also much more common (say, divide by zero vs. a mispredicted branch)
- Recovery is similar to exception handling
Latency of the state recovery
What to do during the state recovery
Checkpointing
- Advantages?

Lecture 12 (2/13 Fri.)

Renaming
Register renaming table
Predictor (branch predictor, cache line predictor …)
Power budget (and its importance)
Architectural state, precise state
Memory dependence is known dynamically
Register state is not shared across threads/processors
Memory state is shared across threads/processors
How to maintain speculative memory states
Write buffers (helps simplify the process of checking the reorder buffer)
Overall OoO mechanism
- What are other ways of eliminating dispatch stalls
- Dispatch when the sources are ready
- Retired instructions make the source available
- Register renaming
- Reservation station
  - What goes into the reservation station
  - Tags required in the reservation station
- Tomasulo's algorithm
- Without precise exception, OoO is hard to debug
- Arch. register ID
- Examples in the slides
  - Slides 28 –> register renaming
  - Slides 30-35 –> Exercise (also on the board)
    - This will be useful for the midterm
- Register aliasing table
- Broadcasting tags
- Using dataflow

Lecture 13 (2/16 Mon.)

OoO –> Restricted Dataflow
- Extracting parallelism
- What are the bottlenecks?
  - Issue width
  - Dispatch width
  - Parallelism in the program
- What does it mean to be restricted data flow
  - Still visible as a Von Neumann model
- Where does the efficiency come from?
- Size of the scheduling windows/reorder buffer. Tradeoffs? What make sense?
Load/store handling
- Would like to schedule them out of order, but make them visible in-order
- When do you schedule the load/store instructions?
- Can we predict if load/store are dependent?
- This is one of the most complex structure of the load/store handling
- What information can be used to predict these load/store optimization?
Centralized vs. distributed? What are the tradeoffs?
How to handle when there is a misprediction/recovery
- OoO + branch prediction?
- Speculatively update the history register
  - When do you update the GHR?
Token dataflow arch.
- What are tokens?
- How to match tokens
- Tagged token dataflow arch.
- What are the tradeoffs?
- Difficulties?

Lecture 14 (2/18 Wed.)

SISD/SIMD/MISD/MIMD
Array processor
Vector processor
Data parallelism
- Where does the concurrency arise?
Differences between array processor vs. vector processor
VLIW
Compactness of an array processor
Vector operates on a vector of data (rather than a single datum (scalar))
- Vector length (also applies to array processor)
- No dependency within a vector –> can have a deep pipeline
- Highly parallel (both instruction level (ILP) and memory level (MLP))
- But the program needs to be very parallel
- Memory can be the bottleneck (due to very high MLP)
- What does the functional units look like? Deep pipeline and simpler control.
- CRAY-I is one of the examples of vector processor
- Memory access pattern in a vector processor
  - How do the memory accesses benefit the memory bandwidth?
  - Memory level parallelism
  - Stride length vs. the number of banks
    - stride length should be relatively prime to the number of banks
  - Tradeoffs between row major and column major –> How can the vector processor deals with the two
- How to calculate the efficiency and performance of vector processors
- What if there are multiple memory ports?
- Gather/Scatter allows vector processor to be a lot more programmable (i.e. gather data for parallelism)
  - Helps handling sparse matrices
- Conditional operation
- Structure of vector units
- How to automatically parallelize code through the compiler?
  - This is a hard problem. Compiler does not know the memory address.
What do we need to ensure for both vector and array processor?
Sequential bottleneck
- Amdahl's law
Intel MMX –> An example of Intel's approach to SIMD
- No VLEN, use OpCode to define the length
- Stride is one in MMX
- Intel SSE –> Modern version of MMX

Lecture 15 (2/20 Fri.)

GPU
- Warp/Wavefront
  - A bunch of threads sharing the same PC
- SIMT
- Lanes
- FGMT + massively parallel
  - Tolerate long latency
- Warp based SIMD vs. traditional SIMD
SPMD (Programming model)
- Single program operates on multiple data
  - can have synchronization point
- Many scientific applications are programmed in this manner
Control flow problem (branch divergence)
- Masking (in a branch, mask threads that should not execute that path)
- Lower SIMD efficiency
- What if you have layers of branches?
Dynamic warp formation
- Combining threads from different warps to increase SIMD utilization
- This can cause memory divergence
VLIW
- Wide fetch
- IA-64
- Tradeoffs
  - Simple hardware (no dynamic scheduling, no dependency checking within VLIW)
  - A lot of loads at the compiler level
Decoupled access/execute
- Limited form of OoO
- Tradeoffs
- How to street the instruction (determine dependency/stalling)?
- Instruction scheduling techniques (static vs. dynamic)
Systolic arrays
- Processing elements transform data in chains
- Develop for image processing (for example, convolution)
Stage processing

Lecture 16 (2/23 Mon.)

Systolic arrays
- Processing elements transform data in chains
- Can be arrays of multi-dimensional processing elements
- Develop for image processing (for example, convolution)
- Can be use to break stages in pipeline programs, using a set of queues and processing elements
- Can enable high concurrency and good for regular programs
- Very special purpose
- The warp computer
Static instruction scheduling
- How do we find the next instruction to execute?
Live-in and live-out
Basic blocks
- Rearranging instructions in the basic block
- Code movement from one basic block to another
Straight line code
Independent instructions
- How to identify independent instructions
Atomicity
Trace scheduling
- Side entrance
- Fixed up code
- How scheduling is done
Instruction scheduling
- Prioritization heuristics
Superblock
- Traces with no side-entrance
Hyperblock
BS-ISA
Tradeoffs between trace cache/Hyperblock/Superblock/BS-ISA

Lecture 17 (2/25 Wed.)

IA-64
- EPIC
IA-64 instruction bundle
- Multiple instructions in the bundle along with the template bit
- Template bits
- Stop bits
- Non-faulting loads and exception propagation
Aggressive ST-LD reordering
Physical memory system
Ideal pipelines
Ideal cache
- More capacity
- Fast
- Cheap
- High bandwidth
DRAM cell
- Cheap
- Sense the perturbation through sense amplifier
- Slow and leaky
SRAM cell (Cross coupled inverter)
- Expensice
- Fast (easier to sense the value in the cell)
Memory bank
- Read access sequence
- DRAM: Activate → Read → Precharge (if needed)
- What dominate the access latency for DRAM and SRAM
Scaling issue
- Hard to scale the scale to be small
Memory hierarchy
- Prefetching
- Caching
Spatial and temporal locality
- Cache can exploit these
- Recently used data is likely to be accessed
- Nearby data is likely to be accessed
Caching in a pipeline design
Cache management
- Manual
  - Data movement is managed manually
    - Embedded processor
    - GPU scratchpad
- Automatic
  - HW manage data movements
Latency analysis
- Based on the hit and miss status, next level access time (if miss), and the current level access time
Cache basics
- Set/block (line)/Placement/replacement/direct mapped vs. associative cache/etc.
Cache access
- How to access tag and data (in parallel vs serially)
- How do tag and index get used?
- Modern processors perform serial access for higher level cache (L3 for example) to save power
Cost and benefit of having more associativity
- Given the associativity, which block should be replace if it is full
- Replacement policy
  - Random
  - Least recently used (LRU)
  - Least frequently used
  - Least costly to refetch
  - etc.
How to implement LRU
- How to keep track of access ordering
  - Complexity increases rapidly
- Approximate LRU
  - Victim and next Victim policy

Lecture 18 (2/27 Fri.)

Tag store and data store
Cache hit rate
Average memory access time (AMAT)
AMAT vs. Stall time
Cache basics
- Direct mapped vs. associative cache
- Set/block (line)/Placement/replacement
- How do tag and index get used?
Full associativity
Set associative cache
- insertion, promotion, eviction (replacement)
Various replacement policies
How to implement LRU
- How to keep track of access ordering
  - Complexity increases rapidly
- Approximate LRU
  - Victim and next Victim policy
Set thrashing
- Working set is bigger than the associativity
Belady's OPT
- Is this optimal?
- Complexity?
DRAM as a cache for disk
Handling writes
- Write through
  - Need a modified bit to make sure accesses to data got the updated data
- Write back
  - Simpler, no consistency issues
Sectored cache
- Use subblock
  - lower bandwidth
  - more complex
Instruction vs data cache
- Where to place instructions
  - Unified vs. separated
- In the first level cache
Cache access
- First level access
- Second level access
  - When to start the second level access
Cache performance
- capacity
- block size
- associativity
Classification of cache misses

Lecture 19 (03/02 Mon.)

Subblocks
Victim cache
- Small, but fully assoc. cache behind the actual cache
- Cached misses cache block
- Prevent ping-ponging
Pseudo associativity
- Simpler way to implement associative cache
Skewed assoc. cache
- Different hashing functions for each way
Restructure data access pattern
- Order of loop traversal
- Blocking
Memory level parallelism
- Cost per miss of a parallel cache miss is less costly compared to serial misses
MSHR
- Keep track of pending cache
  - Think of this as the load/store buffer-ish for cache
- What information goes into the MSHR?
- When do you access the MSHR?
Memory banks
Shared caches in multi-core processors

Lecture 20 (03/04 Wed.)

Virtual vs. physical memory
System's management on memory
- Benefits
Problem: physical memory has limited size
Mechanisms: indirection, virtual addresses, and translation
Demand paging
Physical memory as a cache
Tasks of system SW for VM
Serving a page fault
Address translation
Page table
- PTE (page table entry)
Page replacement algorithm
- CLOCK algo.
- Inverted page table
Page size trade-offs
Protection
Multi-level page tables
x86 implementation of page table
TLB
- Handling misses
When to do address translation?
Homonym and Synonyms
- Homonym: Same VA but maps to different PA with multiple processes
- Synonyms: Multiple VAs map to the same PA
  - Shared libraries, shared data, copy-on-write
Virtually indexed vs. physically indexed
Virtually tagged vs. physically tagged
Virtually indexed physically tagged
Can these create problems when we have the cache
How to eliminate these problems?
Page coloring
Interaction between cache and TLB

Lecture 21 (03/23 Mon.)

DRAM scaling problem
Demands/trends affecting the main memory
- More capacity
- Low energy
- More bandwidth
- QoS
ECC in DRAM
Multi-porting
- Virtual multi-porting
  - Time-share the port, not too scalable but cheap
- True multiporting
Multiple cache copies
Alignment
Banking
- Can have bank conflict
- Extra interconnects across banks
- Address mapping can mitigate bank conflict
- Common in main memory (note that regFile in GPU is also banked, but mainly for the pupose of reducing complexity)
Bank mapping
- How to avoid bank conflicts?
Channel mapping
- Address mapping to minimize bank conflict
- Page coloring
  - Virtual to physical mapping that can help reducing conflicts
Accessing DRAM
- Row bits
- Column bits
- Addressibility
- DRAM has its own clock
- Sense amplifier
- Bit lines
- Word lines
DRAM (2T) vs. SRAM (6T)
- Cost
- Latency
Interleaving in DRAM
- Effects from address mapping on memory interleaving
- Effects from memory access patterns from the program on interleaving
DRAM Bank
- To minimize the cost of interleaving (Shared the data bus and the command bus)
DRAM Rank
- Minimize the cost of the chip (a bundle of chips operated together)
DRAM Channel
- An interface to DRAM, each with its own ranks/banks
DRAM Chip
DIMM
- More DIMM adds the interconnect complexity
List of commands to read/write data into DRAM
- Activate → read/write → precharge
- Activate moves data into the row buffer
- Precharge prepare the bank for the next access
Row buffer hit
Row buffer conflict
Scheduling memory requests to lower row conflicts
Burst mode of DRAM
- Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits
Address mapping
- Row interleaved
- Cache block interleaved
Memory controller
- Sending DRAM commands
- Periodically send commands to refresh DRAM cells
- Ensure correctness and data integrity
- Where to place the memory controller
  - On CPU chip vs. at the main memory
    - Higher BW on-chip
- Determine the order of requests that will be serviced in DRAM
  - Request queues that hold requests
  - Send requests whenever the request can be sent to the bank
  - Determine which command (across banks) should be sent to DRAM

Lecture 22 (03/25 Wed.)

Flash controller
Flash memory
Garbage collection in flash
Overhead in flash memory
- Erase (off the critical path, but takes a long time)
Different types of DRAM
DRAM design choices
- Cost/density/latency/BW/Yield
Sense Amplifier
- How do they work
Dual data rate
Subarray
Rowclone
- Moving bulk of data from one row to others
- Lower latency and BW when performing copies/zeroes out the data
TL-DRAM
- Far segment
- Near segment
- What causes the long latency
- Benefit of TL-DRAM
  - TL-DRAM vs. DRAM cache (adding a small cache in DRAM)
List of commands to read/write data into DRAM
- Activate → read/write → precharge
- Activate moves data into the row buffer
- Precharge prepare the bank for the next access
Row buffer hit
Row buffer conflict
Scheduling memory requests to lower row conflicts
Burst mode of DRAM
- Prefetch 32-bits from an 8-bit interface if DRAM needs to read 32 bits
Address mapping
- Row interleaved
- Cache block interleaved
Memory controller
- Sending DRAM commands
- Periodically send commands to refresh DRAM cells
- Ensure correctness and data integrity
- Where to place the memory controller
  - On CPU chip vs. at the main memory
    - Higher BW on-chip
- Determine the order of requests that will be serviced in DRAM
  - Request queues that hold requests
  - Send requests whenever the request can be sent to the bank
  - Determine which command (across banks) should be sent to DRAM
Priority of demand vs. prefetch requests
Memory scheduling policies
- FCFS
- FR-FCFS
  - Try to maximize row buffer hit rate
  - Capped FR-FCFS: FR-FCFS with a timeout
  - Usually this is done in a command level (read/write commands and precharge/activate commands)
- PAR-BS
  - Key benefits
  - stall time
  - shortest job first
- STFM
- ATLAS
- TCM
  - Key benefits
  - Configurability
  - Fairness + performance at the same time
  - Robuestness isuees
Open row policy
Closed row policy
QoS
- QoS issues in memory scheduling
- Fairness
- Performance guarantee

Lecture 23 (03/27 Fri.)

Different ways to control interference in DRAM
- Partitioning of resource
  - Channel partitioning: map applications that interfere with each other in a different channel
    - Keep track of application's characteristics
    - Dedicate a channel might waste the bandwidth
    - Need OS support to determine the channel bits
- Source throttling
  - A controller throttle the core depends on the performance target
  - Example: Fairness via source throttling
    - Detect unfairness and throttle application that is interfering
    - How do you estimate slowdown?
    - Threshold based solution: hard to configure
- App/thread scheduling
  - Critical threads usually stall the progress
- Designing DRAM controller
  - Has to handle the normal DRAM operations
    - Read/write/refresh/all the timing constraints
  - Keep track of resources
  - Assign priorities to different requests
  - Manage requests to banks
- Self-optimizing controller
  - Use machine learning to improve DRAM controller
- A-DRM
  - Architecture aware DRAM
Multithread
- synchronization
- Pipeline programs
  - Producer consumer model
- Critical path
- Limiter threads
- Prioritization between threads
Different power mode in DRAM
DRAM Refresh
- Why does DRAM has to refresh every 64ms
- Banks are unavailable during refresh
  - LPDDR mitigate this by using a per-bank refresh
- Has to spend longer time with bigger DRAM
- Distributed refresh: stagger refresh every 64 ms in a distributed manner
  - As oppose to burst refresh (long pause time)
RAIDR: Reduce DRAM refresh by profiling and binning
- Some row do not have to be refresh very frequently
  - Profile the row
    - High temperature changes the retention time: need online profiling
Bloom filter
- Represent set membership
- Approximated
- Can contain false positive
  - Better/more hash function helps eliminate this

Lecture 24 (03/30 Mon.)

Simulation
- Drawbacks of RTL simulations
  - Time consuming
  - Complex to develop
  - Hard to perform design explorations
- Explore the design space quickly
- Match the behavior of existing systems
- Tradeoffs: speed, accuracy, flexibility
- High-level simulation vs. detailed simulation
  - High-level simulation is faster, but lower accuracy
Controllers that works on multiple types of cores
- Design problems: how to find a good scheduling policy on its own?
- Self-optimizing memory controller: using machine learning
  - Can adapt to the applications
  - The complexity is very high
Tolerate latency can be costly
- Instruction window is complex
  - Benefit also diminishes
- Designing the buffers can be complex
- A simpler way to tolerate out of order is desirable
Different sources that cause the core to stall in OoO
- Cache miss
- Note that stall happens if the inst. window is full
Scaling instruction window size is hard
- It is better (less complex) to make the windows more efficient
Runahead execution
- Try to optain MLP w/o increasing instruction windows
- Runahead (i.e. execute ahead) when there is a long memory instruction
  - Long memory instruction stall processor for a while anyways, so it's better to make use out of it
  - Execute future instruction to generate accurate prefetches
  - Allow future data to be in the cache
- How to support runahead execution?
  - Need a way to checkpoing the state when entering runahead mode
  - How to make executing in the wrong path useful?
  - Need runahead cache to handle load/store in Runahead mode (since they are speculative)

Lecture 25 (4/1 Wed.)

More Runahead executions
- How to support runahead execution?
  - Need a way to checkpoing the state when entering runahead mode
  - How to make executing in the wrong path useful?
  - Need runahead cache to handle load/store in Runahead mode (since they are speculative)
- Cost and benefit of runahead execution (slide number 27)
- Runahead can have inefficiency
  - Runahead period that are useless
    - Get rid of useless inefficient period
- What if there is a dependent cache miss
  - Cannot be paralellized in a vanilla runahead
  - Can predict the value of the dependent load
    - How to predict the address of the load
      - Delta value information
      - Stride predictor
      - AVD prediction
Questions regarding prefetching
- What to prefetch
- When to prefetch
- how do we prefetch
- where to prefetch from
Prefetching can cause thrasing (evict a useful block)
Prefetching can also be useless (not being used)
- Need to be efficient
Can cause memory bandwidth problem in GPU
Prefetch the whole block, more than one block, or subblock?
- Each one of them has pros and cons
- Big prefetch is more likely to waste bandwidth
- Commonly done in a cache block granularity
Prefetch accuracy: fraction of useful prefetches out of all the prefetches
Prefetcher usually predict based on
- Past knowledge
- Compiler hints
Prefetcher has to prefetch at the right time
- Prefetch that is too early might get evicted
  - It might also evict other useful data
- Prefetch too late does not hide the whole memory latency
Previous prefetches at the same PC can be used as the history
Previous demand requests also is a good information to use for prefetches
Prefetch buffer
- Place the prefetch data to avoid thrashing
  - Can treat demand/prefetch requests separately
  - More complex
Generally, demand block is more important
- This means eviction should prefer prefetch block as oppose to demand block
Tradeoffs between where do we place the prefetcher
- Look at L1 hits and misses
- Look at L1 misses only
- Look at L2 misses
- Different access pattern affect accuracy
  - Tradeoffs between handling more requests (seeing L1 hits and misses) and less visibility (only see L2 miss)
Software vs. hardware vs. execution based prefetching
- Software: ISA previde prefetch instructions, software utilize it
  - What information are useful
  - How to make sure the prefetch is timely
  - What if you have a pointer based structure
    - Not easy to prefetch pointer chasing (because in many case the work between prefetches is short, so you cannot predict the next one timely enough)
      - Can be solved by hinting the nextnext and/or nextnextnext address
- Hardware: Identify the pattern and prefetch
- Execution driven: Oppotunistically try to prefetch (runahead, dual-core execution)
Stride prefetcher
- Predict strides, which is common in many programs
- Cache block based or instruction based
Stream buffer design
- Buffer the stream of accesses (next address)
- Use the information to prefetch
What affect prefetcher performance
- Prefetch distance
  - How far ahead should we prefetch
- Prefetch degree
  - How many prefetches do we prefetch
Prefetcher performance
- Coverage
  - Out of the demand requests, how many are actually from the prefetch request
- Accuracy
  - Out of all the prefetch requests, how many are actually getting used
- Timeliness
  - How much memory latency can we hide from the prefetch requests
- Cache pullition
  - How much did the prefetcher cause misses in the demand misses?
    - Hard to quantify

Lecture 26 (4/3 Fri.)

Feedback directed prefetcher
- Use the result of the prefetcher as a feedback to the prefetcher
  - with accuracy, timeliness, polluting information
Markov prefetcher
- Prefetch based on the previous history
- Use markov model to predict
- Pros: Can cover arbitary pattern (easy for link list traversal or trees)
- Downside: High cost, cannot help with compulsory misses (no history)
Content directed prefetching
- Indentify the content in memory for pointers (which is used as the address to prefetch
- Not very efficient (hard to figure out which block is the pointer)
  - Software can give hints
Correlation table
- Address correlation
Execution based prefetcher
- Helper thread/speculative thread
  - Use another thread to pre-execute a program
- Can be a software based or hardware based
- Discover misses before the main program (to prefetch data in a timely manner)
- How do you construct the helper thread
- Preexecute instruction (one example of how to initialize a speculative thread), slide 9
- Thread-based pre-execution
Error tolerance
Solution to errors
- Tolerate errors
  - New interface, new design
- Eliminate or minimize errors
  - New technology, system-wide rethinking
- Embrace errors
  - Map data that can tolerate errors to error-prone area
Hybrid memory systesm
- Combining multiple memory technology together
What can emerging technology help?
- Scalability
- Lower the cost
- Energy efficiency
Possible solutions to the scaling problem
- Less leakage DRAM
- Heterogeneous DRAM (TL-DRAM, etc.)
- Add more functionality to DRAM
- Denser design (3D stack)
- Different technology
  - NVM
Charge vs. resistice memory
- How data is written?
- How to read the data?
Non volatile memory
- Resistive memory
  - PCM
    - Inject current to change the phase
    - Scales better than DRAM
      - Multiple bits per cell
        
        Wider resistence range
    - No refresh is needed
    - Downside: Latency and write endurance
  - STT-MRAM
    - Inject current to change the polarity
  - Memristor
    - Inject current to change the structure
- Pros and cons between different technologies
- Persistency - data stay there even without power
  - Unified memory and storage management (persistent data structure) - Single level store
    - Improve energy and performance
    - Simplify programming model
Different design options for DRAM + NVM
- DRAM as a cache
- Place some data in DRAM and other in PCM
  - Based on the characteristics
    - Frequently accessed data that need lower write latency in DRAM

Lecture 27 (4/6 Mon.)

Flynn's taxonomy
Parallelism
- Reduces power consumption (P ~ CV^2F)
- Better cost efficiency and easier to scale
- Improves dependability (in case the other core is faulty
Different types of parallelism
- Instruction level parallelism
- Data level parallelism
- Task level parallelism
Task level parallelism
- Partition a single, potentially big, task into multiple parallel sub-task
  - Can be done explicitly (parallel programming by the programmer)
  - Or implicitly (hardware partitions a single thread speculatively)
- Or, run multiple independent tasks (still improves throughput, but the speedup of any single tasks is not better, also simpler to implement)
Loosely coupled multiprocessor
- No shared global address space
  - Message passing to communicate between different sources
- Simple to manage memory
Tightly coupled multiprocessor
- Shared global address space
- Need to ensure consistency of data
- Programming issues
Hardware-based multithreading
- Coarse grained
- Find grained
- Simultaneous: Dispatch instruction from multiple threads at the same time
Parallel speedup
- Superlinear speedup
Utilization, Redundancy, Efficiency
Amdahl's law
- Maximum speedup
- Parallel portion is not perfect
  - Serial bottleneck
  - Synchronization cost
  - Load balance
    - Some threads has more work, requires more time to hit the sync. point
Critical sections
- Enforce mutually exclusive access to shared data
Issues in parallel programming
- Correctness
- Synchronization
- Consistency

Lecture 28 (4/8 Wed.)

Ordering of instructions
- Maintaining memory consistency when there are multiple threads and shared memory
- Need to ensure the semantic is not changed
- Making sure the shared data is properly locked when used
  - Support mutual exclusion
- Ordering depends on when each processor is executed
- Debugging is also difficult (non-deterministic behavior)
Dekker's algorithm
- Inconsistency – the two processors did NOT see the same order of operations to memory
Sequential consistency
- Multiple correct global orders
- Two issues:
  - Too conservative/strict
  - Performance limiting
Weak consistency: global ordering when sync
- programmer hints where the synchronizations are
- Memory fence
- More burden on the programmers
Cache coherence
- Can be done in the software level or hardware level
Snoop-based coherence
- A simple protocol with two states by broadcasting reads/writes on a bus
Maintaining coherence
- Needs to provide 1) write propagation and 2) write serialization
- Update vs. Invalidate
Two cache coherence methods
- Snoopy bus
  - Bus based, single point of serialization
  - More efficient with small number of processors
  - Processors snoop other caches read/write requests to keep the cache block coherent
- Directory
  - Single point of serialization per block
  - Directory coordinates the coherency
  - More scalable
  - The directory keeps track of where the copies of each block resides
    - Supplies data on a read
    - Invalidates the block on a write
    - Has an exclusive state

Lecture 29 (4/13 Mon.)

MSI coherent protocol
- The problem: unnecessary broadcasts of invalidations
MESI coherent protocol
- Add the exclusive state: this is the only cache copy and it is a clean state to MSI
- Multiple invalidation tradeoffs
- Problem: memory can be unnecessarily updated
- A possible owner state (MOESI)
Tradeoffs between snooping and directory based coherence protocols
- Slide 31 has a good summary
Directory: data structures
- Bit vectors vs. linked lists
Scalability of directories
- Size? Latency? Thousand of nodes? Best of both snooping and directory?

Lecture 30 (4/15 Wed.)

Application slowdown
Interference between different applications
- Applications' performance depends on other applications that they are running with
Predictable performance
- Why are they important?
- Applications that need predictibility
- How to predict the performance?
  - What information are useful?
  - What need to be guarantee?
  - How to estimate the performance when running with others?
    - Easy, just measure the performance while it is running.
  - How to estimate the performance when the application is running by itself.
    - Hard if there is no profiling.
  - The relationship between memory service rate and the performance.
    - Key assumption: applications are memory bound
- Behavior of memory-bound applications
  - With and without interference
Memory phase vs. compute phase
MISE
- Estimating slowdown using request service rate
- Inaccuracy when measuring request service rate alone
- Non-memory-bound applications
- Control slowdown and provide soft guarantee
Taking into account of the shared cache
- MISE model + cache resource management
- Aug tag store
  - Separate tag store for different cores
- Cache access rate alone and shared as the metric to estimate slowdown
Cache paritiioning
- How to determine partitioning
  - Utility based cache partitioning
  - Others
Maximum slowdown and fairness metric

18-447 Introduction to Computer Architecture – Spring 2015

Sidebar

Table of Contents

Buzzwords

Lecture 1 (1/12 Mon.)

Lecture 2 (1/14 Wed.)

Lecture 3 (1/17 Fri.)

Lecture 4 (1/21 Wed.)

Lecture 5 (1/26 Mon.)

Lecture 6 (1/28 Mon.)

Lecture 7 (1/30 Fri.)

Lecture 8 (2/2 Mon.)

Lecture 9 (2/4 Wed.)

Lecture 10 (2/6 Fri.)

Lecture 11 (2/11 Wed.)

Lecture 12 (2/13 Fri.)

Lecture 13 (2/16 Mon.)

Lecture 14 (2/18 Wed.)

Lecture 15 (2/20 Fri.)

Lecture 16 (2/23 Mon.)

Lecture 17 (2/25 Wed.)

Lecture 18 (2/27 Fri.)

Lecture 19 (03/02 Mon.)

Lecture 20 (03/04 Wed.)

Lecture 21 (03/23 Mon.)

Lecture 22 (03/25 Wed.)

Lecture 23 (03/27 Fri.)

Lecture 24 (03/30 Mon.)

Lecture 25 (4/1 Wed.)

Lecture 26 (4/3 Fri.)

Lecture 27 (4/6 Mon.)

Lecture 28 (4/8 Wed.)

Lecture 29 (4/13 Mon.)

Lecture 30 (4/15 Wed.)

18-447 Introduction to Computer Architecture – Spring 2015

User Tools

Site Tools

Sidebar

Table of Contents

Buzzwords

Lecture 1 (1/12 Mon.)

Lecture 2 (1/14 Wed.)

Lecture 3 (1/17 Fri.)

Lecture 4 (1/21 Wed.)

Lecture 5 (1/26 Mon.)

Lecture 6 (1/28 Mon.)

Lecture 7 (1/30 Fri.)

Lecture 8 (2/2 Mon.)

Lecture 9 (2/4 Wed.)

Lecture 10 (2/6 Fri.)

Lecture 11 (2/11 Wed.)

Lecture 12 (2/13 Fri.)

Lecture 13 (2/16 Mon.)

Lecture 14 (2/18 Wed.)

Lecture 15 (2/20 Fri.)

Lecture 16 (2/23 Mon.)

Lecture 17 (2/25 Wed.)

Lecture 18 (2/27 Fri.)

Lecture 19 (03/02 Mon.)

Lecture 20 (03/04 Wed.)

Lecture 21 (03/23 Mon.)

Lecture 22 (03/25 Wed.)

Lecture 23 (03/27 Fri.)

Lecture 24 (03/30 Mon.)

Lecture 25 (4/1 Wed.)

Lecture 26 (4/3 Fri.)

Lecture 27 (4/6 Mon.)

Lecture 28 (4/8 Wed.)

Lecture 29 (4/13 Mon.)

Lecture 30 (4/15 Wed.)

Page Tools