Table of Contents
Buzzwords
Lecture 1
- Architecture of Parallel Computers
- Fundamentals and Tradeoffs
- Static and Dynamic Scheduling
- Parallel Task Assignment
- Static/Dynamic
- Task Queues
- Task Stealing
Lecture 2
- Parallel Computer
- SISD, SIMD, MISD, MIMD
- Performance
- Power consumption
- Cost efficiency
- Scalability
- Complexity
- Dependability
- Instruction Level Parallelism
- Data Parallelism
- Task Level Parallelism
- Parallel programming
- Thread level speculation
- Loosely/Tightly coupled multiprocessors
- Shared memory synchronization
- Cache consistency
- Ordering of memory operations
- Hardware-based Multithreading
- Coarse grained
- Fine grained
- Simultaneous
- Amdahl’s Law
- Serial bottleneck
- Synchronization overhead
- Load imbalance overhead
- Resource sharing overhead
- Superlinear Speedup
- Unfair comparisons
- Memory/cache effect
- Utilization, Redundancy, Efficiency
- Parallel Programming
- Parallel and Serial Bottlenecks
Lecture 3
- Programming Models vs. Architectures
- Shared memory programming model
- Message passing programming model
- Shared memory hardware
- Message passing hardware
- Communication abstraction
- Generic Parallel Machine
- Data Flow Graph
- Synchronization
- Application Binary Interface (ABI)
- Data parallel programming model
- Data parallel hardware
- Connection Machine
- Data flow programming model
- Data flow hardware
- Scalability
- Interconnection Schemes
- Uniform Memory/Cache Access (UMA/UCA)
- Memory latency
- Memory bandwidth
- Symmetric multiprocessing (SMP)
- Data placement
- Non-Uniform Memory/Cache Access (NUMA/NUCA)
- Local and remote memories
- Critical path of memory access
Lecture 4
- Multi-Core Processors
- Technology scaling
- Transistors and die area
- Large Superscalar
- Single-thread performance
- Instruction issue queue
- Multi-ported register file
- Loop-level parallelism
- Multiprogramming
- Bigger caches
- Multithreading
- Thread-level parallelism
- Resource sharing
- Integrating platform components
- Clustered superscalar processor
- Inter-cluster bypass
- Traditional symmetric multiprocessors
Lecture 5
- Chip Multiprocessor (CMP)
- Workload Characteristics
- Instruction Level Parallelism (ILP)
- Piranha CMP
- Processing Node
- Coherence Protocol Engine
- I/O Node
- Sun Niagara (UltraSPARC T1)
- Niagara Core
- Sun Niagara II (UltraSPARC T2)
- Chip Multithreading (CMT)
- Sun Rock
- Runahead Execution
- Memory Level Parallelism (MLP)
- IBM POWER4
- IBM POWER5
- IBM POWER6
- IBM POWER7
- Large vs. Small Cores
- Tile-Large vs. Tile-Small
- Asymmetric Chip Multiprocessor (ACMP)
- Serial Bottlenecks
- Amdahl's Law
- Asymmetric vs. Symmetric Cores
- Frequency Boosting
- EPI Throttling
- Dynamic voltage frequency scaling (DVFS)
Lecture 6
- EPI Throttling
- Asymmetric Chip Multiprocessor (ACMP)
- Energy Efficiency
- Programmer effort
- Shared Resource Management
- Serialized Code Sections
- Accelerated Critical Sections (ACS)
- Bottleneck Identification and Scheduling (BIS)
Lecture 7
- Main Memory
- Memory Capacity
- Memory Latency
- Memory Bandwidth
- Memory Energy/Power
- Technology Scaling
- DRAM Scaling
- Charge Memory
- Resistive Memory
- Non-volatile Memory
- Phase Change Memory (PCM)
- Hybrid Memory
- Write Filtering
- Row-Locality Aware Data Placement
- Tags in Memory
- Dynamic Data Transfer Granularity
- Memory Security
Lecture 8
- Barriers
- Thread Waiting
- Bottleneck Acceleration
- False Serialization
- Starvation
- Preemptive Acceleration
- Staged Execution Model
- Segment Spawning
- Inter-segment data
- Generator instruction
- Data Marshaling
- Pipeline Parallelism
- Coverage, Accuracy, Timeliness
Lecture 9
- Memory Scheduling
- Fairness-Throughput
- Thread cluster
- Memory intensity
- CPU-GPU Systems
- Heterogeneous Memory Systems
- Thread
- Multitasking
- Thread context
- Hardware Multithreading
- Latency tolerance
- Fine-grained Multithreading
- Pipeline utilization
- Coarse-grained Multithreading
- Stall events
- Thread Switching Urgency
- Fairness
Lecture 10
- Fine-grained Multithreading
- Coarse-grained Multithreading
- Fairness and throughput
- Thread Switching Urgency
- Simultaneous Multithreading
- Functional Unit Utilization
- Superscalar Out-of-Order Pipeline
- SMT Pipeline
- SMT Scalability
- SMT Fetch Policy
- Long Latency Loads
- Memory-Level Parallelism (MLP)
- Runahead Threads
- Thread Priority Support
- Thread Throttling
Lecture 11
- Utility cache partitioning
- Cache capacity
- Cache data compression
- Frequent value compression
- Frequent pattern compression
- Low dynamic range
- Base+Delta encoding
- Main memory compression
- IBM MXT
- Linearly compressed pages
Lecture 13
- Fault and Error
- Fault Detection
- Fault Tolerance
- Transient Fault
- Permanent Fault
- Space redundancy
- Time redundancy
- Lockstepping
- Simultaneous Redundant Threading (SRT)
- Sphere of Replication
- Input Replication
- Output Comparison
- Branch Outcome Queue
- Line Prediction Queue
- Chip Level Redundant Threading
- Exception Handling
- Helper Threading for Prefetching
- Thread-Based Pre-Execution
Lecture 15
- Slipstreaming
- Instruction Removal
- Dual Core Execution
- Thread Level Speculation
- Conflict Detection
- Speculative Parallelization
- Inter-Thread Communication
- Data Dependences and Versioning
- Speculative Memory State
- Multiscalar Processor
Lecture 16
- Multiscalar Processor
- Multiscalar Tasks
- Register Forwarding
- Task Sequencing
- Inter-Task Dependences
- Address Resolution Buffer
- Memory Dependence Prediction
- Store-Load Dependencies
- Memory Disambiguation
- Speculative Lock Elision
- Atomicity
- Speculative Parallelization
- Accelerating Critical Section
- Transactional Lock Removal
Lecture 17
- Interconnection Network
- Network Topology
- Bus
- Crossbar
- Ring
- Mesh
- Torus
- Tree
- Hypercube
- Multistage Logarithmic Network
- Circuit vs. Packet Switching
- Flow Control
- Head of Line Blocking
- Virtual Channel Flow Control
- Communicating Buffer Availability
Lecture 18
- Routing
- Deadlock
- Router Design
- Router Pipeline Optimizations
- Interconnection Network Performance
- Packet Scheduling
- Bufferless Deflection Routing
- Livelock
- Packet Reassembling
- Golden Packet
- Minimally-Buffered Deflection Routing
- Side Buffer
- Heterogeneous Adaptive Throttling
- Application-Aware Source Throttling
- Dynamic Throttling Rate Adjustment
Lecture 20
- Locks vs. Transactions
- Transactional Memory
- Logging/buffering
- Conflict detection
- Abort/rollback
- Commit
- Routing
- Deterministic
- Oblivious
- Adaptive
- Deadlock
Lecture 21
- Packet Scheduling
- Stall Time Criticality
- Memory Level Parallelism
- Shortest Job First Principle
- Application Aware
- Packet Ranking and Batching
- Slack of Packets
- Packet Prioritizing using Slack
- Starvation Avoidance
- 2-D Mesh, Concentration, Replication
- Flattened Butterfly
- Multidrop Express Channels (MECS)
- Kilo-NoC
- Network-on-Chip (NoC) Quality of Service (QoS)
- Topology-Aware QoS
Lecture 22
- Data Flow
- Data Flow Nodes
- Data Flow Graphs
- Control Flow vs. Data Flow
- Static Data Flow
- Reentrant code (Function calls, Loops)
- Dynamic Data Flow
- Frame Pointer
- Tagging
- Data Structures
- I-Structure
- MIT Tagged Token Data Flow Architecture
- Manchester Data Flow Machine
- Combining Data Flow and Control Flow
Lecture 23
- Combining Data Flow and Control Flow
- Macro Dataflow
- Restricted Data Flow
- Systolic Architecture
- Systolic Computation
- Pipeline Parallelism
Lecture 24
- Resource Sharing
- Shared Resource Management and QoS
- Resource Sharing vs. Partitioning
- Multi-core Caching
- Shared Cache Management
- Sharing in Main Memory
- Memory Controller
- Inter-Thread Interference
- QoS-Aware Memory Scheduling
- Stall-Time Fairness
- Bank Parallelism-Awareness
- Request Batching
- Shortest Stall-Time First Ranking
- Memory Episode Lengths
- Least Attained Service
Lecture 25
- QoS-Aware Memory Request Scheduling
- Smart/Dumb Resources
- Throughput vs. Fairness
- Thread Cluster Memory Scheduling
- Clustering Threads
- CPU-GPU Systems
- Staged Memory Scheduling
- Parallel Application Memory QoS
Lecture 26
- QoS-Aware Memory Systems
- Smart vs. Dumb Resources
- Memory Channel Partitioning
- Application-Awareness
- Multiple Channels
- Memory Intensity
- Row Buffer Locality
- Preferred Channel
- Integrated Memory Partitioning and Scheduling
- Fairness via Source Throttling
- Dynamic Request Throttling
- Estimating System Unfairness
- Inter-Core Interference
- Row Buffer Interference
- Memory Interference-induced Slowdown Estimation
- Shared Memory Performance Predictability
- Shared Resource Interference
- Memory Phase Fraction
- Alone Request Service Rate
- Shared Request Service Rate
- “Soft” Slowdown Guarantees
Lecture 27
- CPU-GPU Memory Scheduling
- Batch Formation
- Batch Scheduler
- DRAM Command Scheduler
- Prefetcher Accuracy
- Feedback-Directed Prefetching
- Hierarchical Prefetcher Aggressiveness Control
- Inter-Core Cache Pollution
- Global Control