=====Buzzwords===== =====Lecture 1===== * Architecture of Parallel Computers * Fundamentals and Tradeoffs * Static and Dynamic Scheduling * Parallel Task Assignment * Static/Dynamic * Task Queues * Task Stealing =====Lecture 2===== * Parallel Computer * SISD, SIMD, MISD, MIMD * Performance * Power consumption * Cost efficiency * Scalability * Complexity * Dependability * Instruction Level Parallelism * Data Parallelism * Task Level Parallelism * Parallel programming * Thread level speculation * Loosely/Tightly coupled multiprocessors * Shared memory synchronization * Cache consistency * Ordering of memory operations * Hardware-based Multithreading * Coarse grained * Fine grained * Simultaneous * Amdahl’s Law * Serial bottleneck * Synchronization overhead * Load imbalance overhead * Resource sharing overhead * Superlinear Speedup * Unfair comparisons * Memory/cache effect * Utilization, Redundancy, Efficiency * Parallel Programming * Parallel and Serial Bottlenecks =====Lecture 3===== * Programming Models vs. Architectures * Shared memory programming model * Message passing programming model * Shared memory hardware * Message passing hardware * Communication abstraction * Generic Parallel Machine * Data Flow Graph * Synchronization * Application Binary Interface (ABI) * Data parallel programming model * Data parallel hardware * Connection Machine * Data flow programming model * Data flow hardware * Scalability * Interconnection Schemes * Uniform Memory/Cache Access (UMA/UCA) * Memory latency * Memory bandwidth * Symmetric multiprocessing (SMP) * Data placement * Non-Uniform Memory/Cache Access (NUMA/NUCA) * Local and remote memories * Critical path of memory access =====Lecture 4===== * Multi-Core Processors * Technology scaling * Transistors and die area * Large Superscalar * Single-thread performance * Instruction issue queue * Multi-ported register file * Loop-level parallelism * Multiprogramming * Bigger caches * Multithreading * Thread-level parallelism * Resource sharing * Integrating platform components * Clustered superscalar processor * Inter-cluster bypass * Traditional symmetric multiprocessors =====Lecture 5===== * Chip Multiprocessor (CMP) * Workload Characteristics * Instruction Level Parallelism (ILP) * Piranha CMP * Processing Node * Coherence Protocol Engine * I/O Node * Sun Niagara (UltraSPARC T1) * Niagara Core * Sun Niagara II (UltraSPARC T2) * Chip Multithreading (CMT) * Sun Rock * Runahead Execution * Memory Level Parallelism (MLP) * IBM POWER4 * IBM POWER5 * IBM POWER6 * IBM POWER7 * Large vs. Small Cores * Tile-Large vs. Tile-Small * Asymmetric Chip Multiprocessor (ACMP) * Serial Bottlenecks * Amdahl's Law * Asymmetric vs. Symmetric Cores * Frequency Boosting * EPI Throttling * Dynamic voltage frequency scaling (DVFS) =====Lecture 6===== * EPI Throttling * Asymmetric Chip Multiprocessor (ACMP) * Energy Efficiency * Programmer effort * Shared Resource Management * Serialized Code Sections * Accelerated Critical Sections (ACS) * Bottleneck Identification and Scheduling (BIS) =====Lecture 7===== * Main Memory * Memory Capacity * Memory Latency * Memory Bandwidth * Memory Energy/Power * Technology Scaling * DRAM Scaling * Charge Memory * Resistive Memory * Non-volatile Memory * Phase Change Memory (PCM) * Hybrid Memory * Write Filtering * Row-Locality Aware Data Placement * Tags in Memory * Dynamic Data Transfer Granularity * Memory Security =====Lecture 8===== * Barriers * Thread Waiting * Bottleneck Acceleration * False Serialization * Starvation * Preemptive Acceleration * Staged Execution Model * Segment Spawning * Inter-segment data * Generator instruction * Data Marshaling * Pipeline Parallelism * Coverage, Accuracy, Timeliness =====Lecture 9===== * Memory Scheduling * Fairness-Throughput * Thread cluster * Memory intensity * CPU-GPU Systems * Heterogeneous Memory Systems * Thread * Multitasking * Thread context * Hardware Multithreading * Latency tolerance * Fine-grained Multithreading * Pipeline utilization * Coarse-grained Multithreading * Stall events * Thread Switching Urgency * Fairness =====Lecture 10===== * Fine-grained Multithreading * Coarse-grained Multithreading * Fairness and throughput * Thread Switching Urgency * Simultaneous Multithreading * Functional Unit Utilization * Superscalar Out-of-Order Pipeline * SMT Pipeline * SMT Scalability * SMT Fetch Policy * Long Latency Loads * Memory-Level Parallelism (MLP) * Runahead Threads * Thread Priority Support * Thread Throttling =====Lecture 11===== * Utility cache partitioning * Cache capacity * Cache data compression * Frequent value compression * Frequent pattern compression * Low dynamic range * Base+Delta encoding * Main memory compression * IBM MXT * Linearly compressed pages =====Lecture 13===== * Fault and Error * Fault Detection * Fault Tolerance * Transient Fault * Permanent Fault * Space redundancy * Time redundancy * Lockstepping * Simultaneous Redundant Threading (SRT) * Sphere of Replication * Input Replication * Output Comparison * Branch Outcome Queue * Line Prediction Queue * Chip Level Redundant Threading * Exception Handling * Helper Threading for Prefetching * Thread-Based Pre-Execution =====Lecture 15===== * Slipstreaming * Instruction Removal * Dual Core Execution * Thread Level Speculation * Conflict Detection * Speculative Parallelization * Inter-Thread Communication * Data Dependences and Versioning * Speculative Memory State * Multiscalar Processor =====Lecture 16===== * Multiscalar Processor * Multiscalar Tasks * Register Forwarding * Task Sequencing * Inter-Task Dependences * Address Resolution Buffer * Memory Dependence Prediction * Store-Load Dependencies * Memory Disambiguation * Speculative Lock Elision * Atomicity * Speculative Parallelization * Accelerating Critical Section * Transactional Lock Removal =====Lecture 17===== * Interconnection Network * Network Topology * Bus * Crossbar * Ring * Mesh * Torus * Tree * Hypercube * Multistage Logarithmic Network * Circuit vs. Packet Switching * Flow Control * Head of Line Blocking * Virtual Channel Flow Control * Communicating Buffer Availability =====Lecture 18===== * Routing * Deadlock * Router Design * Router Pipeline Optimizations * Interconnection Network Performance * Packet Scheduling * Bufferless Deflection Routing * Livelock * Packet Reassembling * Golden Packet * Minimally-Buffered Deflection Routing * Side Buffer * Heterogeneous Adaptive Throttling * Application-Aware Source Throttling * Dynamic Throttling Rate Adjustment =====Lecture 20===== * Locks vs. Transactions * Transactional Memory * Logging/buffering * Conflict detection * Abort/rollback * Commit * Routing * Deterministic * Oblivious * Adaptive * Deadlock =====Lecture 21===== * Packet Scheduling * Stall Time Criticality * Memory Level Parallelism * Shortest Job First Principle * Application Aware * Packet Ranking and Batching * Slack of Packets * Packet Prioritizing using Slack * Starvation Avoidance * 2-D Mesh, Concentration, Replication * Flattened Butterfly * Multidrop Express Channels (MECS) * Kilo-NoC * Network-on-Chip (NoC) Quality of Service (QoS) * Topology-Aware QoS =====Lecture 22===== * Data Flow * Data Flow Nodes * Data Flow Graphs * Control Flow vs. Data Flow * Static Data Flow * Reentrant code (Function calls, Loops) * Dynamic Data Flow * Frame Pointer * Tagging * Data Structures * I-Structure * MIT Tagged Token Data Flow Architecture * Manchester Data Flow Machine * Combining Data Flow and Control Flow =====Lecture 23===== * Combining Data Flow and Control Flow * Macro Dataflow * Restricted Data Flow * Systolic Architecture * Systolic Computation * Pipeline Parallelism =====Lecture 24===== * Resource Sharing * Shared Resource Management and QoS * Resource Sharing vs. Partitioning * Multi-core Caching * Shared Cache Management * Sharing in Main Memory * Memory Controller * Inter-Thread Interference * QoS-Aware Memory Scheduling * Stall-Time Fairness * Bank Parallelism-Awareness * Request Batching * Shortest Stall-Time First Ranking * Memory Episode Lengths * Least Attained Service =====Lecture 25===== * QoS-Aware Memory Request Scheduling * Smart/Dumb Resources * Throughput vs. Fairness * Thread Cluster Memory Scheduling * Clustering Threads * CPU-GPU Systems * Staged Memory Scheduling * Parallel Application Memory QoS =====Lecture 26===== * QoS-Aware Memory Systems * Smart vs. Dumb Resources * Memory Channel Partitioning * Application-Awareness * Multiple Channels * Memory Intensity * Row Buffer Locality * Preferred Channel * Integrated Memory Partitioning and Scheduling * Fairness via Source Throttling * Dynamic Request Throttling * Estimating System Unfairness * Inter-Core Interference * Row Buffer Interference * Memory Interference-induced Slowdown Estimation * Shared Memory Performance Predictability * Shared Resource Interference * Memory Phase Fraction * Alone Request Service Rate * Shared Request Service Rate * “Soft” Slowdown Guarantees =====Lecture 27===== * CPU-GPU Memory Scheduling * Batch Formation * Batch Scheduler * DRAM Command Scheduler * Prefetcher Accuracy * Feedback-Directed Prefetching * Hierarchical Prefetcher Aggressiveness Control * Inter-Core Cache Pollution * Global Control