18742: Reading List and Course Plan

(Required reading marked *)

Part I: Parallel Computer Architectures

Course Intro, Architecture Review, Amdahl's Law

Syllabus

*Cramming More Components onto Integrated Circuits (AKA: Moore's Law)

*Parallel Architectures (AKA: Flynn's Taxonomy)

*Validity of the single processor approach to achieving large scale computing capabilities (AKA: Amdahl's Law)

Design of Ion-Implanted MOSFET’S with Very Small Physical Dimensions (AKA: The 1970 Dennard Scaling Paper)

Parallel Architectures

*Multiscalar processors

*The Case for a Single-chip Multiprocessor


Parallel Execution Strategies

Dataflow Architecture

*WaveScalar

*An Evaluation of the TRIPS computer system

Dataflow execution of sequential imperative programs on multicore architectures

Writing and Executing Parallel Programs

Cache Coherence and Memory Consistency

*Why On-chip Cache Coherence is here to stay

*Token Coherence: Decoupling Performance and Correctness

Memory consistency and event ordering in scalable shared-memory multiprocessors

Memory Consistency Models (Optional)

Foundations of the C++ concurrency Memory Model

x86-TSO: a rigorous and usable programmer’s model for x86 multiprocessors

Synchronization and Transaction Memory

Optimizing Synchronization and Transactional Memory

*Speculative lock elision: enabling highly concurrent multithreaded execution

Inferential queueing and speculative push for reducing critical communication latencies

*Transactional Memory

*(quick skim only) Performance evaluation of Intel® transactional synchronization extensions for high-performance computing

Evaluation of AMD's advanced synchronization facility within a complete transactional memory stack

Making the fast case common and the uncommon case simple in unbounded transactional memory

Software Transactional Memory (Optional)

Software Transactional Memory

Software Transactional Memory: Why is it only a research toy?

Synthesis Lectures on Transactional Memory (AKA: the TM Book)


Memory Consistency Enforcement Mechanisms

Data-race-free and Speculative Models

*DeNovo: Rethinking the Memory Hierarchy for Disciplined Parallelism

*Transactional Memory Coherence and Consistency

*DRFx: a simple and efficient memory model for concurrent programming languages

BulkSC: bulk enforcement of sequential consistency

SARC Coherence: Scaling Directory Cache Coherence in Performance and Power

Memory Consistency Exceptions

*Conflict Exceptions: simplifying concurrent language semantics with precise hardware exceptions for data-races

Valor: efficient, software-only region conflict exceptions

Architecture Support Concurrent Software Reliability

Detecting and Avoiding Concurrency Bugs (Optional)

Learning from mistakes: a comprehensive study on real world concurrency bug characteristics

A Case for an interleaving constrained shared-memory multi-processor

AVIO: detecting atomicity violations via access interleaving invariants

Cooperative, Empirical Failure Avoidance for Multithreaded Programs

Finding Concurrency Bugs with Context-aware Communication Graphs

Flexible, Hardware Acceleration for Instruction-Grain Lifeguards

Atom-aid: detecting and surviving atomicity violations

Deterministic Execution

*A "flight data recorder" for enabling full-system multiprocessor deterministic replay

*DMP: deterministic shared memory multiprocessing

Grace: safe multithreaded programming for C/C++

CoreDet: a compiler and runtime system for deterministic multithreaded execution


The End of Moore's Law and the Beginning of the Era of Dark Silicon

Power, Energy, and Dark Silicon

*Amdahl's Law in the Multicore Era (paper-pdf)

*Dark Silicon and the End of Multicore Scaling (paper-pdf)

*Power: A First-class Architectural Design Constraint (skim) (paper-pdf)

Power struggles: Revisiting the RISC vs. CISC debate on contemporary ARM and x86 architectures (paper-pdf)

Thread criticality predictors for dynamic performance, power, and resource management in chip multiprocessors (paper-pdf)


Part II: Heterogeneity, Specialization, and Acceleration

Fused and Composable Heterogeneous Cores

*Core-fusion: accomodating software diversity in chip multiprocessors (paper-pdf)

*Composable, light-weight processors (paper-pdf)


Specialization

Accelerators for Everything

*Conservation cores: reducing the energy of mature computations (paper-pdf)

*QsCores: Trading Dark Silicon for Scalable Energy with Quasi-specific Cores (paper-pdf)

Database and Genomics Accelerators

*Q100: The Architecture and Design of a Database Processing Unit (paper-pdf)

*Darwin: A Genomics Co-processor Provides up to 15,000X Acceleration on Long Read Assembly (paper-pdf)

Hardware support for fine-grained event-driven computation in Anton 2 (paper-pdf)

Machine Learning and Inference Accelerators

*Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks (paper-pdf)

*In-datacenter Performance Analysis of a Tensor Processing Unit (paper-pdf)

EIE: efficient inference engine on compressed deep neural network (paper-pdf)

DaDianNao: A Machine Learning Supercomputer (paper-pdf)

Reconfigurable Accelerators

*A reconfigurable fabric for accelerating large-scale datacenter services (AKA: The Catapult Paper) (paper-pdf)

*Single-Chip Heterogeneous Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs? (skim) (paper-pdf)

LEAP scratchpads: automatic memory and cache management for reconfigurable logic (paper-pdf)

CoRAM: an in-fabric memory architecture for FPGA-based computing (paper-pdf)

Reconfigurable Dataflow Processors

*RipTide: A programmable, energy-minimal dataflow compiler and architecture (paper-pdf)

*Stream-Dataflow Acceleration (paper-pdf)

Tiled Architectures

*Evaluation of the RAW Microprocessor: An Exposed Wire-delay Architecture for ILP and Streams (paper-pdf)

*A scalable architecture for ordered parallelism (paper-pdf)

Accelerating Irregular Computations

*P-OPT: Practical Optimal Cache Replacement for Graph Analytics (paper-pdf)

*Fifer: Practical Acceleration of Irregular Applications on Reconfigurable Architectures (paper-pdf)

When is Graph Reordering an Optimizaton? Studying the Effect of Lightweight Graph Reordering Across Applications and Input Graphs (paper-pdf)

Graphicionado: A high-performance accelerator for graph analytics (paper-pdf)

Part III: Emerging Topics

Encrypted Computing