This is an old revision of the document!
Table of Contents
Readings
Lecture 1
Required:
- Hill, Jouppi, Sohi, “Multiprocessors and Multicomputers,” pp. 551-560 in Readings in Computer Architecture. pdf
- Hill, Jouppi, Sohi, “Dataflow and Multithreading,” pp. 309-314 in Readings in Computer Architecture. pdf
- Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. pdf
- Culler & Singh, Chapter 1
- Hamming, “You and Your Research,” Bell Communications Research Colloquium Seminar, 7 March 1986. here
Optional:
- Suleman et al., “Feedback-directed pipeline parallelism,” PACT 2010. pdf
- Kumar et al., “Carbon: Architectural Support for Fine-Grained Parallelism on Chip Multiprocessors,” ISCA 2007. pdf
Supplementary Readings on Research, Writing, Reviews:
Lecture 2
Required:
- Hill and Marty, “Amdahl’s Law in the Multi-Core Era,” IEEE Computer 2008. pdf
- Annavaram et al., “Mitigating Amdahl’s Law Through EPI Throttling,” ISCA 2005. pdf
- Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. pdf
- Joao et al., “Bottleneck Identification and Scheduling in Multithreaded Applications,” ASPLOS 2012. pdf
- Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors,” ISCA 2007. pdf
Optional:
- Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966. pdf
- Thornton, “CDC 6600: Design of a Computer,” 1970. pdf
- Burton Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978. pdf
- Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” AFIPS 1967. pdf
- Eyerman and Eeckhout, “Modeling critical sections in Amdahl's law and its implications for multicore design,” ISCA 2010. pdf
- Suleman et al., “Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs,” ASPLOS 2008. pdf
Lecture 3
Required:
- Hillis and Tucker, “The CM-5 Connection Machine: a scalable supercomputer,” CACM 1993. pdf
- Seitz, “The Cosmic Cube,” CACM 1985. pdf
Optional:
Lecture 4
Optional:
- Moore, “Cramming more components onto integrated circuits,” Electronics, 1965. pdf
- Stark, “On pipelining dynamic instruction scheduling logic,” MICRO 2000. pdf
- Olukotun et al., “The Case for a Single-Chip Multiprocessor,” ASPLOS 1996. pdf
- Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro 1999. pdf
- Palacharla et al., “Complexity-effective superscalar processors,” ISCA 1997. pdf
Lecture 5
Optional:
- Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978. pdf
- Barroso et al., “Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing,” ISCA 2000. pdf
- Barroso et al., “Memory system characterization of commercial workloads,” ISCA 1998. pdf
- Ranganathan et al., “Performance of database workloads on shared-memory systems with out-of-order processors,” ASPLOS 1998. pdf
- Kongetira et al., “Niagara: A 32-Way Multithreaded SPARC Processor,” IEEE Micro 2005. pdf
- Spracklen and Abraham, “Chip Multithreading: Opportunities and Challenges,” HPCA Industrial Session, 2005. pdf
- Chaudhry et al., “Rock: A High-Performance Sparc CMT Processor,” IEEE Micro, 2009. pdf
- Chaudhry et al., “Simultaneous Speculative Threading: A Novel Pipeline Architecture Implemented in Sun's ROCK Processor,” ISCA 2009. pdf
- Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors,” HPCA 2003. pdf
- Mutlu et al., “Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance,” IEEE Micro Jan/Feb 2006. pdf
- Tendler et al., “POWER4 system microarchitecture,” IBM J R&D, 2002. pdf
- Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro 2004. pdf
- Le et al., “IBM POWER6 Microarchitecture,” IBM J R&D, 2007. pdf
- Kalla et al., “Power7: IBM’s Next-Generation Server Processor,” IEEE Micro 2010. pdf
- Grochowski et al., “Best of both Latency and Throughput,” ICCD 2004. pdf
- Hill and Marty, “Amdahl’s Law in the Multi-Core Era,” IEEE Computer 2008. pdf
- Annavaram et al., “Mitigating Amdahl’s Law Through EPI Throttling,” ISCA 2005. pdf
Lecture 6
Recommended:
- Ipek et al., “Core Fusion: Accomodating Software Diversity in Chip Multiprocessors,” ISCA 2007. pdf
- Ausavarungnirun et al., “Staged memory scheduling: achieving high performance and scalability in heterogeneous systems,” ISCA 2012. pdf
Optional:
- Kumar et al., “Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction,” MICRO 2003. pdf
- Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. pdf
- Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multicore Architectures,” IEEE Micro 2010. pdf
- Suleman et al., “Data marshaling for multi-core architectures,” ISCA 2010. pdf
- Suleman et al., “Data Marshaling for Multicore Systems,” IEEE Micro 2011. pdf
- Joao et al., “Bottleneck Identification and Scheduling in Multithreaded Applications,” ASPLOS 2012. pdf
- Kim et al., “ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers,” HPCA 2010. pdf
- Kim et al., “Thread Cluster Memory Scheduling,” MICRO 2010. pdf
- Kim et al., “Thread Cluster Memory Scheduling,” IEEE Micro 2011. pdf
- Nychis et al., “Next generation on-chip networks: what kind of congestion control do we need?,” HotNets 2010. pdf
- Das et al., “Application-aware prioritization mechanisms for on-chip networks,” MICRO 2009. pdf
- Das et al., “Aérgia: exploiting packet latency slack in on-chip networks,” ISCA 2010. pdf
- Das et al., “Aérgia: A Network-on-Chip Exploiting Packet Latency Slack,” IEEE Micro 2011. pdf
- Meza et al., “Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management,” IEEE CAL 2012. pdf
- Suleman et al., “Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs,” ASPLOS 2008. pdf
- Annavaram et al., “Mitigating Amdahl’s Law Through EPI Throttling,” ISCA 2005. pdf
- Morad et al., “Performance, Power Efficiency and Scalability of Asymmetric Cluster Chip Multiprocessors,” IEEE CAL 2006. pdf
- Suleman et al., “ACMP: Balancing Hardware Efficiency and Programmer Efficiency,” HPS Technical Report 2007. pdf
- Suleman et al., “Feedback-directed pipeline parallelism,” PACT 2010. pdf
- Suleman, “An Asymmetric Multi-core Architecture for Efficiently Accelerating Critical Paths in Multithreaded Programs,” PhD thesis 2010. pdf
Lecture 7
Optional:
- Lefurgy et al., “Energy Management for Commercial Servers,” IEEE Computer 2003. pdf
- Lee et al., “Architecting Phase Change Memory as a Scalable DRAM Alternative,” ISCA 2009. pdf
- Lee et al., “Phase-Change Technology and the Future of Main Memory,” IEEE Micro 2010. pdf
- Qureshi et al., “Scalable high performance main memory system using phase-change memory technology,” ISCA 2009. pdf
- Dhiman et al, “PDRAM: a hybrid PRAM and DRAM main memory system,” DAC 2009. pdf
- Meza et al., “Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management,” IEEE CAL 2012. pdf
- Yoon et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012. pdf
Lecture 8
Optional:
- Suleman et al., “Data marshaling for multi-core architectures,” ISCA 2010. pdf
- Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. pdf
- Suleman et al., “Data Marshaling for Multicore Systems,” IEEE Micro 2011. pdf
- Chakraborty et al., “Computation Spreading: Employing Hardware Migration to Specialize CMP Cores on-the-fly,” ASPLOS 2006. pdf
- Rangan et al., “Thread Motion: Fine-Grained Power Management for Multi-Core Systems,” ISCA 2009. pdf
Lecture 9
Required:
- Spracklen and Abraham, “Chip Multithreading: Opportunities and Challenges,” HPCA Industrial Session 2005. pdf
- Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro 2004. pdf
- Tullsen et al., “Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor,” ISCA 1996. pdf
- Eyerman and Eeckhout, “A Memory-Level Parallelism Aware Fetch Policy for SMT Processors,” HPCA 2007. pdf
Recommended:
- Hirata et al., “An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads,” ISCA 1992. pdf
- Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978. pdf
- Gabor et al., “Fairness and Throughput in Switch on Event Multithreading,” MICRO 2006. pdf
- Agarwal et al., “APRIL: A Processor Architecture for Multiprocessing,” ISCA 1990. pdf
Optional:
- Kim et al., “Thread Cluster Memory Scheduling,” MICRO 2010. pdf
- Kim et al., “Thread Cluster Memory Scheduling,” IEEE Micro 2011. pdf
- Ausavarungnirun et al., “Staged memory scheduling: achieving high performance and scalability in heterogeneous systems,” ISCA 2012. pdf
- Ebrahimi et al., “Parallel Application Memory Scheduling,” MICRO 2011. pdf
- Meza et al., “Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management,” IEEE CAL 2012. pdf
- Yoon et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012. pdf
- Thornton, “Design of a Computer: The Control Data 6600,” 1970. pdf
- Thornton, “Parallel Operation in the Control Data 6600,” AFIPS 1964. pdf
- McNairy and Bhatia, “Montecito: A Dual-Core, Dual-Thread Itanium Processor,” IEEE Micro 2005. pdf
Lecture 10
Required:
- Spracklen and Abraham, “Chip Multithreading: Opportunities and Challenges,” HPCA Industrial Session 2005. pdf
- Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro 2004. pdf
- Tullsen et al., “Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor,” ISCA 1996. pdf
- Eyerman and Eeckhout, “A Memory-Level Parallelism Aware Fetch Policy for SMT Processors,” HPCA 2007. pdf
Recommended:
- Hirata et al., “An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads,” ISCA 1992. pdf
- Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978. pdf
- Gabor et al., “Fairness and Throughput in Switch on Event Multithreading,” MICRO 2006. pdf
- Agarwal et al., “APRIL: A Processor Architecture for Multiprocessing,” ISCA 1990. pdf
Optional:
- Yamamoto et al., “Performance Estimation of Multistreamed, Supersealar Processors,” HICSS 1994. pdf
- Tullsen et al., “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” ISCA 1995. pdf
- Snavely and Tullsen, “Symbiotic Jobscheduling for a Simultaneous Multithreading Processor,” ASPLOS 2000. pdf
- Jacobsen et al., “Assigning confidence to conditional branch predictions,” MICRO 1996. pdf
- Brown and Tullsen, “Handling Long-latency Loads in a Simultaneous Multithreading Processor,” MICRO 2001. pdf
- El-Moursy and Albonesi, “Front-End Policies for Improved Issue Efficiency in SMT Processors,” HPCA 2003. pdf
- Raasch and Reinhardt, “The Impact of Resource Partitioning on SMT Processors,” PACT 2003. pdf
- Eyerman and Eeckhout, “A Memory-Level Parallelism Aware Fetch Policy for SMT Processors,” HPCA 2007. pdf
- Ramirez et al., “Runahead Threads to Improve SMT Performance,” HPCA 2008. pdf
- Van Craeynest et al., “MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor,” HiPEAC 2009. pdf
- Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro 2004. pdf
- Lebeck et al., “A Large, Fast Instruction Window for Tolerating Cache Misses,” ISCA 2002. pdf
- Marr et al., “Hyper-Threading Technology Architecture and Microarchitecture,” Intel technology Journal 2002. pdf
Lecture 11
Optional:
- Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006. pdf
- Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA 2002. pdf
- Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004. pdf
- Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs,” HPCA 2009. pdf
- Dusser et al., “Zero-Content Augmented Caches,” ICS 2009. pdf
- Islam and Stenstrom, “Zero-Value Caches: Cancelling Loads that Return Zero,” PACT 2009. pdf
- Yang et al., “Frequent Value Compression in Data Caches,” MICRO 2000. pdf
- Alameldeen and Wood, “Adaptive Cache Compression for High-Performance Processors,” ISCA 2004. pdf
- Thoziyoor et al., “A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies,” ISCA 2008. pdf
- Ekman and Stenstrom, “A Robust Main-Memory Compression Scheme,” ISCA 2005. pdf
- Pekhimenko et al., “Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches,” PACT 2012. pdf
- Ubal et al., “Multi2Sim: A Simulation Framework for CPU-GPU Computing,” PACT 2012. pdf
- Chen et al., “C-Pack: A High-Performance Microprocessor Cache Compression Algorithm,” VLSI 2010. pdf
- Magnusson et al., “Simics: A full system simulation platform,” Computer 2002. pdf
- Tremaine et al., “Pinnacle: IBM MXT in a memory controller chip,” IEEE Micro 2001. pdf
Lecture 12
Optional:
- Johnson and Hwu, “Run-Time Adaptive Cache Hierarchy Management via Reference Analysis,” ISCA 1997. pdf
- Piquet et al., “Exploiting single-usage for effective memory management,” ACSAC 2007. pdf
- Wu et al., “SHIP: Signature-based hit predictor for high performance caching,” MICRO 2011. pdf
- Qureshi et al., “Adaptive insertion policies for high performance caching,” ISCA 2007. pdf
- Jaleel et al., “Adaptive insertion policies for managing shared caches,” PACT 2008. pdf
- Jaleel et al., “High performance cache replacement using re-reference interval prediction,” ISCA 2010. pdf
- Xie and Loh, “PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches,” ISCA 2009. pdf
- Cho and Jin, “Managing Distributed, Shared L2 Caches through OS-Level Page Allocation,” MICRO 2006. pdf
- Lin et al., “Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems,” HPCA 2008. pdf
Lecture 13
Optional:
- Reinhardt and Mukherjee, “Transient Fault Detection via Simultaneous Multithreading,” ISCA 2000. pdf
- Rotenberg, “AR-SMT: a microarchitectural approach to fault tolerance in microprocessors,” Fault-Tolerant Computing 1999. pdf
- Mukherjee et al., “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” ISCA 2002. pdf
- Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro 1999. pdf
- Austin, “DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design,” MICRO 1999. pdf
- Qureshi et al., “Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors,” DSN 2005. pdf
- Zilles et al., “The use of multithreading for exception handling,” MICRO 1999. pdf
- Dubois and Song, “Assisted Execution,” USC Tech Report 1998. pdf
- Chappell et al., “Simultaneous Subordinate Microthreading (SSMT),” ISCA 1999. pdf
- Chappell et al., “Difficult-path branch prediction using subordinate microthreads,” ISCA 2002. pdf
- Zilles and Sohi, “Execution-based Prediction Using Speculative Slices”, ISCA 2001. pdf
Lecture 15
Required:
- Sohi et al., “Multiscalar Processors,” ISCA 1995. pdf
- Herlihy and Moss, “Transactional Memory: Architectural Support for Lock-Free Data Structures,” ISCA 1993. pdf
Recommended:
- Rajwar and Goodman, “Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution,” MICRO 2001. pdf
- Colohan et al., “A Scalable Approach to Thread-Level Speculation,” ISCA 2000. pdf
- Akkary and Driscoll, “A dynamic multithreading processor,” MICRO 1998. pdf
Optional:
- Luk, “Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors,” ISCA 2001. pdf
- Sundaramoorthy et al., “Slipstream Processors: Improving both Performance and Fault Tolerance,“ ASPLOS 2000. pdf
- Zhou, “Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window,” PACT 2005. pdf
- Snavely and Tullsen, “Symbiotic Jobscheduling for a Simultaneous Multithreading Processor,” ASPLOS 2000. pdf
- Gopal et al., “Speculative Versioning Cache,” HPCA 1998. pdf
- Franklin and Sohi, “The expandable split window paradigm for exploiting fine-grain parallelism,” ISCA 1992. pdf
Lecture 16
Required:
- Sohi et al., “Multiscalar Processors,” ISCA 1995. pdf
- Herlihy and Moss, “Transactional Memory: Architectural Support for Lock-Free Data Structures,” ISCA 1993. pdf
Recommended:
- Rajwar and Goodman, “Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution,” MICRO 2001. pdf
- Colohan et al., “A Scalable Approach to Thread-Level Speculation,” ISCA 2000. pdf
- Akkary and Driscoll, “A dynamic multithreading processor,” MICRO 1998. pdf
Optional:
- Franklin and Sohi, “ARB: A hardware mechanism for dynamic reordering of memory references,” IEEE TC 1996. pdf
- Vijaykumar and Sohi, “Task selection for a multiscalar processor,” MICRO 1998. pdf
- Moshovos et al., “Dynamic Speculation and Synchronization of Data Dependences,” ISCA 1997. pdf
- Chrysos and Emer, “Memory Dependence Prediction using Store Sets,” ISCA 1998. pdf
- Martinez and Torrellas, “Speculative Synchronization: Applying Thread-Level Speculation to Explicitly Parallel Applications,” ASPLOS 2002. pdf
- Rajwar and Goodman, “Transactional Lock-Free Execution of Lock-Based Programs,” ASPLOS 2002. pdf
- Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. pdf
- Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multicore Architectures,” IEEE Micro 2010. pdf
- Shavit and Touitou, “Software transactional memory,” PODC 1995. pdf