=====Readings=====


=====Lecture 1=====
Required:
  * Hill, Jouppi, Sohi, “Multiprocessors and Multicomputers,” pp. 551-560 in Readings in Computer Architecture. {{:reading_hill_551_560.pdf|pdf}}
  * Hill, Jouppi, Sohi, “Dataflow and Multithreading,” pp. 309-314 in Readings in Computer Architecture. {{:reading_hill_309_314.pdf|pdf}}
  * Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. {{:suleman09-acs.pdf|pdf}}
  * Culler & Singh, Chapter 1
  * Hamming, “You and Your Research,” Bell Communications Research Colloquium Seminar, 7 March 1986. {{http://www.cs.virginia.edu/~robins/YouAndYourResearch.html|here}}

Optional:
  * Suleman et al., “Feedback-directed pipeline parallelism,” PACT 2010. {{:suleman_feedpipe10.pdf|pdf}}
  * Kumar et al., “Carbon: Architectural Support for Fine-Grained Parallelism on Chip Multiprocessors,” ISCA 2007. {{:kumar07-carbon.pdf|pdf}}

Supplementary Readings on Research, Writing, Reviews:
  * Levin and Redell, “How (and how not) to write a good systems paper,” OSR 1983. {{:systemspaper_levin.pdf|pdf}}
  * Smith, “The Task of the Referee,” IEEE Computer 1990. {{:smith90-referee.pdf|pdf}}
  * SP Jones, “How to Write a Great Research Paper”. {{:jones04-writing-a-paper-slides.pdf|pdf}}
  * Fong, “How to Write a CS Research Paper: A Bibliography”. {{:fong06-writing-papers.pdf|pdf}}

=====Lecture 2=====
Required:
  * Hill and Marty, “Amdahl’s Law in the Multi-Core Era,” IEEE Computer 2008. {{:hill08_amdahl.pdf|pdf}}
  * Annavaram et al., “Mitigating Amdahl’s Law Through EPI Throttling,” ISCA 2005. {{:annavaram05_amdahl.pdf|pdf}}
  * Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. {{:suleman09-acs.pdf|pdf}}
  * Joao et al., “Bottleneck Identification and Scheduling in Multithreaded Applications,” ASPLOS 2012. {{:joao12-bottleneck.pdf|pdf}}
  * Ipek et al., “Core Fusion: Accommodating Software Diversity in Chip Multiprocessors,” ISCA 2007. {{:ipek07-fusion.pdf|pdf}}

Optional:
  * Mike Flynn, “Very High-Speed Computing Systems,” Proc. of IEEE, 1966. {{:flynn66_computing.pdf|pdf}}
  * Thornton, “CDC 6600: Design of a Computer,” 1970. {{:thornton_cdc6600.pdf|pdf}}
  * Burton Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978. {{:smith78_hep.pdf|pdf}}
  * Amdahl, “Validity of the single processor approach to achieving large scale computing capabilities,” AFIPS 1967. {{:amdahl67_singleproc.pdf|pdf}}
  * Eyerman and Eeckhout, “Modeling critical sections in Amdahl's law and its implications for multicore design,” ISCA 2010. {{:eyerman_critsectamdahl.pdf|pdf}}
  * Suleman et al., “Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs,” ASPLOS 2008. {{:suleman_feedback.pdf|pdf}}

=====Lecture 3=====
Required:
  * Hillis and Tucker, "The CM-5 Connection Machine: a scalable supercomputer," CACM 1993. {{:hillis_cm5.pdf|pdf}}
  * Seitz, "The Cosmic Cube," CACM 1985. {{:seitz_cosmiccube.pdf|pdf}}

Optional:
  * Li and Hudak, "Memory Coherence in Shared Virtual Memory Systems, " ACM TOCS 1989. {{:li_coherencesharedmem.pdf|pdf}}
  * Batcher, "Architecture of a massively parallel processor," ISCA 1980. {{:batcher_massparproc.pdf|pdf}}
  * Tucker and Robertson, "Architecture and Applications of the Connection Machine," IEEE Computer 1988. {{:tucker_connection.pdf|pdf}}

=====Lecture 4=====
Optional:
  * Moore, "Cramming more components onto integrated circuits," Electronics, 1965. {{:r1_moore.pdf|pdf}}
  * Stark, "On pipelining dynamic instruction scheduling logic," MICRO 2000. {{:stark00-scheduling.pdf|pdf}}
  * Olukotun et al., "The Case for a Single-Chip Multiprocessor," ASPLOS 1996. {{:olukutun96_cmp.pdf|pdf}}
  * Kessler, "The Alpha 21264 Microprocessor," IEEE Micro 1999. {{:kessler99-alpha21264.pdf|pdf}}
  * Palacharla et al., "Complexity-effective superscalar processors," ISCA 1997. {{:palacharla97-complexity.pdf|pdf}}

=====Lecture 5=====
Optional:
  * Smith, "A pipelined, shared resource MIMD computer," ICPP 1978. {{:smith78_hep.pdf|pdf}}
  * Barroso et al., "Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing," ISCA 2000. {{:barroso00_piranha.pdf|pdf}}
  * Barroso et al., "Memory system characterization of commercial workloads," ISCA 1998. {{:barroso98-workloads.pdf|pdf}}
  * Ranganathan et al., "Performance of database workloads on shared-memory systems with out-of-order processors," ASPLOS 1998. {{:ranganathan98-workloads.pdf|pdf}}
  * Kongetira et al., “Niagara: A 32-Way Multithreaded SPARC Processor,” IEEE Micro 2005. {{:kongetira05_niagara.pdf|pdf}}
  * Spracklen and Abraham, “Chip Multithreading: Opportunities and Challenges,” HPCA Industrial Session, 2005. {{:spracklen05_mt.pdf|pdf}}
  * Chaudhry et al., “Rock: A High-Performance Sparc CMT Processor,” IEEE Micro, 2009. {{:chaudhry_rock.pdf|pdf}}
  * Chaudhry et al., “Simultaneous Speculative Threading: A Novel Pipeline Architecture Implemented in Sun's ROCK Processor,” ISCA 2009. {{:chaudhry_specthread.pdf|pdf}}
  * Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors,” HPCA 2003. {{:mutlu_runahead.pdf|pdf}}
  * Mutlu et al., “Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance,” IEEE Micro Jan/Feb 2006. {{:mutlu06_efficient.pdf|pdf}}
  * Tendler et al., "POWER4 system microarchitecture," IBM J R&D, 2002. {{:tendler_power4.pdf|pdf}}
  * Kalla et al., "IBM Power5 Chip: A Dual-Core Multithreaded Processor," IEEE Micro 2004. {{:kalla04_power5.pdf|pdf}}
  * Le et al., "IBM POWER6 Microarchitecture," IBM J R&D, 2007. {{:le_power6.pdf|pdf}}
  * Kalla et al., "Power7: IBM’s Next-Generation Server Processor," IEEE Micro 2010. {{:kalla_power7.pdf|pdf}}
  * Grochowski et al., "Best of both Latency and Throughput," ICCD 2004. {{:grochowski_latthrough.pdf|pdf}}
  * Hill and Marty, “Amdahl’s Law in the Multi-Core Era,” IEEE Computer 2008. {{:hill08_amdahl.pdf|pdf}}
  * Annavaram et al., “Mitigating Amdahl’s Law Through EPI Throttling,” ISCA 2005. {{:annavaram05_amdahl.pdf|pdf}}

=====Lecture 6=====
Recommended:
  * Ipek et al., "Core Fusion: Accomodating Software Diversity in Chip Multiprocessors," ISCA 2007. {{:ipek07-fusion.pdf|pdf}}
  * Ausavarungnirun et al., "Staged memory scheduling: achieving high performance and scalability in heterogeneous systems," ISCA 2012. {{:ausavarungnirun12-sms.pdf|pdf}}

Optional:
  * Kumar et al., “Single-ISA Heterogeneous Multi-Core Architectures: The Potential for Processor Power Reduction,” MICRO 2003. {{:kumar_singleisaheterog.pdf|pdf}}
  * Suleman et al., "Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures," ASPLOS 2009. {{:suleman09-acs.pdf|pdf}}
  * Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multicore Architectures,” IEEE Micro 2010. {{:suleman10-acs.pdf|pdf}}
  * Suleman et al., "Data marshaling for multi-core architectures," ISCA 2010. {{:suleman10-marshaling.pdf|pdf}}
  * Suleman et al., "Data Marshaling for Multicore Systems," IEEE Micro 2011. {{:suleman11-marshaling.pdf|pdf}}
  * Joao et al., “Bottleneck Identification and Scheduling in Multithreaded Applications,” ASPLOS 2012. {{:joao12-bottleneck.pdf|pdf}}
  * Kim et al., "ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers," HPCA 2010. {{:kim10-atlas.pdf|pdf}}
  * Kim et al., "Thread Cluster Memory Scheduling," MICRO 2010. {{:kim10-tcm.pdf|pdf}}
  * Kim et al., "Thread Cluster Memory Scheduling," IEEE Micro 2011. {{:kim11-tcm.pdf|pdf}}
  * Nychis et al., "Next generation on-chip networks: what kind of congestion control do we need?," HotNets 2010. {{:nychis10-congestion.pdf|pdf}}
  * Das et al., "Application-aware prioritization mechanisms for on-chip networks," MICRO 2009. {{:das09-prioritization.pdf|pdf}}
  * Das et al., "Aérgia: exploiting packet latency slack in on-chip networks," ISCA 2010. {{:das10-aergia.pdf|pdf}}
  * Das et al., "Aérgia: A Network-on-Chip Exploiting Packet Latency Slack," IEEE Micro 2011. {{:das11-aergia.pdf|pdf}}
  * Meza et al., "Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management," IEEE CAL 2012. {{:meza12-timber.pdf|pdf}}
  * Suleman et al., "Feedback-driven threading: power-efficient and high-performance execution of multi-threaded workloads on CMPs," ASPLOS 2008. {{:suleman_feedback.pdf|pdf}}
  * Annavaram et al., “Mitigating Amdahl’s Law Through EPI Throttling,” ISCA 2005. {{:annavaram05_amdahl.pdf|pdf}}
  * Morad et al., "Performance, Power Efficiency and Scalability of Asymmetric Cluster Chip Multiprocessors," IEEE CAL 2006. {{:morad_jul05.pdf|pdf}}
  * Suleman et al., "ACMP: Balancing Hardware Efficiency and Programmer Efficiency," HPS Technical Report 2007. {{:TR-HPS-2007-001.pdf|pdf}}
  * Suleman et al., “Feedback-directed pipeline parallelism,” PACT 2010. {{:suleman_feedpipe10.pdf|pdf}}
  * Suleman, "An Asymmetric Multi-core Architecture for Efficiently Accelerating Critical Paths in Multithreaded Programs," PhD thesis 2010. {{:TR-HPS-2010-003.pdf|pdf}}

=====Lecture 7=====
Optional:
  * Lefurgy et al., "Energy Management for Commercial Servers," IEEE Computer 2003. {{:lefurgy03-energy.pdf|pdf}}
  * Lee et al., “Architecting Phase Change Memory as a Scalable DRAM Alternative,” ISCA 2009. {{:lee09-pcm.pdf|pdf}}
  * Lee et al., "Phase-Change Technology and the Future of Main Memory," IEEE Micro 2010. {{:lee10-pcm.pdf|pdf}}
  * Qureshi et al., “Scalable high performance main memory system using phase-change memory technology,” ISCA 2009. {{:qureshi09-pcm.pdf|pdf}}
  * Dhiman et al, "PDRAM: a hybrid PRAM and DRAM main memory system," DAC 2009. {{:dhiman09-pdram.pdf|pdf}}
  * Meza et al., “Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management,” IEEE CAL 2012. {{:meza12-timber.pdf|pdf}}
  * Yoon et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012. {{:yoon12-rbla.pdf|pdf}}

=====Lecture 8=====
Optional:
  * Suleman et al., “Data marshaling for multi-core architectures,” ISCA 2010. {{:suleman10-marshaling.pdf|pdf}}
  * Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. {{:suleman09-acs.pdf|pdf}}
  * Suleman et al., “Data Marshaling for Multicore Systems,” IEEE Micro 2011. {{:suleman11-marshaling.pdf|pdf}}
  * Chakraborty et al., "Computation Spreading: Employing Hardware Migration to Specialize CMP Cores on-the-fly," ASPLOS 2006. {{:chakraborty_compspread06.pdf|pdf}}
  * Rangan et al., "Thread Motion: Fine-Grained Power Management for Multi-Core Systems," ISCA 2009. {{:rangan_threadmotion09.pdf|pdf}}

=====Lecture 9=====
Required:
  * Spracklen and Abraham, “Chip Multithreading: Opportunities and Challenges,” HPCA Industrial Session 2005. {{:spracklen05_mt.pdf|pdf}}
  * Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro 2004. {{:kalla04_power5.pdf|pdf}}
  * Tullsen et al., “Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor,” ISCA 1996. {{:tullsen96_smt.pdf|pdf}}
  * Eyerman and Eeckhout, “A Memory-Level Parallelism Aware Fetch Policy for SMT Processors,” HPCA 2007. {{:eyerman07_mlp.pdf|pdf}}

Recommended:
  * Hirata et al., “An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads,” ISCA 1992. {{:hirata92_smt.pdf|pdf}}
  * Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978. {{:smith78_hep.pdf|pdf}}
  * Gabor et al., “Fairness and Throughput in Switch on Event Multithreading,” MICRO 2006. {{:gabor_fairthru06.pdf|pdf}}
  * Agarwal et al., “APRIL: A Processor Architecture for Multiprocessing,” ISCA 1990. {{:agarwal90-april.pdf|pdf}}

Optional:
  * Kim et al., “Thread Cluster Memory Scheduling,” MICRO 2010. {{:kim10-tcm.pdf|pdf}}
  * Kim et al., “Thread Cluster Memory Scheduling,” IEEE Micro 2011. {{:kim11-tcm.pdf|pdf}}
  * Ausavarungnirun et al., “Staged memory scheduling: achieving high performance and scalability in heterogeneous systems,” ISCA 2012. {{:ausavarungnirun12-sms.pdf|pdf}}
  * Ebrahimi et al., “Parallel Application Memory Scheduling,” MICRO 2011. {{:ebrahimi11-parallel.pdf|pdf}}
  * Meza et al., “Enabling Efficient and Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management,” IEEE CAL 2012. {{:meza12-timber.pdf|pdf}}
  * Yoon et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,” ICCD 2012. {{:yoon12-rbla.pdf|pdf}}
  * Thornton, "Design of a Computer: The Control Data 6600," 1970. {{:thornton_cdc6600.pdf|pdf}}
  * Thornton, "Parallel Operation in the Control Data 6600," AFIPS 1964. {{:thornton_parallelcd64.pdf|pdf}}
  * McNairy and Bhatia, “Montecito: A Dual-Core, Dual-Thread Itanium Processor,” IEEE Micro 2005. {{:mcnairy_montecito05.pdf|pdf}}

=====Lecture 10=====
Required:
  * Spracklen and Abraham, “Chip Multithreading: Opportunities and Challenges,” HPCA Industrial Session 2005. {{:spracklen05_mt.pdf|pdf}}
  * Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro 2004. {{:kalla04_power5.pdf|pdf}}
  * Tullsen et al., “Exploiting choice: instruction fetch and issue on an implementable simultaneous multithreading processor,” ISCA 1996. {{:tullsen96_smt.pdf|pdf}}
  * Eyerman and Eeckhout, “A Memory-Level Parallelism Aware Fetch Policy for SMT Processors,” HPCA 2007. {{:eyerman07_mlp.pdf|pdf}}

Recommended:
  * Hirata et al., “An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads,” ISCA 1992. {{:hirata92_smt.pdf|pdf}}
  * Smith, “A pipelined, shared resource MIMD computer,” ICPP 1978. {{:smith78_hep.pdf|pdf}}
  * Gabor et al., “Fairness and Throughput in Switch on Event Multithreading,” MICRO 2006. {{:gabor_fairthru06.pdf|pdf}}
  * Agarwal et al., “APRIL: A Processor Architecture for Multiprocessing,” ISCA 1990. {{:agarwal90-april.pdf|pdf}}

Optional:
  * Yamamoto et al., “Performance Estimation of Multistreamed, Supersealar Processors,” HICSS 1994. {{:yamamoto_perfest94.pdf|pdf}}
  * Tullsen et al., “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” ISCA 1995. {{:tullsen_simulmthd95.pdf|pdf}}
  * Snavely and Tullsen, "Symbiotic Jobscheduling for a Simultaneous Multithreading Processor," ASPLOS 2000. {{:snavely_symbioticsched00.pdf|pdf}}
  * Jacobsen et al., "Assigning confidence to conditional branch predictions," MICRO 1996. {{:jacobsen96-confidence.pdf|pdf}}
  * Brown and Tullsen, “Handling Long-latency Loads in a Simultaneous Multithreading Processor,” MICRO 2001. {{:brown_longlatsmt01.pdf|pdf}}
  * El-Moursy and Albonesi, “Front-End Policies for Improved Issue Efficiency in SMT Processors,” HPCA 2003. {{:elmoursy_frontendsmt03.pdf|pdf}}
  * Raasch and Reinhardt, “The Impact of Resource Partitioning on SMT Processors,” PACT 2003. {{:raasch_resourcepartsmt03.pdf|pdf}}
  * Eyerman and Eeckhout, “A Memory-Level Parallelism Aware Fetch Policy for SMT Processors,” HPCA 2007. {{:eyerman07_mlp.pdf|pdf}}
  * Ramirez et al., “Runahead Threads to Improve SMT Performance,” HPCA 2008. {{:ramirez_runaheadsmt08.pdf|pdf}}
  * Van Craeynest et al., "MLP-Aware Runahead Threads in a Simultaneous Multithreading Processor," HiPEAC 2009. {{:vancraeynest09-mlprunahead.pdf|pdf}}
  * Kalla et al., “IBM Power5 Chip: A Dual-Core Multithreaded Processor,” IEEE Micro 2004. {{:kalla04_power5.pdf|pdf}}
  * Lebeck et al., "A Large, Fast Instruction Window for Tolerating Cache Misses," ISCA 2002. {{:lebeck02-wib.pdf|pdf}}
  * Marr et al., “Hyper-Threading Technology Architecture and Microarchitecture,” Intel technology Journal 2002. {{:marr_hyperthread02.pdf|pdf}}

=====Lecture 11=====
Optional:
  * Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006. {{:qureshi06-UCP.pdf|pdf}}
  * Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA 2002. {{:suh02-partitioning.pdf|pdf}}
  * Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004. {{:kim04-faircache.pdf|pdf}}
  * Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs,” HPCA 2009. {{:qureshi09-asr.pdf|pdf}}
  * Dusser et al., "Zero-Content Augmented Caches," ICS 2009. {{:dusser09-zerocontent.pdf|pdf}}
  * Islam and Stenstrom, "Zero-Value Caches: Cancelling Loads that Return Zero," PACT 2009. {{:islam09-zerovalue.pdf|pdf}}
  * Yang et al., "Frequent Value Compression in Data Caches," MICRO 2000. {{:yang00-compression.pdf|pdf}}
  * Alameldeen and Wood, "Adaptive Cache Compression for High-Performance Processors," ISCA 2004. {{:alameldeen04-cachecompression.pdf|pdf}}
  * Thoziyoor et al., "A Comprehensive Memory Modeling Tool and Its Application to the Design and Analysis of Future Memory Hierarchies," ISCA 2008. {{:thoziyoor08-modeling.pdf|pdf}}
  * Ekman and Stenstrom, "A Robust Main-Memory Compression Scheme," ISCA 2005. {{:ekman05-memcompression.pdf|pdf}}
  * Pekhimenko et al., "Base-Delta-Immediate Compression: Practical Data Compression for On-Chip Caches," PACT 2012. {{:pekhimenko12-bdi.pdf|pdf}}
  * Ubal et al., "Multi2Sim: A Simulation Framework for CPU-GPU Computing," PACT 2012. {{:ubal12-multi2sim.pdf|pdf}}
  * Chen et al., "C-Pack: A High-Performance Microprocessor Cache Compression Algorithm," VLSI 2010. {{:chen10-cpack.pdf|pdf}}
  * Magnusson et al., "Simics: A full system simulation platform," Computer 2002. {{:magnusson02-simics.pdf|pdf}}
  * Tremaine et al., "Pinnacle: IBM MXT in a memory controller chip," IEEE Micro 2001. {{:tremaine01-mxt.pdf|pdf}}

=====Lecture 12=====
Optional:
  * Johnson and Hwu, "Run-Time Adaptive Cache Hierarchy Management via Reference Analysis," ISCA 1997. {{:johnson97-reference.pdf|pdf}}
  * Piquet et al., "Exploiting single-usage for effective memory management," ACSAC 2007. {{:piquet07-singleusage.pdf|pdf}}
  * Wu et al., "SHIP: Signature-based hit predictor for high performance caching," MICRO 2011. {{:wu11-ship.pdf|pdf}}
  * Qureshi et al., "Adaptive insertion policies for high performance caching," ISCA 2007. {{:qureshi07-dip.pdf|pdf}}
  * Jaleel et al., "Adaptive insertion policies for managing shared caches," PACT 2008. {{:jaleel08-tadip.pdf|pdf}}
  * Jaleel et al., "High performance cache replacement using re-reference interval prediction," ISCA 2010. {{:jaleel10-rrip.pdf|pdf}}
  * Xie and Loh, "PIPP: Promotion/insertion pseudo-partitioning of multi-core shared caches," ISCA 2009. {{:xie09-pipp.pdf|pdf}}
  * Cho and Jin, “Managing Distributed, Shared L2 Caches through OS-Level Page Allocation,” MICRO 2006. {{:cho06-coloring.pdf|pdf}}
  * Lin et al., “Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems,” HPCA 2008. {{:lin08-partitioning.pdf|pdf}}

=====Lecture 13=====
Optional:
  * Reinhardt and Mukherjee, “Transient Fault Detection via Simultaneous Multithreading,” ISCA 2000. {{:reinhardt_faultdetectsmt00.pdf|pdf}}
  * Rotenberg, "AR-SMT: a microarchitectural approach to fault tolerance in microprocessors," Fault-Tolerant Computing 1999. {{:rotenberg99-ar-smt.pdf|pdf}}
  * Mukherjee et al., “Detailed Design and Evaluation of Redundant Multithreading Alternatives,” ISCA 2002. {{:mukherjee_redunmt02.pdf|pdf}}
  * Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro 1999. {{:kessler99-alpha21264.pdf|pdf}}
  * Austin, “DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design,” MICRO 1999. {{:austin_diva99.pdf|pdf}}
  * Qureshi et al., “Microarchitecture-Based Introspection: A Technique for Transient-Fault Tolerance in Microprocessors,” DSN 2005. {{:qureshi_faulttol05.pdf|pdf}}
  * Zilles et al., “The use of multithreading for exception handling,” MICRO 1999. {{:zilles_exception99.pdf|pdf}}
  * Dubois and Song, “Assisted Execution,” USC Tech Report 1998. {{:dubois_assisted98.pdf|pdf}}
  * Chappell et al., “Simultaneous Subordinate Microthreading (SSMT),” ISCA 1999. {{:chappell_ssmt99.pdf|pdf}}
  * Chappell et al., "Difficult-path branch prediction using subordinate microthreads," ISCA 2002. {{:chappell02-prediction.pdf|pdf}}
  * Zilles and Sohi, “Execution-based Prediction Using Speculative Slices”, ISCA 2001. {{:zilles_specprediction01.pdf|pdf}}

=====Lecture 15=====
Required:
  * Sohi et al., “Multiscalar Processors,” ISCA 1995. {{:sohi95.pdf|pdf}}
  * Herlihy and Moss, “Transactional Memory: Architectural Support for Lock-Free Data Structures,” ISCA 1993. {{:herlihy93.pdf|pdf}}

Recommended:
  * Rajwar and Goodman, “Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution,” MICRO 2001. {{:rajwar01.pdf|pdf}}
  * Colohan et al., “A Scalable Approach to Thread-Level Speculation,” ISCA 2000. {{:colohan00.pdf|pdf}}
  * Akkary and Driscoll, “A dynamic multithreading processor,” MICRO 1998. {{:akkary_dynmthread98.pdf|pdf}}

Optional:
  * Luk, "Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors," ISCA 2001. {{:luk01-preexecution.pdf|pdf}}
  * Sundaramoorthy et al., “Slipstream Processors: Improving both Performance and Fault Tolerance,“ ASPLOS 2000. {{:sundaramoorthy_slipstream00.pdf|pdf}}
  * Zhou, “Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window,” PACT 2005. {{:zhou_scaleinstwindow00.pdf|pdf}}
  * Snavely and Tullsen, “Symbiotic Jobscheduling for a Simultaneous Multithreading Processor,” ASPLOS 2000. {{:snavely_symbioticsched00.pdf|pdf}}
  * Gopal et al., “Speculative Versioning Cache,” HPCA 1998. {{:gopal_speculativeversioning98.pdf|pdf}}
  * Franklin and Sohi, “The expandable split window paradigm for exploiting fine-grain parallelism,” ISCA 1992. {{:franklin_windowpar92.pdf|pdf}}

=====Lecture 16=====
Required:
  * Sohi et al., “Multiscalar Processors,” ISCA 1995. {{:sohi95.pdf|pdf}}
  * Herlihy and Moss, “Transactional Memory: Architectural Support for Lock-Free Data Structures,” ISCA 1993. {{:herlihy93.pdf|pdf}}

Recommended:
  * Rajwar and Goodman, “Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution,” MICRO 2001. {{:rajwar01.pdf|pdf}}
  * Colohan et al., “A Scalable Approach to Thread-Level Speculation,” ISCA 2000. {{:colohan00.pdf|pdf}}
  * Akkary and Driscoll, “A dynamic multithreading processor,” MICRO 1998. {{:akkary_dynmthread98.pdf|pdf}}

Optional:
  * Franklin and Sohi, “ARB: A hardware mechanism for dynamic reordering of memory references,” IEEE TC 1996. {{:franklin_arb96.pdf|pdf}}
  * Vijaykumar and Sohi, "Task selection for a multiscalar processor," MICRO 1998. {{:vijaykumar98-selection.pdf|pdf}}
  * Moshovos et al., “Dynamic Speculation and Synchronization of Data Dependences,” ISCA 1997. {{:moshovos_datadep97.pdf|pdf}}
  * Chrysos and Emer, “Memory Dependence Prediction using Store Sets,” ISCA 1998. {{:chrysos_memorydependence98.pdf|pdf}}
  * Martinez and Torrellas, "Speculative Synchronization: Applying Thread-Level Speculation to Explicitly Parallel Applications," ASPLOS 2002. {{:martinez_specsync02.pdf|pdf}}
  * Rajwar and Goodman, "Transactional Lock-Free Execution of Lock-Based Programs," ASPLOS 2002. {{:rajwar_tlr02.pdf|pdf}}
  * Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multi-Core Architectures,” ASPLOS 2009. {{:suleman09-acs.pdf|pdf}}
  * Suleman et al., “Accelerating Critical Section Execution with Asymmetric Multicore Architectures,” IEEE Micro 2010. {{:suleman10-acs.pdf|pdf}}
  * Shavit and Touitou, "Software transactional memory," PODC 1995. {{:shavit95-swtm.pdf|pdf}}

=====Lecture 17=====
Required:
  * Dally, “Virtual Channel Flow Control,” ISCA 1990. {{:dally90-vcflow.pdf|pdf}}
  * Mullins et al., “Low-Latency Virtual-Channel Routers for On-Chip Networks,” ISCA 2004. {{:mullins04.pdf|pdf}}
  * Moscibroda and Mutlu, “A Case for Bufferless Routing in On-Chip Networks,” ISCA 2009. {{:moscibroda09.pdf|pdf}}
  * Wentzlaff et al., “On-Chip Interconnection Architecture of the Tile Processor,” IEEE Micro 2007. {{:wentzlaff07.pdf|pdf}}
  * Patel, “Processor-Memory Interconnections for Multiprocessors,” ISCA 1979. {{:patel_procmeminterconnect79.pdf|pdf}}

Recommended:
  * Fallin et al., "CHIPPER: A Low-Complexity, Bufferless Deflection Router," HPCA 2011. {{:fallin_chipper11.pdf|pdf}}
  * Fallin et al., "MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect," NOCS 2012. {{:fallin12-minbd.pdf|pdf}}
  * Bjerregaard and Mahadevan, “A Survey of Research and Practices of Network-on-Chip”, ACM Computing Surveys (CSUR) 2006. {{:bjerregaard_nocsurvey.pdf|pdf}}

Optional:
  * Hillis and Tucker, "The CM-5 Connection Machine: a scalable supercomputer," CACM 1993. {{:hillis_cm5.pdf|pdf}}
  * Das et al., "Design and Evaluation of a Hierarchical On-Chip Interconnect for Next-Generation CMPs," HPCA 2009. {{:das09-hierarchical.pdf|pdf}}
  * Seitz, “The Cosmic Cube,” CACM 1985. {{:seitz_cosmiccube.pdf|pdf}}
  * Gottlieb et al. “The NYU Ultracomputer-designing a MIMD, shared-memory parallel machine,” ISCA 1982. {{:gottlieb_ultracomputer82.pdf|pdf}}

=====Lecture 18=====
Required:
  * Dally, “Virtual Channel Flow Control,” ISCA 1990. {{:dally90-vcflow.pdf|pdf}}
  * Mullins et al., “Low-Latency Virtual-Channel Routers for On-Chip Networks,” ISCA 2004. {{:mullins04.pdf|pdf}}
  * Wentzlaff et al., “On-Chip Interconnection Architecture of the Tile Processor,” IEEE Micro 2007. {{:wentzlaff07.pdf|pdf}}
  * Fallin et al., "CHIPPER: A Low-Complexity, Bufferless Deflection Router," HPCA 2011. {{:fallin_chipper11.pdf|pdf}}
  * Fallin et al., "MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect," NOCS 2012. {{:fallin12-minbd.pdf|pdf}}
  * Patel, “Processor-Memory Interconnections for Multiprocessors,” ISCA 1979. {{:patel_procmeminterconnect79.pdf|pdf}}

Recommended:
  * Moscibroda and Mutlu, “A Case for Bufferless Routing in On-Chip Networks,” ISCA 2009. {{:moscibroda09.pdf|pdf}}
  * Bjerregaard and Mahadevan, “A Survey of Research and Practices of Network-on-Chip”, ACM Computing Surveys (CSUR) 2006. {{:bjerregaard_nocsurvey.pdf|pdf}}
  * Chang et al., "HAT: Heterogeneous Adaptive Throttling for On-Chip Networks," SBAC-PAD 2012. {{:chang12-hat.pdf|pdf}}

Optional:
  * Glass and Ni, “The Turn Model for Adaptive Routing,” ISCA 1992. {{:glass_turnmodel92.pdf|pdf}}
  * Galles, “Spider: A High-Speed Network Interconnect,” IEEE Micro 1997. {{:galles97-spider.pdf|pdf}}

=====Lecture 20=====
Optional:
  * Gurd et al., "The Manchester prototype dataflow computer," CACM 1985. {{:gurd95.pdf|pdf}}
  * Lee and Hurson, "Dataflow Architectures and Multithreading," IEEE Computer 1994. {{:lee_dataflow94.pdf|pdf}}
  * Patt et al., "HPS, a new microarchitecture: rationale and introduction," MICRO 1985. {{:patt85.pdf|pdf}}
  * Patt et al., "Critical issues regarding HPS, a high performance microarchitecture," MICRO 1985. {{:patt85-hpsissues.pdf|pdf}}
  * Herlihy and Moss, “Transactional Memory: Architectural Support for Lock-Free Data Structures,” ISCA 1993. {{:herlihy93.pdf|pdf}}
  * Rajwar and Goodman, “Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution,” MICRO 2001. {{:rajwar01.pdf|pdf}}
  * Martinez and Torrellas, "Speculative Synchronization: Applying Thread-Level Speculation to Explicitly Parallel Applications," ASPLOS 2002. {{:martinez_specsync02.pdf|pdf}}
  * Rajwar and Goodman, "Transactional Lock-Free Execution of Lock-Based Programs," ASPLOS 2002. {{:rajwar_tlr02.pdf|pdf}}
  * Shavit and Touitou, "Software transactional memory," PODC 1995. {{:shavit95-swtm.pdf|pdf}}
  * Dice et al., "Early experience with a commercial hardware transactional memory implementation," ASPLOS 2009. {{:dice09-transactional.pdf|pdf}}
  * Wang et al., "Evaluation of blue Gene/Q hardware support for transactional memories," PACT 2012. {{:wang12-transactional.pdf|pdf}}
  * Glass and Ni, “The Turn Model for Adaptive Routing,” ISCA 1992. {{:glass_turnmodel92.pdf|pdf}}

=====Lecture 21=====
Optional:
  * Gurd et al., "The Manchester prototype dataflow computer," CACM 1985. {{:gurd95.pdf|pdf}}
  * Lee and Hurson, "Dataflow Architectures and Multithreading," IEEE Computer 1994. {{:lee_dataflow94.pdf|pdf}}
  * Patt et al., "HPS, a new microarchitecture: rationale and introduction," MICRO 1985. {{:patt85.pdf|pdf}}
  * Patt et al., "Critical issues regarding HPS, a high performance microarchitecture," MICRO 1985. {{:patt85-hpsissues.pdf|pdf}}
  * Sankaralingam et al., “Exploiting ILP, TLP and DLP with the Polymorphous TRIPS Architecture,” ISCA 2003. {{:sankaralingam_itdlp03.pdf|pdf}}
  * Burger et al., “Scaling to the End of Silicon with EDGE Architectures,” IEEE Computer 2004. {{:burger_edge04.pdf|pdf}}
  * Das et al., "Application-aware prioritization mechanisms for on-chip networks," MICRO 2009. {{:das09-prioritization.pdf|pdf}}
  * Das et al., "Aérgia: exploiting packet latency slack in on-chip networks," ISCA 2010. {{:das10-aergia.pdf|pdf}}
  * Grot et al., "Express Cube Topologies for On-Chip Interconnects," HPCA 2009. {{:grot_expresscube09.pdf|pdf}}
  * Grot et al., “Kilo-NOC: A Heterogeneous Network-on-Chip Architecture for Scalability and Service Guarantees,” ISCA 2011. {{:grot11-kilonoc.pdf|pdf}}
  * Grot et al., “Preemptive Virtual Clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip,” MICRO 2009. {{:grot09-pvc.pdf|pdf}}

=====Lecture 22=====
Optional:
  * Gurd et al., "The Manchester prototype dataflow computer," CACM 1985. {{:gurd95.pdf|pdf}}
  * Lee and Hurson, "Dataflow Architectures and Multithreading," IEEE Computer 1994. {{:lee_dataflow94.pdf|pdf}}
  * Patt et al., "HPS, a new microarchitecture: rationale and introduction," MICRO 1985. {{:patt85.pdf|pdf}}
  * Patt et al., "Critical issues regarding HPS, a high performance microarchitecture," MICRO 1985. {{:patt85-hpsissues.pdf|pdf}}
  * Sankaralingam et al., “Exploiting ILP, TLP and DLP with the Polymorphous TRIPS Architecture,” ISCA 2003. {{:sankaralingam_itdlp03.pdf|pdf}}
  * Burger et al., “Scaling to the End of Silicon with EDGE Architectures,” IEEE Computer 2004. {{:burger_edge04.pdf|pdf}}
  * Dennis and Misunas, "A preliminary architecture for a basic data flow processor," ISCA 1974. {{:dennis74.pdf|pdf}}
  * Treleaven et al., “Data-Driven and Demand-Driven Computer Architecture,” ACM Computing Surveys 1982. {{:treleaven82.pdf|pdf}}
  * Veen, “Dataflow Machine Architecture,” ACM Computing Surveys 1986. {{:veen86.pdf|pdf}}
  * Arvind and Nikhil, "Executing a program on the MIT tagged-token dataflow architecture," IEEE TC 1990. {{:arvind90.pdf|pdf}}
  * Hwu and Patt, “HPSm, a high performance restricted data flow architecture having minimal functionality,” ISCA 1986. {{:hwu86-hpsm.pdf|pdf}}

=====Lecture 23=====
Optional:
  * Sakai et al., “An Architecture of a Dataflow Single Chip Processor,” ISCA 1989. {{:sakai_dataflow89.pdf|pdf}}
  * Patt et al., "HPS, a new microarchitecture: rationale and introduction," MICRO 1985. {{:patt85.pdf|pdf}}
  * Colwell, "The Pentium Chronicles," Wiley-IEEE Computer Society Press 2005.
  * Kung, “Why Systolic Architectures?,” IEEE Computer 1982. {{:kung_systolic82.pdf|pdf}}
  * Annaratone et al., “Warp Architecture and Implementation,” ISCA 1986. {{:annaratone_warparch86.pdf|pdf}}
  * Annaratone et al., “The Warp Computer: Architecture, Implementation, and Performance,” IEEE TC 1987. {{:annaratone_warpperf87.pdf|pdf}}

=====Lecture 24=====
Required:
  * Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007. {{:mph_usenix_security07.pdf|pdf}}
  * Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” MICRO 2007.  {{:mutlu07.pdf|pdf}}
  * Kim et al., “ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers,” HPCA 2010. {{:kim10-atlas.pdf|pdf}}
  * Muralidhara et al., “Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning,” MICRO 2011. {{:mcp_micro2011.pdf|pdf}}
  * Ausavarungnirun et al., “Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems,” ISCA 2012. {{:sms_isca12.pdf|pdf}}
  * Lee et al., “Prefetch-Aware DRAM Controllers,” MICRO 2008. {{:lee_prefetchdram08.pdf|pdf}}
  * Qureshi and Patt, “Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches,” MICRO 2006. {{:qureshi06-ucp.pdf|pdf}}
  * Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004. {{:kim04-faircache.pdf|pdf}}
  * Qureshi, “Adaptive Spill-Receive for Robust High-Performance Caching in CMPs,” HPCA 2009. {{:qureshi09-asr.pdf|pdf}}
  * Hardavellas et al., “Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches,” ISCA 2009. {{:hardavellas09_rnuca.pdf|pdf}}

Recommended:
  * Rixner et al., “Memory Access Scheduling,” ISCA 2000. {{:rixner00.pdf|pdf}}
  * Zheng et al., “Mini-Rank: Adaptive DRAM Architecture for Improving Memory Power Efficiency,” MICRO 2008. {{:zheng08.pdf|pdf}}
  * Ipek et al., “Self Optimizing Memory Controllers: A Reinforcement Learning Approach,” ISCA 2008. {{:ipek08-selfoptimizing.pdf|pdf}}
  * Kim et al., “An Adaptive, Non-Uniform Cache Structure for Wire-Delay Dominated On-Chip Caches,” ASPLOS 2002. {{:kim02_nuca.pdf|pdf}}
  * Qureshi et al., “Adaptive Insertion Policies for High-Performance Caching,” ISCA 2007. {{:qureshi07_adaptive.pdf|pdf}}
  * Lin et al., “Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems,” HPCA 2008. {{:lin08-partitioning.pdf|pdf}}

Optional:
  * Suh et al., “A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning,” HPCA 2002. {{:suh02-partitioning.pdf|pdf}}
  * Grot et al., “Preemptive virtual clock: A Flexible, Efficient, and Cost-effective QOS Scheme for Networks-on-Chip,“ MICRO 2009. {{:grot09-pvc.pdf|pdf}}

=====Lecture 25=====
Required:
  * Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007. {{:mph_usenix_security07.pdf|pdf}}
  * Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” MICRO 2007.  {{:mutlu07.pdf|pdf}}
  * Kim et al., “ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers,” HPCA 2010. {{:kim10-atlas.pdf|pdf}}
  * Muralidhara et al., “Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning,” MICRO 2011. {{:mcp_micro2011.pdf|pdf}}
  * Ausavarungnirun et al., “Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems,” ISCA 2012. {{:sms_isca12.pdf|pdf}}
  * Lee et al., “Prefetch-Aware DRAM Controllers,” MICRO 2008. {{:lee_prefetchdram08.pdf|pdf}}

Recommended:
  * Rixner et al., “Memory Access Scheduling,” ISCA 2000. {{:rixner00.pdf|pdf}}
  * Zheng et al., “Mini-Rank: Adaptive DRAM Architecture for Improving Memory Power Efficiency,” MICRO 2008. {{:zheng08.pdf|pdf}}
  * Ipek et al., “Self Optimizing Memory Controllers: A Reinforcement Learning Approach,” ISCA 2008. {{:ipek08-selfoptimizing.pdf|pdf}}

Optional:
  * Moscibroda and Mutlu, "Distributed order scheduling and its application to multi-core DRAM controllers," PODC 2008. {{:moscibroda08-order.pdf|pdf}}
  * Waldspurger and Weihl, "Lottery scheduling: flexible proportional-share resource management," OSDI 1994. {{:waldspurger94-lottery.pdf|pdf}}

=====Lecture 26=====
Required:
  * Muralidhara et al., “Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning,” MICRO 2011. {{:mcp_micro2011.pdf|pdf}}
  * Ebrahimi et al., “Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core Memory Systems,” ASPLOS 2010. {{:ebrahimi_throttle10.pdf|pdf}}
  * Subramanian et al., "MISE: Providing Performance Predictability in Shared Main Memory Systems," HPCA 2013.

Recommended:
  * Kim et al., “Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior,” MICRO 2010. {{:kim10-tcm.pdf|pdf}}
  * Rixner et al., “Memory Access Scheduling,” ISCA 2000. {{:rixner00.pdf|pdf}}
  * Kim et al., “ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers,” HPCA 2010. {{:kim10-atlas.pdf|pdf}}
  * Kim et al., “Fair Cache Sharing and Partitioning in a Chip Multiprocessor Architecture,” PACT 2004. {{:kim04-faircache.pdf|pdf}}
  * Mutlu and Moscibroda, “Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems,” ISCA 2008. {{:mutlu08-parbs.pdf|pdf}}
  * Moscibroda and Mutlu, “Memory Performance Attacks,” USENIX Security 2007. {{:mph_usenix_security07.pdf|pdf}}
  * Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” MICRO 2007.  {{:mutlu07.pdf|pdf}}

=====Lecture 27=====
Required:
  * Ausavarungnirun et al., “Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems,” ISCA 2012. {{:sms_isca12.pdf|pdf}}
  * Ebrahimi et al, "Coordinated Control of Multiple Prefetchers in Multi-Core Systems," HPCA 2009. {{:ebrahimi09-prefetchers.pdf|pdf}}

Recommended:
  * Rixner et al., “Memory Access Scheduling,” ISCA 2000. {{:rixner00.pdf|pdf}}
  * Kim et al., “ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers,” HPCA 2010. {{:kim10-atlas.pdf|pdf}}
  * Kim et al., "Thread Cluster Memory Scheduling," MICRO 2010. {{:kim10-tcm.pdf|pdf}}
  * Mutlu and Moscibroda, “Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors,” MICRO 2007. {{:mutlu07.pdf|pdf}}
  * Srinath et al, "Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers," HPCA 2007. {{:srinath07-fdp.pdf|pdf}}
  * Zhuang and Lee, "A Hardware-based Cache Pollution Filtering Mechanism for Aggressive Prefetches," ICPP 2003. {{:zhuang03-prefetch.pdf|pdf}}
  * Lee et al., “Prefetch-Aware DRAM Controllers,” MICRO 2008. {{:lee_prefetchdram08.pdf|pdf}}