Differences

This shows you the differences between two versions of the page.

readings [2010/10/06 20:38]
lsubrama
readings [2010/12/04 06:00] (current)
vseshadr
Line 74: Line 74:
  * {{anefficientalgorithmforexploitingmultiplearithmeticunits.pdf|Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967}}   * {{anefficientalgorithmforexploitingmultiplearithmeticunits.pdf|Tomasulo, “An Efficient Algorithm for Exploiting Multiple Arithmetic Units,” IBM Journal of R&D, Jan. 1967}}
-===== For Lecture 8 =====+===== For Lecture 9 =====
== Required Readings == == Required Readings ==
  * {{improvingdirectmappedcacheperformance.pdf|Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” ISCA 1990}}   * {{improvingdirectmappedcacheperformance.pdf|Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” ISCA 1990}}
  * {{acaseformlpawarecachereplacement.pdf|Qureshi et al., “A Case for MLP-Aware Cache Replacement,“ ISCA 2006}}   * {{acaseformlpawarecachereplacement.pdf|Qureshi et al., “A Case for MLP-Aware Cache Replacement,“ ISCA 2006}}
  * {{slavememoriesanddynamicstorageallocation.pdf|Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. On Electronic Computers, 1965}}   * {{slavememoriesanddynamicstorageallocation.pdf|Wilkes, “Slave Memories and Dynamic Storage Allocation,” IEEE Trans. On Electronic Computers, 1965}}
 +
 +===== For Lecture 10 =====
 +== Required Readings ==
 +  * {{runaheadexecution.pdf|Mutlu et al., "Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors," HPCA 2003}}
 +  * {{dualcoreexecution.pdf|Zhou, Dual-Core Execution: "Building a Highly Scalable Single-Thread Instruction Window," PACT 2005}}
 +  * {{efficientrunaheadexecution.pdf|Mutlu et al., "Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance," IEEE Micro Top Picks 2006}}
 +
 +== Suggested Readings ==
 +
 +  * {{memorydependenceprediction.pdf|Chrysos and Emer, "Memory Dependence Prediction Using Store Sets," ISCA 1998}}
 +
 +
 +===== For Lecture 11 =====
 +== Required Readings ==
 +  * Hennessy and Patterson, Appendix C.1-C.3
 +  * {{improvingdirectmappedcacheperformance.pdf|Jouppi, “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” ISCA 1990}}
 +  * {{acaseformlpawarecachereplacement.pdf|Qureshi et al., “A Case for MLP-Aware Cache Replacement,“ ISCA 2006}}
 +
 +== Suggested Readings ==
 +  * {{twowayskewedassociativecaches.pdf|Seznec, "A Case for Two-way Skewed Associative Caches," ISCA 1993}}
 +  * {{cacheconsciousstructuredefinition.pdf|Chilimbi et al., "Cache-conscious Structure Definition," PLDI 1999}}
 +  * {{cacheconsciousstructurelayout.pdf|Chilimbi et al., "Cache-conscious Structure Layout," PLDI 1999}}
 +
 +
 +===== For Lecture 13 =====
 +== Required Readings ==
 +  * {{codetransformationsformlp.pdf|Pai et al., "Code Transformations to Improve Memory Parallelism," MICRO 1999}}
 +
 +== Recommended Readings ==
 +  * {{datacachesforsuperscalar.pdf|Juan et al., "Data Caches for Superscalar Processors," ICS 1997}}
 +
 +===== For Lecture 14 =====
 +== Required Readings ==
 +  * {{markovpredictors.pdf|Joseph and Grunwald, "Prefetching using Markov Predictors,' ISCA 1997}}
 +
 +== Recommended Readings ==
 +  * {{compileralgorithmforprefetching.pdf| Mowry et al., "Design and Evaluation of a Compiler Algorithm for Prefetching," ASPLOS 1992}}
 +  * {{feedbackdirectedprefetching.pdf| Srinath et al., "Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers", HPCA 2007}}
 +  * {{runaheadexecution.pdf|Mutlu et al., "Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors," HPCA 2003}}
 +
 +===== For Lecture 15 =====
 +Same as previous lecture
 +
 +===== For Lecture 16 =====
 +== Recommended Readings ==
 +  * {{statelesscontentdirectedprefetching.pdf|Cooksey et al., "A stateless, content-directed data prefetching mechanism," ASPLOS 2002}}
 +  * {{bandwidthefficientprefetching.pdf|Ebrahimi et al., "Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems," HPCA 2009}}
 +  * {{softwarecontrolledpreexecution.pdf|Luk, "Tolerating Memory Latency through Software-Controlled Pre-Execution in Simultaneous Multithreading Processors," ISCA 2001}}
 +
 +
 +===== Guest Lecture by Thomas Moscibroda =====
 +== Recommended Readings ==
 +  * {{bless.pdf|Moscibroda and Mutlu, "A Case for Bufferless Routing in On-Chip Networks", ISCA 2009}}
 +  * {{appawareprioritizationmechanismfornocs.pdf|Das et al., "Application-Aware Prioritization Mechanism for On-Chip Networks", MICRO 2009}}
 +  * {{aergia.pdf|Das et al. "Aergia: Exploiting Packet-Latency Slack in On-Chip Networks", ISCA 2010}}
 +  * {{nextgenerationnoc.pdf|Nychis et al., "Next Generation On-Chip Networks: What Kind of Congestion Control do we Need?", Hotnets 2010}}
 +
 +
 +===== For Lecture 17 =====
 +== Recommended Readings ==
 +  * {{coordinatedprefetchermanagement.pdf|Ebrahimi et al., "Coordinated Management of Multiple Prefetchers in Multi-Core Systems," MICRO 2009}}
 +
 +===== For Lecture 18 =====
 +== Required Readings ==
 +  * {{utilitybasedcachepartitioning.pdf|Qureshi and Patt, "Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches," MICRO 2006}}
 +
 +== Recommended Readings ==
 +  * {{gaininginsightsintocachepartitioning.pdf|Lin et al., "Gaining Insights into Multi-Core Cache Partitioning:Bridging the Gap between Simulation and Real Systems," HPCA 2008}}
 +  * {{adaptiveinsertionpolicies.pdf|Qureshi et al., "Adaptive Insertion Policies for High-Performance Caching," ISCA 2007}}
 +
 +
 +===== For Lecture 19 =====
 +== Required Readings ==
 +  * {{parbs.pdf|Mutlu and Moscibroda, "Parallelism-Aware Batch Scheduling:Enabling High-Performance and Fair Memory Controllers," IEEE Micro Top Picks 2009}}
 +  * {{stalltimefairmemoryscheduling.pdf| Mutlu and Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors," MICRO 2007}}
 +
 +== Recommended Readings ==
 +  * {{permutationbasedpageinterleaving.pdf|Zhang et al., "A Permutation-based Page Interleaving Scheme to Reduce Row-buffer Conflicts and Exploit Data Locality," MICRO 2000}}
 +  * {{prefetchawaredramcontrollers.pdf|Lee et al., "Prefetch-Aware DRAM Controllers," MICRO 2008}}
 +  * {{memoryaccessscheduling.pdf|Rixner et al., "Memory Access Scheduling," ISCA 2000}}
 +
 +===== For Lecture 20 =====
 +Same as previous lecture
 +
 +===== For Lecture 21 =====
 +== Required Readings ==
 +  * {{evaluationoftracecachefetchmechanisms.pdf|Patel et al., "Evaluation of design options for the trace cache fetch mechanism," IEEE TC 1999}}
 +  * {{complexityeffectivesuperscalar.pdf|Palacharla et al., "Complexity Effective Superscalar Processors," ISCA 1997}}
 +
 +== Required Readings (old) ==
 +  * {{microarchitectureofsuperscalar.pdf|Smith and Sohi, "The Microarchitecture of Superscalar Processors," Proc IEEE 1995}}
 +  * {{onpipeliningdynamicinstructionschedulinglogic.pdf|Stark, Brown, Patt, "On pipelining dynamic instruction scheduling logic," MICRO 2000}}
 +  * {{Themicroarchitectureofthepentium4processor.pdf|Boggs et al., "The microarchitecture of the Pentium 4 processor," Intel Technology Journal, 2001}}
 +  * {{21264microprocessor.pdf|Kessler, "The Alpha 21264 microprocessor," IEEE Micro, March-April 1999}}
 +
 +== Recommended Readings ==
 +  * {{tracecache.pdf|Rotenberg et al., "Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching," MICRO 1996}}
 + 
 +===== For Lecture 21 =====
 +Same as previous lecture
 +
 +===== For Lecture 22 =====
 +Same as previous lecture
 +
 +===== For Lecture 23 =====
 +Same as previous lecture
 +
 +===== For Lecture 24 =====
 +== Required Readings ==
 +  * {{conbiningbranchpredictors.pdf|McFarling, "Combining Branch Predictors," DEC WRL TR, 1993}}
 +  * {{increasingprocessorperformance.pdf|Carmean and Sprangle, "Increasing Processor Performance by Implementing Deeper Pipelines," ISCA 2002}}
 +
 +== Recommended Readings ==
 +  * {{analysisofcorrelationandpredictability.pdf|Evers et al., "An Analysis of Correlation and Predictability: What Makes Two-Level Branch Predictors Work," ISCA 1998}}
 +  * {{alternativeimplementationoftwolevelbp.pdf|Yeh and Patt, "Alternative Implementations of Two-Level Adaptive Branch Prediction," ISCA 1992}}
 +  * {{availableilpforsuperscalar.pdf|Jouppi and Wall, "Available instruction-level parallelism for superscalar and superpipelined machines," ASPLOS 1989}}
 +  * {{divergemergeprocessors.pdf|Kim et al., "Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths," MICRO 2006}}
 +  * {{dynamicbranchpredictionwithperceptrons.pdf|Jimenez and Lin, "Dynamic Branch Prediction with Perceptrons," HPCA 2001}}
 +
 +===== For Lecture 25 =====
 +Same as previous lecture
 +
 +===== For Lecture 26 =====
 +
 +=== Control Flow III ===
 +
 +== Recommended Readings ==
 +  * {{wishbranches.pdf|Kim et al., "Wish Branches: Enabling Adaptive and Aggressive Predicated Execution," IEEE Micro Top Picks, Jan/Feb 2006}}
 +  * {{divergemergeprocessors.pdf|Kim et al., "Diverge-Merge Processor: Generalized and Energy-Efficient Dynamic Predication," IEEE Micro Top Picks, Jan/Feb 2007}}
 + 
 +=== Alternative Approaches to Concurrency ===
 +== Required Readings ==
 +  * {{vliweli.pdf|Fisher, "Very Long Instruction Word architectures and the ELI-512," ISCA 1983}}
 +  * {{introducingia64.pdf|Huck et al., "Introducing the IA-64 Architecture," IEEE Micro 2000}}
 +
 +== Recommended Readings ==
 +  * {{cray1computersystem.pdf|Russell, "The CRAY-1 computer system," CACM 1978}}
 +  * {{ilpprocessing.pdf|Rau and Fisher, "Instruction-level parallel processing: history,overview, and perspective," Journal of Supercomputing, 1993}}
 +  * {{instructionschedulingforilpprocessors.pdf|Faraboschi et al., "Instruction Scheduling for Instruction Level Parallel Processors," Proc. IEEE, Nov. 2001}}
 +
 +===== For Lecture 26 =====
 +Same as previous lecture (Alternative Approaches to Concurrency)
 +
 +===== For Lecture 27 =====
 +== Required Readings ==
 +  * {{nvidiatesla.pdf|Lindholm et al., "NVIDIA Tesla: A Unified Graphics and Computing Architecture," IEEE Micro 2008}}
 +  * {{cray1computersystem.pdf|Russell, "The CRAY-1 computer system," CACM 1978}}
 +
 +== Recommended Readings ==
 +  * {{dynamicwarpformation.pdf|Fung et al., "Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow," MICRO 2007}}
 +  * {{qilin.pdf|Luk et al., "Qilin: Exploiting Parallelism on Heterogeneous Multiprocessors with Adaptive Mapping," MICRO 2009}}

Personal Tools