User Tools

Site Tools


readings

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
readings [2014/03/28 20:23]
rachata
readings [2014/05/20 01:46]
rachata
Line 1: Line 1:
-1====== Readings ======+====== Readings ======
  
   * **P&P** stands for Patt & Patel'​s //​Introduction to Computing Systems: From Bits and Gates to C and Beyond//   * **P&P** stands for Patt & Patel'​s //​Introduction to Computing Systems: From Bits and Gates to C and Beyond//
Line 223: Line 223:
   * {{p335-ebrahimi.pdf|Ebrahimi,​ E., Lee, C. J., Mutlu, O., & Patt, Y. N. (2010). Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems.}}   * {{p335-ebrahimi.pdf|Ebrahimi,​ E., Lee, C. J., Mutlu, O., & Patt, Y. N. (2010). Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems.}}
   * {{p362-ebrahimi.pdf|Ebrahimi,​ E., Miftakhutdinov,​ R., Fallin, C., Lee, C. J., Joao, J. A., Mutlu, O., & Patt, Y. N. (2011). Parallel application memory scheduling. Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.}}   * {{p362-ebrahimi.pdf|Ebrahimi,​ E., Miftakhutdinov,​ R., Fallin, C., Lee, C. J., Joao, J. A., Mutlu, O., & Patt, Y. N. (2011). Parallel application memory scheduling. Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.}}
 +
 +===== Lecture 25 (4/2 Wed.) =====
 +
 +** Mentioned during lecture: **
 +  * {{raidr-isca12.pdf|Liu et al., RAIDR: Retention-Aware Intelligent DRAM Refresh, ISCA 2012.}}
 +  * {{p60-liu.pdf|Liu et al., “An Experimental Study of Data Retention Behavior in Modern DRAM Devices,” ISCA 2013.}}
 +  * {{4299a065.pdf|Kim,​ Y., Papamichael,​ M., Mutlu, O., & Harchol-Balter,​ M. (2010). Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.}}
 +  * {{muralidhara_et_al._-_2011_-_reducing_memory_interference_in_multicore_systems_via_application-aware_memory_channel_partitioning.pdf|Muralidhara,​ S. P., Subramanian,​ L., Mutlu, O., Kandemir, M., & Moscibroda, T. (2011). Reducing memory interference in multicore systems via application-aware memory channel partitioning. Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.}}
 +  * {{p335-ebrahimi.pdf|Ebrahimi,​ E., Lee, C. J., Mutlu, O., & Patt, Y. N. (2010). Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems.}}
 +  * {{p362-ebrahimi.pdf|Ebrahimi,​ E., Miftakhutdinov,​ R., Fallin, C., Lee, C. J., Joao, J. A., Mutlu, O., & Patt, Y. N. (2011). Parallel application memory scheduling. Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.}}
 +  * {{isca08_ipek.pdf|Ipek,​ E., Mutlu, O., Martinez, J., Caruana, R. (2008). Self-Optimizing Memory Controllers:​ A Reinforcement Learning Approach. Proceedings of the 42th Annual IEEE/ACM International Symposium on Microarchitecture.}}
 +
 +
 +===== Lecture 25 (4/7 Mon.) =====
 +** Required: **
 +  * {{mutlu_et_al._-_2003_-_runahead_execution_an_alternative_to_very_large_instruction_windows_for_out-of-order_processors.pdf|Mutlu,​ O., Stark, J., Wilkerson, C., & Patt, Y. N. (2003). Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. Proceedings of the 9th International Symposium on High-Performance Computer Architecture.}}
 +  * {{04147648.pdf|Srinath,​ S., Mutlu, O., Kim, H., & Patt, Y. N. (2007). Feedback Directed Prefetching:​ Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture.}}
 +
 +** Recommended:​ **
 +  * {{24400233.pdf|Mutlu,​ O., Kim, H., & Patt, Y. N. (2005). Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns. Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture.}}
 +  * {{01603492.pdf|Mutlu,​ O., Kim, H., & Patt, Y. N. (2006). Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance. IEEE Micro.}}
 +  * {{21260119.pdf|Armstrong,​ D. N., Kim, H., Mutlu, O., & Patt, Y. N. (2004). Wrong Path Events: Exploiting Unusual and Illegal Program Behavior for Early Misprediction Detection and Recovery. Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture.}}
 +
 +===== Lecture 27 (4/8 Wed.) =====
 +** Required: **
 +  * None
 +** Mentioned during lecture: **
 +  * {{p176-baer.pdf|Baer,​ J.-L., & Chen, T.-F. (1991). An effective on-chip preloading scheme to reduce data access penalty. Proceedings of the 1991 ACM/IEEE conference on Supercomputing.}}
 +  * {{jouppi_-_1990_-_improving_direct-mapped_cache_performance_by_the_addition_of_a_small_fully-associative_cache_and_prefetch_buffers.pdf|Jouppi,​ N. P. (1990). Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. Proceedings of the 17th annual international symposium on Computer Architecture.}}
 +  * {{mowry_lam_gupta_-_1992_-_design_and_evaluation_of_a_compiler_algorithm_for_prefetching.pdf|Mowry,​ T. C., Lam, M. S., & Gupta, A. (1992). Design and evaluation of a compiler algorithm for prefetching. Proceedings of the fifth international conference on Architectural support for programming languages and operating systems.}}
 +
 +
 +===== Lecture 28 (4/14 Mon.) =====
 +** Required: **
 +  * {{amdahl_-_1967_-_validity_of_the_single_processor_approach_to_achieving_large_scale_computing_capabilities.pdf|Amdahl,​ G. M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference.}}
 +  * {{lamport_-_1979_-_how_to_make_a_multiprocessor_computer_that_correctly_executes_multiprocess_programs.pdf|Lamport,​ L. (1979). How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs.}}
 +  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​culler-mesi.pdf|C&​S,​ Chapters 5.1 & 5.3]]
 +  * P&H, Chapter 5.8
 +** Recommended:​ **
 +  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​hill_309_314.pdf|Hill,​ Jouppi, Sohi. "​Multiprocessors and Multicomputers,"​ pp. 551-560 in Readings in Computer Architecture.]]
 +  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​hill_551_560.pdf|Hill,​ Jouppi, Sohi. "​Dataflow and Multithreading,"​ pp. 309-314 in Readings in Computer Architecture.]]
 +  * {{01447203.pdf|Flynn,​ M. J. (1966). Very high-speed computing systems. Proceedings of the IEEE.}}
 +  * {{papamarcos_patel_-_1984_-_a_low-overhead_coherence_solution_for_multiprocessors_with_private_cache_memories.pdf|Papamarcos,​ M. S., & Patel, J. H. (1984). A low-overhead coherence solution for multiprocessors with private cache memories. Proceedings of the 11th annual international symposium on Computer architecture.}}
 +** Mentioned during lecture: **
 +  * {{p176-baer.pdf|Baer,​ J.-L., & Chen, T.-F. (1991). An effective on-chip preloading scheme to reduce data access penalty. Proceedings of the 1991 ACM/IEEE conference on Supercomputing.}}
 +  * {{04147648.pdf|Srinath,​ S., Mutlu, O., Kim, H., & Patt, Y. N. (2007). Feedback Directed Prefetching:​ Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture.}}
 +  * {{joseph_grunwald_-_1997_-_prefetching_using_markov_predictors.pdf|Joseph,​ D., & Grunwald, D. (1997). Prefetching using Markov predictors. Proceedings of the 24th annual international symposium on Computer architecture.}}
 +  * {{p279-cooksey.pdf|Cooksey,​ R., Jourdan, S., & Grunwald, D. (2002). A stateless, content-directed data prefetching mechanism. Proceedings of the 10th international conference on Architectural support for programming languages and operating systems.}}
 +  * {{04798232.pdf|Ebrahimi,​ E., Mutlu, O., & Patt, Y. N. (2009). Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. High Performance Computer Architecture,​ 2009.}}
 +  * {{p186-chappell.pdf|Chappell,​ R. S., Stark, J., Kim, S. P., Reinhardt, S. K., & Patt, Y. N. (1999). Simultaneous subordinate microthreading (SSMT). Proceedings of the 26th annual international symposium on Computer architecture.}}
 +  * {{p2-zilles.pdf|Zilles,​ C., & Sohi, G. (2001). Execution-based prediction using speculative slices. Proceedings of the 28th annual international symposium on Computer architecture.}}
 +  * {{p40-luk.pdf|Luk,​ C.-K. (2001). Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. Proceedings of the 28th annual international symposium on Computer architecture.}}
 +  * {{p172-zilles.pdf|Zilles,​ C. B., & Sohi, G. S. (2000). Understanding the backward slices of performance degrading instructions. Proceedings of the 27th annual international symposium on Computer architecture.}}
 +  * {{mutlu_et_al._-_2003_-_runahead_execution_an_alternative_to_very_large_instruction_windows_for_out-of-order_processors.pdf|Mutlu,​ O., Stark, J., Wilkerson, C., & Patt, Y. N. (2003). Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. Proceedings of the 9th International Symposium on High-Performance Computer Architecture.}}
 +  * {{jouppi_-_1990_-_improving_direct-mapped_cache_performance_by_the_addition_of_a_small_fully-associative_cache_and_prefetch_buffers.pdf|Jouppi,​ N. P. (1990). Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. Proceedings of the 17th annual international symposium on Computer Architecture.}}
 +
 +
 +===== Lecture 29 (4/16 Wed.) =====
 +** Required: **
 +  * {{amdahl_-_1967_-_validity_of_the_single_processor_approach_to_achieving_large_scale_computing_capabilities.pdf|Amdahl,​ G. M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference.}}
 +  * {{lamport_-_1979_-_how_to_make_a_multiprocessor_computer_that_correctly_executes_multiprocess_programs.pdf|Lamport,​ L. (1979). How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs.}}
 +  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​culler-mesi.pdf|C&​S,​ Chapters 5.1 & 5.3]]
 +  * P&H, Chapter 5.8
 +
 +===== Lecture 30 (4/18 Fri.) =====
 +** Required: **
 +  * {{LCP.pdf|Pekhimenko et al., “Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency,” MICRO 2013.}}
 +  * {{bdi-compression_pact12.pdf|Pekhimenko et al., "​Base-Delta-Immediate Compression:​ Practical Data Compression for On-Chip Caches,"​ PACT 2012.}}
 +  * {{mise-predictable_memory_performance-hpca13.pdf|Subramanian et al., “MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems,” HPCA 2013.}} ​
 +
 +===== Lecture 31 (4/28 Mon.) =====
 +** Required: **
 +  * {{amdahl_-_1967_-_validity_of_the_single_processor_approach_to_achieving_large_scale_computing_capabilities.pdf|Amdahl,​ G. M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference.}}
 +  * {{lamport_-_1979_-_how_to_make_a_multiprocessor_computer_that_correctly_executes_multiprocess_programs.pdf|Lamport,​ L. (1979). How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs.}}
 +  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​culler-mesi.pdf|C&​S,​ Chapters 5.1 & 5.3]]
 +  * P&H, Chapter 5.8
 +** Recommended:​ **
 +  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​hill_309_314.pdf|Hill,​ Jouppi, Sohi. "​Multiprocessors and Multicomputers,"​ pp. 551-560 in Readings in Computer Architecture.]]
 +  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​hill_551_560.pdf|Hill,​ Jouppi, Sohi. "​Dataflow and Multithreading,"​ pp. 309-314 in Readings in Computer Architecture.]]
 +  * {{01447203.pdf|Flynn,​ M. J. (1966). Very high-speed computing systems. Proceedings of the IEEE.}}
 +  * {{papamarcos_patel_-_1984_-_a_low-overhead_coherence_solution_for_multiprocessors_with_private_cache_memories.pdf|Papamarcos,​ M. S., & Patel, J. H. (1984). A low-overhead coherence solution for multiprocessors with private cache memories. Proceedings of the 11th annual international symposium on Computer architecture.}}
 +** Mentioned during lecture: **
 +  * {{p168-patel.pdf|Patel,​ J. H. (1979). Processor-memory interconnections for multiprocessors. Proceedings of the 6th annual symposium on Computer architecture.}}
 +  * {{p196-moscibroda.pdf|Moscibroda,​ T., & Mutlu, O. (2009). A case for bufferless routing in on-chip networks. Proceedings of the 36th annual international symposium on Computer architecture.}}
 +  * {{p27-gottlieb.pdf|Gottlieb,​ A., Grishman, R., Kruskal, C. P., McAuliffe, K. P., Rudolph, L., & Snir, M. (1982). The NYU Ultracomputer -- designing a MIMD, shared-memory parallel machine (Extended Abstract). Proceedings of the 9th annual symposium on Computer Architecture.}}
 +  * {{p22-seitz.pdf|Seitz,​ C. L. (1985). The cosmic cube. Commun. ACM.}}
 +  * {{p278-glass.pdf|Glass,​ C. J., & Ni, L. M. (1992). The turn model for adaptive routing. Proceedings of the 19th annual international symposium on Computer architecture.}}
 +
 +===== Lecture 32 (4/30 Wed.) =====
 +** Required: **
 +  * None
 +
 +** Mentioned during lecture: **
 +  * {{amdahl_-_1967_-_validity_of_the_single_processor_approach_to_achieving_large_scale_computing_capabilities.pdf|Amdahl,​ G. M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference.}}
 +  * {{grochowski_et_al._-_2004_-_best_of_both_latency_and_throughput.pdf|Grochowski,​ E., Ronen, R., Shen, J., & Wang, H. (2004). Best of Both Latency and Throughput. Proceedings of the IEEE International Conference on Computer Design (pp. 236–243).}}
 +  * {{tendler_et_al._-_2002_-_power4_system_microarchitecture.pdf|Tendler,​ J. M., Dodson, J. S., Fields, J. S., Le, H., & Sinharoy, B. (2002). POWER4 system microarchitecture. IBM J. Res. Dev.}}
 +  * {{01289290.pdf|Kalla,​ R., Sinharoy, B., & Tendler, J. M. (2004). IBM Power5 Chip: A Dual-Core Multithreaded Processor. IEEE Micro.}}
 +  * {{kongetira_aingaran_olukotun_-_2005_-_niagara_a_32-way_multithreaded_sparc_processor.pdf|Kongetira,​ P., Aingaran, K., & Olukotun, K. (2005). Niagara: A 32-Way Multithreaded Sparc Processor. IEEE Micro.}}
 +  * {{p253-suleman.pdf|Suleman,​ M. A., Mutlu, O., Qureshi, M. K., & Patt, Y. N. (2009). Accelerating critical section execution with asymmetric multi-core architectures. Proceedings of the 14th international conference on Architectural support for programming languages and operating systems.}}
 +  * {{p441-suleman.pdf|Suleman,​ M. A., Mutlu, O., Joao, J. A., Khubaib, & Patt, Y. N. (2010). Data marshaling for multi-core architectures. Proceedings of the 37th annual international symposium on Computer architecture.}}
 +  * {{p223-joao.pdf|Joao,​ J. A., Suleman, M. A., Mutlu, O., & Patt, Y. N. (2012). Bottleneck identification and scheduling in multithreaded applications. Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems.}}
 +
 +===== Lecture 33 (5/2 Fri.) =====
 +** Required: **
 +  * None
 +
 +** Mentioned during lecture: **
 +  * Liu, Jaiyen, Veras, Mutlu, “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.
 +  * Kim, Seshadri, Lee+, “A Case for Exploiting Subarray-Level Parallelism in DRAM,” ISCA 2012.
 +  * Lee+, “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,​” HPCA 2013.
 +  * Liu+, “An Experimental Study of Data Retention Behavior in Modern DRAM Devices,” ISCA 2013.
 +  * Seshadri+, “RowClone:​ Fast and Efficient In-DRAM Copy and Initialization of Bulk Data,” MICRO 2013.
 +  * Pekhimenko+,​ “Linearly Compressed Pages: A Main Memory Compression Framework,​” MICRO 2013.
 +  * Chang+, “Improving DRAM Performance by Parallelizing Refreshes with Accesses,​” HPCA 2014.
 +  * Khan+, “The Efficacy of Error Mitigation Techniques for DRAM Retention Failures: A Comparative Experimental Study,” SIGMETRICS 2014.
 +  * Luo+, “Characterizing Application Memory Error Vulnerability to Optimize Data Center Cost,” DSN 2014.
 +  * Kim+, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors,” ISCA 2014.
 +  * Lee, Ipek, Mutlu, Burger, “Architecting Phase Change Memory as a Scalable DRAM Alternative,​” ISCA 2009, CACM 2010, Top Picks 2010.
 +  * Meza, Chang, Yoon, Mutlu, Ranganathan,​ “Enabling Efficient and Scalable Hybrid Memories,​” IEEE Comp. Arch. Letters 2012.
 +  * Yoon, Meza et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,​” ICCD 2012.
 +  * Kultursay+, “Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative,​” ISPASS 2013. 
 +  * Meza+, “A Case for Efficient Hardware-Software Cooperative Management of Storage and 
 +Memory,” WEED 2013.
 +  * Lee, Ipek, Mutlu, Burger, “Architecting Phase Change Memory as a Scalable DRAM Alternative,​” ISCA 2009.
 +  * Meza+, “Enabling Efficient and Scalable Hybrid Memories,​” IEEE Comp. Arch. Letters, 2012.
 +  * Yoon, Meza et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,​” ICCD 2012 Best Paper Award.
readings.txt · Last modified: 2015/04/13 19:31 by kevincha