User Tools

Site Tools


readings

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
readings [2014/01/14 03:25]
rachata created
readings [2015/04/13 19:31] (current)
kevincha
Line 2: Line 2:
  
   * **P&P** stands for Patt & Patel'​s //​Introduction to Computing Systems: From Bits and Gates to C and Beyond//   * **P&P** stands for Patt & Patel'​s //​Introduction to Computing Systems: From Bits and Gates to C and Beyond//
 +    * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​PP_Chap1.pdf|P&​P Chapter 1 (Fundamentals)]]
 +    * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​PP_Chap4.pdf|P&​P Chapter 4 (The von Neumann Model)]]
 +    * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​pp-appendixa.pdf|P&​P Appendix A (The LC-3b ISA)]]
 +    * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​pp-appendixc.pdf|P&​P Appendix C (The Microarchitecture of the LC-3b, Basic Machine)]]
   * **P&H** stands for Patterson & Hennessy'​s //Computer Organization and Design: The Hardware/​Software Interface//   * **P&H** stands for Patterson & Hennessy'​s //Computer Organization and Design: The Hardware/​Software Interface//
  
-===== Lecture 1 (1/14 Mon.) =====+====== Guides on how to review papers critically ====== 
 +  * Lecture slides: {{onur-447-s15-how-to-do-the-paper-reviews.pdf | pdf}} {{onur-447-s15-how-to-do-the-paper-reviews.ppt | Slides ppt}} 
 +  * Example reviews on "Main Memory Scaling: Challenges and Solution Directions"​ (link to the paper) 
 +      * {{review-chapter.pdf | Review 1}} 
 +      * {{review-chapter-2.pdf | Review 2}} 
 +  * Example review on "​Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems"​ (link to the paper) 
 +      * {{review-sms.pdf | Review 1}} 
 + 
 + 
 +===== Lecture 1 (1/12 Mon.) =====
 **Required:​** **Required:​**
-  * None+  * For HW1: {{00964437.pdf|Patt,​ Y. (2001). Requirements,​ bottlenecks,​ and good fortune: agents for microprocessor evolution. Proceedings of the IEEE.}}
  
 **Mentioned during lecture:** **Mentioned during lecture:**
- 
   * {{bstj29-2-147.pdf|Hamming,​ R. W. (1950). Error Detecting and Error Correcting Codes. Bell System Technical Journal, 29(2).}}   * {{bstj29-2-147.pdf|Hamming,​ R. W. (1950). Error Detecting and Error Correcting Codes. Bell System Technical Journal, 29(2).}}
   * {{youandyourresearch.pdf|Hamming,​ R. W. (1986). You and Your Research. Transcription of the Bell Communications Research Colloquium Seminar.}}   * {{youandyourresearch.pdf|Hamming,​ R. W. (1986). You and Your Research. Transcription of the Bell Communications Research Colloquium Seminar.}}
     * [[http://​www.youtube.com/​watch?​v=a1zDuOPkMSw|youtube]]     * [[http://​www.youtube.com/​watch?​v=a1zDuOPkMSw|youtube]]
-  * {{05392210.pdf|Amdahl,​ G. M., Blaauw, G. A., & Brooks, F. P. (1964). Architecture of the IBM system/360. IBM J. Res. Dev., 8(2).}} 
   * {{p128-rixner.pdf|Rixner,​ S., Dally, W. J., Kapasi, U. J., Mattson, P., & Owens, J. D. (2000). Memory access scheduling. Proceedings of the 27th annual international symposium on Computer architecture.}}   * {{p128-rixner.pdf|Rixner,​ S., Dally, W. J., Kapasi, U. J., Mattson, P., & Owens, J. D. (2000). Memory access scheduling. Proceedings of the 27th annual international symposium on Computer architecture.}}
-  * {{us5630096.pdf|William KZuravleff, & RobinsonT. (1997). Controller for a synchronous DRAM that maximizes throughput by allowing ​memory ​requests and commands to be issued out of order.}} +  * {{http://​users.ece.cmu.edu/​~omutlu/​pub/​mph_usenix_security07.pdf|Moscibroda, T., & MutluO. (2007). Memory performance attacks: denial of memory ​service in multi-core systems. Proceedings ​of 16th USENIX Security Symposium.}} 
-  * {{00964437.pdf|PattY(2001)Requirementsbottlenecks, and good fortuneagents for microprocessor evolutionProceedings of the IEEE.}} +   ​* {{http://​research.microsoft.com/​pubs/​79625/​MICRO2007.pdf|Onur Mutlu and Thomas Moscibroda"​Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors",​ MICRO 2007}} 
-  * {{moscibroda.pdf|MoscibrodaT., Mutlu, ​O(2007)Memory performance attacksdenial of memory service in multi-core systemsProceedings ​of 16th USENIX Security Symposium.}}+   * {{http://​users.ece.cmu.edu/​~omutlu/​pub/​memory-channel-partitioning-micro11.pdf|Sai Prashanth MuralidharaLavanya Subramanian,​ Onur Mutlu, Mahmut Kandemir, and Thomas Moscibroda, "​Reducing Memory Interference in Multicore Systems via Application-Aware  
 +   * Memory Channel Partitioning",​ MICRO 2011.}} 
 +   * {{http://users.ece.cmu.edu/​~omutlu/​pub/​raidr-dram-refresh_isca12.pdf|Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.}} 
 +   ​* {{http://​users.ece.cmu.edu/​~omutlu/​pub/​memory-scaling_memcon13.pdf|Onur Mutlu"​Memory Scaling: A Systems Architecture Perspective"​ Technical talk at MemCon 2013 (MEMCON), Santa Clara, CA, August 2013.}} 
 +   * {{http://​users.ece.cmu.edu/​~kevincha/​papers/​chang_hpca2014.pdf|Kevin ChangDonghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu, ​"​Improving DRAM Performance by Parallelizing Refreshes with Accesses",​ In HPCA 2014, Orlando, Feb2014.}} 
 +   * {{http://​users.ece.cmu.edu/​~yoonguk/​papers/​kim-isca14.pdf | Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, Onur Mutlu, "​Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors",​ In ISCA-41, 2014.}}
  
-===== Lecture 2 (1/16 Wed.) =====+===== Lecture 2 (1/14 Wed.) =====
 **Required:​** **Required:​**
   * {{00964437.pdf|Patt,​ Y. (2001). Requirements,​ bottlenecks,​ and good fortune: agents for microprocessor evolution. Proceedings of the IEEE.}}   * {{00964437.pdf|Patt,​ Y. (2001). Requirements,​ bottlenecks,​ and good fortune: agents for microprocessor evolution. Proceedings of the IEEE.}}
Line 25: Line 41:
   * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​PP_Chap1.pdf|P&​P Chapter 1 (Fundamentals)]]   * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​PP_Chap1.pdf|P&​P Chapter 1 (Fundamentals)]]
   * P&H Chapters 1 and 2 (Intro, Abstractions,​ ISA, MIPS)   * P&H Chapters 1 and 2 (Intro, Abstractions,​ ISA, MIPS)
-  * + 
 + 
 +**Mentioned during lecture:​** 
 +  
 +===== Lecture 3 (1/16 Fri.) ===== 
 +**Required:​** 
 +  * Note that you should familiarize yourself with these manuals. Please briefly skim through these manuals as you will probably need to refer to them while working on labs and homework 
 +  * MIPS Architecture Reference Manual 
 +    * {{mips_r4000_users_manual.pdf |Manual (the instruction set reference starts on pg.469)}} 
 +  * Intel® 64 and IA-32 Architectures Software Developer Manual (2013) 
 +    * [[http://​download.intel.com/​products/​processor/​manual/​325462.pdf|(15MB) Combined Volumes 1-3]]3 
 + 
 +**Mentioned during lecture:​** 
 +  * {{paper_aklaiber_19jan00.pdf | Klaiber, "The Technology Behind CrusoeTM Processors",​ Transmeta White Paper, 2000}} 
 + 
 +===== Lecture 4 (1/21 Wed.) ===== 
 +**Required:​** 
 +  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​PP_Chap4.pdf|P&​P Chapter 4 (The von Neumann Model)]] 
 +  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​pp-appendixa.pdf|P&​P Appendix A (The LC-3b ISA)]] 
 +  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​pp-appendixc.pdf|P&​P Appendix C (The Microarchitecture of the LC-3b, Basic Machine)]] 
 + 
 +**Mentioned during lecture:​** 
 + 
 +===== Lecture 5 (1/23 Fri.) ===== 
 +**Required** 
 +  * None 
 + 
 +===== Lecture 6 (1/26 Mon.) ===== 
 +**Required:​** 
 +  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​pp-appendixc.pdf|P&​P Appendix C (The Microarchitecture of the LC-3b, Basic Machine)]] 
 +  * P&H Appendix D (Mapping Control to Hardware) 
 +**Optional:​** 
 +  * {{bestway.pdf|Wilkes,​ M. V. (1951). The best way to design an automatic calculating machine. Manchester University Computer Inaugural Conference.}} 
 +**Mentioned during lecture:​** 
 + 
 +===== Lecture 7 (1/28 Wed.) ===== 
 +**Required:​** 
 +  * None 
 + 
 +**Mentioned during lecture:​** 
 +  * {{bestway.pdf|Wilkes,​ M. V. (1951). The best way to design an automatic calculating machine. Manchester University Computer Inaugural Conference.}} 
 +  * [[http://​research.microsoft.com/​pubs/​68221/​acrobat.pdf |Butler W. Lampson, “Hints for Computer System Design,” ACM Operating Systems Review, 1983.]] 
 + 
 +===== Lecture 8 (2/2 Mon.) ===== 
 +**Required:​** 
 +  * P&H Sections 4.9-4.11 
 +  * {{00476078.pdf|Smith,​ J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}} 
 +  * {{00004607.pdf|Smith,​ J. E., & Pleszkun, A. R. (1988). Implementing precise interrupts in pipelined processors. Computers, IEEE Transactions on.}} 
 +  * {{mcfarling_-_1993_-_combining_branch_predictors.pdf|Mcfarling,​ S. (1993). Combining branch predictors. WRL Technical Note TN-36.}} 
 +  * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler,​ R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}} 
 + 
 +**Mentioned during lecture:​** 
 +  * {{p16-pettis.pdf|Pettis,​ K., & Hansen, R. C. (1990). Profile guided code positioning. Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation.}} 
 + 
 +===== Lecture 9 (2/4 Wed.) ===== 
 +**Required:​** 
 +  * P&H Sections 4.9-4.11 
 +  * {{00476078.pdf|Smith,​ J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}} 
 +  * {{00004607.pdf|Smith,​ J. E., & Pleszkun, A. R. (1988). Implementing precise interrupts in pipelined processors. Computers, IEEE Transactions on.}} 
 +  * {{mcfarling_-_1993_-_combining_branch_predictors.pdf|Mcfarling,​ S. (1993). Combining branch predictors. WRL Technical Note TN-36.}} 
 +  * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler,​ R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}} 
 + 
 +**Mentioned during lecture:​** 
 +  * {{flash-memory-data-retention_hpca15.pdf|Yu Cai, Yixin Luo, Erich F. Haratsch, Ken Mai, and Onur Mutlu, Data Retention in MLC NAND Flash Memory: Characterization,​ Optimization and Recovery, HPCA 2015.}} 
 +  * {{adaptive-latency-dram_hpca15.pdf|Donghyuk Lee, Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, and Onur Mutlu, Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case,​ HPCA 2015.}} 
 +  * {{compression-aware-cache-management_hpca15.pdf|Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Phillip P. Gibbons, Michael A. Kozuch, and Todd C. Mowry, Exploiting Compressed Block Size as an Indicator of Future Reuse, HPCA 2015.}} 
 + 
 + 
 +===== Lecture 10 (2/6 Fri.) ===== 
 +**Required:​** 
 +  * P&H Sections 4.9-4.11 
 +  * {{00476078.pdf|Smith,​ J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}} 
 +  * {{00004607.pdf|Smith,​ J. E., & Pleszkun, A. R. (1988). Implementing precise interrupts in pipelined processors. Computers, IEEE Transactions on.}} 
 +  * {{mcfarling_-_1993_-_combining_branch_predictors.pdf|Mcfarling,​ S. (1993). Combining branch predictors. WRL Technical Note TN-36.}} 
 +  * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler,​ R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}} 
 + 
 +**Mentioned in the Lecture:​** 
 +  * {{p300-ball.pdf|Ball,​ T., & Larus, J. R. (1993). Branch prediction for free. Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation.}} 
 +  * {{p135-smith.pdf|Smith,​ J. E. (1981). A study of branch prediction strategies. Proceedings of the 8th annual symposium on Computer Architecture.}} 
 +  * {{yeh_patt_-_1991_-_two-level_adaptive_training_branch_prediction.pdf|Yeh,​ T.-Y., & Patt, Y. N. (1991). Two-level adaptive training branch prediction. Proceedings of the 24th annual international symposium on Microarchitecture.}} 
 +  * {{p22-chang.pdf|Chang,​ P.-Y., Hao, E., Yeh, T.-Y., & Patt, Y. (1994). Branch classification:​ a new mechanism for improving branch predictor performance. Proceedings of the 27th annual international symposium on Microarchitecture.}} 
 +  * {{hpca01.pdf|Daniel A. Jimenez and Calvin Lin. 2001. Dynamic Branch Prediction with Perceptrons. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA '​01)}} 
 +  * {{Riseman.1972.TC.pdf|E. M. Riseman and C. C. Foster. 1972. The Inhibition of Potential Parallelism by Conditional Jumps. IEEE Trans. Comput. 21, 12 (December 1972)}} 
 +  * {{p274-chang.pdf|Po-Yung Chang, Eric Hao, and Yale N. Patt. 1997. Target prediction for indirect jumps. ISCA'​97.}} 
 +  * {{kim_isca07.pdf|Hyesoon Kim, José A. Joao, Onur Mutlu, Chang Joo Lee, Yale N. Patt, and Robert Cohn. 2007. VPC prediction: reducing the cost of indirect branches via hardware-based dynamic devirtualization. ISCA'​07}} 
 + 
 +===== Lecture 11 (2/11 Wed.) ===== 
 +**Required:​** 
 + 
 +**Mentioned in the Lecture:​** 
 +  * {{p18-hwu.pdf|Hwu and Patt (1987). Checkpoint Repair for Out-of-order Execution Machines.}} 
 +  * {{00004607.pdf|Smith,​ J. E., & Pleszkun, A. R. (1988). Implementing precise interrupts in pipelined processors. Computers, IEEE Transactions on.}} 
 +  * {{ogehl.pdf | Seznec (2005). Analysis of the O-GEometric History Length Branch Predictor. ISCA}} 
 +  * {{hpca01.pdf|Daniel A. Jimenez and Calvin Lin. 2001. Dynamic Branch Prediction with Perceptrons. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA '​01)}} 
 + 
 +===== Lecture 12 (2/13 Fri.) ===== 
 +**Required:​** 
 +  * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler,​ R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}} 
 + 
 +===== Lecture 13 (2/16 Mon.) ===== 
 +**Required:​** 
 +  * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler,​ R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}} 
 +  * {{00476078.pdf|Smith,​ J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}} 
 +  * {{04523358.pdf|Lindholm,​ E., Nickolls, J., Oberman, S., & Montrym, J. (2008). NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE.}} 
 +  * {{p50-fatahalian.pdf|Fatahalian,​ K., & Houston, M. (2008). A closer look at GPUs. Commun. ACM.}} 
 + 
 +===== Lecture 14 (2/18 Wed.) ===== 
 +** Required ** 
 +  * {{04523358.pdf|Lindholm,​ E., Nickolls, J., Oberman, S., & Montrym, J. (2008). NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE.}} 
 +  * {{p50-fatahalian.pdf|Fatahalian,​ K., & Houston, M. (2008). A closer look at GPUs. Commun. ACM.}} 
 + 
 +**Mentioned during lecture:​** 
 +  * {{30470407.pdf|Fung,​ W. W. L., Sham, I., Yuan, G., & Aamodt, T. M. (2007). Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture.}} 
 +  * {{p253-suleman.pdf |Suleman, M. A., Mutlu, O., Qureshi, M. K., & Patt, Y. N. (2009). Accelerating critical section execution with asymmetric multi-core architectures. Proceedings of the 14th international conference on Architectural support for programming languages and operating systems.}} 
 +  * {{01447203.pdf|Flynn,​ M. J. (1966). Very high-speed computing systems. Proceedings of the IEEE.}} 
 +  * {{fisher_-_1983_-_very_long_instruction_word_architectures_and_the_eli-512.pdf|Fisher,​ J. A. (1983). Very Long Instruction Word architectures and the ELI-512. Proceedings of the 10th annual international symposium on Computer architecture.}} 
 +  * {{Smith-1982-Decoupled-Access-Execute-Computer-Architectures.pdf|Smith,​ J. E. (1982). Decoupled access/​execute computer architectures. Proceedings of the 9th annual symposium on Computer Architecture.}} 
 +  * {{p289-smith.pdf|Smith,​ J. E. (1984). Decoupled access/​execute computer architectures. ACM Trans. Comput. Syst.}} 
 +  * {{p199-smith.pdf|Smith,​ J. E., Dermer, G. E., Vanderwarn, B. D., Klinger, S. D., & Rozewski, C. M. (1987). The ZS-1 central processor. Proceedings of the second international conference on Architectual support for programming languages and operating systems.}} 
 +  * {{00030730.pdf|Smith,​ J. E. (1989). Dynamic instruction scheduling and the Astronautics ZS-1. IEEE Computer.}} 
 +  * {{kung_-_1982_-_why_systolic_architectures.pdf|Kung,​ H. T. (1982). Why Systolic Architectures?​ IEEE Computer.}} 
 +  * {{annaratone_et_al._-_1986_-_warp_architecture_and_implementation.pdf|Annaratone,​ M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. S. (1986). Warp architecture and implementation. Proceedings of the 13th annual international symposium on Computer architecture.}} 
 +  * {{annaratone_et_al._-_1987_-_the_warp_computer_architecture_implementation_and_performance.pdf|Annaratone,​ M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. (1987). The warp computer: Architecture,​ implementation,​ and performance. IEEE Transactions on Computers.}} 
 + 
 +===== Lecture 15 (2/20 Fri.) ===== 
 +** Required ** 
 +  * {{04523358.pdf|Lindholm,​ E., Nickolls, J., Oberman, S., & Montrym, J. (2008). NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE.}} 
 +  * {{p50-fatahalian.pdf|Fatahalian,​ K., & Houston, M. (2008). A closer look at GPUs. Commun. ACM.}} 
 + 
 +**Mentioned during lecture:​** 
 +  * {{30470407.pdf|Fung,​ W. W. L., Sham, I., Yuan, G., & Aamodt, T. M. (2007). Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture.}} 
 +  * {{fisher_-_1983_-_very_long_instruction_word_architectures_and_the_eli-512.pdf|Fisher,​ J. A. (1983). Very Long Instruction Word architectures and the ELI-512. Proceedings of the 10th annual international symposium on Computer architecture.}} 
 +  * {{Smith-1982-Decoupled-Access-Execute-Computer-Architectures.pdf|Smith,​ J. E. (1982). Decoupled access/​execute computer architectures. Proceedings of the 9th annual symposium on Computer Architecture.}} 
 +  * {{p289-smith.pdf|Smith,​ J. E. (1984). Decoupled access/​execute computer architectures. ACM Trans. Comput. Syst.}} 
 +  * {{p199-smith.pdf|Smith,​ J. E., Dermer, G. E., Vanderwarn, B. D., Klinger, S. D., & Rozewski, C. M. (1987). The ZS-1 central processor. Proceedings of the second international conference on Architectual support for programming languages and operating systems.}} 
 +  * {{00030730.pdf|Smith,​ J. E. (1989). Dynamic instruction scheduling and the Astronautics ZS-1. IEEE Computer.}} 
 +  * {{kung_-_1982_-_why_systolic_architectures.pdf|Kung,​ H. T. (1982). Why Systolic Architectures?​ IEEE Computer.}} 
 +  * {{annaratone_et_al._-_1986_-_warp_architecture_and_implementation.pdf|Annaratone,​ M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. S. (1986). Warp architecture and implementation. Proceedings of the 13th annual international symposium on Computer architecture.}} 
 +  * {{annaratone_et_al._-_1987_-_the_warp_computer_architecture_implementation_and_performance.pdf|Annaratone,​ M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. (1987). The warp computer: Architecture,​ implementation,​ and performance. IEEE Transactions on Computers.}} 
 +  * {{jog_orchestrated.pdf|Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. Orchestrated scheduling and prefetching for GPGPUs. ISCA '​13}} 
 +  * {{large-gpu-warps_micro11|Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov,​ Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling.MICRO-44}} 
 + 
 +===== Lecture 16 (2/23 Mon.) ===== 
 +**Mentioned during lecture:​** 
 +  * {{:​mise-predictable_memory_performance-hpca13.pdf|Subramanian et al., “MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems,” HPCA 2013}} 
 +  * [[http://​users.ece.cmu.edu/​~omutlu/​pub/​mph_usenix_security07.pdf|Moscibroda,​ T., & Mutlu, O. (2007). Memory performance attacks: denial of memory service in multi-core systems. Proceedings of 16th USENIX Security Symposium.]] 
 +  * {{kung_-_1982_-_why_systolic_architectures.pdf|Kung,​ H. T. (1982). Why Systolic Architectures?​ IEEE Computer.}} 
 +  * {{01675827.pdf|Fisher,​ J. A. (1981). Trace Scheduling: A Technique for Global Microcode Compaction. IEEE Trans. Comput.}} 
 +  * {{2fbf01205185.pdf|Hwu,​ W.-M. W., Mahlke, S. A., Chen, W. Y., Chang, P. P., Warter, N. J., Bringmann, R. A., Ouellette, R. G., et al. (1993). The superblock: an effective technique for VLIW and superscalar compilation. J. Supercomput.}} 
 +  * {{p45-mahlke.pdf|Mahlke,​ S. A., Lin, D. C., Chen, W. Y., Hank, R. E., & Bringmann, R. A. (1992). Effective compiler support for predicated execution using the hyperblock. Proceedings of the 25th annual international symposium on Microarchitecture.}} 
 +  * {{melvin_patt_-_1995_-_enhancing_instruction_scheduling_with_a_block-structured_isa.pdf|Melvin,​ S., & Patt, Y. (1995). Enhancing instruction scheduling with a block-structured ISA. Int. J. Parallel Program.}} 
 +  * {{hao_et_al._-_1996_-_increasing_the_instruction_fetch_rate_via_block-structured_instruction_set_architectures.pdf|Hao,​ E., Chang, P.-Y., Evers, M., & Patt, Y. N. (1996). Increasing the instruction fetch rate via block-structured instruction set architectures. Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture.}} 
 +  * {{00877947.pdf|Huck,​ J., Morris, D., Ross, J., Knies, A., Mulder, H., & Zahir, R. (2000). Introducing the IA-64 architecture. IEEE Micro.}} 
 +  * {{annaratone_et_al._-_1986_-_warp_architecture_and_implementation.pdf|Annaratone,​ M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. S. (1986). Warp architecture and implementation. Proceedings of the 13th annual international symposium on Computer architecture.}} 
 +  * {{annaratone_et_al._-_1987_-_the_warp_computer_architecture_implementation_and_performance.pdf|Annaratone,​ M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. (1987). The warp computer: Architecture,​ implementation,​ and performance. IEEE Transactions on Computers.}} 
 +  *  {{fisher_-_1983_-_very_long_instruction_word_architectures_and_the_eli-512.pdf|Fisher,​ J. A. (1983). Very Long Instruction Word architectures and the ELI-512. Proceedings of the 10th annual international symposium on Computer architecture.}} 
 +  * {{Smith-1982-Decoupled-Access-Execute-Computer-Architectures.pdf|Smith,​ J. E. (1982). Decoupled access/​execute computer architectures. Proceedings of the 9th annual symposium on Computer Architecture.}} 
 +  * {{p289-smith.pdf|Smith,​ J. E. (1984). Decoupled access/​execute computer architectures. ACM Trans. Comput. Syst.}} 
 +  * {{:​ilp_history_overview_perspective.pdf|Rau and Fisher, “Instruction-level parallel processing:​ history,​ overview, and perspective,​” Journal of Supercomputing,​ 1993.}} 
 +  * {{:​ieee_proceedings_2001_-_compiler_techniques.pdf|Faraboschi et al., “Instruction Scheduling for Instruction Level Parallel Processors,​” Proc. IEEE, Nov. 2001. 
 +}} 
 + 
 +===== Lecture 17 (2/25 Wed.) ===== 
 +**Required:​** 
 +  * {{00877947.pdf|Huck,​ J., Morris, D., Ross, J., Knies, A., Mulder, H., & Zahir, R. (2000). Introducing the IA-64 architecture. IEEE Micro.}} 
 +  * P&H Chapters 5.1-5.3 (cache chapters) 
 +  * Hamacher et al. Chapters 8.1-8.7 (cache/​memory chapters) 
 +  * {{wilkes_-_1965_-_slave_memories_and_dynamic_storage_allocation.pdf|Wilkes,​ M. V. (1965). Slave Memories and Dynamic Storage Allocation. IEEE Transactions on Electronic Computers.}} 
 +  * {{:​liptay68.pdf|Liptay,​ “Structural aspects of the System/360 Model 85 II: the cache,” IBM Systems Journal, 1968. 
 +}} 
 + 
 +===== Lecture 18 (2/27 Fri.) ===== 
 +**Required:​** 
 +  * {{wilkes_-_1965_-_slave_memories_and_dynamic_storage_allocation.pdf|Wilkes,​ M. V. (1964). Slave Memories and Dynamic Storage Allocation. IEEE Transactions on Electronic Computers.}} 
 +  * {{A_Case_For_MLP_Aware_Cache_Replacement.pdf|nQureshi et al., “A Case for MLP-Aware Cache Replacement,​“ ISCA 2006.}} 
 +  * P&H Chapters 5.1-5.3 (cache chapters) 
 +  * Hamacher et al. Chapters 8.1-8.7 (cache/​memory chapters) 
 + 
 +**Mentioned During Lecture:​** 
 +  * {{A_Study_of_replacement_algorithms_for_a_virtual-storage_computer.pdf|qBelady,​ “A study of replacement algorithms for a virtual-storage computer,​” IBM Systems Journal, 1966.}} 
 + 
 +===== Lecture 19 (3/2 Mon.) ===== 
 +**Required:​** 
 +  * {{wilkes_-_1965_-_slave_memories_and_dynamic_storage_allocation.pdf|Wilkes,​ M. V. (1964). Slave Memories and Dynamic Storage Allocation. IEEE Transactions on Electronic Computers.}} 
 +  * {{A_Case_For_MLP_Aware_Cache_Replacement.pdf|Qureshi et al., “A Case for MLP-Aware Cache Replacement,​“ ISCA 2006.}} 
 +  * P&H Chapters 5.1-5.3 (cache chapters) 
 +  * Hamacher et al. Chapters 8.1-8.7 (cache/​memory chapters) 
 + 
 +**Mentioned During Lecture:​** 
 +  * {{jouppi_-_1990_-_improving_direct-mapped_cache_performance_by_the_addition_of_a_small_fully-associative_cache_and_prefetch_buffers.pdf|Jouppi,​ N. P. (1990). Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. Proceedings of the 17th annual international symposium on Computer Architecture.}} 
 +  * {{mutlu_et_al._-_2003_-_runahead_execution_an_alternative_to_very_large_instruction_windows_for_out-of-order_processors.pdf|Mutlu,​ O., Stark, J., Wilkerson, C., & Patt, Y. N. (2003). Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. Proceedings of the 9th International Symposium on High-Performance Computer Architecture.}} 
 +  * {{seznec_a_case_for_two_way_skewed_associative_caches.pdf|Seznec. A Case for Two-Way Skewed-Associative Caches. ISCA 1993.}} 
 +  * {{seznec_a_case_for_two_way_skewed_associative_caches.pdf|Kroft. Lockup-Free Instruction Fetch/​Prefetch Cache Organization. ISCA 1981.}} 
 +  * {{qureshi_utility_based_cache_partitioning.pdf|Qureshi and Patt. Utility-Based Cache Partitioning:​ A Low-Overhead,​ High-Performance,​ Runtime Mechanism to Partition Shared Caches. MICRO 2006.}} 
 +  * {{suh_new_memory_monitoring_scheme_for_memory_aware_scheduling_and_partitioning.pdf|Suh et al. A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning. HPCA 2002.}} 
 + 
 +===== Lecture 20 (3/4 Wed.) ===== 
 +**Required:​** 
 +  * Section 5.4 in P&H 
 +**Mentioned During Lecture:​** 
 +  * Section 8.8 in Hamacher et al. 
 +  * {{megiddo.pdf|Megiddo and Modha, “ARC: A Self-Tuning,​ Low Overhead Replacement Cache,” FAST 2003.}} 
 + 
 +===== Lecture 21 (3/23 Mon.) ===== 
 +**Required:​** 
 +  * {{tldram-lee.pdf| Lee et al., “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,​” HPCA 2013. (Sections 1 and 2)}} 
 +  * {{2012_isca_salp.pdf| Kim et al., “A Case for Subarray-Level Parallelism (SALP) in DRAM,” ISCA 2012. (Sections 1 and 2)}} 
 +  * {{raidr-isca12.pdf| Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012. (Sections 1 and 2)}} 
 +  * {{main-memory-system_kiise15.pdf| Onur Mutlu, Justin Meza, and Lavanya Subramanian,​ "The Main Memory System: Challenges and Opportunities,"​ Invited Article in Communications of the Korean Institute of Information Scientists and Engineers (KIISE), 2015.}} 
 +**Mentioned During Lecture:​** 
 +  * {{flipping_kim-isca14.pdf| Kim+, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors,” ISCA 2014.}} 
 +  * {{flash-memory-data-retention_hpca15.pdf| Yu Cai, Yixin Luo, Erich F. Haratsch, Ken Mai, and Onur Mutlu, 
 +"Data Retention in MLC NAND Flash Memory: Characterization,​ Optimization and Recovery,"​  
 +Proceedings of the 21st International Symposium on High-Performance Computer Architecture (HPCA), Bay Area, CA, February 2015. }} 
 +  * {{coarchitecting-kang.pdf| Kang+, "​Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling"​}} 
 + 
 +===== Lecture 22 (3/25 Wed.) ===== 
 +**Required:​** 
 +  * {{tldram-lee.pdf| Lee et al., “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,​” HPCA 2013. (Sections 1 and 2)}} 
 +  * {{2012_isca_salp.pdf| Kim et al., “A Case for Subarray-Level Parallelism (SALP) in DRAM,” ISCA 2012. (Sections 1 and 2)}} 
 +   * {{http://​users.ece.cmu.edu/​~omutlu/​pub/​raidr-dram-refresh_isca12.pdf|Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.}} 
 +  * {{main-memory-system_kiise15.pdf| Onur Mutlu, Justin Meza, and Lavanya Subramanian,​ "The Main Memory System: Challenges and Opportunities,"​ Invited Article in Communications of the Korean Institute of Information Scientists and Engineers (KIISE), 2015.}} 
 +**Mentioned During Lecture:​** 
 +  * {{moscibroda.pdf| Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service in multi-core systems,​”USENIX Security 2007}} 
 +  * {{moscibroda2007.pdf| Onur Mutluand Thomas Moscibroda, "​Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors"​ ISCA 2008}} 
 +  * {{isca08.pdf| Onur Mutlu and Thomas Moscibroda, "​Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM," ISCA 2008}} 
 +  * {{thread_cluster_mem_sched.pdf| Yoongu Kim, Michael Papamichael,​ Onur Mutlu, and Mor Harchol-Balter,​ "​Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior"​ 43rd International Symposium on Microarchitecture (MICRO), pages 65-76, Atlanta, GA, December 2010}} 
 +  * {{ATLAS.pdf| Yoongu Kim, Dongsu Han, Onur Mutlu, Mor Harchol-Balter,​ "​ATLAS:​ A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers"​}} 
 + 
 +===== Lecture 23 (3/27 Fri.) ===== 
 +**Required:​** 
 +  * {{tldram-lee.pdf| Lee et al., “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,​” HPCA 2013. (Sections 1 and 2)}} 
 +  * {{2012_isca_salp.pdf| Kim et al., “A Case for Subarray-Level Parallelism (SALP) in DRAM,” ISCA 2012. (Sections 1 and 2)}} 
 +   * {{http://​users.ece.cmu.edu/​~omutlu/​pub/​raidr-dram-refresh_isca12.pdf|Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.}} 
 +  * {{main-memory-system_kiise15.pdf| Onur Mutlu, Justin Meza, and Lavanya Subramanian,​ "The Main Memory System: Challenges and Opportunities,"​ Invited Article in Communications of the Korean Institute of Information Scientists and Engineers (KIISE), 2015.}} 
 + 
 +**Mentioned During Lecture:​** 
 +  * {{moscibroda.pdf| Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service in multi-core systems,​”USENIX Security 2007}} 
 +  * {{http://​users.ece.cmu.edu/​~omutlu/​pub/​memory-channel-partitioning-micro11.pdf|Sai Prashanth Muralidhara,​ Lavanya Subramanian,​ Onur Mutlu, Mahmut Kandemir, and Thomas Moscibroda, "​Reducing Memory Interference in Multicore Systems via Application-Awareness"​}} 
 +  * {{http://​users.ece.cmu.edu/​~omutlu/​pub/​architecture-aware-distributed-resource-management_vee15.pdf | Wang et al., “A-DRM: Architecture-aware Distributed Resource Management of Virtualized Clusters,​” VEE 2015.}} 
 +  * {{p362-ebrahimi.pdf|Ebrahimi,​ E., Miftakhutdinov,​ R., Fallin, C., Lee, C. J., Joao, J. A., Mutlu, O., & Patt, Y. N. (2011). Parallel application memory scheduling. Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.}} 
 +  * {{p335-ebrahimi.pdf|Ebrahimi,​ E., Lee, C. J., Mutlu, O., & Patt, Y. N. (2010). Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems.}} 
 +  * {{bloom1970.pdf|Bloom,​ “Space/​Time Trade-offs in Hash Coding with Allowable Errors”, CACM 1970}} 
 +  * {{p60-liu.pdf|Liu et al., “An Experimental Study of Data Retention Behavior in Modern DRAM Devices,” ISCA 2013.}} 
 +   * {{http://​users.ece.cmu.edu/​~kevincha/​papers/​chang_hpca2014.pdf|Kevin Chang, Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu, "​Improving DRAM Performance by Parallelizing Refreshes with Accesses",​ In HPCA 2014, Orlando, Feb. 2014.}} 
 + 
 +===== Lecture 24 (3/30 Mon.) ===== 
 +**Required:​** 
 +  * {{http://​users.ece.cmu.edu/​~omutlu/​pub/​mutlu_hpca03.pdf | Mutlu et al., “Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors,​” HPCA 2003.}} 
 +  * {{http://​users.ece.cmu.edu/​~omutlu/​pub/​TR-HPS-2006-006.pdf|Srinathet al., “Feedback directed prefetching”,​ HPCA 2007.}} 
 + 
 +**Mentioned During Lecture:​** 
 +  * {{ramulator.pdf|Kim et al., “Ramulator:​ A Fast and Extensible DRAM Simulator,​” IEEE Computer Architecture Letters 2015.}} 
 +  * {{http://​users.ece.cmu.edu/​~omutlu/​pub/​eaf-cache_pact12.pdf|Seshadri et al., “The Evicted-Address Filter: A Unified Mechanism to Address Both Cache Pollution and Thrashing,​”PACT 2012.}} 
 +  * {{http://​hps.ece.utexas.edu/​pub/​TR-HPS-2010-002.pdf|Lee et al., “DRAM-Aware Last-Level Cache Writeback: Reducing Write-Caused Interference in Memory Systems,​”HPS Technical Report, April 2010.}} 
 +  * {{tldram-lee.pdf| Lee et al., “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,​” HPCA 2013. (Sections 1 and 2)}} 
 +  * {{2012_isca_salp.pdf| Kim et al., “A Case for Subarray-Level Parallelism (SALP) in DRAM,” ISCA 2012. (Sections 1 and 2)}} 
 +  * {{http://​users.ece.cmu.edu/​~omutlu/​pub/​rlmc_isca08.pdf|Ipek et al., “Self Optimizing Memory Controllers:​ A Reinforcement Learning Approach,​”ISCA 2008}} 
 +  * {{http://​users.ece.cmu.edu/​~omutlu/​pub/​mutlu_ieee_micro06.pdf|Mutlu et al., "​Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance,"​ ISCA 2005, IEEE Micro Top Picks 2006.}} 
 +  * {{http://​users.ece.cmu.edu/​~omutlu/​pub/​mutlu_micro05.pdf|Mutlu et al., "​Address-Value Delta (AVD) Prediction,"​ MICRO 2005.}} 
 +  * {{http://​users.ece.cmu.edu/​~omutlu/​pub/​armstrong_micro04.pdf|Armstrong et al., "Wrong Path Events,"​ MICRO 2004.}} 
 + 
 +===== Lecture 25 (4/1 Wed.) ===== 
 +**Required:​** 
 +  * {{main-memory-system_kiise15.pdf| Onur Mutlu, Justin Meza, and Lavanya Subramanian,​ "The Main Memory System: Challenges and Opportunities,"​ Invited Article in Communications of the Korean Institute of Information Scientists and Engineers (KIISE), 2015.}} 
 + 
 +**Mentioned During Lecture:​** 
 +  * {{jouppi1990.pdf|Jouppi,​ “Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers,” ISCA 1990.}} 
 +  * {{http://​users.ece.cmu.edu/​~omutlu/​pub/​TR-HPS-2006-006.pdf|Srinathet al., “Feedback directed prefetching”,​ HPCA 2007.}} 
 + 
 +===== Lecture 26 (4/3 Fri.) ===== 
 +** Required: ** 
 +  * {{main-memory-system_kiise15.pdf| Onur Mutlu, Justin Meza, and Lavanya Subramanian,​ "The Main Memory System: Challenges and Opportunities,"​ Invited Article in Communications of the Korean Institute of Information Scientists and Engineers (KIISE), 2015.}} 
 + 
 +** Mentioned during lecture: ** 
 +  * {{raidr-isca12.pdf|Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.}} 
 +  * {{2012_isca_salp.pdf|Kim et al., “A Case for Exploiting Subarray-Level Parallelism in DRAM,” ISCA 2012.}} 
 +  * {{TLDRAM-Lee.pdf|Lee et al., “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,​” HPCA 2013.}} 
 +  * {{p60-liu.pdf|Liu et al., “An Experimental Study of Data Retention Behavior in Modern DRAM Devices,” ISCA 2013.}} 
 +  * {{rowclone_micro13.pdf|Seshadri et al., “RowClone:​ Fast and Efficient In-DRAM Copy and Initialization of Bulk Data,” MICRO 2013.}} 
 +  * {{LCP.pdf|Pekhimenko et al., “Linearly Compressed Pages: A Main Memory Compression Framework,​” MICRO 2013.}} 
 +  * {{|Chang et al., “Improving DRAM Performance by Parallelizing Refreshes with Accesses,​” HPCA 2014.}} 
 +  * {{error-mitigation-for-intermittent-dram-failures_sigmetrics14.pdf|Khan et al., “The Efficacy of Error Mitigation Techniques for DRAM Retention Failures: A Comparative Experimental Study,” SIGMETRICS 2014.}} 
 +  * {{luo_dsn14.pdf|Luo et al., “Characterizing Application Memory Error Vulnerability to Optimize Data Center Cost,” DSN 2014.}} 
 +  * {{yoongu2014.pdf|Kim et al., “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors,” ISCA 2014.}} 
 +  * {{meza_cal12.pdf|Meza et al., “Enabling Efficient and Scalable Hybrid Memories,​” IEEE Comp. Arch. Letters 2012.}} 
 +  * {{rowbuffer-aware-caching_iccd12.pdf|Yoon et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,​” ICCD 2012.}} 
 +  * {{sttram_ispass13.pdf|Kultursay ​ et al., “Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative,​” ISPASS 2013. }} 
 +  * {{meza_weed13.pdf|Meza ​ et al., “A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory,” WEED 2013.}} 
 +  * {{ISCA09.pdf|Lee ​ et al. “Architecting Phase Change Memory as a Scalable DRAM Alternative,​” ISCA 2009.}} 
 +  * Meza+, “Revisiting Memory Errors in Large-Scale Production Data Centers: Analysis and Modeling of New Trends from the Field,” DSN 2015. 
 +  * Qureshi+, “AVATAR: A Variable-Retention-Time (VRT) Aware Refresh for DRAM Systems,” DSN 2015. 
 +  * {{ramulator.pdf|Kim et al., “Ramulator:​ A Fast and Extensible DRAM Simulator,​” IEEE Computer Architecture Letters 2015.}} 
 +  * {{dirty-block-index_isca14.pdf|Seshadri+,​ “The Dirty-Block Index,” ISCA 2014.}} 
 +  * {{bdi-compression_pact12.pdf|Pekhimenko+,​ “Base-Delta-Immediate Compression:​ Practical Data Compression for On-Chip Caches,” PACT 2012.}} 
 +  * {{eaf-cache_pact12.pdf|Seshadri+,​ “The Evicted-Address Filter: A Unified Mechanism to Address Both Cache Pollution and Thrashing,​” PACT 2012.}} 
 +  * {{zhao-micro2014.pdf|Zhao+,​ “FIRM: Fair and High-Performance Memory Control for Persistent Memory Systems,” MICRO 2014.}} 
 +  * {{a40-yoon.pdf|Yoon,​ Meza+, “Efficient Data Mapping and Buffering Techniques for Multi-Level Cell Phase-Change Memories,​” ACM TACO 2014.}} 
 +  * {{compression-aware-cache-management_hpca15.pdf|Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Phillip P. Gibbons, Michael A. Kozuch, and Todd C. Mowry, Exploiting Compressed Block Size as an Indicator of Future Reuse, HPCA 2015.}} 
 +  * {{06974684.pdf|Lu+,​ “Loose Ordering Consistency for Persistent Memory,” ICCD 2014.}} 
 +  * {{mutlu_et_al._-_2003_-_runahead_execution_an_alternative_to_very_large_instruction_windows_for_out-of-order_processors.pdf|Mutlu,​ O., Stark, J., Wilkerson, C., & Patt, Y. N. (2003). Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. Proceedings of the 9th International Symposium on High-Performance Computer Architecture.}} 
 +  * {{cooksey.pdf|Cooksey et al., “A stateless, content-directed data prefetching mechanism,​” ASPLOS 2002.}} 
 +  * {{zilles-2001.pdf|Zilles and Sohi, “Execution-based Prediction Using Speculative Slices”, ISCA 2001.}} 
 +  * {{zilles-2000.pdf|Zilles and Sohi, ”Understanding the backward slices of performance degrading instructions,​” ISCA 2000.}} 
 +  * {{http://​www.amazon.com/​Inside-AS-400-Second-Edition/​dp/​1882419669|Frank Soltis,"​Inside the AS/​400"​}} 
 + 
 +===== Lecture 27 (4/6 Mon.) ===== 
 +** Required: ** 
 +  * {{amdahl_-_1967_-_validity_of_the_single_processor_approach_to_achieving_large_scale_computing_capabilities.pdf|Amdahl,​ G. M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference.}} 
 +  * {{lamport_-_1979_-_how_to_make_a_multiprocessor_computer_that_correctly_executes_multiprocess_programs.pdf|Lamport,​ L. (1979). How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs.}} 
 +  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​culler-mesi.pdf|C&​S,​ Chapters 5.1 & 5.3]] 
 +  * P&H, Chapter 5.8 
 +** Recommended:​ ** 
 +  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​hill_309_314.pdf|Hill,​ Jouppi, Sohi. "​Multiprocessors and Multicomputers,"​ pp. 551-560 in Readings in Computer Architecture.]] 
 +  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​hill_551_560.pdf|Hill,​ Jouppi, Sohi. "​Dataflow and Multithreading,"​ pp. 309-314 in Readings in Computer Architecture.]] 
 +  * {{01447203.pdf|Flynn,​ M. J. (1966). Very high-speed computing systems. Proceedings of the IEEE.}} 
 +  * {{papamarcos_patel_-_1984_-_a_low-overhead_coherence_solution_for_multiprocessors_with_private_cache_memories.pdf|Papamarcos,​ M. S., & Patel, J. H. (1984). A low-overhead coherence solution for multiprocessors with private cache memories. Proceedings of the 11th annual international symposium on Computer architecture.}} 
 +** Mentioned during lecture: ** 
 +  * {{horner-1819.pdf|Horner (1819). A new method of solving numerical equations of all orders, by continuous approximation. Philosophical Transactions of the Royal Society}} 
 + 
 +===== Lecture 28 (4/8 Wed.) ===== 
 +** Required: ** 
 +  * {{lamport_-_1979_-_how_to_make_a_multiprocessor_computer_that_correctly_executes_multiprocess_programs.pdf|Lamport,​ L. (1979). How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs.}} 
 +  * {{papamarcos_patel_-_1984_-_a_low-overhead_coherence_solution_for_multiprocessors_with_private_cache_memories.pdf|Papamarcos,​ M. S., & Patel, J. H. (1984). A low-overhead coherence solution for multiprocessors with private cache memories. Proceedings of the 11th annual international symposium on Computer architecture.}} 
 +  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​culler-mesi.pdf|C&​S,​ Chapters 5.1 & 5.3]] 
 +  * P&H, Chapter 5.8 
 +** Recommended:​ ** 
 +  * {{10.1.1.17.8112.pdf|Gharachorloo et al. (1990). Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors.}} 
 +  * {{10.1.1.89.3693.pdf|Gharachorloo et al. (1991). Two Techniques to Enhance the Performance of Memory Consistency Models.}} 
 +  * {{isca07_bulksc.pdf|Ceze et al. (2007). BulkSC: Bulk Enforcement of Sequential Consistency.}} 
 +  * {{censier.pdf|Censier et al. (1978). A new solution to coherence problems in multicache systems.}} 
 +  * {{goodman-snoopyprotocol.pdf|Goodman (1983). Using cache memory to reduce processor-memory traffic.}} 
 +  * {{isca123.pdf|Laudon et al. (1997). The SGI Origin: a ccNUMA highly scalable server.}} 
 +  * {{isca03_token_coherence.pdf|Martin et al. (2003). Token coherence: decoupling performance and correctness.}} 
 +  * {{p73-baer.pdf|Baer et al. (1988). On the inclusion properties for multi-level cache hierarchies.}} 
 +** Mentioned during lecture: ** 
 +  * (HTML) [[http://​www.cs.utexas.edu/​users/​EWD/​transcriptions/​EWD01xx/​EWD123.html|Dijkstra (1965) Cooperating Sequential Processes.]] 
 + 
 +===== Lecture 29 (4/10 Fri.) ===== 
 +** Required: ** 
 +  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​culler-mesi.pdf|C&​S,​ Chapters 5.1 & 5.3]] 
 +  * P&H, Chapter 5.8 
 +  * {{papamarcos_patel_-_1984_-_a_low-overhead_coherence_solution_for_multiprocessors_with_private_cache_memories.pdf|Papamarcos,​ M. S., & Patel, J. H. (1984). A low-overhead coherence solution for multiprocessors with private cache memories. Proceedings of the 11th annual international symposium on Computer architecture.}} 
 +** Recommended:​ ** 
 +  * {{censier.pdf|Censier et al. (1978). A new solution to coherence problems in multicache systems.}} 
 +  * {{goodman-snoopyprotocol.pdf|Goodman (1983). Using cache memory to reduce processor-memory traffic.}} 
 +  * {{isca123.pdf|Laudon et al. (1997). The SGI Origin: a ccNUMA highly scalable server.}} 
 +  * {{isca03_token_coherence.pdf|Martin et al. (2003). Token coherence: decoupling performance and correctness.}} 
 +  * {{p73-baer.pdf|Baer et al. (1988). On the inclusion properties for multi-level cache hierarchies.}} 
 + 
 +===== Lecture 30 (4/13 Mon.) ===== 
 +** Required: ** 
 +  * {{rowclone_micro13.pdf|Seshadri et al., “RowClone:​ Fast and Efficient In-DRAM Copy and Initialization of Bulk Data,” MICRO 2013.}} 
readings.1389669904.txt.gz · Last modified: 2014/12/11 00:09 (external edit)