User Tools

Site Tools


readings

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
readings [2014/12/11 00:09]
127.0.0.1 external edit
readings [2015/03/25 21:15]
albert
Line 8: Line 8:
   * **P&H** stands for Patterson & Hennessy'​s //Computer Organization and Design: The Hardware/​Software Interface//   * **P&H** stands for Patterson & Hennessy'​s //Computer Organization and Design: The Hardware/​Software Interface//
  
-===== Lecture 1 (1/13 Mon.) =====+====== Guides on how to review papers critically ====== 
 +  * Lecture slides: {{onur-447-s15-how-to-do-the-paper-reviews.pdf | pdf}} {{onur-447-s15-how-to-do-the-paper-reviews.ppt | Slides ppt}} 
 +  * Example reviews on "Main Memory Scaling: Challenges and Solution Directions"​ (link to the paper) 
 +      * {{review-chapter.pdf | Review 1}} 
 +      * {{review-chapter-2.pdf | Review 2}} 
 +  * Example review on "​Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems"​ (link to the paper) 
 +      * {{review-sms.pdf | Review 1}} 
 + 
 + 
 +===== Lecture 1 (1/12 Mon.) =====
 **Required:​** **Required:​**
-  * None+  * For HW1: {{00964437.pdf|Patt,​ Y. (2001). Requirements,​ bottlenecks,​ and good fortune: agents for microprocessor evolution. Proceedings of the IEEE.}}
  
 **Mentioned during lecture:** **Mentioned during lecture:**
- 
   * {{bstj29-2-147.pdf|Hamming,​ R. W. (1950). Error Detecting and Error Correcting Codes. Bell System Technical Journal, 29(2).}}   * {{bstj29-2-147.pdf|Hamming,​ R. W. (1950). Error Detecting and Error Correcting Codes. Bell System Technical Journal, 29(2).}}
   * {{youandyourresearch.pdf|Hamming,​ R. W. (1986). You and Your Research. Transcription of the Bell Communications Research Colloquium Seminar.}}   * {{youandyourresearch.pdf|Hamming,​ R. W. (1986). You and Your Research. Transcription of the Bell Communications Research Colloquium Seminar.}}
     * [[http://​www.youtube.com/​watch?​v=a1zDuOPkMSw|youtube]]     * [[http://​www.youtube.com/​watch?​v=a1zDuOPkMSw|youtube]]
-  * {{05392210.pdf|Amdahl,​ G. M., Blaauw, G. A., & Brooks, F. P. (1964). Architecture of the IBM system/360. IBM J. Res. Dev., 8(2).}} 
   * {{p128-rixner.pdf|Rixner,​ S., Dally, W. J., Kapasi, U. J., Mattson, P., & Owens, J. D. (2000). Memory access scheduling. Proceedings of the 27th annual international symposium on Computer architecture.}}   * {{p128-rixner.pdf|Rixner,​ S., Dally, W. J., Kapasi, U. J., Mattson, P., & Owens, J. D. (2000). Memory access scheduling. Proceedings of the 27th annual international symposium on Computer architecture.}}
-  * {{us5630096.pdf|William K. Zuravleff, & Robinson, T. (1997). Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order.}} 
-  * {{00964437.pdf|Patt,​ Y. (2001). Requirements,​ bottlenecks,​ and good fortune: agents for microprocessor evolution. Proceedings of the IEEE.}} 
   * {{http://​users.ece.cmu.edu/​~omutlu/​pub/​mph_usenix_security07.pdf|Moscibroda,​ T., & Mutlu, O. (2007). Memory performance attacks: denial of memory service in multi-core systems. Proceedings of 16th USENIX Security Symposium.}}   * {{http://​users.ece.cmu.edu/​~omutlu/​pub/​mph_usenix_security07.pdf|Moscibroda,​ T., & Mutlu, O. (2007). Memory performance attacks: denial of memory service in multi-core systems. Proceedings of 16th USENIX Security Symposium.}}
    * {{http://​research.microsoft.com/​pubs/​79625/​MICRO2007.pdf|Onur Mutlu and Thomas Moscibroda, "​Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors",​ MICRO 2007. }}    * {{http://​research.microsoft.com/​pubs/​79625/​MICRO2007.pdf|Onur Mutlu and Thomas Moscibroda, "​Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors",​ MICRO 2007. }}
Line 27: Line 32:
    * {{http://​users.ece.cmu.edu/​~omutlu/​pub/​raidr-dram-refresh_isca12.pdf|Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.}}    * {{http://​users.ece.cmu.edu/​~omutlu/​pub/​raidr-dram-refresh_isca12.pdf|Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.}}
    * {{http://​users.ece.cmu.edu/​~omutlu/​pub/​memory-scaling_memcon13.pdf|Onur Mutlu, "​Memory Scaling: A Systems Architecture Perspective"​ Technical talk at MemCon 2013 (MEMCON), Santa Clara, CA, August 2013.}}    * {{http://​users.ece.cmu.edu/​~omutlu/​pub/​memory-scaling_memcon13.pdf|Onur Mutlu, "​Memory Scaling: A Systems Architecture Perspective"​ Technical talk at MemCon 2013 (MEMCON), Santa Clara, CA, August 2013.}}
 +   * {{http://​users.ece.cmu.edu/​~kevincha/​papers/​chang_hpca2014.pdf|Kevin Chang, Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu, "​Improving DRAM Performance by Parallelizing Refreshes with Accesses",​ In HPCA 2014, Orlando, Feb. 2014.}}
 +   * {{http://​users.ece.cmu.edu/​~yoonguk/​papers/​kim-isca14.pdf | Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, Onur Mutlu, "​Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors",​ In ISCA-41, 2014.}}
  
-===== Lecture 2 (1/15 Wed.) =====+===== Lecture 2 (1/14 Wed.) =====
 **Required:​** **Required:​**
   * {{00964437.pdf|Patt,​ Y. (2001). Requirements,​ bottlenecks,​ and good fortune: agents for microprocessor evolution. Proceedings of the IEEE.}}   * {{00964437.pdf|Patt,​ Y. (2001). Requirements,​ bottlenecks,​ and good fortune: agents for microprocessor evolution. Proceedings of the IEEE.}}
Line 34: Line 41:
   * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​PP_Chap1.pdf|P&​P Chapter 1 (Fundamentals)]]   * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​PP_Chap1.pdf|P&​P Chapter 1 (Fundamentals)]]
   * P&H Chapters 1 and 2 (Intro, Abstractions,​ ISA, MIPS)   * P&H Chapters 1 and 2 (Intro, Abstractions,​ ISA, MIPS)
 +
  
 **Mentioned during lecture:** **Mentioned during lecture:**
-  * {{gordon_moore_1965_article.pdf|Moore,​ G. E. (1965). Cramming More Components onto Integrated Circuits. Electronics,​ 38(8).}} +  
-  * {{bab6286.0001.001.pdf|Burks,​ A. W., Goldstine, H. H., & Neumann, J. von. (1946). Preliminary discussion of the logical design of an electronic computing instrument.}} +===== Lecture 3 (1/16 Fri.) =====
-  * {{p126-dennis.pdf|Dennis,​ J. B., & Misunas, D. P. (1975). A preliminary architecture for a basic data-flow processor. Proceedings of the 2nd annual symposium on Computer architecture.}} +
-  * {{p34-gurd.pdf|Gurd,​ J. R., Kirkham, C. C., & Watson, I. (1985). The Manchester prototype dataflow computer. Commun. ACM, 28(1).}} +
-  * Kuhn, T. S. (1962). The Structure of Scientific Revolutions. +
-  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​PP_Chap4.pdf|P&​P Chapter 4 (The von Neumann Model)]] +
- +
-===== Lecture 3 (1/17 Fri.) =====+
 **Required:​** **Required:​**
   * Note that you should familiarize yourself with these manuals. Please briefly skim through these manuals as you will probably need to refer to them while working on labs and homework   * Note that you should familiarize yourself with these manuals. Please briefly skim through these manuals as you will probably need to refer to them while working on labs and homework
-  * ARM Architecture Reference Manual +  * MIPS Architecture Reference Manual 
-    * [[https://​www.scss.tcd.ie/​~waldroj/​3d1/​arm_arm.pdf|Manual (5MB)]] +    * {{mips_r4000_users_manual.pdf |Manual ​(the instruction set reference starts on pg.469)}}
-  * ARM Architecture Instruction Quick Reference +
-    * {{arm-instructionset.pdf|Quick Ref (.5MB)}}+
   * Intel® 64 and IA-32 Architectures Software Developer Manual (2013)   * Intel® 64 and IA-32 Architectures Software Developer Manual (2013)
     * [[http://​download.intel.com/​products/​processor/​manual/​325462.pdf|(15MB) Combined Volumes 1-3]]3     * [[http://​download.intel.com/​products/​processor/​manual/​325462.pdf|(15MB) Combined Volumes 1-3]]3
  
 **Mentioned during lecture:** **Mentioned during lecture:**
-  ​* P&H Chapter 4, Sections 4.1-4.4. +  * {{paper_aklaiber_19jan00.pdf | Klaiber"The Technology Behind CrusoeTM Processors"​Transmeta White Paper2000}}
-  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​pp-appendixc.pdf|P&​P Appendix C (The Microarchitecture of the LC-3b, Basic Machine)]] +
-  * P&P Chapter 5 (The LC3) +
-  ​* {{p25-patterson.pdf|PattersonD. A., & Ditzel, D. R. (1980). ​The case for the reduced instruction set computer. SIGARCH Comput. Archit. News8(6).}} +
-  * [[http://​www.ece.cmu.edu/​~koopman/​stack_computers/​sec3_2.html | KoopmanP. (1989) Stack Computers: The New Wave.]] +
-  * {{chapter9.pdf|Levy,​ H. (1984). Capability-Based Computer Systems. Chapter 9. The Intel iAPX 432.}} +
-  * {{p489-wilner.pdf|Wilner,​ W. T. (1972). Design of the Burroughs B1700. Proceedings of the December 5-7, 1972, fall joint computer conference, part I. }}+
  
- +===== Lecture 4 (1/21 Wed.) ===== 
-===== Lecture 4 (1/22 Wed.) ===== +**Required:**
-**Required**+
   * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​PP_Chap4.pdf|P&​P Chapter 4 (The von Neumann Model)]]   * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​PP_Chap4.pdf|P&​P Chapter 4 (The von Neumann Model)]]
   * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​pp-appendixa.pdf|P&​P Appendix A (The LC-3b ISA)]]   * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​pp-appendixa.pdf|P&​P Appendix A (The LC-3b ISA)]]
   * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​pp-appendixc.pdf|P&​P Appendix C (The Microarchitecture of the LC-3b, Basic Machine)]]   * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​pp-appendixc.pdf|P&​P Appendix C (The Microarchitecture of the LC-3b, Basic Machine)]]
  
-===== Lecture 5 (1/24 Fri.) =====+**Mentioned during lecture:​** 
 + 
 +===== Lecture 5 (1/23 Fri.) =====
 **Required** **Required**
   * None   * None
  
-===== Lecture 6 (1/27 Mon.) =====+===== Lecture 6 (1/26 Mon.) =====
 **Required:​** **Required:​**
   * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​pp-appendixc.pdf|P&​P Appendix C (The Microarchitecture of the LC-3b, Basic Machine)]]   * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​pp-appendixc.pdf|P&​P Appendix C (The Microarchitecture of the LC-3b, Basic Machine)]]
Line 80: Line 75:
   * {{bestway.pdf|Wilkes,​ M. V. (1951). The best way to design an automatic calculating machine. Manchester University Computer Inaugural Conference.}}   * {{bestway.pdf|Wilkes,​ M. V. (1951). The best way to design an automatic calculating machine. Manchester University Computer Inaugural Conference.}}
 **Mentioned during lecture:** **Mentioned during lecture:**
-  * {{bestway.pdf|Wilkes,​ M. V. (1951). The best way to design an automatic calculating machine. Manchester University Computer Inaugural Conference.}} 
  
-===== Lecture 7 (1/29 Wed.) =====+===== Lecture 7 (1/28 Wed.) =====
 **Required:​** **Required:​**
   * None   * None
  
 **Mentioned during lecture:** **Mentioned during lecture:**
-  * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/pp-appendixc.pdf|P&P Appendix C (The Microarchitecture of the LC-3bBasic Machine)]]+  * {{bestway.pdf|Wilkes,​ M. V. (1951). The best way to design an automatic calculating machine. Manchester University Computer Inaugural Conference.}} 
 +  * [[http://research.microsoft.com/pubs/68221/acrobat.pdf |Butler W. Lampson“Hints for Computer System Design,” ACM Operating Systems Review, 1983.]]
  
-===== Lecture 8 (1/31 Fri.) =====+===== Lecture 8 (2/2 Mon.) =====
 **Required:​** **Required:​**
-  * None+  * P&H Sections 4.9-4.11 
 +  * {{00476078.pdf|Smith,​ J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}} 
 +  * {{00004607.pdf|Smith,​ J. E., & Pleszkun, A. R. (1988). Implementing precise interrupts in pipelined processors. Computers, IEEE Transactions on.}} 
 +  * {{mcfarling_-_1993_-_combining_branch_predictors.pdf|Mcfarling,​ S. (1993). Combining branch predictors. WRL Technical Note TN-36.}} 
 +  * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler,​ R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}} 
 + 
 +**Mentioned during lecture:​** 
 +  * {{p16-pettis.pdf|Pettis,​ K., & Hansen, R. C. (1990). Profile guided code positioning. Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation.}}
  
-===== Lecture 9 (2/3 Mon.) =====+===== Lecture 9 (2/4 Wed.) =====
 **Required:​** **Required:​**
   * P&H Sections 4.9-4.11   * P&H Sections 4.9-4.11
   * {{00476078.pdf|Smith,​ J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}}   * {{00476078.pdf|Smith,​ J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}}
 +  * {{00004607.pdf|Smith,​ J. E., & Pleszkun, A. R. (1988). Implementing precise interrupts in pipelined processors. Computers, IEEE Transactions on.}}
 +  * {{mcfarling_-_1993_-_combining_branch_predictors.pdf|Mcfarling,​ S. (1993). Combining branch predictors. WRL Technical Note TN-36.}}
 +  * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler,​ R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}}
  
 **Mentioned during lecture:** **Mentioned during lecture:**
-  * {{p177-allen.pdf|AllenJ. R., KennedyK.PorterfieldC.& Warren, J. (1983). Conversion of control dependence to data dependence. Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages.}} +  * {{flash-memory-data-retention_hpca15.pdf|Yu CaiYixin Luo, Erich FHaratschKen Maiand Onur MutluData Retention in MLC NAND Flash Memory: CharacterizationOptimization and RecoveryHPCA 2015.}} 
-  * {{24400043.pdf|Kim, ​H.MutluO.StarkJ.& Patt, Y. N. (2005). Wish BranchesCombining Conditional Branching and Predication ​for Adaptive Predicated Execution. Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture.}} +  * {{adaptive-latency-dram_hpca15.pdf|Donghyuk Lee, Yoongu ​Kim, Gennady PekhimenkoSamira KhanVivek SeshadriKevin Changand Onur MutluAdaptive-Latency DRAMOptimizing DRAM Timing ​for the Common-Case,​ HPCA 2015.}} 
-  * {{thornton_-_1964_-_parallel_operation_in_the_control_data_6600.pdf|ThorntonJ. E. (1964). Parallel Operation in the Control Data 6600. Proceedings of the Fall Joint Computer Conference.}} +  * {{compression-aware-cache-management_hpca15.pdf|Gennady PekhimenkoTyler HubertyRui Cai, Onur MutluPhillip PGibbonsMichael AKozuchand Todd C. Mowry, Exploiting Compressed Block Size as an Indicator ​of Future Reuse, HPCA 2015.}}
-  * {{smith78_hep.pdf|SmithB. J. (1978). A pipelinedshared resource MIMD computerInternational Conference on Parallel Processing.}} +
-  * {{p16-pettis.pdf|PettisK., & Hansen, R. C. (1990). Profile guided code positioning. Proceedings ​of the ACM SIGPLAN 1990 conference on Programming language design and implementation.}}+
  
-===== Lecture 10 (2/5 Wed.) ===== 
  
 +===== Lecture 10 (2/6 Fri.) =====
 **Required:​** **Required:​**
 +  * P&H Sections 4.9-4.11
 +  * {{00476078.pdf|Smith,​ J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}}
 +  * {{00004607.pdf|Smith,​ J. E., & Pleszkun, A. R. (1988). Implementing precise interrupts in pipelined processors. Computers, IEEE Transactions on.}}
   * {{mcfarling_-_1993_-_combining_branch_predictors.pdf|Mcfarling,​ S. (1993). Combining branch predictors. WRL Technical Note TN-36.}}   * {{mcfarling_-_1993_-_combining_branch_predictors.pdf|Mcfarling,​ S. (1993). Combining branch predictors. WRL Technical Note TN-36.}}
   * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler,​ R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}}   * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler,​ R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}}
-**Mentioned ​during lecture:**+ 
 +**Mentioned ​in the Lecture:**
   * {{p300-ball.pdf|Ball,​ T., & Larus, J. R. (1993). Branch prediction for free. Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation.}}   * {{p300-ball.pdf|Ball,​ T., & Larus, J. R. (1993). Branch prediction for free. Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation.}}
   * {{p135-smith.pdf|Smith,​ J. E. (1981). A study of branch prediction strategies. Proceedings of the 8th annual symposium on Computer Architecture.}}   * {{p135-smith.pdf|Smith,​ J. E. (1981). A study of branch prediction strategies. Proceedings of the 8th annual symposium on Computer Architecture.}}
Line 117: Line 124:
   * {{hpca01.pdf|Daniel A. Jimenez and Calvin Lin. 2001. Dynamic Branch Prediction with Perceptrons. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA '01)}}   * {{hpca01.pdf|Daniel A. Jimenez and Calvin Lin. 2001. Dynamic Branch Prediction with Perceptrons. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA '01)}}
   * {{Riseman.1972.TC.pdf|E. M. Riseman and C. C. Foster. 1972. The Inhibition of Potential Parallelism by Conditional Jumps. IEEE Trans. Comput. 21, 12 (December 1972)}}   * {{Riseman.1972.TC.pdf|E. M. Riseman and C. C. Foster. 1972. The Inhibition of Potential Parallelism by Conditional Jumps. IEEE Trans. Comput. 21, 12 (December 1972)}}
- 
-===== Lecture 11 (2/12 Wed.) ===== 
-** Required ** 
-  * None 
- 
-** Mentioned during the lecture ** 
   * {{p274-chang.pdf|Po-Yung Chang, Eric Hao, and Yale N. Patt. 1997. Target prediction for indirect jumps. ISCA'​97.}}   * {{p274-chang.pdf|Po-Yung Chang, Eric Hao, and Yale N. Patt. 1997. Target prediction for indirect jumps. ISCA'​97.}}
   * {{kim_isca07.pdf|Hyesoon Kim, José A. Joao, Onur Mutlu, Chang Joo Lee, Yale N. Patt, and Robert Cohn. 2007. VPC prediction: reducing the cost of indirect branches via hardware-based dynamic devirtualization. ISCA'​07}}   * {{kim_isca07.pdf|Hyesoon Kim, José A. Joao, Onur Mutlu, Chang Joo Lee, Yale N. Patt, and Robert Cohn. 2007. VPC prediction: reducing the cost of indirect branches via hardware-based dynamic devirtualization. ISCA'​07}}
  
-===== Lecture ​12 (2/14 Fri.) ===== +===== Lecture ​11 (2/11 Wed.) ===== 
-** Required ** +**Required:** 
-  P&H Sections 4.9-4.11 + 
-  * {{00476078.pdf|Smith, J. E., & Sohi, G. S. (1995). The microarchitecture ​of superscalar processors. Proceedings of the IEEE.}}+**Mentioned in the Lecture:*
 +  * {{p18-hwu.pdf|Hwu and Patt (1987). Checkpoint Repair for Out-of-order Execution Machines.}}
   * {{00004607.pdf|Smith,​ J. E., & Pleszkun, A. R. (1988). Implementing precise interrupts in pipelined processors. Computers, IEEE Transactions on.}}   * {{00004607.pdf|Smith,​ J. E., & Pleszkun, A. R. (1988). Implementing precise interrupts in pipelined processors. Computers, IEEE Transactions on.}}
 +  * {{ogehl.pdf | Seznec (2005). Analysis of the O-GEometric History Length Branch Predictor. ISCA}}
 +  * {{hpca01.pdf|Daniel A. Jimenez and Calvin Lin. 2001. Dynamic Branch Prediction with Perceptrons. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA '01)}}
  
-===== Lecture ​13 (2/17 Mon.) ===== +===== Lecture ​12 (2/13 Fri.) ===== 
-** Required ** +**Required:** 
-  * none+  * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler,​ R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}}
  
-===== Lecture ​14 (2/19 Wed.) ===== +===== Lecture ​13 (2/16 Mon.) ===== 
-** Required ** +**Required:** 
-  * {{p18-hwu.pdf|HwuWW., & Patt, Y. N. (1987). Checkpoint repair for out-of-order execution machinesProceedings of the 14th annual international symposium on Computer architecture.}}+  * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|KesslerRE. (1999). The Alpha 21264 MicroprocessorIEEE Micro.}}
   * {{00476078.pdf|Smith,​ J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}}   * {{00476078.pdf|Smith,​ J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}}
-  * {{00004607.pdf|Smith, J. E., & PleszkunA. R. (1988). Implementing precise interrupts in pipelined processorsComputers, IEEE Transactions on.}} +  * {{04523358.pdf|Lindholm, E., Nickolls, J., Oberman, S., & MontrymJ. (2008). NVIDIA Tesla: A Unified Graphics and Computing ArchitectureMicro, IEEE.}} 
 +  * {{p50-fatahalian.pdf|Fatahalian,​ K., & Houston, M. (2008). A closer look at GPUs. Commun. ACM.}}
  
-===== Lecture ​15 (2/21 Fri.) =====+===== Lecture ​14 (2/18 Wed.) =====
 ** Required ** ** Required **
   * {{04523358.pdf|Lindholm,​ E., Nickolls, J., Oberman, S., & Montrym, J. (2008). NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE.}}   * {{04523358.pdf|Lindholm,​ E., Nickolls, J., Oberman, S., & Montrym, J. (2008). NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE.}}
Line 161: Line 165:
   * {{annaratone_et_al._-_1987_-_the_warp_computer_architecture_implementation_and_performance.pdf|Annaratone,​ M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. (1987). The warp computer: Architecture,​ implementation,​ and performance. IEEE Transactions on Computers.}}   * {{annaratone_et_al._-_1987_-_the_warp_computer_architecture_implementation_and_performance.pdf|Annaratone,​ M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. (1987). The warp computer: Architecture,​ implementation,​ and performance. IEEE Transactions on Computers.}}
  
-===== Lecture ​18 (2/28 Fri.) =====+===== Lecture ​15 (2/20 Fri.) ===== 
 +** Required ** 
 +  * {{04523358.pdf|Lindholm,​ E., Nickolls, J., Oberman, S., & Montrym, J. (2008). NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE.}} 
 +  * {{p50-fatahalian.pdf|Fatahalian,​ K., & Houston, M. (2008). A closer look at GPUs. Commun. ACM.}} 
 **Mentioned during lecture:** **Mentioned during lecture:**
 +  * {{30470407.pdf|Fung,​ W. W. L., Sham, I., Yuan, G., & Aamodt, T. M. (2007). Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture.}}
 +  * {{fisher_-_1983_-_very_long_instruction_word_architectures_and_the_eli-512.pdf|Fisher,​ J. A. (1983). Very Long Instruction Word architectures and the ELI-512. Proceedings of the 10th annual international symposium on Computer architecture.}}
 +  * {{Smith-1982-Decoupled-Access-Execute-Computer-Architectures.pdf|Smith,​ J. E. (1982). Decoupled access/​execute computer architectures. Proceedings of the 9th annual symposium on Computer Architecture.}}
 +  * {{p289-smith.pdf|Smith,​ J. E. (1984). Decoupled access/​execute computer architectures. ACM Trans. Comput. Syst.}}
 +  * {{p199-smith.pdf|Smith,​ J. E., Dermer, G. E., Vanderwarn, B. D., Klinger, S. D., & Rozewski, C. M. (1987). The ZS-1 central processor. Proceedings of the second international conference on Architectual support for programming languages and operating systems.}}
 +  * {{00030730.pdf|Smith,​ J. E. (1989). Dynamic instruction scheduling and the Astronautics ZS-1. IEEE Computer.}}
 +  * {{kung_-_1982_-_why_systolic_architectures.pdf|Kung,​ H. T. (1982). Why Systolic Architectures?​ IEEE Computer.}}
 +  * {{annaratone_et_al._-_1986_-_warp_architecture_and_implementation.pdf|Annaratone,​ M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. S. (1986). Warp architecture and implementation. Proceedings of the 13th annual international symposium on Computer architecture.}}
 +  * {{annaratone_et_al._-_1987_-_the_warp_computer_architecture_implementation_and_performance.pdf|Annaratone,​ M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. (1987). The warp computer: Architecture,​ implementation,​ and performance. IEEE Transactions on Computers.}}
 +  * {{jog_orchestrated.pdf|Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. Orchestrated scheduling and prefetching for GPGPUs. ISCA '13}}
 +  * {{large-gpu-warps_micro11|Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov,​ Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling.MICRO-44}}
 +
 +===== Lecture 16 (2/23 Mon.) =====
 +**Mentioned during lecture:**
 +  * {{:​mise-predictable_memory_performance-hpca13.pdf|Subramanian et al., “MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems,” HPCA 2013}}
 +  * [[http://​users.ece.cmu.edu/​~omutlu/​pub/​mph_usenix_security07.pdf|Moscibroda,​ T., & Mutlu, O. (2007). Memory performance attacks: denial of memory service in multi-core systems. Proceedings of 16th USENIX Security Symposium.]]
 +  * {{kung_-_1982_-_why_systolic_architectures.pdf|Kung,​ H. T. (1982). Why Systolic Architectures?​ IEEE Computer.}}
   * {{01675827.pdf|Fisher,​ J. A. (1981). Trace Scheduling: A Technique for Global Microcode Compaction. IEEE Trans. Comput.}}   * {{01675827.pdf|Fisher,​ J. A. (1981). Trace Scheduling: A Technique for Global Microcode Compaction. IEEE Trans. Comput.}}
   * {{2fbf01205185.pdf|Hwu,​ W.-M. W., Mahlke, S. A., Chen, W. Y., Chang, P. P., Warter, N. J., Bringmann, R. A., Ouellette, R. G., et al. (1993). The superblock: an effective technique for VLIW and superscalar compilation. J. Supercomput.}}   * {{2fbf01205185.pdf|Hwu,​ W.-M. W., Mahlke, S. A., Chen, W. Y., Chang, P. P., Warter, N. J., Bringmann, R. A., Ouellette, R. G., et al. (1993). The superblock: an effective technique for VLIW and superscalar compilation. J. Supercomput.}}
Line 169: Line 194:
   * {{hao_et_al._-_1996_-_increasing_the_instruction_fetch_rate_via_block-structured_instruction_set_architectures.pdf|Hao,​ E., Chang, P.-Y., Evers, M., & Patt, Y. N. (1996). Increasing the instruction fetch rate via block-structured instruction set architectures. Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture.}}   * {{hao_et_al._-_1996_-_increasing_the_instruction_fetch_rate_via_block-structured_instruction_set_architectures.pdf|Hao,​ E., Chang, P.-Y., Evers, M., & Patt, Y. N. (1996). Increasing the instruction fetch rate via block-structured instruction set architectures. Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture.}}
   * {{00877947.pdf|Huck,​ J., Morris, D., Ross, J., Knies, A., Mulder, H., & Zahir, R. (2000). Introducing the IA-64 architecture. IEEE Micro.}}   * {{00877947.pdf|Huck,​ J., Morris, D., Ross, J., Knies, A., Mulder, H., & Zahir, R. (2000). Introducing the IA-64 architecture. IEEE Micro.}}
 +  * {{annaratone_et_al._-_1986_-_warp_architecture_and_implementation.pdf|Annaratone,​ M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. S. (1986). Warp architecture and implementation. Proceedings of the 13th annual international symposium on Computer architecture.}}
 +  * {{annaratone_et_al._-_1987_-_the_warp_computer_architecture_implementation_and_performance.pdf|Annaratone,​ M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. (1987). The warp computer: Architecture,​ implementation,​ and performance. IEEE Transactions on Computers.}}
 +  *  {{fisher_-_1983_-_very_long_instruction_word_architectures_and_the_eli-512.pdf|Fisher,​ J. A. (1983). Very Long Instruction Word architectures and the ELI-512. Proceedings of the 10th annual international symposium on Computer architecture.}}
 +  * {{Smith-1982-Decoupled-Access-Execute-Computer-Architectures.pdf|Smith,​ J. E. (1982). Decoupled access/​execute computer architectures. Proceedings of the 9th annual symposium on Computer Architecture.}}
 +  * {{p289-smith.pdf|Smith,​ J. E. (1984). Decoupled access/​execute computer architectures. ACM Trans. Comput. Syst.}}
 +  * {{:​ilp_history_overview_perspective.pdf|Rau and Fisher, “Instruction-level parallel processing:​ history,​ overview, and perspective,​” Journal of Supercomputing,​ 1993.}}
 +  * {{:​ieee_proceedings_2001_-_compiler_techniques.pdf|Faraboschi et al., “Instruction Scheduling for Instruction Level Parallel Processors,​” Proc. IEEE, Nov. 2001.
 +}}
  
-===== Lecture ​19 (3/19 Wed.) =====+===== Lecture ​17 (2/25 Wed.) =====
 **Required:​** **Required:​**
 +  * {{00877947.pdf|Huck,​ J., Morris, D., Ross, J., Knies, A., Mulder, H., & Zahir, R. (2000). Introducing the IA-64 architecture. IEEE Micro.}}
   * P&H Chapters 5.1-5.3 (cache chapters)   * P&H Chapters 5.1-5.3 (cache chapters)
   * Hamacher et al. Chapters 8.1-8.7 (cache/​memory chapters)   * Hamacher et al. Chapters 8.1-8.7 (cache/​memory chapters)
   * {{wilkes_-_1965_-_slave_memories_and_dynamic_storage_allocation.pdf|Wilkes,​ M. V. (1965). Slave Memories and Dynamic Storage Allocation. IEEE Transactions on Electronic Computers.}}   * {{wilkes_-_1965_-_slave_memories_and_dynamic_storage_allocation.pdf|Wilkes,​ M. V. (1965). Slave Memories and Dynamic Storage Allocation. IEEE Transactions on Electronic Computers.}}
 +  * {{:​liptay68.pdf|Liptay,​ “Structural aspects of the System/360 Model 85 II: the cache,” IBM Systems Journal, 1968.
 +}}
  
-===== Lecture ​20 (3/21 Fri.) ===== +===== Lecture ​18 (2/27 Fri.) ===== 
-** Mentioned in the Lecture** +**Required:** 
-  * {{26080167.pdf|Qureshi, M. K., Lynch, DN., Mutlu, O., & Patt, YN. (2006). A Case for MLP-Aware Cache Replacement. Proceedings of the 33rd annual international symposium on Computer Architecture.}} +  * {{wilkes_-_1965_-_slave_memories_and_dynamic_storage_allocation.pdf|Wilkes, M. V(1964)Slave Memories and Dynamic Storage AllocationIEEE Transactions on Electronic Computers.}} 
-  * {{05388441.pdf|Belady, L. A. (1966). A study of replacement algorithms for a virtual-storage computerIBM Syst. J.}}+  * {{A_Case_For_MLP_Aware_Cache_Replacement.pdf|nQureshi et al., “A Case for MLP-Aware Cache Replacement,“ ISCA 2006.}} 
 +  * P&H Chapters 5.1-5.(cache chapters) 
 +  * Hamacher et alChapters 8.1-8.7 (cache/​memory chapters)
  
-===== Lecture 21 (3/24 Mon.) ===== +**Mentioned During Lecture:** 
-** Required ​** +  * {{A_Study_of_replacement_algorithms_for_a_virtual-storage_computer.pdf|qBeladyA study of replacement algorithms for a virtual-storage computer,​” ​IBM Systems ​Journal1966.}}
-  * {{26080167.pdf|Qureshi,​ M. K., Lynch, D. N., Mutlu, O., & Patt, Y. N. (2006). A Case for MLP-Aware Cache Replacement. Proceedings of the 33rd annual international symposium on Computer Architecture.}} +
-  * {{05388441.pdf|BeladyL. A. (1966). ​A study of replacement algorithms for a virtual-storage computer. IBM Syst. J.}} +
- +
- +
-===== Lecture 22 (3/26 Wed.) ===== +
-** Recommended:​ ** +
-  * {{p6-bell.pdf|Bell,​ G., & Strecker, W. D. (1998). Retrospective:​ what have we learned from the PDP-11&​mdash;​what we have learned from VAX and Alpha. 25 years of the international symposia on Computer architecture (selected papers).}} +
-  * {{p1-bell.pdf|Bell,​ G., & Strecker, W. D. (1976). Computer structures: What have we learned from the PDP-11? Proceedings of the 3rd annual symposium on Computer architecture.}} +
- +
-** Mentioned during lecture: ** +
-  * {{TLDRAM-Lee.pdf|Lee et al., Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,​ HPCA 2013.}} +
-  * {{raidr-isca12.pdf|Liu et al., RAIDR: Retention-Aware Intelligent DRAM Refresh, ISCA 2012.}} +
-  * {{2012_isca_salp.pdf|Kim et al., “A Case for Exploiting Subarray-Level Parallelism in DRAM, ISCA 2012.}} +
-  * {{p60-liu.pdf|Liu et al., “An Experimental Study of Data Retention Behavior in Modern DRAM Devices,” ISCA 2013.}} +
-  * {{moscibroda.pdf|Moscibroda,​ T., & Mutlu, O. (2007). Memory performance attacks: denial of memory service in multi-core systems. Proceedings of 16th USENIX Security Symposium.}} +
-  * {{30470146.pdf|Mutlu,​ O., & Moscibroda, T. (2007). Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 146–160).}} +
-  * {{3174a063.pdf|Mutlu,​ O., & Moscibroda, T. (2008). Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. Proceedings of the 35th Annual International Symposium on Computer Architecture.}} +
-  * {{4299a065.pdf|KimY., Papamichael,​ M., Mutlu, O., & Harchol-Balter,​ M. (2010). Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.}} +
-  * {{muralidhara_et_al._-_2011_-_reducing_memory_interference_in_multicore_systems_via_application-aware_memory_channel_partitioning.pdf|Muralidhara,​ S. P., Subramanian,​ L., Mutlu, O., Kandemir, M., & Moscibroda, T. (2011). Reducing memory interference in multicore systems via application-aware memory channel partitioning. Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.}} +
-  * {{p335-ebrahimi.pdf|Ebrahimi,​ E., Lee, C. J., Mutlu, O., & Patt, Y. N. (2010). Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems.}} +
-  * {{p362-ebrahimi.pdf|Ebrahimi,​ E., Miftakhutdinov,​ R., Fallin, C., Lee, C. J., Joao, J. A., Mutlu, O., & Patt, Y. N. (2011). Parallel application memory scheduling. Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.}} +
- +
-===== Lecture 24 (3/31 Mon.) ===== +
- +
-** Recommended:​ ** +
-  * {{p6-bell.pdf|Bell,​ G., & Strecker, W. D. (1998). Retrospective:​ what have we learned from the PDP-11&​mdash;​what we have learned from VAX and Alpha. 25 years of the international symposia on Computer architecture (selected papers).}} +
-  * {{p1-bell.pdf|Bell,​ G., & Strecker, W. D. (1976). Computer structures: What have we learned from the PDP-11? Proceedings of the 3rd annual symposium on Computer architecture.}} +
- +
-** Mentioned during lecture: ** +
-  * {{TLDRAM-Lee.pdf|Lee et al., Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,​ HPCA 2013.}} +
-  * {{raidr-isca12.pdf|Liu et al., RAIDR: Retention-Aware Intelligent DRAM Refresh, ISCA 2012.}} +
-  * {{2012_isca_salp.pdf|Kim et al., “A Case for Exploiting Subarray-Level Parallelism in DRAM, ISCA 2012.}} +
-  * {{p60-liu.pdf|Liu et al., “An Experimental Study of Data Retention Behavior in Modern DRAM Devices,” ISCA 2013.}} +
-  * {{moscibroda.pdf|Moscibroda,​ T., & Mutlu, O. (2007). Memory performance attacks: denial of memory service in multi-core systems. Proceedings of 16th USENIX Security Symposium.}} +
-  * {{30470146.pdf|Mutlu,​ O., & Moscibroda, T. (2007). Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 146–160).}} +
-  * {{3174a063.pdf|Mutlu,​ O., & Moscibroda, T. (2008). Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. Proceedings of the 35th Annual International Symposium on Computer Architecture.}} +
-  * {{4299a065.pdf|Kim,​ Y., Papamichael,​ M., Mutlu, O., & Harchol-Balter,​ M. (2010). Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.}} +
-  * {{muralidhara_et_al._-_2011_-_reducing_memory_interference_in_multicore_systems_via_application-aware_memory_channel_partitioning.pdf|Muralidhara,​ S. P., Subramanian,​ L., Mutlu, O., Kandemir, M., & Moscibroda, T. (2011). Reducing memory interference in multicore systems via application-aware memory channel partitioning. Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.}} +
-  * {{p335-ebrahimi.pdf|Ebrahimi,​ E., Lee, C. J., Mutlu, O., & Patt, Y. N. (2010). Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems.}} +
-  * {{p362-ebrahimi.pdf|Ebrahimi,​ E., Miftakhutdinov,​ R., Fallin, C., Lee, C. J., Joao, J. A., Mutlu, O., & Patt, Y. N. (2011). Parallel application memory scheduling. Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.}} +
- +
-===== Lecture 25 (4/2 Wed.) ===== +
- +
-** Mentioned during lecture: ** +
-  * {{raidr-isca12.pdf|Liu et al., RAIDR: Retention-Aware Intelligent DRAM Refresh, ISCA 2012.}} +
-  * {{p60-liu.pdf|Liu et al., “An Experimental Study of Data Retention Behavior in Modern DRAM Devices,” ISCA 2013.}} +
-  * {{4299a065.pdf|Kim,​ Y., Papamichael,​ M., Mutlu, O., & Harchol-Balter,​ M. (2010). Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.}} +
-  * {{muralidhara_et_al._-_2011_-_reducing_memory_interference_in_multicore_systems_via_application-aware_memory_channel_partitioning.pdf|Muralidhara,​ S. P., Subramanian,​ L., Mutlu, O., Kandemir, M., & Moscibroda, T. (2011). Reducing memory interference in multicore systems via application-aware memory channel partitioning. Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.}} +
-  * {{p335-ebrahimi.pdf|Ebrahimi,​ E., Lee, C. J., Mutlu, O., & Patt, Y. N. (2010). Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems.}} +
-  * {{p362-ebrahimi.pdf|Ebrahimi,​ E., Miftakhutdinov,​ R., Fallin, C., Lee, C. J., Joao, J. A., Mutlu, O., & Patt, Y. N. (2011). Parallel application memory scheduling. Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.}} +
-  * {{isca08_ipek.pdf|Ipek,​ E., Mutlu, O., Martinez, J., Caruana, R. (2008). Self-Optimizing Memory Controllers:​ A Reinforcement Learning Approach. Proceedings of the 42th Annual IEEE/ACM International Symposium on Microarchitecture.}}+
  
 +===== Lecture 19 (3/2 Mon.) =====
 +**Required:​**
 +  * {{wilkes_-_1965_-_slave_memories_and_dynamic_storage_allocation.pdf|Wilkes,​ M. V. (1964). Slave Memories and Dynamic Storage Allocation. IEEE Transactions on Electronic Computers.}}
 +  * {{A_Case_For_MLP_Aware_Cache_Replacement.pdf|Qureshi et al., “A Case for MLP-Aware Cache Replacement,​“ ISCA 2006.}}
 +  * P&H Chapters 5.1-5.3 (cache chapters)
 +  * Hamacher et al. Chapters 8.1-8.7 (cache/​memory chapters)
  
-===== Lecture 25 (4/7 Mon.) ===== +**Mentioned During ​Lecture:**
-** Required: ** +
-  * {{mutlu_et_al._-_2003_-_runahead_execution_an_alternative_to_very_large_instruction_windows_for_out-of-order_processors.pdf|Mutlu,​ O., Stark, J., Wilkerson, C., & Patt, Y. N. (2003). Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. Proceedings of the 9th International Symposium on High-Performance Computer Architecture.}} +
-  * {{04147648.pdf|Srinath,​ S., Mutlu, O., Kim, H., & Patt, Y. N. (2007). Feedback Directed Prefetching:​ Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture.}} +
- +
-** Recommended:​ ** +
-  * {{24400233.pdf|Mutlu,​ O., Kim, H., & Patt, Y. N. (2005). Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns. Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture.}} +
-  * {{01603492.pdf|Mutlu,​ O., Kim, H., & Patt, Y. N. (2006). Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance. IEEE Micro.}} +
-  * {{21260119.pdf|Armstrong,​ D. N., Kim, H., Mutlu, O., & Patt, Y. N. (2004). Wrong Path Events: Exploiting Unusual and Illegal Program Behavior for Early Misprediction Detection and Recovery. Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture.}} +
- +
-===== Lecture ​27 (4/8 Wed.) ===== +
-** Required: ** +
-  * None +
-** Mentioned during lecture: ** +
-  * {{p176-baer.pdf|Baer,​ J.-L., & Chen, T.-F. (1991). An effective on-chip preloading scheme to reduce data access penalty. Proceedings of the 1991 ACM/IEEE conference on Supercomputing.}}+
   * {{jouppi_-_1990_-_improving_direct-mapped_cache_performance_by_the_addition_of_a_small_fully-associative_cache_and_prefetch_buffers.pdf|Jouppi,​ N. P. (1990). Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. Proceedings of the 17th annual international symposium on Computer Architecture.}}   * {{jouppi_-_1990_-_improving_direct-mapped_cache_performance_by_the_addition_of_a_small_fully-associative_cache_and_prefetch_buffers.pdf|Jouppi,​ N. P. (1990). Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. Proceedings of the 17th annual international symposium on Computer Architecture.}}
-  * {{mowry_lam_gupta_-_1992_-_design_and_evaluation_of_a_compiler_algorithm_for_prefetching.pdf|Mowry,​ T. C., Lam, M. S., & Gupta, A. (1992). Design and evaluation of a compiler algorithm for prefetching. Proceedings of the fifth international conference on Architectural support for programming languages and operating systems.}} 
- 
- 
-===== Lecture 28 (4/14 Mon.) ===== 
-** Required: ** 
-  * {{amdahl_-_1967_-_validity_of_the_single_processor_approach_to_achieving_large_scale_computing_capabilities.pdf|Amdahl,​ G. M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference.}} 
-  * {{lamport_-_1979_-_how_to_make_a_multiprocessor_computer_that_correctly_executes_multiprocess_programs.pdf|Lamport,​ L. (1979). How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs.}} 
-  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​culler-mesi.pdf|C&​S,​ Chapters 5.1 & 5.3]] 
-  * P&H, Chapter 5.8 
-** Recommended:​ ** 
-  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​hill_309_314.pdf|Hill,​ Jouppi, Sohi. "​Multiprocessors and Multicomputers,"​ pp. 551-560 in Readings in Computer Architecture.]] 
-  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​hill_551_560.pdf|Hill,​ Jouppi, Sohi. "​Dataflow and Multithreading,"​ pp. 309-314 in Readings in Computer Architecture.]] 
-  * {{01447203.pdf|Flynn,​ M. J. (1966). Very high-speed computing systems. Proceedings of the IEEE.}} 
-  * {{papamarcos_patel_-_1984_-_a_low-overhead_coherence_solution_for_multiprocessors_with_private_cache_memories.pdf|Papamarcos,​ M. S., & Patel, J. H. (1984). A low-overhead coherence solution for multiprocessors with private cache memories. Proceedings of the 11th annual international symposium on Computer architecture.}} 
-** Mentioned during lecture: ** 
-  * {{p176-baer.pdf|Baer,​ J.-L., & Chen, T.-F. (1991). An effective on-chip preloading scheme to reduce data access penalty. Proceedings of the 1991 ACM/IEEE conference on Supercomputing.}} 
-  * {{04147648.pdf|Srinath,​ S., Mutlu, O., Kim, H., & Patt, Y. N. (2007). Feedback Directed Prefetching:​ Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture.}} 
-  * {{joseph_grunwald_-_1997_-_prefetching_using_markov_predictors.pdf|Joseph,​ D., & Grunwald, D. (1997). Prefetching using Markov predictors. Proceedings of the 24th annual international symposium on Computer architecture.}} 
-  * {{p279-cooksey.pdf|Cooksey,​ R., Jourdan, S., & Grunwald, D. (2002). A stateless, content-directed data prefetching mechanism. Proceedings of the 10th international conference on Architectural support for programming languages and operating systems.}} 
-  * {{04798232.pdf|Ebrahimi,​ E., Mutlu, O., & Patt, Y. N. (2009). Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. High Performance Computer Architecture,​ 2009.}} 
-  * {{p186-chappell.pdf|Chappell,​ R. S., Stark, J., Kim, S. P., Reinhardt, S. K., & Patt, Y. N. (1999). Simultaneous subordinate microthreading (SSMT). Proceedings of the 26th annual international symposium on Computer architecture.}} 
-  * {{p2-zilles.pdf|Zilles,​ C., & Sohi, G. (2001). Execution-based prediction using speculative slices. Proceedings of the 28th annual international symposium on Computer architecture.}} 
-  * {{p40-luk.pdf|Luk,​ C.-K. (2001). Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. Proceedings of the 28th annual international symposium on Computer architecture.}} 
-  * {{p172-zilles.pdf|Zilles,​ C. B., & Sohi, G. S. (2000). Understanding the backward slices of performance degrading instructions. Proceedings of the 27th annual international symposium on Computer architecture.}} 
   * {{mutlu_et_al._-_2003_-_runahead_execution_an_alternative_to_very_large_instruction_windows_for_out-of-order_processors.pdf|Mutlu,​ O., Stark, J., Wilkerson, C., & Patt, Y. N. (2003). Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. Proceedings of the 9th International Symposium on High-Performance Computer Architecture.}}   * {{mutlu_et_al._-_2003_-_runahead_execution_an_alternative_to_very_large_instruction_windows_for_out-of-order_processors.pdf|Mutlu,​ O., Stark, J., Wilkerson, C., & Patt, Y. N. (2003). Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. Proceedings of the 9th International Symposium on High-Performance Computer Architecture.}}
-  * {{jouppi_-_1990_-_improving_direct-mapped_cache_performance_by_the_addition_of_a_small_fully-associative_cache_and_prefetch_buffers.pdf|JouppiNP(1990)Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffersProceedings of the 17th annual international symposium on Computer Architecture.}}+  * {{seznec_a_case_for_two_way_skewed_associative_caches.pdf|Seznec. A Case for Two-Way Skewed-Associative Caches. ISCA 1993.}} 
 +  * {{seznec_a_case_for_two_way_skewed_associative_caches.pdf|Kroft. Lockup-Free Instruction Fetch/​Prefetch Cache Organization. ISCA 1981.}} 
 +  * {{qureshi_utility_based_cache_partitioning.pdf|Qureshi and Patt. Utility-Based Cache Partitioning:​ A Low-OverheadHigh-Performance,​ Runtime Mechanism to Partition Shared CachesMICRO 2006.}} 
 +  * {{suh_new_memory_monitoring_scheme_for_memory_aware_scheduling_and_partitioning.pdf|Suh et al. A New Memory Monitoring Scheme for Memory-Aware Scheduling ​and PartitioningHPCA 2002.}}
  
 +===== Lecture 20 (3/4 Wed.) =====
 +**Required:​**
 +  * Section 5.4 in P&H
 +**Mentioned During Lecture:**
 +  * Section 8.8 in Hamacher et al.
 +  * {{megiddo.pdf|Megiddo and Modha, “ARC: A Self-Tuning,​ Low Overhead Replacement Cache,” FAST 2003.}}
  
-===== Lecture ​29 (4/16 Wed.) ===== +===== Lecture ​21 (3/23 Mon.) ===== 
-** Required: ** +**Required:​** 
-  * {{amdahl_-_1967_-_validity_of_the_single_processor_approach_to_achieving_large_scale_computing_capabilities.pdf|AmdahlG. M. (1967). Validity of the single processor approach to achieving large scale computing capabilitiesProceedings of the April 18-201967, spring joint computer conference.}} +  * {{tldram-lee.pdf| Lee et al., “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” HPCA 2013. (Sections 1 and 2)}} 
-  * {{lamport_-_1979_-_how_to_make_a_multiprocessor_computer_that_correctly_executes_multiprocess_programs.pdf|LamportL. (1979). How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs.}} +  * {{2012_isca_salp.pdf| Kim et al., “A Case for Subarray-Level Parallelism (SALP) in DRAM,” ISCA 2012(Sections 1 and 2)}} 
-  * (CMU WebISO[[http://www.ece.cmu.edu/​~ece447/​cmu_only/​culler-mesi.pdf|C&SChapters 5.1 & 5.3]] +  * {{raidr-isca12.pdf| Liu et al.“RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012. (Sections 1 and 2)}} 
-  * P&H, Chapter 5.8+  * {{main-memory-system_kiise15.pdf| Onur Mutlu, Justin Meza, and Lavanya Subramanian,​ "The Main Memory System: Challenges and Opportunities,"​ Invited Article in Communications of the Korean Institute of Information Scientists and Engineers ​(KIISE), 2015.}} 
 +**Mentioned During Lecture:** 
 +  * {{flipping_kim-isca14.pdf| Kim+, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors,” ISCA 2014.}} 
 +  * {{flash-memory-data-retention_hpca15.pdf| Yu CaiYixin Luo, Erich FHaratsch, Ken Mai, and Onur Mutlu, 
 +"Data Retention in MLC NAND Flash Memory: Characterization,​ Optimization and Recovery,"​  
 +Proceedings of the 21st International Symposium on High-Performance Computer Architecture (HPCA), Bay Area, CA, February 2015}} 
 +  * {{coarchitecting-kang.pdf| Kang+, "​Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling"​}}
  
-===== Lecture ​30 (4/18 Fri.) ===== +===== Lecture ​22 (3/25 Wed.) ===== 
-** Required: ** +**Required:​** 
-  * {{LCP.pdf|Pekhimenko ​et al., “Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity ​and Low Latency,” MICRO 2013.}} +  * {{tldram-lee.pdf| Lee et al., “Tiered-Latency DRAM: A Low Latency ​and Low Cost DRAM Architecture,” HPCA 2013. (Sections 1 and 2)}} 
-  * {{bdi-compression_pact12.pdf|Pekhimenko ​et al., "​Base-Delta-Immediate Compression:​ Practical Data Compression ​for On-Chip Caches," PACT 2012.}} +  * {{2012_isca_salp.pdf| Kim et al., “A Case for Subarray-Level Parallelism (SALP) in DRAM,” ISCA 2012. (Sections 1 and 2)}} 
-  * {{mise-predictable_memory_performance-hpca13.pdf|Subramanian ​et al., “MISEProviding Performance Predictability and Improving Fairness in Shared Main Memory Systems,” HPCA 2013.}}  +  * {{raidr_isca12.pdf| Liu et al., “RAIDRRetention-Aware Intelligent DRAM Refresh,” ISCA 2012. (Sections 1 and 2)}} 
- +  * {{main-memory-system_kiise15.pdf| Onur MutluJustin Meza, and Lavanya Subramanian, "The Main Memory SystemChallenges ​and Opportunities," ​Invited Article ​in Communications ​of the Korean Institute ​of Information Scientists ​and Engineers ​(KIISE), 2015.}}
-===== Lecture 31 (4/28 Mon.===== +
-** Required: ** +
-  * {{amdahl_-_1967_-_validity_of_the_single_processor_approach_to_achieving_large_scale_computing_capabilities.pdf|Amdahl,​ G. M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference.}} +
-  * {{lamport_-_1979_-_how_to_make_a_multiprocessor_computer_that_correctly_executes_multiprocess_programs.pdf|LamportL. (1979). How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs.}} +
-  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​culler-mesi.pdf|C&​SChapters 5.1 & 5.3]] +
-  * P&H, Chapter 5.8 +
-** Recommended:​ ** +
-  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​hill_309_314.pdf|Hill,​ Jouppi, Sohi. "​Multiprocessors ​and Multicomputers," ​pp. 551-560 in Readings in Computer Architecture.]] +
-  * (CMU WebISO) [[http://​www.ece.cmu.edu/​~ece447/​cmu_only/​hill_551_560.pdf|Hill,​ Jouppi, Sohi. "​Dataflow ​and Multithreading," ​pp. 309-314 ​in Readings in Computer Architecture.]] +
-  * {{01447203.pdf|Flynn,​ M. J. (1966). Very high-speed computing systems. Proceedings ​of the IEEE.}} +
-  * {{papamarcos_patel_-_1984_-_a_low-overhead_coherence_solution_for_multiprocessors_with_private_cache_memories.pdf|Papamarcos,​ M. S., & Patel, J. H. (1984). A low-overhead coherence solution for multiprocessors with private cache memories. Proceedings ​of the 11th annual international symposium on Computer architecture.}} +
-** Mentioned during lecture: ** +
-  * {{p168-patel.pdf|Patel,​ J. H. (1979). Processor-memory interconnections for multiprocessors. Proceedings of the 6th annual symposium on Computer architecture.}} +
-  * {{p196-moscibroda.pdf|Moscibroda,​ T., & Mutlu, O. (2009). A case for bufferless routing in on-chip networks. Proceedings of the 36th annual international symposium on Computer architecture.}} +
-  * {{p27-gottlieb.pdf|Gottlieb,​ A., Grishman, R., Kruskal, C. P., McAuliffe, K. P., Rudolph, L., & Snir, M. (1982). The NYU Ultracomputer -- designing a MIMD, shared-memory parallel machine (Extended Abstract). Proceedings of the 9th annual symposium on Computer Architecture.}} +
-  * {{p22-seitz.pdf|Seitz,​ C. L. (1985). The cosmic cube. Commun. ACM.}} +
-  * {{p278-glass.pdf|Glass,​ C. J., & Ni, L. M. (1992). The turn model for adaptive routing. Proceedings of the 19th annual international symposium on Computer architecture.}} +
- +
-===== Lecture 32 (4/30 Wed.) ===== +
-** Required: ** +
-  * None +
- +
-** Mentioned during lecture: ** +
-  * {{amdahl_-_1967_-_validity_of_the_single_processor_approach_to_achieving_large_scale_computing_capabilities.pdf|Amdahl,​ G. M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference.}} +
-  * {{grochowski_et_al._-_2004_-_best_of_both_latency_and_throughput.pdf|Grochowski,​ E., Ronen, R., Shen, J., & Wang, H. (2004). Best of Both Latency ​and Throughput. Proceedings of the IEEE International Conference on Computer Design ​(pp. 236–243).}} +
-  * {{tendler_et_al._-_2002_-_power4_system_microarchitecture.pdf|TendlerJ. M., Dodson, J. S., Fields, J. S., Le, H., & Sinharoy, B. (2002). POWER4 system microarchitecture. IBM J. Res. Dev.}} +
-  * {{01289290.pdf|Kalla,​ R., Sinharoy, B., & Tendler, J. M. (2004). IBM Power5 Chip: A Dual-Core Multithreaded Processor. IEEE Micro.}} +
-  * {{kongetira_aingaran_olukotun_-_2005_-_niagara_a_32-way_multithreaded_sparc_processor.pdf|Kongetira,​ P., Aingaran, K., & Olukotun, K. (2005). Niagara: A 32-Way Multithreaded Sparc Processor. IEEE Micro.}} +
-  * {{p253-suleman.pdf|Suleman,​ M. A., Mutlu, O., Qureshi, M. K., & Patt, Y. N. (2009). Accelerating critical section execution with asymmetric multi-core architectures. Proceedings of the 14th international conference on Architectural support for programming languages and operating systems.}} +
-  * {{p441-suleman.pdf|Suleman,​ M. A., Mutlu, O., Joao, J. A., Khubaib, & Patt, Y. N. (2010). Data marshaling for multi-core architectures. Proceedings of the 37th annual international symposium on Computer architecture.}} +
-  * {{p223-joao.pdf|Joao,​ J. A., Suleman, M. A., Mutlu, O., & Patt, Y. N. (2012). Bottleneck identification and scheduling in multithreaded applications. Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems.}} +
- +
-===== Lecture 33 (5/2 Fri.) ===== +
-** Required: ** +
-  * None+
  
-** Mentioned during lecture: ** 
-  * {{raidr-isca12.pdf|Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.}} 
-  * {{2012_isca_salp.pdf|Kim et al., “A Case for Exploiting Subarray-Level Parallelism in DRAM,” ISCA 2012.}} 
-  * {{TLDRAM-Lee.pdf|Lee et al., “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,​” HPCA 2013.}} 
-  * {{p60-liu.pdf|Liu et al., “An Experimental Study of Data Retention Behavior in Modern DRAM Devices,” ISCA 2013.}} 
-  * {{rowclone_micro13.pdf|Seshadri et al., “RowClone:​ Fast and Efficient In-DRAM Copy and Initialization of Bulk Data,” MICRO 2013.}} 
-  * {{LCP.pdf|Pekhimenko et al., “Linearly Compressed Pages: A Main Memory Compression Framework,​” MICRO 2013.}} 
-  * {{|Chang et al., “Improving DRAM Performance by Parallelizing Refreshes with Accesses,​” HPCA 2014.}} 
-  * {{error-mitigation-for-intermittent-dram-failures_sigmetrics14.pdf|Khan et al., “The Efficacy of Error Mitigation Techniques for DRAM Retention Failures: A Comparative Experimental Study,” SIGMETRICS 2014.}} 
-  * {{luo_dsn14.pdf|Luo et al., “Characterizing Application Memory Error Vulnerability to Optimize Data Center Cost,” DSN 2014.}} 
-  * Kim et al., “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors,” ISCA 2014. 
-  * {{meza_cal12.pdf|Meza et al., “Enabling Efficient and Scalable Hybrid Memories,​” IEEE Comp. Arch. Letters 2012.}} 
-  * {{rowbuffer-aware-caching_iccd12.pdf|Yoon et al., “Row Buffer Locality Aware Caching Policies for Hybrid Memories,​” ICCD 2012.}} 
-  * {{sttram_ispass13.pdf|Kultursay ​ et al., “Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative,​” ISPASS 2013. }} 
-  * {{meza_weed13.pdf|Meza ​ et al., “A Case for Efficient Hardware-Software Cooperative Management of Storage and Memory,” WEED 2013.}} 
-  * {{ISCA09.pdf|Lee ​ et al. “Architecting Phase Change Memory as a Scalable DRAM Alternative,​” ISCA 2009.}} 
readings.txt · Last modified: 2015/04/13 19:31 by kevincha