This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
readings [2014/02/12 00:18] rachata |
readings [2015/03/27 20:10] kevincha |
||
---|---|---|---|
Line 8: | Line 8: | ||
* **P&H** stands for Patterson & Hennessy's //Computer Organization and Design: The Hardware/Software Interface// | * **P&H** stands for Patterson & Hennessy's //Computer Organization and Design: The Hardware/Software Interface// | ||
- | ===== Lecture 1 (1/13 Mon.) ===== | + | ====== Guides on how to review papers critically ====== |
+ | * Lecture slides: {{onur-447-s15-how-to-do-the-paper-reviews.pdf | pdf}} {{onur-447-s15-how-to-do-the-paper-reviews.ppt | Slides ppt}} | ||
+ | * Example reviews on "Main Memory Scaling: Challenges and Solution Directions" (link to the paper) | ||
+ | * {{review-chapter.pdf | Review 1}} | ||
+ | * {{review-chapter-2.pdf | Review 2}} | ||
+ | * Example review on "Staged memory scheduling: Achieving high performance and scalability in heterogeneous systems" (link to the paper) | ||
+ | * {{review-sms.pdf | Review 1}} | ||
+ | |||
+ | |||
+ | ===== Lecture 1 (1/12 Mon.) ===== | ||
**Required:** | **Required:** | ||
- | * None | + | * For HW1: {{00964437.pdf|Patt, Y. (2001). Requirements, bottlenecks, and good fortune: agents for microprocessor evolution. Proceedings of the IEEE.}} |
**Mentioned during lecture:** | **Mentioned during lecture:** | ||
- | |||
* {{bstj29-2-147.pdf|Hamming, R. W. (1950). Error Detecting and Error Correcting Codes. Bell System Technical Journal, 29(2).}} | * {{bstj29-2-147.pdf|Hamming, R. W. (1950). Error Detecting and Error Correcting Codes. Bell System Technical Journal, 29(2).}} | ||
* {{youandyourresearch.pdf|Hamming, R. W. (1986). You and Your Research. Transcription of the Bell Communications Research Colloquium Seminar.}} | * {{youandyourresearch.pdf|Hamming, R. W. (1986). You and Your Research. Transcription of the Bell Communications Research Colloquium Seminar.}} | ||
* [[http://www.youtube.com/watch?v=a1zDuOPkMSw|youtube]] | * [[http://www.youtube.com/watch?v=a1zDuOPkMSw|youtube]] | ||
- | * {{05392210.pdf|Amdahl, G. M., Blaauw, G. A., & Brooks, F. P. (1964). Architecture of the IBM system/360. IBM J. Res. Dev., 8(2).}} | ||
* {{p128-rixner.pdf|Rixner, S., Dally, W. J., Kapasi, U. J., Mattson, P., & Owens, J. D. (2000). Memory access scheduling. Proceedings of the 27th annual international symposium on Computer architecture.}} | * {{p128-rixner.pdf|Rixner, S., Dally, W. J., Kapasi, U. J., Mattson, P., & Owens, J. D. (2000). Memory access scheduling. Proceedings of the 27th annual international symposium on Computer architecture.}} | ||
- | * {{us5630096.pdf|William K. Zuravleff, & Robinson, T. (1997). Controller for a synchronous DRAM that maximizes throughput by allowing memory requests and commands to be issued out of order.}} | ||
- | * {{00964437.pdf|Patt, Y. (2001). Requirements, bottlenecks, and good fortune: agents for microprocessor evolution. Proceedings of the IEEE.}} | ||
* {{http://users.ece.cmu.edu/~omutlu/pub/mph_usenix_security07.pdf|Moscibroda, T., & Mutlu, O. (2007). Memory performance attacks: denial of memory service in multi-core systems. Proceedings of 16th USENIX Security Symposium.}} | * {{http://users.ece.cmu.edu/~omutlu/pub/mph_usenix_security07.pdf|Moscibroda, T., & Mutlu, O. (2007). Memory performance attacks: denial of memory service in multi-core systems. Proceedings of 16th USENIX Security Symposium.}} | ||
* {{http://research.microsoft.com/pubs/79625/MICRO2007.pdf|Onur Mutlu and Thomas Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors", MICRO 2007. }} | * {{http://research.microsoft.com/pubs/79625/MICRO2007.pdf|Onur Mutlu and Thomas Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors", MICRO 2007. }} | ||
- | * {{http://users.ece.cmu.edu/~omutlu/pub/memory-channel-partitioning-micro11.pdf|Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mahmut Kandemir, and Thomas Moscibroda, "Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning", MICRO 2011.}} | + | * {{http://users.ece.cmu.edu/~omutlu/pub/memory-channel-partitioning-micro11.pdf|Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mahmut Kandemir, and Thomas Moscibroda, "Reducing Memory Interference in Multicore Systems via Application-Aware |
+ | * Memory Channel Partitioning", MICRO 2011.}} | ||
* {{http://users.ece.cmu.edu/~omutlu/pub/raidr-dram-refresh_isca12.pdf|Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.}} | * {{http://users.ece.cmu.edu/~omutlu/pub/raidr-dram-refresh_isca12.pdf|Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.}} | ||
* {{http://users.ece.cmu.edu/~omutlu/pub/memory-scaling_memcon13.pdf|Onur Mutlu, "Memory Scaling: A Systems Architecture Perspective" Technical talk at MemCon 2013 (MEMCON), Santa Clara, CA, August 2013.}} | * {{http://users.ece.cmu.edu/~omutlu/pub/memory-scaling_memcon13.pdf|Onur Mutlu, "Memory Scaling: A Systems Architecture Perspective" Technical talk at MemCon 2013 (MEMCON), Santa Clara, CA, August 2013.}} | ||
+ | * {{http://users.ece.cmu.edu/~kevincha/papers/chang_hpca2014.pdf|Kevin Chang, Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu, "Improving DRAM Performance by Parallelizing Refreshes with Accesses", In HPCA 2014, Orlando, Feb. 2014.}} | ||
+ | * {{http://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf | Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, Onur Mutlu, "Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors", In ISCA-41, 2014.}} | ||
- | ===== Lecture 2 (1/15 Wed.) ===== | + | ===== Lecture 2 (1/14 Wed.) ===== |
**Required:** | **Required:** | ||
* {{00964437.pdf|Patt, Y. (2001). Requirements, bottlenecks, and good fortune: agents for microprocessor evolution. Proceedings of the IEEE.}} | * {{00964437.pdf|Patt, Y. (2001). Requirements, bottlenecks, and good fortune: agents for microprocessor evolution. Proceedings of the IEEE.}} | ||
Line 33: | Line 41: | ||
* (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/PP_Chap1.pdf|P&P Chapter 1 (Fundamentals)]] | * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/PP_Chap1.pdf|P&P Chapter 1 (Fundamentals)]] | ||
* P&H Chapters 1 and 2 (Intro, Abstractions, ISA, MIPS) | * P&H Chapters 1 and 2 (Intro, Abstractions, ISA, MIPS) | ||
+ | |||
**Mentioned during lecture:** | **Mentioned during lecture:** | ||
- | * {{gordon_moore_1965_article.pdf|Moore, G. E. (1965). Cramming More Components onto Integrated Circuits. Electronics, 38(8).}} | + | |
- | * {{bab6286.0001.001.pdf|Burks, A. W., Goldstine, H. H., & Neumann, J. von. (1946). Preliminary discussion of the logical design of an electronic computing instrument.}} | + | ===== Lecture 3 (1/16 Fri.) ===== |
- | * {{p126-dennis.pdf|Dennis, J. B., & Misunas, D. P. (1975). A preliminary architecture for a basic data-flow processor. Proceedings of the 2nd annual symposium on Computer architecture.}} | + | |
- | * {{p34-gurd.pdf|Gurd, J. R., Kirkham, C. C., & Watson, I. (1985). The Manchester prototype dataflow computer. Commun. ACM, 28(1).}} | + | |
- | * Kuhn, T. S. (1962). The Structure of Scientific Revolutions. | + | |
- | * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/PP_Chap4.pdf|P&P Chapter 4 (The von Neumann Model)]] | + | |
- | + | ||
- | ===== Lecture 3 (1/17 Fri.) ===== | + | |
**Required:** | **Required:** | ||
* Note that you should familiarize yourself with these manuals. Please briefly skim through these manuals as you will probably need to refer to them while working on labs and homework | * Note that you should familiarize yourself with these manuals. Please briefly skim through these manuals as you will probably need to refer to them while working on labs and homework | ||
- | * ARM Architecture Reference Manual | + | * MIPS Architecture Reference Manual |
- | * [[https://www.scss.tcd.ie/~waldroj/3d1/arm_arm.pdf|Manual (5MB)]] | + | * {{mips_r4000_users_manual.pdf |Manual (the instruction set reference starts on pg.469)}} |
- | * ARM Architecture Instruction Quick Reference | + | |
- | * {{arm-instructionset.pdf|Quick Ref (.5MB)}} | + | |
* Intel® 64 and IA-32 Architectures Software Developer Manual (2013) | * Intel® 64 and IA-32 Architectures Software Developer Manual (2013) | ||
* [[http://download.intel.com/products/processor/manual/325462.pdf|(15MB) Combined Volumes 1-3]]3 | * [[http://download.intel.com/products/processor/manual/325462.pdf|(15MB) Combined Volumes 1-3]]3 | ||
**Mentioned during lecture:** | **Mentioned during lecture:** | ||
- | * P&H Chapter 4, Sections 4.1-4.4. | + | * {{paper_aklaiber_19jan00.pdf | Klaiber, "The Technology Behind CrusoeTM Processors", Transmeta White Paper, 2000}} |
- | * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/pp-appendixc.pdf|P&P Appendix C (The Microarchitecture of the LC-3b, Basic Machine)]] | + | |
- | * P&P Chapter 5 (The LC3) | + | |
- | * {{p25-patterson.pdf|Patterson, D. A., & Ditzel, D. R. (1980). The case for the reduced instruction set computer. SIGARCH Comput. Archit. News, 8(6).}} | + | |
- | * [[http://www.ece.cmu.edu/~koopman/stack_computers/sec3_2.html | Koopman, P. (1989) Stack Computers: The New Wave.]] | + | |
- | * {{chapter9.pdf|Levy, H. (1984). Capability-Based Computer Systems. Chapter 9. The Intel iAPX 432.}} | + | |
- | * {{p489-wilner.pdf|Wilner, W. T. (1972). Design of the Burroughs B1700. Proceedings of the December 5-7, 1972, fall joint computer conference, part I. }} | + | |
- | + | ===== Lecture 4 (1/21 Wed.) ===== | |
- | ===== Lecture 4 (1/22 Wed.) ===== | + | **Required:** |
- | **Required** | + | |
* (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/PP_Chap4.pdf|P&P Chapter 4 (The von Neumann Model)]] | * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/PP_Chap4.pdf|P&P Chapter 4 (The von Neumann Model)]] | ||
* (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/pp-appendixa.pdf|P&P Appendix A (The LC-3b ISA)]] | * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/pp-appendixa.pdf|P&P Appendix A (The LC-3b ISA)]] | ||
* (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/pp-appendixc.pdf|P&P Appendix C (The Microarchitecture of the LC-3b, Basic Machine)]] | * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/pp-appendixc.pdf|P&P Appendix C (The Microarchitecture of the LC-3b, Basic Machine)]] | ||
- | ===== Lecture 5 (1/24 Fri.) ===== | + | **Mentioned during lecture:** |
+ | |||
+ | ===== Lecture 5 (1/23 Fri.) ===== | ||
**Required** | **Required** | ||
* None | * None | ||
- | ===== Lecture 6 (1/27 Mon.) ===== | + | ===== Lecture 6 (1/26 Mon.) ===== |
**Required:** | **Required:** | ||
* (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/pp-appendixc.pdf|P&P Appendix C (The Microarchitecture of the LC-3b, Basic Machine)]] | * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/pp-appendixc.pdf|P&P Appendix C (The Microarchitecture of the LC-3b, Basic Machine)]] | ||
Line 79: | Line 75: | ||
* {{bestway.pdf|Wilkes, M. V. (1951). The best way to design an automatic calculating machine. Manchester University Computer Inaugural Conference.}} | * {{bestway.pdf|Wilkes, M. V. (1951). The best way to design an automatic calculating machine. Manchester University Computer Inaugural Conference.}} | ||
**Mentioned during lecture:** | **Mentioned during lecture:** | ||
- | * {{bestway.pdf|Wilkes, M. V. (1951). The best way to design an automatic calculating machine. Manchester University Computer Inaugural Conference.}} | ||
- | ===== Lecture 7 (1/29 Wed.) ===== | + | ===== Lecture 7 (1/28 Wed.) ===== |
**Required:** | **Required:** | ||
* None | * None | ||
**Mentioned during lecture:** | **Mentioned during lecture:** | ||
- | * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/pp-appendixc.pdf|P&P Appendix C (The Microarchitecture of the LC-3b, Basic Machine)]] | + | * {{bestway.pdf|Wilkes, M. V. (1951). The best way to design an automatic calculating machine. Manchester University Computer Inaugural Conference.}} |
+ | * [[http://research.microsoft.com/pubs/68221/acrobat.pdf |Butler W. Lampson, “Hints for Computer System Design,” ACM Operating Systems Review, 1983.]] | ||
- | ===== Lecture 8 (1/31 Fri.) ===== | + | ===== Lecture 8 (2/2 Mon.) ===== |
**Required:** | **Required:** | ||
- | * None | + | * P&H Sections 4.9-4.11 |
+ | * {{00476078.pdf|Smith, J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}} | ||
+ | * {{00004607.pdf|Smith, J. E., & Pleszkun, A. R. (1988). Implementing precise interrupts in pipelined processors. Computers, IEEE Transactions on.}} | ||
+ | * {{mcfarling_-_1993_-_combining_branch_predictors.pdf|Mcfarling, S. (1993). Combining branch predictors. WRL Technical Note TN-36.}} | ||
+ | * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler, R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}} | ||
- | ===== Lecture 9 (2/3 Mon.) ===== | + | **Mentioned during lecture:** |
+ | * {{p16-pettis.pdf|Pettis, K., & Hansen, R. C. (1990). Profile guided code positioning. Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation.}} | ||
+ | |||
+ | ===== Lecture 9 (2/4 Wed.) ===== | ||
**Required:** | **Required:** | ||
* P&H Sections 4.9-4.11 | * P&H Sections 4.9-4.11 | ||
* {{00476078.pdf|Smith, J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}} | * {{00476078.pdf|Smith, J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}} | ||
+ | * {{00004607.pdf|Smith, J. E., & Pleszkun, A. R. (1988). Implementing precise interrupts in pipelined processors. Computers, IEEE Transactions on.}} | ||
+ | * {{mcfarling_-_1993_-_combining_branch_predictors.pdf|Mcfarling, S. (1993). Combining branch predictors. WRL Technical Note TN-36.}} | ||
+ | * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler, R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}} | ||
**Mentioned during lecture:** | **Mentioned during lecture:** | ||
- | * {{p177-allen.pdf|Allen, J. R., Kennedy, K., Porterfield, C., & Warren, J. (1983). Conversion of control dependence to data dependence. Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages.}} | + | * {{flash-memory-data-retention_hpca15.pdf|Yu Cai, Yixin Luo, Erich F. Haratsch, Ken Mai, and Onur Mutlu, Data Retention in MLC NAND Flash Memory: Characterization, Optimization and Recovery, HPCA 2015.}} |
- | * {{24400043.pdf|Kim, H., Mutlu, O., Stark, J., & Patt, Y. N. (2005). Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution. Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture.}} | + | * {{adaptive-latency-dram_hpca15.pdf|Donghyuk Lee, Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, and Onur Mutlu, Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case, HPCA 2015.}} |
- | * {{thornton_-_1964_-_parallel_operation_in_the_control_data_6600.pdf|Thornton, J. E. (1964). Parallel Operation in the Control Data 6600. Proceedings of the Fall Joint Computer Conference.}} | + | * {{compression-aware-cache-management_hpca15.pdf|Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Phillip P. Gibbons, Michael A. Kozuch, and Todd C. Mowry, Exploiting Compressed Block Size as an Indicator of Future Reuse, HPCA 2015.}} |
- | * {{smith78_hep.pdf|Smith, B. J. (1978). A pipelined, shared resource MIMD computer. International Conference on Parallel Processing.}} | + | |
- | * {{p16-pettis.pdf|Pettis, K., & Hansen, R. C. (1990). Profile guided code positioning. Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation.}} | + | |
- | ===== Lecture 10 (2/5 Wed.) ===== | ||
+ | ===== Lecture 10 (2/6 Fri.) ===== | ||
**Required:** | **Required:** | ||
+ | * P&H Sections 4.9-4.11 | ||
+ | * {{00476078.pdf|Smith, J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}} | ||
+ | * {{00004607.pdf|Smith, J. E., & Pleszkun, A. R. (1988). Implementing precise interrupts in pipelined processors. Computers, IEEE Transactions on.}} | ||
* {{mcfarling_-_1993_-_combining_branch_predictors.pdf|Mcfarling, S. (1993). Combining branch predictors. WRL Technical Note TN-36.}} | * {{mcfarling_-_1993_-_combining_branch_predictors.pdf|Mcfarling, S. (1993). Combining branch predictors. WRL Technical Note TN-36.}} | ||
* {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler, R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}} | * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler, R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}} | ||
- | **Mentioned during lecture:** | + | |
+ | **Mentioned in the Lecture:** | ||
* {{p300-ball.pdf|Ball, T., & Larus, J. R. (1993). Branch prediction for free. Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation.}} | * {{p300-ball.pdf|Ball, T., & Larus, J. R. (1993). Branch prediction for free. Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation.}} | ||
* {{p135-smith.pdf|Smith, J. E. (1981). A study of branch prediction strategies. Proceedings of the 8th annual symposium on Computer Architecture.}} | * {{p135-smith.pdf|Smith, J. E. (1981). A study of branch prediction strategies. Proceedings of the 8th annual symposium on Computer Architecture.}} | ||
Line 116: | Line 124: | ||
* {{hpca01.pdf|Daniel A. Jimenez and Calvin Lin. 2001. Dynamic Branch Prediction with Perceptrons. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA '01)}} | * {{hpca01.pdf|Daniel A. Jimenez and Calvin Lin. 2001. Dynamic Branch Prediction with Perceptrons. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA '01)}} | ||
* {{Riseman.1972.TC.pdf|E. M. Riseman and C. C. Foster. 1972. The Inhibition of Potential Parallelism by Conditional Jumps. IEEE Trans. Comput. 21, 12 (December 1972)}} | * {{Riseman.1972.TC.pdf|E. M. Riseman and C. C. Foster. 1972. The Inhibition of Potential Parallelism by Conditional Jumps. IEEE Trans. Comput. 21, 12 (December 1972)}} | ||
+ | * {{p274-chang.pdf|Po-Yung Chang, Eric Hao, and Yale N. Patt. 1997. Target prediction for indirect jumps. ISCA'97.}} | ||
+ | * {{kim_isca07.pdf|Hyesoon Kim, José A. Joao, Onur Mutlu, Chang Joo Lee, Yale N. Patt, and Robert Cohn. 2007. VPC prediction: reducing the cost of indirect branches via hardware-based dynamic devirtualization. ISCA'07}} | ||
- | ===== Lecture 11 (2/12 Wed.) ===== | + | ===== Lecture 11 (2/11 Wed.) ===== |
+ | **Required:** | ||
+ | |||
+ | **Mentioned in the Lecture:** | ||
+ | * {{p18-hwu.pdf|Hwu and Patt (1987). Checkpoint Repair for Out-of-order Execution Machines.}} | ||
+ | * {{00004607.pdf|Smith, J. E., & Pleszkun, A. R. (1988). Implementing precise interrupts in pipelined processors. Computers, IEEE Transactions on.}} | ||
+ | * {{ogehl.pdf | Seznec (2005). Analysis of the O-GEometric History Length Branch Predictor. ISCA}} | ||
+ | * {{hpca01.pdf|Daniel A. Jimenez and Calvin Lin. 2001. Dynamic Branch Prediction with Perceptrons. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA '01)}} | ||
+ | |||
+ | ===== Lecture 12 (2/13 Fri.) ===== | ||
+ | **Required:** | ||
+ | * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler, R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}} | ||
+ | |||
+ | ===== Lecture 13 (2/16 Mon.) ===== | ||
+ | **Required:** | ||
+ | * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler, R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}} | ||
+ | * {{00476078.pdf|Smith, J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}} | ||
+ | * {{04523358.pdf|Lindholm, E., Nickolls, J., Oberman, S., & Montrym, J. (2008). NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE.}} | ||
+ | * {{p50-fatahalian.pdf|Fatahalian, K., & Houston, M. (2008). A closer look at GPUs. Commun. ACM.}} | ||
+ | |||
+ | ===== Lecture 14 (2/18 Wed.) ===== | ||
** Required ** | ** Required ** | ||
- | * None | + | * {{04523358.pdf|Lindholm, E., Nickolls, J., Oberman, S., & Montrym, J. (2008). NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE.}} |
+ | * {{p50-fatahalian.pdf|Fatahalian, K., & Houston, M. (2008). A closer look at GPUs. Commun. ACM.}} | ||
- | ===== Lecture 12 (2/14 Fri.) ===== | + | **Mentioned during lecture:** |
+ | * {{30470407.pdf|Fung, W. W. L., Sham, I., Yuan, G., & Aamodt, T. M. (2007). Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture.}} | ||
+ | * {{p253-suleman.pdf |Suleman, M. A., Mutlu, O., Qureshi, M. K., & Patt, Y. N. (2009). Accelerating critical section execution with asymmetric multi-core architectures. Proceedings of the 14th international conference on Architectural support for programming languages and operating systems.}} | ||
+ | * {{01447203.pdf|Flynn, M. J. (1966). Very high-speed computing systems. Proceedings of the IEEE.}} | ||
+ | * {{fisher_-_1983_-_very_long_instruction_word_architectures_and_the_eli-512.pdf|Fisher, J. A. (1983). Very Long Instruction Word architectures and the ELI-512. Proceedings of the 10th annual international symposium on Computer architecture.}} | ||
+ | * {{Smith-1982-Decoupled-Access-Execute-Computer-Architectures.pdf|Smith, J. E. (1982). Decoupled access/execute computer architectures. Proceedings of the 9th annual symposium on Computer Architecture.}} | ||
+ | * {{p289-smith.pdf|Smith, J. E. (1984). Decoupled access/execute computer architectures. ACM Trans. Comput. Syst.}} | ||
+ | * {{p199-smith.pdf|Smith, J. E., Dermer, G. E., Vanderwarn, B. D., Klinger, S. D., & Rozewski, C. M. (1987). The ZS-1 central processor. Proceedings of the second international conference on Architectual support for programming languages and operating systems.}} | ||
+ | * {{00030730.pdf|Smith, J. E. (1989). Dynamic instruction scheduling and the Astronautics ZS-1. IEEE Computer.}} | ||
+ | * {{kung_-_1982_-_why_systolic_architectures.pdf|Kung, H. T. (1982). Why Systolic Architectures? IEEE Computer.}} | ||
+ | * {{annaratone_et_al._-_1986_-_warp_architecture_and_implementation.pdf|Annaratone, M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. S. (1986). Warp architecture and implementation. Proceedings of the 13th annual international symposium on Computer architecture.}} | ||
+ | * {{annaratone_et_al._-_1987_-_the_warp_computer_architecture_implementation_and_performance.pdf|Annaratone, M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. (1987). The warp computer: Architecture, implementation, and performance. IEEE Transactions on Computers.}} | ||
+ | |||
+ | ===== Lecture 15 (2/20 Fri.) ===== | ||
** Required ** | ** Required ** | ||
- | * P&H Sections 4.9-4.11 | + | * {{04523358.pdf|Lindholm, E., Nickolls, J., Oberman, S., & Montrym, J. (2008). NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE.}} |
- | * {{00476078.pdf|Smith, J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}} | + | * {{p50-fatahalian.pdf|Fatahalian, K., & Houston, M. (2008). A closer look at GPUs. Commun. ACM.}} |
- | * {{00004607.pdf|Smith, J. E., & Pleszkun, A. R. (1988). Implementing precise interrupts in pipelined processors. Computers, IEEE Transactions on.}} | + | |
+ | **Mentioned during lecture:** | ||
+ | * {{30470407.pdf|Fung, W. W. L., Sham, I., Yuan, G., & Aamodt, T. M. (2007). Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture.}} | ||
+ | * {{fisher_-_1983_-_very_long_instruction_word_architectures_and_the_eli-512.pdf|Fisher, J. A. (1983). Very Long Instruction Word architectures and the ELI-512. Proceedings of the 10th annual international symposium on Computer architecture.}} | ||
+ | * {{Smith-1982-Decoupled-Access-Execute-Computer-Architectures.pdf|Smith, J. E. (1982). Decoupled access/execute computer architectures. Proceedings of the 9th annual symposium on Computer Architecture.}} | ||
+ | * {{p289-smith.pdf|Smith, J. E. (1984). Decoupled access/execute computer architectures. ACM Trans. Comput. Syst.}} | ||
+ | * {{p199-smith.pdf|Smith, J. E., Dermer, G. E., Vanderwarn, B. D., Klinger, S. D., & Rozewski, C. M. (1987). The ZS-1 central processor. Proceedings of the second international conference on Architectual support for programming languages and operating systems.}} | ||
+ | * {{00030730.pdf|Smith, J. E. (1989). Dynamic instruction scheduling and the Astronautics ZS-1. IEEE Computer.}} | ||
+ | * {{kung_-_1982_-_why_systolic_architectures.pdf|Kung, H. T. (1982). Why Systolic Architectures? IEEE Computer.}} | ||
+ | * {{annaratone_et_al._-_1986_-_warp_architecture_and_implementation.pdf|Annaratone, M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. S. (1986). Warp architecture and implementation. Proceedings of the 13th annual international symposium on Computer architecture.}} | ||
+ | * {{annaratone_et_al._-_1987_-_the_warp_computer_architecture_implementation_and_performance.pdf|Annaratone, M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. (1987). The warp computer: Architecture, implementation, and performance. IEEE Transactions on Computers.}} | ||
+ | * {{jog_orchestrated.pdf|Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. Orchestrated scheduling and prefetching for GPGPUs. ISCA '13}} | ||
+ | * {{large-gpu-warps_micro11|Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov, Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling.MICRO-44}} | ||
+ | |||
+ | ===== Lecture 16 (2/23 Mon.) ===== | ||
+ | **Mentioned during lecture:** | ||
+ | * {{:mise-predictable_memory_performance-hpca13.pdf|Subramanian et al., “MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems,” HPCA 2013}} | ||
+ | * [[http://users.ece.cmu.edu/~omutlu/pub/mph_usenix_security07.pdf|Moscibroda, T., & Mutlu, O. (2007). Memory performance attacks: denial of memory service in multi-core systems. Proceedings of 16th USENIX Security Symposium.]] | ||
+ | * {{kung_-_1982_-_why_systolic_architectures.pdf|Kung, H. T. (1982). Why Systolic Architectures? IEEE Computer.}} | ||
+ | * {{01675827.pdf|Fisher, J. A. (1981). Trace Scheduling: A Technique for Global Microcode Compaction. IEEE Trans. Comput.}} | ||
+ | * {{2fbf01205185.pdf|Hwu, W.-M. W., Mahlke, S. A., Chen, W. Y., Chang, P. P., Warter, N. J., Bringmann, R. A., Ouellette, R. G., et al. (1993). The superblock: an effective technique for VLIW and superscalar compilation. J. Supercomput.}} | ||
+ | * {{p45-mahlke.pdf|Mahlke, S. A., Lin, D. C., Chen, W. Y., Hank, R. E., & Bringmann, R. A. (1992). Effective compiler support for predicated execution using the hyperblock. Proceedings of the 25th annual international symposium on Microarchitecture.}} | ||
+ | * {{melvin_patt_-_1995_-_enhancing_instruction_scheduling_with_a_block-structured_isa.pdf|Melvin, S., & Patt, Y. (1995). Enhancing instruction scheduling with a block-structured ISA. Int. J. Parallel Program.}} | ||
+ | * {{hao_et_al._-_1996_-_increasing_the_instruction_fetch_rate_via_block-structured_instruction_set_architectures.pdf|Hao, E., Chang, P.-Y., Evers, M., & Patt, Y. N. (1996). Increasing the instruction fetch rate via block-structured instruction set architectures. Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture.}} | ||
+ | * {{00877947.pdf|Huck, J., Morris, D., Ross, J., Knies, A., Mulder, H., & Zahir, R. (2000). Introducing the IA-64 architecture. IEEE Micro.}} | ||
+ | * {{annaratone_et_al._-_1986_-_warp_architecture_and_implementation.pdf|Annaratone, M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. S. (1986). Warp architecture and implementation. Proceedings of the 13th annual international symposium on Computer architecture.}} | ||
+ | * {{annaratone_et_al._-_1987_-_the_warp_computer_architecture_implementation_and_performance.pdf|Annaratone, M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. (1987). The warp computer: Architecture, implementation, and performance. IEEE Transactions on Computers.}} | ||
+ | * {{fisher_-_1983_-_very_long_instruction_word_architectures_and_the_eli-512.pdf|Fisher, J. A. (1983). Very Long Instruction Word architectures and the ELI-512. Proceedings of the 10th annual international symposium on Computer architecture.}} | ||
+ | * {{Smith-1982-Decoupled-Access-Execute-Computer-Architectures.pdf|Smith, J. E. (1982). Decoupled access/execute computer architectures. Proceedings of the 9th annual symposium on Computer Architecture.}} | ||
+ | * {{p289-smith.pdf|Smith, J. E. (1984). Decoupled access/execute computer architectures. ACM Trans. Comput. Syst.}} | ||
+ | * {{:ilp_history_overview_perspective.pdf|Rau and Fisher, “Instruction-level parallel processing: history, overview, and perspective,” Journal of Supercomputing, 1993.}} | ||
+ | * {{:ieee_proceedings_2001_-_compiler_techniques.pdf|Faraboschi et al., “Instruction Scheduling for Instruction Level Parallel Processors,” Proc. IEEE, Nov. 2001. | ||
+ | }} | ||
+ | |||
+ | ===== Lecture 17 (2/25 Wed.) ===== | ||
+ | **Required:** | ||
+ | * {{00877947.pdf|Huck, J., Morris, D., Ross, J., Knies, A., Mulder, H., & Zahir, R. (2000). Introducing the IA-64 architecture. IEEE Micro.}} | ||
+ | * P&H Chapters 5.1-5.3 (cache chapters) | ||
+ | * Hamacher et al. Chapters 8.1-8.7 (cache/memory chapters) | ||
+ | * {{wilkes_-_1965_-_slave_memories_and_dynamic_storage_allocation.pdf|Wilkes, M. V. (1965). Slave Memories and Dynamic Storage Allocation. IEEE Transactions on Electronic Computers.}} | ||
+ | * {{:liptay68.pdf|Liptay, “Structural aspects of the System/360 Model 85 II: the cache,” IBM Systems Journal, 1968. | ||
+ | }} | ||
+ | |||
+ | ===== Lecture 18 (2/27 Fri.) ===== | ||
+ | **Required:** | ||
+ | * {{wilkes_-_1965_-_slave_memories_and_dynamic_storage_allocation.pdf|Wilkes, M. V. (1964). Slave Memories and Dynamic Storage Allocation. IEEE Transactions on Electronic Computers.}} | ||
+ | * {{A_Case_For_MLP_Aware_Cache_Replacement.pdf|nQureshi et al., “A Case for MLP-Aware Cache Replacement,“ ISCA 2006.}} | ||
+ | * P&H Chapters 5.1-5.3 (cache chapters) | ||
+ | * Hamacher et al. Chapters 8.1-8.7 (cache/memory chapters) | ||
+ | |||
+ | **Mentioned During Lecture:** | ||
+ | * {{A_Study_of_replacement_algorithms_for_a_virtual-storage_computer.pdf|qBelady, “A study of replacement algorithms for a virtual-storage computer,” IBM Systems Journal, 1966.}} | ||
+ | |||
+ | ===== Lecture 19 (3/2 Mon.) ===== | ||
+ | **Required:** | ||
+ | * {{wilkes_-_1965_-_slave_memories_and_dynamic_storage_allocation.pdf|Wilkes, M. V. (1964). Slave Memories and Dynamic Storage Allocation. IEEE Transactions on Electronic Computers.}} | ||
+ | * {{A_Case_For_MLP_Aware_Cache_Replacement.pdf|Qureshi et al., “A Case for MLP-Aware Cache Replacement,“ ISCA 2006.}} | ||
+ | * P&H Chapters 5.1-5.3 (cache chapters) | ||
+ | * Hamacher et al. Chapters 8.1-8.7 (cache/memory chapters) | ||
+ | |||
+ | **Mentioned During Lecture:** | ||
+ | * {{jouppi_-_1990_-_improving_direct-mapped_cache_performance_by_the_addition_of_a_small_fully-associative_cache_and_prefetch_buffers.pdf|Jouppi, N. P. (1990). Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. Proceedings of the 17th annual international symposium on Computer Architecture.}} | ||
+ | * {{mutlu_et_al._-_2003_-_runahead_execution_an_alternative_to_very_large_instruction_windows_for_out-of-order_processors.pdf|Mutlu, O., Stark, J., Wilkerson, C., & Patt, Y. N. (2003). Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. Proceedings of the 9th International Symposium on High-Performance Computer Architecture.}} | ||
+ | * {{seznec_a_case_for_two_way_skewed_associative_caches.pdf|Seznec. A Case for Two-Way Skewed-Associative Caches. ISCA 1993.}} | ||
+ | * {{seznec_a_case_for_two_way_skewed_associative_caches.pdf|Kroft. Lockup-Free Instruction Fetch/Prefetch Cache Organization. ISCA 1981.}} | ||
+ | * {{qureshi_utility_based_cache_partitioning.pdf|Qureshi and Patt. Utility-Based Cache Partitioning: A Low-Overhead, High-Performance, Runtime Mechanism to Partition Shared Caches. MICRO 2006.}} | ||
+ | * {{suh_new_memory_monitoring_scheme_for_memory_aware_scheduling_and_partitioning.pdf|Suh et al. A New Memory Monitoring Scheme for Memory-Aware Scheduling and Partitioning. HPCA 2002.}} | ||
+ | |||
+ | ===== Lecture 20 (3/4 Wed.) ===== | ||
+ | **Required:** | ||
+ | * Section 5.4 in P&H | ||
+ | **Mentioned During Lecture:** | ||
+ | * Section 8.8 in Hamacher et al. | ||
+ | * {{megiddo.pdf|Megiddo and Modha, “ARC: A Self-Tuning, Low Overhead Replacement Cache,” FAST 2003.}} | ||
+ | |||
+ | ===== Lecture 21 (3/23 Mon.) ===== | ||
+ | **Required:** | ||
+ | * {{tldram-lee.pdf| Lee et al., “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” HPCA 2013. (Sections 1 and 2)}} | ||
+ | * {{2012_isca_salp.pdf| Kim et al., “A Case for Subarray-Level Parallelism (SALP) in DRAM,” ISCA 2012. (Sections 1 and 2)}} | ||
+ | * {{raidr-isca12.pdf| Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012. (Sections 1 and 2)}} | ||
+ | * {{main-memory-system_kiise15.pdf| Onur Mutlu, Justin Meza, and Lavanya Subramanian, "The Main Memory System: Challenges and Opportunities," Invited Article in Communications of the Korean Institute of Information Scientists and Engineers (KIISE), 2015.}} | ||
+ | **Mentioned During Lecture:** | ||
+ | * {{flipping_kim-isca14.pdf| Kim+, “Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors,” ISCA 2014.}} | ||
+ | * {{flash-memory-data-retention_hpca15.pdf| Yu Cai, Yixin Luo, Erich F. Haratsch, Ken Mai, and Onur Mutlu, | ||
+ | "Data Retention in MLC NAND Flash Memory: Characterization, Optimization and Recovery," | ||
+ | Proceedings of the 21st International Symposium on High-Performance Computer Architecture (HPCA), Bay Area, CA, February 2015. }} | ||
+ | * {{coarchitecting-kang.pdf| Kang+, "Co-Architecting Controllers and DRAM to Enhance DRAM Process Scaling"}} | ||
+ | |||
+ | ===== Lecture 22 (3/25 Wed.) ===== | ||
+ | **Required:** | ||
+ | * {{tldram-lee.pdf| Lee et al., “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” HPCA 2013. (Sections 1 and 2)}} | ||
+ | * {{2012_isca_salp.pdf| Kim et al., “A Case for Subarray-Level Parallelism (SALP) in DRAM,” ISCA 2012. (Sections 1 and 2)}} | ||
+ | * {{raidr_isca12.pdf| Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012. (Sections 1 and 2)}} | ||
+ | * {{main-memory-system_kiise15.pdf| Onur Mutlu, Justin Meza, and Lavanya Subramanian, "The Main Memory System: Challenges and Opportunities," Invited Article in Communications of the Korean Institute of Information Scientists and Engineers (KIISE), 2015.}} | ||
+ | **Mentioned During Lecture:** | ||
+ | * {{moscibroda.pdf| Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service in multi-core systems,”USENIX Security 2007}} | ||
+ | * {{moscibroda2007.pdf| Onur Mutluand Thomas Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors" ISCA 2008}} | ||
+ | * {{isca08.pdf| Onur Mutlu and Thomas Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM," ISCA 2008}} | ||
+ | * {{thread_cluster_mem_sched.pdf| Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter, "Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior" 43rd International Symposium on Microarchitecture (MICRO), pages 65-76, Atlanta, GA, December 2010}} | ||
+ | * {{ATLAS.pdf| Yoongu Kim, Dongsu Han, Onur Mutlu, Mor Harchol-Balter, "ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers"}} | ||
+ | |||
+ | ===== Lecture 22 (3/25 Wed.) ===== | ||
+ | **Required:** | ||
+ | * {{tldram-lee.pdf| Lee et al., “Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture,” HPCA 2013. (Sections 1 and 2)}} | ||
+ | * {{2012_isca_salp.pdf| Kim et al., “A Case for Subarray-Level Parallelism (SALP) in DRAM,” ISCA 2012. (Sections 1 and 2)}} | ||
+ | * {{raidr_isca12.pdf| Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012. (Sections 1 and 2)}} | ||
+ | * {{main-memory-system_kiise15.pdf| Onur Mutlu, Justin Meza, and Lavanya Subramanian, "The Main Memory System: Challenges and Opportunities," Invited Article in Communications of the Korean Institute of Information Scientists and Engineers (KIISE), 2015.}} | ||
+ | **Mentioned During Lecture:** | ||
+ | * {{moscibroda.pdf| Moscibroda and Mutlu, “Memory performance attacks: Denial of memory service in multi-core systems,”USENIX Security 2007}} | ||
+ | * {{moscibroda2007.pdf| Onur Mutluand Thomas Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors" ISCA 2008}} | ||
+ | * {{isca08.pdf| Onur Mutlu and Thomas Moscibroda, "Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM," ISCA 2008}} | ||
+ | * {{thread_cluster_mem_sched.pdf| Yoongu Kim, Michael Papamichael, Onur Mutlu, and Mor Harchol-Balter, "Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior" 43rd International Symposium on Microarchitecture (MICRO), pages 65-76, Atlanta, GA, December 2010}} | ||
+ | * {{ATLAS.pdf| Yoongu Kim, Dongsu Han, Onur Mutlu, Mor Harchol-Balter, "ATLAS: A Scalable and High-Performance Scheduling Algorithm for Multiple Memory Controllers"}} |