This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
readings [2014/01/15 20:59] rachata |
readings [2014/04/16 14:12] rachata |
||
---|---|---|---|
Line 23: | Line 23: | ||
* {{http://users.ece.cmu.edu/~omutlu/pub/mph_usenix_security07.pdf|Moscibroda, T., & Mutlu, O. (2007). Memory performance attacks: denial of memory service in multi-core systems. Proceedings of 16th USENIX Security Symposium.}} | * {{http://users.ece.cmu.edu/~omutlu/pub/mph_usenix_security07.pdf|Moscibroda, T., & Mutlu, O. (2007). Memory performance attacks: denial of memory service in multi-core systems. Proceedings of 16th USENIX Security Symposium.}} | ||
* {{http://research.microsoft.com/pubs/79625/MICRO2007.pdf|Onur Mutlu and Thomas Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors", MICRO 2007. }} | * {{http://research.microsoft.com/pubs/79625/MICRO2007.pdf|Onur Mutlu and Thomas Moscibroda, "Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors", MICRO 2007. }} | ||
- | * {{http://users.ece.cmu.edu/~omutlu/pub/memory-channel-partitioning-micro11.pdf|Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mahmut Kandemir, and Thomas Moscibroda, "Reducing Memory Interference in Multicore Systems via Application-Aware Memory Channel Partitioning", MICRO 2011.}} | + | * {{http://users.ece.cmu.edu/~omutlu/pub/memory-channel-partitioning-micro11.pdf|Sai Prashanth Muralidhara, Lavanya Subramanian, Onur Mutlu, Mahmut Kandemir, and Thomas Moscibroda, "Reducing Memory Interference in Multicore Systems via Application-Aware |
+ | * Memory Channel Partitioning", MICRO 2011.}} | ||
* {{http://users.ece.cmu.edu/~omutlu/pub/raidr-dram-refresh_isca12.pdf|Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.}} | * {{http://users.ece.cmu.edu/~omutlu/pub/raidr-dram-refresh_isca12.pdf|Liu et al., “RAIDR: Retention-Aware Intelligent DRAM Refresh,” ISCA 2012.}} | ||
* {{http://users.ece.cmu.edu/~omutlu/pub/memory-scaling_memcon13.pdf|Onur Mutlu, "Memory Scaling: A Systems Architecture Perspective" Technical talk at MemCon 2013 (MEMCON), Santa Clara, CA, August 2013.}} | * {{http://users.ece.cmu.edu/~omutlu/pub/memory-scaling_memcon13.pdf|Onur Mutlu, "Memory Scaling: A Systems Architecture Perspective" Technical talk at MemCon 2013 (MEMCON), Santa Clara, CA, August 2013.}} | ||
Line 50: | Line 51: | ||
* {{arm-instructionset.pdf|Quick Ref (.5MB)}} | * {{arm-instructionset.pdf|Quick Ref (.5MB)}} | ||
* Intel® 64 and IA-32 Architectures Software Developer Manual (2013) | * Intel® 64 and IA-32 Architectures Software Developer Manual (2013) | ||
- | * [[http://download.intel.com/products/processor/manual/325462.pdf|(15MB) Combined Volumes 1-3]] | + | * [[http://download.intel.com/products/processor/manual/325462.pdf|(15MB) Combined Volumes 1-3]]3 |
+ | |||
+ | **Mentioned during lecture:** | ||
+ | * P&H Chapter 4, Sections 4.1-4.4. | ||
+ | * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/pp-appendixc.pdf|P&P Appendix C (The Microarchitecture of the LC-3b, Basic Machine)]] | ||
+ | * P&P Chapter 5 (The LC3) | ||
+ | * {{p25-patterson.pdf|Patterson, D. A., & Ditzel, D. R. (1980). The case for the reduced instruction set computer. SIGARCH Comput. Archit. News, 8(6).}} | ||
+ | * [[http://www.ece.cmu.edu/~koopman/stack_computers/sec3_2.html | Koopman, P. (1989) Stack Computers: The New Wave.]] | ||
+ | * {{chapter9.pdf|Levy, H. (1984). Capability-Based Computer Systems. Chapter 9. The Intel iAPX 432.}} | ||
+ | * {{p489-wilner.pdf|Wilner, W. T. (1972). Design of the Burroughs B1700. Proceedings of the December 5-7, 1972, fall joint computer conference, part I. }} | ||
+ | |||
+ | |||
+ | ===== Lecture 4 (1/22 Wed.) ===== | ||
+ | **Required** | ||
+ | * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/PP_Chap4.pdf|P&P Chapter 4 (The von Neumann Model)]] | ||
+ | * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/pp-appendixa.pdf|P&P Appendix A (The LC-3b ISA)]] | ||
+ | * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/pp-appendixc.pdf|P&P Appendix C (The Microarchitecture of the LC-3b, Basic Machine)]] | ||
+ | |||
+ | ===== Lecture 5 (1/24 Fri.) ===== | ||
+ | **Required** | ||
+ | * None | ||
+ | |||
+ | ===== Lecture 6 (1/27 Mon.) ===== | ||
+ | **Required:** | ||
+ | * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/pp-appendixc.pdf|P&P Appendix C (The Microarchitecture of the LC-3b, Basic Machine)]] | ||
+ | * P&H Appendix D (Mapping Control to Hardware) | ||
+ | **Optional:** | ||
+ | * {{bestway.pdf|Wilkes, M. V. (1951). The best way to design an automatic calculating machine. Manchester University Computer Inaugural Conference.}} | ||
+ | **Mentioned during lecture:** | ||
+ | * {{bestway.pdf|Wilkes, M. V. (1951). The best way to design an automatic calculating machine. Manchester University Computer Inaugural Conference.}} | ||
+ | |||
+ | ===== Lecture 7 (1/29 Wed.) ===== | ||
+ | **Required:** | ||
+ | * None | ||
+ | |||
+ | **Mentioned during lecture:** | ||
+ | * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/pp-appendixc.pdf|P&P Appendix C (The Microarchitecture of the LC-3b, Basic Machine)]] | ||
+ | |||
+ | ===== Lecture 8 (1/31 Fri.) ===== | ||
+ | **Required:** | ||
+ | * None | ||
+ | |||
+ | ===== Lecture 9 (2/3 Mon.) ===== | ||
+ | **Required:** | ||
+ | * P&H Sections 4.9-4.11 | ||
+ | * {{00476078.pdf|Smith, J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}} | ||
+ | |||
+ | **Mentioned during lecture:** | ||
+ | * {{p177-allen.pdf|Allen, J. R., Kennedy, K., Porterfield, C., & Warren, J. (1983). Conversion of control dependence to data dependence. Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages.}} | ||
+ | * {{24400043.pdf|Kim, H., Mutlu, O., Stark, J., & Patt, Y. N. (2005). Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution. Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture.}} | ||
+ | * {{thornton_-_1964_-_parallel_operation_in_the_control_data_6600.pdf|Thornton, J. E. (1964). Parallel Operation in the Control Data 6600. Proceedings of the Fall Joint Computer Conference.}} | ||
+ | * {{smith78_hep.pdf|Smith, B. J. (1978). A pipelined, shared resource MIMD computer. International Conference on Parallel Processing.}} | ||
+ | * {{p16-pettis.pdf|Pettis, K., & Hansen, R. C. (1990). Profile guided code positioning. Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation.}} | ||
+ | |||
+ | ===== Lecture 10 (2/5 Wed.) ===== | ||
+ | |||
+ | **Required:** | ||
+ | * {{mcfarling_-_1993_-_combining_branch_predictors.pdf|Mcfarling, S. (1993). Combining branch predictors. WRL Technical Note TN-36.}} | ||
+ | * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler, R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}} | ||
+ | **Mentioned during lecture:** | ||
+ | * {{p300-ball.pdf|Ball, T., & Larus, J. R. (1993). Branch prediction for free. Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation.}} | ||
+ | * {{p135-smith.pdf|Smith, J. E. (1981). A study of branch prediction strategies. Proceedings of the 8th annual symposium on Computer Architecture.}} | ||
+ | * {{yeh_patt_-_1991_-_two-level_adaptive_training_branch_prediction.pdf|Yeh, T.-Y., & Patt, Y. N. (1991). Two-level adaptive training branch prediction. Proceedings of the 24th annual international symposium on Microarchitecture.}} | ||
+ | * {{p22-chang.pdf|Chang, P.-Y., Hao, E., Yeh, T.-Y., & Patt, Y. (1994). Branch classification: a new mechanism for improving branch predictor performance. Proceedings of the 27th annual international symposium on Microarchitecture.}} | ||
+ | * {{hpca01.pdf|Daniel A. Jimenez and Calvin Lin. 2001. Dynamic Branch Prediction with Perceptrons. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA '01)}} | ||
+ | * {{Riseman.1972.TC.pdf|E. M. Riseman and C. C. Foster. 1972. The Inhibition of Potential Parallelism by Conditional Jumps. IEEE Trans. Comput. 21, 12 (December 1972)}} | ||
+ | |||
+ | ===== Lecture 11 (2/12 Wed.) ===== | ||
+ | ** Required ** | ||
+ | * None | ||
+ | |||
+ | ** Mentioned during the lecture ** | ||
+ | * {{p274-chang.pdf|Po-Yung Chang, Eric Hao, and Yale N. Patt. 1997. Target prediction for indirect jumps. ISCA'97.}} | ||
+ | * {{kim_isca07.pdf|Hyesoon Kim, José A. Joao, Onur Mutlu, Chang Joo Lee, Yale N. Patt, and Robert Cohn. 2007. VPC prediction: reducing the cost of indirect branches via hardware-based dynamic devirtualization. ISCA'07}} | ||
+ | |||
+ | ===== Lecture 12 (2/14 Fri.) ===== | ||
+ | ** Required ** | ||
+ | * P&H Sections 4.9-4.11 | ||
+ | * {{00476078.pdf|Smith, J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}} | ||
+ | * {{00004607.pdf|Smith, J. E., & Pleszkun, A. R. (1988). Implementing precise interrupts in pipelined processors. Computers, IEEE Transactions on.}} | ||
+ | |||
+ | ===== Lecture 13 (2/17 Mon.) ===== | ||
+ | ** Required ** | ||
+ | * none | ||
+ | |||
+ | ===== Lecture 14 (2/19 Wed.) ===== | ||
+ | ** Required ** | ||
+ | * {{p18-hwu.pdf|Hwu, W. W., & Patt, Y. N. (1987). Checkpoint repair for out-of-order execution machines. Proceedings of the 14th annual international symposium on Computer architecture.}} | ||
+ | * {{00476078.pdf|Smith, J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}} | ||
+ | * {{00004607.pdf|Smith, J. E., & Pleszkun, A. R. (1988). Implementing precise interrupts in pipelined processors. Computers, IEEE Transactions on.}} | ||
+ | |||
+ | |||
+ | ===== Lecture 15 (2/21 Fri.) ===== | ||
+ | ** Required ** | ||
+ | * {{04523358.pdf|Lindholm, E., Nickolls, J., Oberman, S., & Montrym, J. (2008). NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE.}} | ||
+ | * {{p50-fatahalian.pdf|Fatahalian, K., & Houston, M. (2008). A closer look at GPUs. Commun. ACM.}} | ||
+ | |||
+ | **Mentioned during lecture:** | ||
+ | * {{30470407.pdf|Fung, W. W. L., Sham, I., Yuan, G., & Aamodt, T. M. (2007). Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture.}} | ||
+ | * {{p253-suleman.pdf |Suleman, M. A., Mutlu, O., Qureshi, M. K., & Patt, Y. N. (2009). Accelerating critical section execution with asymmetric multi-core architectures. Proceedings of the 14th international conference on Architectural support for programming languages and operating systems.}} | ||
+ | * {{01447203.pdf|Flynn, M. J. (1966). Very high-speed computing systems. Proceedings of the IEEE.}} | ||
+ | * {{fisher_-_1983_-_very_long_instruction_word_architectures_and_the_eli-512.pdf|Fisher, J. A. (1983). Very Long Instruction Word architectures and the ELI-512. Proceedings of the 10th annual international symposium on Computer architecture.}} | ||
+ | * {{Smith-1982-Decoupled-Access-Execute-Computer-Architectures.pdf|Smith, J. E. (1982). Decoupled access/execute computer architectures. Proceedings of the 9th annual symposium on Computer Architecture.}} | ||
+ | * {{p289-smith.pdf|Smith, J. E. (1984). Decoupled access/execute computer architectures. ACM Trans. Comput. Syst.}} | ||
+ | * {{p199-smith.pdf|Smith, J. E., Dermer, G. E., Vanderwarn, B. D., Klinger, S. D., & Rozewski, C. M. (1987). The ZS-1 central processor. Proceedings of the second international conference on Architectual support for programming languages and operating systems.}} | ||
+ | * {{00030730.pdf|Smith, J. E. (1989). Dynamic instruction scheduling and the Astronautics ZS-1. IEEE Computer.}} | ||
+ | * {{kung_-_1982_-_why_systolic_architectures.pdf|Kung, H. T. (1982). Why Systolic Architectures? IEEE Computer.}} | ||
+ | * {{annaratone_et_al._-_1986_-_warp_architecture_and_implementation.pdf|Annaratone, M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. S. (1986). Warp architecture and implementation. Proceedings of the 13th annual international symposium on Computer architecture.}} | ||
+ | * {{annaratone_et_al._-_1987_-_the_warp_computer_architecture_implementation_and_performance.pdf|Annaratone, M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. (1987). The warp computer: Architecture, implementation, and performance. IEEE Transactions on Computers.}} | ||
+ | |||
+ | ===== Lecture 18 (2/28 Fri.) ===== | ||
+ | **Mentioned during lecture:** | ||
+ | * {{01675827.pdf|Fisher, J. A. (1981). Trace Scheduling: A Technique for Global Microcode Compaction. IEEE Trans. Comput.}} | ||
+ | * {{2fbf01205185.pdf|Hwu, W.-M. W., Mahlke, S. A., Chen, W. Y., Chang, P. P., Warter, N. J., Bringmann, R. A., Ouellette, R. G., et al. (1993). The superblock: an effective technique for VLIW and superscalar compilation. J. Supercomput.}} | ||
+ | * {{p45-mahlke.pdf|Mahlke, S. A., Lin, D. C., Chen, W. Y., Hank, R. E., & Bringmann, R. A. (1992). Effective compiler support for predicated execution using the hyperblock. Proceedings of the 25th annual international symposium on Microarchitecture.}} | ||
+ | * {{melvin_patt_-_1995_-_enhancing_instruction_scheduling_with_a_block-structured_isa.pdf|Melvin, S., & Patt, Y. (1995). Enhancing instruction scheduling with a block-structured ISA. Int. J. Parallel Program.}} | ||
+ | * {{hao_et_al._-_1996_-_increasing_the_instruction_fetch_rate_via_block-structured_instruction_set_architectures.pdf|Hao, E., Chang, P.-Y., Evers, M., & Patt, Y. N. (1996). Increasing the instruction fetch rate via block-structured instruction set architectures. Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture.}} | ||
+ | * {{00877947.pdf|Huck, J., Morris, D., Ross, J., Knies, A., Mulder, H., & Zahir, R. (2000). Introducing the IA-64 architecture. IEEE Micro.}} | ||
+ | |||
+ | ===== Lecture 19 (3/19 Wed.) ===== | ||
+ | **Required:** | ||
+ | * P&H Chapters 5.1-5.3 (cache chapters) | ||
+ | * Hamacher et al. Chapters 8.1-8.7 (cache/memory chapters) | ||
+ | * {{wilkes_-_1965_-_slave_memories_and_dynamic_storage_allocation.pdf|Wilkes, M. V. (1965). Slave Memories and Dynamic Storage Allocation. IEEE Transactions on Electronic Computers.}} | ||
+ | |||
+ | ===== Lecture 20 (3/21 Fri.) ===== | ||
+ | ** Mentioned in the Lecture** | ||
+ | * {{26080167.pdf|Qureshi, M. K., Lynch, D. N., Mutlu, O., & Patt, Y. N. (2006). A Case for MLP-Aware Cache Replacement. Proceedings of the 33rd annual international symposium on Computer Architecture.}} | ||
+ | * {{05388441.pdf|Belady, L. A. (1966). A study of replacement algorithms for a virtual-storage computer. IBM Syst. J.}} | ||
+ | |||
+ | ===== Lecture 21 (3/24 Mon.) ===== | ||
+ | ** Required ** | ||
+ | * {{26080167.pdf|Qureshi, M. K., Lynch, D. N., Mutlu, O., & Patt, Y. N. (2006). A Case for MLP-Aware Cache Replacement. Proceedings of the 33rd annual international symposium on Computer Architecture.}} | ||
+ | * {{05388441.pdf|Belady, L. A. (1966). A study of replacement algorithms for a virtual-storage computer. IBM Syst. J.}} | ||
+ | |||
+ | |||
+ | ===== Lecture 22 (3/26 Wed.) ===== | ||
+ | ** Recommended: ** | ||
+ | * {{p6-bell.pdf|Bell, G., & Strecker, W. D. (1998). Retrospective: what have we learned from the PDP-11—what we have learned from VAX and Alpha. 25 years of the international symposia on Computer architecture (selected papers).}} | ||
+ | * {{p1-bell.pdf|Bell, G., & Strecker, W. D. (1976). Computer structures: What have we learned from the PDP-11? Proceedings of the 3rd annual symposium on Computer architecture.}} | ||
+ | |||
+ | ** Mentioned during lecture: ** | ||
+ | * {{TLDRAM-Lee.pdf|Lee et al., Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture, HPCA 2013.}} | ||
+ | * {{raidr-isca12.pdf|Liu et al., RAIDR: Retention-Aware Intelligent DRAM Refresh, ISCA 2012.}} | ||
+ | * {{2012_isca_salp.pdf|Kim et al., “A Case for Exploiting Subarray-Level Parallelism in DRAM, ISCA 2012.}} | ||
+ | * {{p60-liu.pdf|Liu et al., “An Experimental Study of Data Retention Behavior in Modern DRAM Devices,” ISCA 2013.}} | ||
+ | * {{moscibroda.pdf|Moscibroda, T., & Mutlu, O. (2007). Memory performance attacks: denial of memory service in multi-core systems. Proceedings of 16th USENIX Security Symposium.}} | ||
+ | * {{30470146.pdf|Mutlu, O., & Moscibroda, T. (2007). Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 146–160).}} | ||
+ | * {{3174a063.pdf|Mutlu, O., & Moscibroda, T. (2008). Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. Proceedings of the 35th Annual International Symposium on Computer Architecture.}} | ||
+ | * {{4299a065.pdf|Kim, Y., Papamichael, M., Mutlu, O., & Harchol-Balter, M. (2010). Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.}} | ||
+ | * {{muralidhara_et_al._-_2011_-_reducing_memory_interference_in_multicore_systems_via_application-aware_memory_channel_partitioning.pdf|Muralidhara, S. P., Subramanian, L., Mutlu, O., Kandemir, M., & Moscibroda, T. (2011). Reducing memory interference in multicore systems via application-aware memory channel partitioning. Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.}} | ||
+ | * {{p335-ebrahimi.pdf|Ebrahimi, E., Lee, C. J., Mutlu, O., & Patt, Y. N. (2010). Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems.}} | ||
+ | * {{p362-ebrahimi.pdf|Ebrahimi, E., Miftakhutdinov, R., Fallin, C., Lee, C. J., Joao, J. A., Mutlu, O., & Patt, Y. N. (2011). Parallel application memory scheduling. Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.}} | ||
+ | |||
+ | ===== Lecture 24 (3/31 Mon.) ===== | ||
+ | |||
+ | ** Recommended: ** | ||
+ | * {{p6-bell.pdf|Bell, G., & Strecker, W. D. (1998). Retrospective: what have we learned from the PDP-11—what we have learned from VAX and Alpha. 25 years of the international symposia on Computer architecture (selected papers).}} | ||
+ | * {{p1-bell.pdf|Bell, G., & Strecker, W. D. (1976). Computer structures: What have we learned from the PDP-11? Proceedings of the 3rd annual symposium on Computer architecture.}} | ||
+ | |||
+ | ** Mentioned during lecture: ** | ||
+ | * {{TLDRAM-Lee.pdf|Lee et al., Tiered-Latency DRAM: A Low Latency and Low Cost DRAM Architecture, HPCA 2013.}} | ||
+ | * {{raidr-isca12.pdf|Liu et al., RAIDR: Retention-Aware Intelligent DRAM Refresh, ISCA 2012.}} | ||
+ | * {{2012_isca_salp.pdf|Kim et al., “A Case for Exploiting Subarray-Level Parallelism in DRAM, ISCA 2012.}} | ||
+ | * {{p60-liu.pdf|Liu et al., “An Experimental Study of Data Retention Behavior in Modern DRAM Devices,” ISCA 2013.}} | ||
+ | * {{moscibroda.pdf|Moscibroda, T., & Mutlu, O. (2007). Memory performance attacks: denial of memory service in multi-core systems. Proceedings of 16th USENIX Security Symposium.}} | ||
+ | * {{30470146.pdf|Mutlu, O., & Moscibroda, T. (2007). Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors. Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture (pp. 146–160).}} | ||
+ | * {{3174a063.pdf|Mutlu, O., & Moscibroda, T. (2008). Parallelism-Aware Batch Scheduling: Enhancing both Performance and Fairness of Shared DRAM Systems. Proceedings of the 35th Annual International Symposium on Computer Architecture.}} | ||
+ | * {{4299a065.pdf|Kim, Y., Papamichael, M., Mutlu, O., & Harchol-Balter, M. (2010). Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.}} | ||
+ | * {{muralidhara_et_al._-_2011_-_reducing_memory_interference_in_multicore_systems_via_application-aware_memory_channel_partitioning.pdf|Muralidhara, S. P., Subramanian, L., Mutlu, O., Kandemir, M., & Moscibroda, T. (2011). Reducing memory interference in multicore systems via application-aware memory channel partitioning. Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.}} | ||
+ | * {{p335-ebrahimi.pdf|Ebrahimi, E., Lee, C. J., Mutlu, O., & Patt, Y. N. (2010). Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems.}} | ||
+ | * {{p362-ebrahimi.pdf|Ebrahimi, E., Miftakhutdinov, R., Fallin, C., Lee, C. J., Joao, J. A., Mutlu, O., & Patt, Y. N. (2011). Parallel application memory scheduling. Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.}} | ||
+ | |||
+ | ===== Lecture 25 (4/2 Wed.) ===== | ||
+ | |||
+ | ** Mentioned during lecture: ** | ||
+ | * {{raidr-isca12.pdf|Liu et al., RAIDR: Retention-Aware Intelligent DRAM Refresh, ISCA 2012.}} | ||
+ | * {{p60-liu.pdf|Liu et al., “An Experimental Study of Data Retention Behavior in Modern DRAM Devices,” ISCA 2013.}} | ||
+ | * {{4299a065.pdf|Kim, Y., Papamichael, M., Mutlu, O., & Harchol-Balter, M. (2010). Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture.}} | ||
+ | * {{muralidhara_et_al._-_2011_-_reducing_memory_interference_in_multicore_systems_via_application-aware_memory_channel_partitioning.pdf|Muralidhara, S. P., Subramanian, L., Mutlu, O., Kandemir, M., & Moscibroda, T. (2011). Reducing memory interference in multicore systems via application-aware memory channel partitioning. Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.}} | ||
+ | * {{p335-ebrahimi.pdf|Ebrahimi, E., Lee, C. J., Mutlu, O., & Patt, Y. N. (2010). Fairness via source throttling: a configurable and high-performance fairness substrate for multi-core memory systems. Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems.}} | ||
+ | * {{p362-ebrahimi.pdf|Ebrahimi, E., Miftakhutdinov, R., Fallin, C., Lee, C. J., Joao, J. A., Mutlu, O., & Patt, Y. N. (2011). Parallel application memory scheduling. Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.}} | ||
+ | * {{isca08_ipek.pdf|Ipek, E., Mutlu, O., Martinez, J., Caruana, R. (2008). Self-Optimizing Memory Controllers: A Reinforcement Learning Approach. Proceedings of the 42th Annual IEEE/ACM International Symposium on Microarchitecture.}} | ||
+ | |||
+ | |||
+ | ===== Lecture 25 (4/7 Mon.) ===== | ||
+ | ** Required: ** | ||
+ | * {{mutlu_et_al._-_2003_-_runahead_execution_an_alternative_to_very_large_instruction_windows_for_out-of-order_processors.pdf|Mutlu, O., Stark, J., Wilkerson, C., & Patt, Y. N. (2003). Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. Proceedings of the 9th International Symposium on High-Performance Computer Architecture.}} | ||
+ | * {{04147648.pdf|Srinath, S., Mutlu, O., Kim, H., & Patt, Y. N. (2007). Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture.}} | ||
+ | |||
+ | ** Recommended: ** | ||
+ | * {{24400233.pdf|Mutlu, O., Kim, H., & Patt, Y. N. (2005). Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns. Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture.}} | ||
+ | * {{01603492.pdf|Mutlu, O., Kim, H., & Patt, Y. N. (2006). Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance. IEEE Micro.}} | ||
+ | * {{21260119.pdf|Armstrong, D. N., Kim, H., Mutlu, O., & Patt, Y. N. (2004). Wrong Path Events: Exploiting Unusual and Illegal Program Behavior for Early Misprediction Detection and Recovery. Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture.}} | ||
+ | |||
+ | ===== Lecture 27 (4/8 Wed.) ===== | ||
+ | ** Required: ** | ||
+ | * None | ||
+ | ** Mentioned during lecture: ** | ||
+ | * {{p176-baer.pdf|Baer, J.-L., & Chen, T.-F. (1991). An effective on-chip preloading scheme to reduce data access penalty. Proceedings of the 1991 ACM/IEEE conference on Supercomputing.}} | ||
+ | * {{jouppi_-_1990_-_improving_direct-mapped_cache_performance_by_the_addition_of_a_small_fully-associative_cache_and_prefetch_buffers.pdf|Jouppi, N. P. (1990). Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. Proceedings of the 17th annual international symposium on Computer Architecture.}} | ||
+ | * {{mowry_lam_gupta_-_1992_-_design_and_evaluation_of_a_compiler_algorithm_for_prefetching.pdf|Mowry, T. C., Lam, M. S., & Gupta, A. (1992). Design and evaluation of a compiler algorithm for prefetching. Proceedings of the fifth international conference on Architectural support for programming languages and operating systems.}} | ||
+ | |||
+ | |||
+ | ===== Lecture 28 (4/14 Mon.) ===== | ||
+ | ** Required: ** | ||
+ | * {{amdahl_-_1967_-_validity_of_the_single_processor_approach_to_achieving_large_scale_computing_capabilities.pdf|Amdahl, G. M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference.}} | ||
+ | * {{lamport_-_1979_-_how_to_make_a_multiprocessor_computer_that_correctly_executes_multiprocess_programs.pdf|Lamport, L. (1979). How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs.}} | ||
+ | * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/culler-mesi.pdf|C&S, Chapters 5.1 & 5.3]] | ||
+ | * P&H, Chapter 5.8 | ||
+ | ** Recommended: ** | ||
+ | * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/hill_309_314.pdf|Hill, Jouppi, Sohi. "Multiprocessors and Multicomputers," pp. 551-560 in Readings in Computer Architecture.]] | ||
+ | * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/hill_551_560.pdf|Hill, Jouppi, Sohi. "Dataflow and Multithreading," pp. 309-314 in Readings in Computer Architecture.]] | ||
+ | * {{01447203.pdf|Flynn, M. J. (1966). Very high-speed computing systems. Proceedings of the IEEE.}} | ||
+ | * {{papamarcos_patel_-_1984_-_a_low-overhead_coherence_solution_for_multiprocessors_with_private_cache_memories.pdf|Papamarcos, M. S., & Patel, J. H. (1984). A low-overhead coherence solution for multiprocessors with private cache memories. Proceedings of the 11th annual international symposium on Computer architecture.}} | ||
+ | ** Mentioned during lecture: ** | ||
+ | * {{p176-baer.pdf|Baer, J.-L., & Chen, T.-F. (1991). An effective on-chip preloading scheme to reduce data access penalty. Proceedings of the 1991 ACM/IEEE conference on Supercomputing.}} | ||
+ | * {{04147648.pdf|Srinath, S., Mutlu, O., Kim, H., & Patt, Y. N. (2007). Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers. Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture.}} | ||
+ | * {{joseph_grunwald_-_1997_-_prefetching_using_markov_predictors.pdf|Joseph, D., & Grunwald, D. (1997). Prefetching using Markov predictors. Proceedings of the 24th annual international symposium on Computer architecture.}} | ||
+ | * {{p279-cooksey.pdf|Cooksey, R., Jourdan, S., & Grunwald, D. (2002). A stateless, content-directed data prefetching mechanism. Proceedings of the 10th international conference on Architectural support for programming languages and operating systems.}} | ||
+ | * {{04798232.pdf|Ebrahimi, E., Mutlu, O., & Patt, Y. N. (2009). Techniques for bandwidth-efficient prefetching of linked data structures in hybrid prefetching systems. High Performance Computer Architecture, 2009.}} | ||
+ | * {{p186-chappell.pdf|Chappell, R. S., Stark, J., Kim, S. P., Reinhardt, S. K., & Patt, Y. N. (1999). Simultaneous subordinate microthreading (SSMT). Proceedings of the 26th annual international symposium on Computer architecture.}} | ||
+ | * {{p2-zilles.pdf|Zilles, C., & Sohi, G. (2001). Execution-based prediction using speculative slices. Proceedings of the 28th annual international symposium on Computer architecture.}} | ||
+ | * {{p40-luk.pdf|Luk, C.-K. (2001). Tolerating memory latency through software-controlled pre-execution in simultaneous multithreading processors. Proceedings of the 28th annual international symposium on Computer architecture.}} | ||
+ | * {{p172-zilles.pdf|Zilles, C. B., & Sohi, G. S. (2000). Understanding the backward slices of performance degrading instructions. Proceedings of the 27th annual international symposium on Computer architecture.}} | ||
+ | * {{mutlu_et_al._-_2003_-_runahead_execution_an_alternative_to_very_large_instruction_windows_for_out-of-order_processors.pdf|Mutlu, O., Stark, J., Wilkerson, C., & Patt, Y. N. (2003). Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-Order Processors. Proceedings of the 9th International Symposium on High-Performance Computer Architecture.}} | ||
+ | * {{jouppi_-_1990_-_improving_direct-mapped_cache_performance_by_the_addition_of_a_small_fully-associative_cache_and_prefetch_buffers.pdf|Jouppi, N. P. (1990). Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers. Proceedings of the 17th annual international symposium on Computer Architecture.}} | ||
+ | |||
+ | |||
+ | ===== Lecture 29 (4/16 Wed.) ===== | ||
+ | ** Required: ** | ||
+ | * {{amdahl_-_1967_-_validity_of_the_single_processor_approach_to_achieving_large_scale_computing_capabilities.pdf|Amdahl, G. M. (1967). Validity of the single processor approach to achieving large scale computing capabilities. Proceedings of the April 18-20, 1967, spring joint computer conference.}} | ||
+ | * {{lamport_-_1979_-_how_to_make_a_multiprocessor_computer_that_correctly_executes_multiprocess_programs.pdf|Lamport, L. (1979). How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs.}} | ||
+ | * (CMU WebISO) [[http://www.ece.cmu.edu/~ece447/cmu_only/culler-mesi.pdf|C&S, Chapters 5.1 & 5.3]] | ||
+ | * P&H, Chapter 5.8 |