User Tools

Site Tools


readings

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
readings [2015/02/02 21:06]
rachata
readings [2015/02/24 18:12]
kevincha
Line 80: Line 80:
   * None   * None
  
-**Mentioned during lecture:*+**Mentioned during lecture:**
   * {{bestway.pdf|Wilkes,​ M. V. (1951). The best way to design an automatic calculating machine. Manchester University Computer Inaugural Conference.}}   * {{bestway.pdf|Wilkes,​ M. V. (1951). The best way to design an automatic calculating machine. Manchester University Computer Inaugural Conference.}}
   * [[http://​research.microsoft.com/​pubs/​68221/​acrobat.pdf |Butler W. Lampson, “Hints for Computer System Design,” ACM Operating Systems Review, 1983.]]   * [[http://​research.microsoft.com/​pubs/​68221/​acrobat.pdf |Butler W. Lampson, “Hints for Computer System Design,” ACM Operating Systems Review, 1983.]]
Line 94: Line 94:
 **Mentioned during lecture:** **Mentioned during lecture:**
   * {{p16-pettis.pdf|Pettis,​ K., & Hansen, R. C. (1990). Profile guided code positioning. Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation.}}   * {{p16-pettis.pdf|Pettis,​ K., & Hansen, R. C. (1990). Profile guided code positioning. Proceedings of the ACM SIGPLAN 1990 conference on Programming language design and implementation.}}
 +
 +===== Lecture 9 (2/4 Wed.) =====
 +**Required:​**
 +  * P&H Sections 4.9-4.11
 +  * {{00476078.pdf|Smith,​ J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}}
 +  * {{00004607.pdf|Smith,​ J. E., & Pleszkun, A. R. (1988). Implementing precise interrupts in pipelined processors. Computers, IEEE Transactions on.}}
 +  * {{mcfarling_-_1993_-_combining_branch_predictors.pdf|Mcfarling,​ S. (1993). Combining branch predictors. WRL Technical Note TN-36.}}
 +  * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler,​ R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}}
 +
 +**Mentioned during lecture:**
 +  * {{flash-memory-data-retention_hpca15.pdf|Yu Cai, Yixin Luo, Erich F. Haratsch, Ken Mai, and Onur Mutlu, Data Retention in MLC NAND Flash Memory: Characterization,​ Optimization and Recovery, HPCA 2015.}}
 +  * {{adaptive-latency-dram_hpca15.pdf|Donghyuk Lee, Yoongu Kim, Gennady Pekhimenko, Samira Khan, Vivek Seshadri, Kevin Chang, and Onur Mutlu, Adaptive-Latency DRAM: Optimizing DRAM Timing for the Common-Case,​ HPCA 2015.}}
 +  * {{compression-aware-cache-management_hpca15.pdf|Gennady Pekhimenko, Tyler Huberty, Rui Cai, Onur Mutlu, Phillip P. Gibbons, Michael A. Kozuch, and Todd C. Mowry, Exploiting Compressed Block Size as an Indicator of Future Reuse, HPCA 2015.}}
 +
 +
 +===== Lecture 10 (2/6 Fri.) =====
 +**Required:​**
 +  * P&H Sections 4.9-4.11
 +  * {{00476078.pdf|Smith,​ J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}}
 +  * {{00004607.pdf|Smith,​ J. E., & Pleszkun, A. R. (1988). Implementing precise interrupts in pipelined processors. Computers, IEEE Transactions on.}}
 +  * {{mcfarling_-_1993_-_combining_branch_predictors.pdf|Mcfarling,​ S. (1993). Combining branch predictors. WRL Technical Note TN-36.}}
 +  * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler,​ R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}}
 +
 +**Mentioned in the Lecture:**
 +  * {{p300-ball.pdf|Ball,​ T., & Larus, J. R. (1993). Branch prediction for free. Proceedings of the ACM SIGPLAN 1993 conference on Programming language design and implementation.}}
 +  * {{p135-smith.pdf|Smith,​ J. E. (1981). A study of branch prediction strategies. Proceedings of the 8th annual symposium on Computer Architecture.}}
 +  * {{yeh_patt_-_1991_-_two-level_adaptive_training_branch_prediction.pdf|Yeh,​ T.-Y., & Patt, Y. N. (1991). Two-level adaptive training branch prediction. Proceedings of the 24th annual international symposium on Microarchitecture.}}
 +  * {{p22-chang.pdf|Chang,​ P.-Y., Hao, E., Yeh, T.-Y., & Patt, Y. (1994). Branch classification:​ a new mechanism for improving branch predictor performance. Proceedings of the 27th annual international symposium on Microarchitecture.}}
 +  * {{hpca01.pdf|Daniel A. Jimenez and Calvin Lin. 2001. Dynamic Branch Prediction with Perceptrons. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA '01)}}
 +  * {{Riseman.1972.TC.pdf|E. M. Riseman and C. C. Foster. 1972. The Inhibition of Potential Parallelism by Conditional Jumps. IEEE Trans. Comput. 21, 12 (December 1972)}}
 +  * {{p274-chang.pdf|Po-Yung Chang, Eric Hao, and Yale N. Patt. 1997. Target prediction for indirect jumps. ISCA'​97.}}
 +  * {{kim_isca07.pdf|Hyesoon Kim, José A. Joao, Onur Mutlu, Chang Joo Lee, Yale N. Patt, and Robert Cohn. 2007. VPC prediction: reducing the cost of indirect branches via hardware-based dynamic devirtualization. ISCA'​07}}
 +
 +===== Lecture 11 (2/11 Wed.) =====
 +**Required:​**
 +
 +**Mentioned in the Lecture:**
 +  * {{p18-hwu.pdf|Hwu and Patt (1987). Checkpoint Repair for Out-of-order Execution Machines.}}
 +  * {{00004607.pdf|Smith,​ J. E., & Pleszkun, A. R. (1988). Implementing precise interrupts in pipelined processors. Computers, IEEE Transactions on.}}
 +  * {{ogehl.pdf | Seznec (2005). Analysis of the O-GEometric History Length Branch Predictor. ISCA}}
 +  * {{hpca01.pdf|Daniel A. Jimenez and Calvin Lin. 2001. Dynamic Branch Prediction with Perceptrons. In Proceedings of the 7th International Symposium on High-Performance Computer Architecture (HPCA '01)}}
 +
 +===== Lecture 12 (2/13 Fri.) =====
 +**Required:​**
 +  * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler,​ R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}}
 +
 +===== Lecture 13 (2/16 Mon.) =====
 +**Required:​**
 +  * {{kessler_-_1999_-_the_alpha_21264_microprocessor.pdf|Kessler,​ R. E. (1999). The Alpha 21264 Microprocessor. IEEE Micro.}}
 +  * {{00476078.pdf|Smith,​ J. E., & Sohi, G. S. (1995). The microarchitecture of superscalar processors. Proceedings of the IEEE.}}
 +  * {{04523358.pdf|Lindholm,​ E., Nickolls, J., Oberman, S., & Montrym, J. (2008). NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE.}}
 +  * {{p50-fatahalian.pdf|Fatahalian,​ K., & Houston, M. (2008). A closer look at GPUs. Commun. ACM.}}
 +
 +===== Lecture 14 (2/18 Wed.) =====
 +** Required **
 +  * {{04523358.pdf|Lindholm,​ E., Nickolls, J., Oberman, S., & Montrym, J. (2008). NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE.}}
 +  * {{p50-fatahalian.pdf|Fatahalian,​ K., & Houston, M. (2008). A closer look at GPUs. Commun. ACM.}}
 +
 +**Mentioned during lecture:**
 +  * {{30470407.pdf|Fung,​ W. W. L., Sham, I., Yuan, G., & Aamodt, T. M. (2007). Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture.}}
 +  * {{p253-suleman.pdf |Suleman, M. A., Mutlu, O., Qureshi, M. K., & Patt, Y. N. (2009). Accelerating critical section execution with asymmetric multi-core architectures. Proceedings of the 14th international conference on Architectural support for programming languages and operating systems.}}
 +  * {{01447203.pdf|Flynn,​ M. J. (1966). Very high-speed computing systems. Proceedings of the IEEE.}}
 +  * {{fisher_-_1983_-_very_long_instruction_word_architectures_and_the_eli-512.pdf|Fisher,​ J. A. (1983). Very Long Instruction Word architectures and the ELI-512. Proceedings of the 10th annual international symposium on Computer architecture.}}
 +  * {{Smith-1982-Decoupled-Access-Execute-Computer-Architectures.pdf|Smith,​ J. E. (1982). Decoupled access/​execute computer architectures. Proceedings of the 9th annual symposium on Computer Architecture.}}
 +  * {{p289-smith.pdf|Smith,​ J. E. (1984). Decoupled access/​execute computer architectures. ACM Trans. Comput. Syst.}}
 +  * {{p199-smith.pdf|Smith,​ J. E., Dermer, G. E., Vanderwarn, B. D., Klinger, S. D., & Rozewski, C. M. (1987). The ZS-1 central processor. Proceedings of the second international conference on Architectual support for programming languages and operating systems.}}
 +  * {{00030730.pdf|Smith,​ J. E. (1989). Dynamic instruction scheduling and the Astronautics ZS-1. IEEE Computer.}}
 +  * {{kung_-_1982_-_why_systolic_architectures.pdf|Kung,​ H. T. (1982). Why Systolic Architectures?​ IEEE Computer.}}
 +  * {{annaratone_et_al._-_1986_-_warp_architecture_and_implementation.pdf|Annaratone,​ M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. S. (1986). Warp architecture and implementation. Proceedings of the 13th annual international symposium on Computer architecture.}}
 +  * {{annaratone_et_al._-_1987_-_the_warp_computer_architecture_implementation_and_performance.pdf|Annaratone,​ M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. (1987). The warp computer: Architecture,​ implementation,​ and performance. IEEE Transactions on Computers.}}
 +
 +===== Lecture 15 (2/20 Fri.) =====
 +** Required **
 +  * {{04523358.pdf|Lindholm,​ E., Nickolls, J., Oberman, S., & Montrym, J. (2008). NVIDIA Tesla: A Unified Graphics and Computing Architecture. Micro, IEEE.}}
 +  * {{p50-fatahalian.pdf|Fatahalian,​ K., & Houston, M. (2008). A closer look at GPUs. Commun. ACM.}}
 +
 +**Mentioned during lecture:**
 +  * {{30470407.pdf|Fung,​ W. W. L., Sham, I., Yuan, G., & Aamodt, T. M. (2007). Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow. Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture.}}
 +  * {{fisher_-_1983_-_very_long_instruction_word_architectures_and_the_eli-512.pdf|Fisher,​ J. A. (1983). Very Long Instruction Word architectures and the ELI-512. Proceedings of the 10th annual international symposium on Computer architecture.}}
 +  * {{Smith-1982-Decoupled-Access-Execute-Computer-Architectures.pdf|Smith,​ J. E. (1982). Decoupled access/​execute computer architectures. Proceedings of the 9th annual symposium on Computer Architecture.}}
 +  * {{p289-smith.pdf|Smith,​ J. E. (1984). Decoupled access/​execute computer architectures. ACM Trans. Comput. Syst.}}
 +  * {{p199-smith.pdf|Smith,​ J. E., Dermer, G. E., Vanderwarn, B. D., Klinger, S. D., & Rozewski, C. M. (1987). The ZS-1 central processor. Proceedings of the second international conference on Architectual support for programming languages and operating systems.}}
 +  * {{00030730.pdf|Smith,​ J. E. (1989). Dynamic instruction scheduling and the Astronautics ZS-1. IEEE Computer.}}
 +  * {{kung_-_1982_-_why_systolic_architectures.pdf|Kung,​ H. T. (1982). Why Systolic Architectures?​ IEEE Computer.}}
 +  * {{annaratone_et_al._-_1986_-_warp_architecture_and_implementation.pdf|Annaratone,​ M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. S. (1986). Warp architecture and implementation. Proceedings of the 13th annual international symposium on Computer architecture.}}
 +  * {{annaratone_et_al._-_1987_-_the_warp_computer_architecture_implementation_and_performance.pdf|Annaratone,​ M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. (1987). The warp computer: Architecture,​ implementation,​ and performance. IEEE Transactions on Computers.}}
 +  * {{jog_orchestrated.pdf|Adwait Jog, Onur Kayiran, Asit K. Mishra, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, and Chita R. Das. 2013. Orchestrated scheduling and prefetching for GPGPUs. ISCA '13}}
 +  * {{large-gpu-warps_micro11|Veynu Narasiman, Michael Shebanow, Chang Joo Lee, Rustam Miftakhutdinov,​ Onur Mutlu, and Yale N. Patt. 2011. Improving GPU performance via large warps and two-level warp scheduling.MICRO-44}}
 +
 +===== Lecture 16 (2/23 Mon.) =====
 +**Mentioned during lecture:**
 +  * {{:​mise-predictable_memory_performance-hpca13.pdf|Subramanian et al., “MISE: Providing Performance Predictability and Improving Fairness in Shared Main Memory Systems,” HPCA 2013}}
 +  * [[http://​users.ece.cmu.edu/​~omutlu/​pub/​mph_usenix_security07.pdf|Moscibroda,​ T., & Mutlu, O. (2007). Memory performance attacks: denial of memory service in multi-core systems. Proceedings of 16th USENIX Security Symposium.]]
 +  * {{kung_-_1982_-_why_systolic_architectures.pdf|Kung,​ H. T. (1982). Why Systolic Architectures?​ IEEE Computer.}}
 +  * {{01675827.pdf|Fisher,​ J. A. (1981). Trace Scheduling: A Technique for Global Microcode Compaction. IEEE Trans. Comput.}}
 +  * {{2fbf01205185.pdf|Hwu,​ W.-M. W., Mahlke, S. A., Chen, W. Y., Chang, P. P., Warter, N. J., Bringmann, R. A., Ouellette, R. G., et al. (1993). The superblock: an effective technique for VLIW and superscalar compilation. J. Supercomput.}}
 +  * {{p45-mahlke.pdf|Mahlke,​ S. A., Lin, D. C., Chen, W. Y., Hank, R. E., & Bringmann, R. A. (1992). Effective compiler support for predicated execution using the hyperblock. Proceedings of the 25th annual international symposium on Microarchitecture.}}
 +  * {{melvin_patt_-_1995_-_enhancing_instruction_scheduling_with_a_block-structured_isa.pdf|Melvin,​ S., & Patt, Y. (1995). Enhancing instruction scheduling with a block-structured ISA. Int. J. Parallel Program.}}
 +  * {{hao_et_al._-_1996_-_increasing_the_instruction_fetch_rate_via_block-structured_instruction_set_architectures.pdf|Hao,​ E., Chang, P.-Y., Evers, M., & Patt, Y. N. (1996). Increasing the instruction fetch rate via block-structured instruction set architectures. Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture.}}
 +  * {{00877947.pdf|Huck,​ J., Morris, D., Ross, J., Knies, A., Mulder, H., & Zahir, R. (2000). Introducing the IA-64 architecture. IEEE Micro.}}
 +  * {{annaratone_et_al._-_1986_-_warp_architecture_and_implementation.pdf|Annaratone,​ M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. S. (1986). Warp architecture and implementation. Proceedings of the 13th annual international symposium on Computer architecture.}}
 +  * {{annaratone_et_al._-_1987_-_the_warp_computer_architecture_implementation_and_performance.pdf|Annaratone,​ M., Arnould, E., Gross, T., Kung, H. T., & Lam, M. (1987). The warp computer: Architecture,​ implementation,​ and performance. IEEE Transactions on Computers.}}
 +  *  {{fisher_-_1983_-_very_long_instruction_word_architectures_and_the_eli-512.pdf|Fisher,​ J. A. (1983). Very Long Instruction Word architectures and the ELI-512. Proceedings of the 10th annual international symposium on Computer architecture.}}
 +  * {{Smith-1982-Decoupled-Access-Execute-Computer-Architectures.pdf|Smith,​ J. E. (1982). Decoupled access/​execute computer architectures. Proceedings of the 9th annual symposium on Computer Architecture.}}
 +  * {{p289-smith.pdf|Smith,​ J. E. (1984). Decoupled access/​execute computer architectures. ACM Trans. Comput. Syst.}}
 +  * {{:​ilp_history_overview_perspective.pdf|Rau and Fisher, “Instruction-level parallel processing:​ history,​ overview, and perspective,​” Journal of Supercomputing,​ 1993.}}
 +  * {{:​ieee_proceedings_2001_-_compiler_techniques.pdf|Faraboschi et al., “Instruction Scheduling for Instruction Level Parallel Processors,​” Proc. IEEE, Nov. 2001.
 +}}
readings.txt · Last modified: 2015/04/13 19:31 by kevincha