18-845: Internet Services
Carnegie Mellon University, Spring 2024

Syllabus (pdf) | Critiques | Group Project (GP)

1. Instructor

Prof. David O'Hallaron, droh@andrew.cmu.edu, GHC 7517
Office hours: Fri 3:00-4:30pm (or by appt.)

2. Organization

Class times: Mon and Wed, 2:00-3:20pm, WeH 4623


Web page: http://www.ece.cmu.edu/~ece845
Class mailing list: 18-845@cs.cmu.edu (Now working)

Canvas: We will not be using Canvas.
Piazza: We will not be using Piazza.
Course directory: /afs/ece/class/ece845

3. Reference material

There is no required textbook for 18-845. The following are standard references for Linux programming and network programming:
  • Michael Kerrisk, The Linux Programming Interface: A Linux and UNIX System Programming Handbook, No Starch Press, 2010.
  • W. Richard Stevens, Bill Fenner, Andrew M. Rudoff Unix Network Programming: The Sockets Networking API, Volume 1 (3rd Edition), Prentice Hall, 2003.
The CS:APP3e text, which is on permanent reserve in the Engineering library, covers system-level programming topics such as dynamic linking, process control, Unix I/O, the sockets interface, writing Web servers, and application level concurrency and synchronization:

4. Linux cluster resources

  • Andrew cluster: linux.andrew.cmu.edu
    • RHEL, 64-bit, login using your Andrew credentials
  • SCS Gates cluster: ghc{26..86}.ghc.andrew.cmu.edu.
    • RHEL, 64-bit, login using your Andrew credentials
    • Machines ghc{26..46} contain NVIDIA GeForc GTX 1080 GPUs. The Wikipedia entry for GeForce 10 GPUs provides useful information about this model of GPU. They support CUDA compute capability 6.1.
  • ECE cluster: ece{000-031}.ece.local.cmu.edu
    • SuSE, 64 bit, login using your ECE credentials
    • See here for details. Contact help@ece.cmu.edu for help with accounts.

5. Course schedule (final)

Legend: GP: group project

n
Class Date Day Topic Projects Discussion Leader
1 01/17 Wed Intro and welcome Dave O'Hallaron
2 01/22 Mon System design principles Dave O'Hallaron
3 01/24 Wed Server design: basics Dave O'Hallaron
4 01/290 Mon Server design: advanced Noah Champagne
5 01/31 Wed Motivating application: Google Search Eshita Shrawan
6 02/05 Mon Stragglers: The tail at scale Chenfei (Mike) Lou
7 02/07 Wed Graceful degration: Defcon Nathalie Jeans
8 02/12 Mon Consistent hashing: Chord Stephen Dai
9 02/14 Wed Google file system (GFS)/Colossus Mitchel Fream
10 02/19 Mon Data processing: MapReduce Meixuan (Lucy) Li
11 02/21 Wed Classical replication: Paxos Neha Tarakad
12 02/22 Mon Modern replication: Raft Jiani Li
13 02/28Wed Modern replication: Aegean Simon Spivey
14 03/04 Mon No class - Spring break ---
15 03/08 Wed No class - Spring break ---
16 03/11 Mon Lock services: Chubby Anisha Nilakantan
17 03/13 Wed Distributed stores: BigTable GP abstracts due, 11:59pm Yifan Guang
18 03/11 Mon Distributed stores: Spanner Sachit Goyal
19 03/20 Wed Distributed stores: memcached Vidya Prabhakar
20 03/25 Mon Distributed stores: Dynamo Rohit Madhusudhana
21 03/27 Wed DRAM-based storage: RAMCloud Jim Shao
22 04/01 Mon Datacenter management: Borg GP oral mid-term reports due, in class Zilin Bai
23 04/03 Wed Warehouse-scale computing Dave O'Hallaron
24 04/08 Mon Virtual machines: VMWare Nina Duan
25 04/10 Wed Virtual machines: Xen Simon Men
26 04/15 Mon No class - GP prep ---
27 04/18 Thu No class - GP prep GP final reports due, Thur, Apr 18, 11:59pm ---
28 04/22 Mon No class - GP prep GP critiques due, 11:59pm ---
29 04/24 Wed No class - GP prep ---
04/28 Sun GP camera-ready reports due Sun Apr 28, 11:59pm ---

6. Detailed course schedule (final)

Students who are not leading the discussion for a particular class should prepare a single 1-page critique. Unless explictly noted, the critique should cover all papers with a "*". Please bring a hardcopy of your critique with you to class. We will not accept emailed critiques.

Class 1: Welcome and intro

Class 2: System design principles

  • Note: Your critique should list three other examples (not discussed by the authors) of end-to-end arguments in system design.
  • *J. Saltzer, D. Reed, and D. Clark, End-to-End Arguments in System Design, ACM Transactions on Computer Systems, Vol 2, No 4, Nov, 1984. (pdf)

Class 3: Server design: Basics

  • *V. Pai, P. Druschel, and W. Zwaenepoel, Flash: An efficient and portable Web server, Proceedings of the USENIX 1999 Annual Technical Conference, 1999. (pdf)

Class 4: Server design: Advanced

  • Note: Please write a single critique covering both papers.
  • *Gaurav Banga, Jeff Mogul and Peter Druschel, A scalable and explicit event delivery mechanism for UNIX, in the Proceedings of the USENIX 1999 Technical Conference, June 1999. (pdf)
  • *Tim Brecht, David Pariag, and Louay Gammo, accept()able Strategies for Improving Web Server Performance, Proceedings of the USENIX 2004 Annual Technical Conference, June, 2004. (pdf)

Class 5: Motivating application: Google search

  • *Sergey Brin and Larry Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Seventh International World Wide Web Conference / Computer Networks 30(1-7): 107-117. 1998. (pdf)
  • For background only. No need to critique.
  • Ian Rogers, The Google Pagerank Algorithm and How It Works, May, 2002. (html) No need to critique this, but you might find the specific examples to be helpful.

Class 6: Stragglers: Tail at scale

  • *Jeffrey Dean and Luiz Andre Barroso, The Tail at Scale, in Communications of the ACM, Feb. 2013, (pdf)

Class 7: Graceful degradation: Defcon

  • *Justin Meza et al, Defcon: Preventing Overload with Graceful Feature Degradation, in Usenix OSDI, June 2023 (pdf)

Class 8 Consistent hashing: Chord

  • *I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, H. Balakrishnan, Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications, in SIGCOMM01, Aug. 2001, (pdf)

Class 9: Google file system (GFS)/Colossus

  • Note: Please write a single critique covering both papers.
  • *Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System, in Proceedings of the 19th ACM Symposium on Operating Systems Principles, October, 2003. (pdf)
  • *Kirk McKusick and Sean Quinlan, GFS: Evolution on Fast-Forward, CACM, March, 2010. (html)

Class 10: Data processing: MapReduce

  • *J. Dean, and S. Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, in Proceedings of Sixth Symposium on Operating System Design and Implementation, December, 2004. (pdf)
  • Note: No need to critique, but please come to class prepared to discuss
  • J. Summers, The Friendship that Made Google Huge, New Yorker, Dec 3, 2018. Beautiful article about the 20-year friendship between Jeff Dean and Sanjay Ghemawat and the unique pair programming approach they've used to build some of Google's most important systems. (pdf)
  • For background only. No need to critique.
  • M. Zaharia, M. Chowdurey, M. Franklin, S. Shenkar, and I Stoica, Spark: Cluster computing with working sets, HotCloud10, 2010. (pdf)

Class 11: Classical replicaton: Paxos

  • Note: Please write a single critique covering both papers.
  • *Michael Swift, "Paxos, Agreement, Consensus", Lecture notes for CS 739, Spring 2012, Univ of Wisc, A clear and concise description, with psuedo-code, of the Paxos algorithm and its behavior under various scenarios. (pdf)
  • *Tushar Chandra, Robert Griesemer, Joshua Redstone, Paxos Made Live - An Engineering Perspective, in ACM Symposium on Principles of Distributed Computing, Aug, 2007. (pdf)
  • For background only. No need to critique
  • Angus MacDonald, Paxos by Example, Web post, 2018. (html). Helpful step-by-step example with multiple leaders.
  • Leslie Lamport, Paxos Made Simple, ACM SIGACT News (Distributed Computing Column) 32, 4 (December 2001) 51-58. (pdf). Maybe not so simple :-)

Class 12: Modern replication: Raft

  • *Diego Ongaro and John Ousterhout, In Search of an Understandable Consensus Algorithm, USENIX, 2014. (pdf)

Class 13: Modern replication: Aegean

  • *R. Aksoy and M. Dapritsos, Aegean: Replication beyond the client-server model, in SOSP19, October, 2019. (pdf)

Class 14: No class - Spring break

Class 15: No class - Spring break

Class 16: Lock services: Chubby

  • *M. Burrows, The Chubby Lock Service for Loosely-Coupled Distributed Systems, in Proceedings of the Seventh Symposium on Operating System Design and Implementation (OSDI 06), December, 2006. (pdf)

Class 17: Distributed Stores: BigTable

  • *F. Chang, J. Dean, S. Ghemawat, W.C. Hsieh, D.A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber, Bigtable: A Distributed Storage System for Structured Data, in Proceedings of the Seventh Symposium on Operating System Design and Implementation (OSDI 06), December, 2006. (pdf)

Class 18: Distributed stores: Spanner

  • *J. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, W. Hsieh, S. Kanthak, E. Kogan, H. Li, A. Lloyd, S. Melnik, D. Mwaura, D. Nagle, S. Quinlan, R. Rao, L. Rolig, Y. Saito, M. Szymaniak, C. Taylor, R. Wang, and D. Woodford, Spanner: Googles Globally-Distributed Database, OSDI 12, 2012, Jay Lepreau Best Paper Award. (pdf)

Class 19: Distributed stores: memcached

  • *R. Nishtala et al, Scaling Memcache at Facebook, NSDI 13. (pdf)

Class 20: Distributed stores: Dynamo

  • *G. DeCandia, D. Hastorun, M. Jampani, G. Kakulapati, A. Lakshman, A. Pilchin, S. Sivasubramnian, P. Vosshal, and W. Vogels, Dynamo: Amazon's Highly Available Key-value Store, SOSP 07, Oct. 2007, (pdf)

Class 21: DRAM-based storage: RAMCloud

  • *D. Ongaro, S. Rumble, R. Stutsman, J. Ousterhout, and M. Rosenblum, Fast Crash Recovery in RAMCloud, SOSP'11, Oct. 2011 (pdf)
  • For background only. No need to critique
  • Stephen M. Rumble, Ankita Kejriwal, and John Ousterhout, Log-structured Memory for DRAM-based Storage, FAST'14. Awarded best paper. (pdf)
  • John Ousterhout, Parag Agrawal, David Erickson, Christos Kozyrakis, Jacob Leverich, David Mazieres, Subhasish Mitra, Aravind Narayanan, Diego Ongaro, Guru Parulkar, Mendel Rosenblum, Stephen M. Rumble, Eric Stratmann, and Ryan Stutsman, The Case for RAMCloud, CACM, July, 2011. (pdf)

Class 22: Datacenter management

  • *Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, John Wilkes, Large-scale cluster management at Google with Borg, EuroSys 2015, Bordeaux, France. (pdf)

Class 23: Warehouse-scale computing

  • Note: Write a single critique covering Chapters 1-3, and 5.
  • *Luiz Andre Barrosa, Urs Holzle, and Parthasarathy Ranganathan, The Datacenter as a Computer: Designing Warehouse-scale Machines, Third Edition, Morgan & Claypool, 2018. (pdf)

Class 24: Virtual machines: VMWare

  • *K. Adams, and O. Agesen, A Comparison of Software and Hardware Techniques for x86 Virtualization, In Proceedings of the 12th international conference on Architectural support for programming languages and operating systems (ASPLOS'06), 2006. (pdf)
  • For background only. Do not critique
  • G. Neiger, A. Santoni, F. Leung, D. Rodgers, R. Uhlig, "Intel Virtualization Technology: Hardware Support for Efficient Processor Virtualization", Intel Technology Journal, Aug, 2006. Please skip all discussion of the Itaniums VT-i (pdf)
  • Ole Agesen, Alex Garthwaite, Jeffrey Sheldon, Pratap Subrahmanyam, The Evolution of an x86 Virtual Machine Monitor, ACM SIGOPS Operating Systems Review archive Volume 44 Issue 4, December 2010. (pdf)
  • Mendel Rosenblum and Tal Garfinkel, Virtual Machine Monitors: Current Technology and Future Trends, IEEE Computer, May, 2005. (pdf)
  • Wes Felter, Alexandre Ferreira, Ram Rajamony, Juan Rubio, Updated Performance Comparison of Virtual Machines and Linux Containers, IBM Research Report, RC25482 (AUS1407-001) July 21, 2014 (pdf)

Class 25: Virtual machines: Xen

  • *P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugebauer, I. Pratt, A. Warfiel, Xen and the Art of Virtualization, In Proceedings of the 19th ACM Symposium on Operating Systems Principles, October, 2003. (pdf)

Class 26: No class

Class 27: No class

Class 28: No class

Class 29: No class