Schedule
Course Schedule, Slides, and Homeworks
Week | Topics | Readings | Slides | Projects |
---|---|---|---|---|
1 | Introduction. Class policy. Project Infrastructure. Networking refresh. | Beej’s Guide, Why are Distributed Systems so Hard | Intro | |
2 | Introduction continued… | Network refresh | ||
3 | Time in distributed systems (Lamport clocks, vector clocks, NTP). Global states and distributed snapshots. Failure detectors. | Event Ordering | ||
4 | Consensus: synchronous systems, asynchronous systems, byzantine failures. | Consensus | ||
5 | Process Groups: Leader election, membership, reliable multicast, virtual synchrony. | Leader Election, Multicast, … | ||
6 | Distributed commit (2PC and 3PC) | 2 and 3 phase commit | ||
7 | Quorums. Paxos. Viewstamped replication. BFT. | Quorums, Paxos, View-stamped Replication, RAFT, BFT | Project 1 | |
8 | continued… | |||
9 | Peer-to-peer overlays. Gossip protocols. Distributed Hash Tables | P2P, Overlsys DHT | ||
10 | Application of distributed systems concepts to real systems | Dynamo*, Cassandra | ||
11 | continued… | GFS, BigTable, Spanner | ||
12 | continued… | ACMS, DNS | Project 2 RAFT author’s slides | |
13 | November Break | |||
14 | Class summary | Project 3 | ||
15 | Finals week | Project 4 |
Resources
C
Python
- Socket Programming HOWTO
- TCP echo server (1st example)
- UDP echo server (nice design, but needs to be converted to Python3)
- Examples from Week 2 lecture: udpecho.py, udpecho-timer.py, udpecho-wo-select.py, udpecho-with-select.py
Java
- Distributed Systems text: Chapters 3 and 4
- TCP echo server
- UDP echo server: Example 1 ,Example 2
Must-read articles in distributed systems.
- Why Do Computers Stop and What can be done about it? J. Gray. 1985.
- End to end arguments in System Design. Saltzer, Reed, Clark. TOCS 1990.
- Why do Internet services fail, and what can be done about it? 2003. D. Oppenheimer, A.Ganapathi and D. A. Patterson.
- Time, Clocks, and the Ordering of Events in a Distributed System, L. Lamport 1978, SIGOPS Hall of Fame.
- Virtual Time and Global States of Distributed Systems", Mattern, F. 1988.
- Distributed Snapshots: Determining Global States of Distributed Systems. K. M. Chandy and L. Lamport,, 1985, SIGOPS Hall of Fame.
- Unreliable Failure Detectors for Reliable Distributed Systems, T. Chandra and S. Toueg. , 1996.
- Knowledge and Common Knowledge in a Distributed Environment, J. Halpern and Y. Moses , E.W. Dijkstra Prize 2009.
- Impossibility of Distributed Consensus with One Faulty Process. M.J.Fischer, N.A.Lynch and M.S. Paterson. , 1983. E.W. Dijkstra Prize, 2001.
- The Byzantine Generals Problem, L. Lamport, R. Shostak, and M. Pease, 1982.
- Another advantage of free choice (Extended Abstract): Completely asynchronous agreement protocol. M. Ben-Or. 1983.
- Exploiting virtual synchrony in distributed systems. K. P. Birman and T. A. Joseph, 1987.
- Extended Virtual Synchrony, L. E. Moser, Y. Amir, P. M. Melliar-Smith, D. A. Agarwal,1994.
- Distributed Recovery, Bernstein, Goodman and Hadzilakos.
- Non-blocking Commit Protocols, D. Skeen.
- Determining the Last Process to Fail, D. Skeen.
- The State Machine Approach. F.B. Schneider. , SIGOPS Hall of Fame.
- Hypervisor-based Fault-Tolerance, T. Bressoud and F.B. Schneider
- A Survey of Rollback Recovery Protocols in Message Passing Systems, E. Elnozahy, L. Alvisi, Y.M.Wang, and D.B. Johnson.
- Paxos Made Simple, L. Lamport.
- The Part-Time Parliament L. Lamport , SIGOS Hall of Fame
- Paxos for System Builders, J. Kirsch and Y. Amir (the technical report) .
- Viewstamped Replication Revisited, B. Liskov and J. Cowling
- From Viewstamped replication to Byzantine replication. B Liskov.
- Bimodal Multicast, K.P. Birman, M. Hayden, O. Ozkasap, Z. Xiao, M. Budiu, and Y. Minsky
- Byzantine Quorum Systems, D. Malkhi and M. Reiter
- Practical Byzantine Fault-Tolerance, M. Castro and B. Liskov
- Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications, Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, H. Balakrishnan, , 2001.
- Google File System. S, Ghemawat, H. Gobioff and S.-T. Leung. SOSP 2003.
- The Chubby Lock Service for Loosely-Coupled Distributed Systems. Mike Burrows, OSDI 2006 1, Bigtable: A Distributed Storage System for Structured Data. 2008. ACM Trans. Comput. Syst. 26, 2 (Jun. 2008), 1-26
- Spanner, Google’s globally distributed database. OSDI 2012.
- MapReduce: Simplified Data Processing on Large Clusters OSDI 2004
- Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, NSDI 2011
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012, best paper
- Apache Hadoop YARN: Yet Another Resource Negotiator SOCC 2013 (best paper)
- Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud, VLDB 2012
- Pregel: A System for Large-Scale Graph Processing, SIGMOD 2010
- TensorFlow: A System for Large-Scale Machine Learning OSDI 2016
- Hyperledger fabric: a distributed operating system for permissioned blockchains, EuroSys 2018.
- Scaling Distributed Machine Learning with the Parameter Server, OSDI 2014
- Bitcoin: A Peer-to-Peer Electronic Cash System, Satoshi Nakamoto
- Majority is not Enough: Bitcoin Mining is Vulnerable Ittay Eyal, and Emin Gün Sirer <>