Schedule

Course Schedule, Slides, and Homeworks

Week	Topics	Readings	Slides	Projects
1	Introduction. Class policy. Project Infrastructure. Networking refresh.	Beej’s Guide, Why are Distributed Systems so Hard	Intro
2	Introduction continued…		Network refresh
3	Time in distributed systems (Lamport clocks, vector clocks, NTP). Global states and distributed snapshots. Failure detectors.		Event Ordering
4	Consensus: synchronous systems, asynchronous systems, byzantine failures.		Consensus
5	Process Groups: Leader election, membership, reliable multicast, virtual synchrony.		Leader Election, Multicast, …
6	Distributed commit (2PC and 3PC)		2 and 3 phase commit
7	Quorums. Paxos. Viewstamped replication. BFT.		Quorums, Paxos, View-stamped Replication, RAFT, BFT	Project 1
8	continued…
9	Peer-to-peer overlays. Gossip protocols. Distributed Hash Tables		P2P, Overlsys DHT
10	Application of distributed systems concepts to real systems		Dynamo*, Cassandra
11	continued…		GFS, BigTable, Spanner
12	continued…		ACMS, DNS	Project 2 RAFT author’s slides
13	November Break
14	Class summary			Project 3
15	Finals week			Project 4

Resources

C

Python

Socket Programming HOWTO
TCP echo server (1st example)
UDP echo server (nice design, but needs to be converted to Python3)
Examples from Week 2 lecture: udpecho.py, udpecho-timer.py, udpecho-wo-select.py, udpecho-with-select.py

Java

Distributed Systems text: Chapters 3 and 4
TCP echo server
UDP echo server: Example 1 ,Example 2

Must-read articles in distributed systems.

Why Do Computers Stop and What can be done about it? J. Gray. 1985.
End to end arguments in System Design. Saltzer, Reed, Clark. TOCS 1990.
Why do Internet services fail, and what can be done about it? 2003. D. Oppenheimer, A.Ganapathi and D. A. Patterson.
Time, Clocks, and the Ordering of Events in a Distributed System, L. Lamport 1978, SIGOPS Hall of Fame.
Virtual Time and Global States of Distributed Systems", Mattern, F. 1988.
Distributed Snapshots: Determining Global States of Distributed Systems. K. M. Chandy and L. Lamport,, 1985, SIGOPS Hall of Fame.
Unreliable Failure Detectors for Reliable Distributed Systems, T. Chandra and S. Toueg. , 1996.
Knowledge and Common Knowledge in a Distributed Environment, J. Halpern and Y. Moses , E.W. Dijkstra Prize 2009.
Impossibility of Distributed Consensus with One Faulty Process. M.J.Fischer, N.A.Lynch and M.S. Paterson. , 1983. E.W. Dijkstra Prize, 2001.
The Byzantine Generals Problem, L. Lamport, R. Shostak, and M. Pease, 1982.
Another advantage of free choice (Extended Abstract): Completely asynchronous agreement protocol. M. Ben-Or. 1983.
Exploiting virtual synchrony in distributed systems. K. P. Birman and T. A. Joseph, 1987.
Extended Virtual Synchrony, L. E. Moser, Y. Amir, P. M. Melliar-Smith, D. A. Agarwal,1994.
Distributed Recovery, Bernstein, Goodman and Hadzilakos.
Non-blocking Commit Protocols, D. Skeen.
Determining the Last Process to Fail, D. Skeen.
The State Machine Approach. F.B. Schneider. , SIGOPS Hall of Fame.
Hypervisor-based Fault-Tolerance, T. Bressoud and F.B. Schneider
A Survey of Rollback Recovery Protocols in Message Passing Systems, E. Elnozahy, L. Alvisi, Y.M.Wang, and D.B. Johnson.
Paxos Made Simple, L. Lamport.
The Part-Time Parliament L. Lamport , SIGOS Hall of Fame
Paxos for System Builders, J. Kirsch and Y. Amir (the technical report) .
Viewstamped Replication Revisited, B. Liskov and J. Cowling
From Viewstamped replication to Byzantine replication. B Liskov.
Bimodal Multicast, K.P. Birman, M. Hayden, O. Ozkasap, Z. Xiao, M. Budiu, and Y. Minsky
Byzantine Quorum Systems, D. Malkhi and M. Reiter
Practical Byzantine Fault-Tolerance, M. Castro and B. Liskov
Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications, Ion Stoica, Robert Morris, David Karger, M. Frans Kaashoek, H. Balakrishnan, , 2001.
Google File System. S, Ghemawat, H. Gobioff and S.-T. Leung. SOSP 2003.
The Chubby Lock Service for Loosely-Coupled Distributed Systems. Mike Burrows, OSDI 2006 1, Bigtable: A Distributed Storage System for Structured Data. 2008. ACM Trans. Comput. Syst. 26, 2 (Jun. 2008), 1-26
Spanner, Google’s globally distributed database. OSDI 2012.
MapReduce: Simplified Data Processing on Large Clusters OSDI 2004
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center, NSDI 2011
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, NSDI 2012, best paper
Apache Hadoop YARN: Yet Another Resource Negotiator SOCC 2013 (best paper)
Distributed GraphLab: A Framework for Machine Learning and Data Mining in the Cloud, VLDB 2012
Pregel: A System for Large-Scale Graph Processing, SIGMOD 2010
TensorFlow: A System for Large-Scale Machine Learning OSDI 2016
Hyperledger fabric: a distributed operating system for permissioned blockchains, EuroSys 2018.
Scaling Distributed Machine Learning with the Parameter Server, OSDI 2014
Bitcoin: A Peer-to-Peer Electronic Cash System, Satoshi Nakamoto
Majority is not Enough: Bitcoin Mining is Vulnerable Ittay Eyal, and Emin Gün Sirer <>

Schedule

Resources#

C#

Python#

Java#

Must-read articles in distributed systems.#

Resources

C

Python

Java

Must-read articles in distributed systems.