Ultra-large-scale systems (e.g., and Google's GMail) pose new challenges for software engineers and operators. These systems require near-perfect up-time while supporting millions of concurrent connections and operations. Failures and errors in such systems may bring financial and reputational repercussions. It is of great interest to learn how to effectively and efficiently engineer (develop, maintain and operate) such ultra-large-scale systems.

This seminar course explores leading research in the engineering of ultra-large-scale systems, discusses challenges associated with developing, maintaining and operating such systems, highlights industrial engineering practice, and outlines future research directions. Students will acquire the advance knowledge about the engineering of ultra-large-scale systems in the field. Once completed, students should be able to conduct research in topics related to the engineering of ultra-large-scale systems and will be able to leverage the learnt techniques in other system and software engineering related research or practice.


Classes are held on TBD in 156 Barrie Street.

Each class, students will present and discuss around three papers. A detailed schedule is available here. Each class will cover papers along one of the following themes:

  • Performance engineering
  • System anomaly and failure detection
  • Monitoring of ultra-large-scale systems
  • Log engineering
  • Debugging ultra-large-scale systems
  • Performance and Power
  • System configuration

Students are expected to have some background in software development and software engineering. Knowledge of ultra-large-scale systems will be beneficial but not expected.

Students will be evaluated using the following breakdown:

1. Classroom participation (10%):
Students are expected to read all papers covered in a week, come to class prepared to discuss their thoughts and take in the classroom discussions.

2. Paper presentation and discussion (20%):
Each paper will be assigned to one student who will act as a presenter and a discussant. The presentation will last 15 mins strict and the discussion will last 15-20 mins. Each student should upload the slides to the course account before class.

  • Role of presenter: As a presenter you should not simply repeat the paper's content (remember you only have 15 mins), instead you should point out the main important findings of the work. You should highlight any novel contributions, any surprises, and other possible applications of the proposed techniques. You should check the authors' other work related to the presented paper. Finally you should place the work relative to other papers covered in the course (especially the papers covered in that particular week).
  • Role of discussant: As a discussant, you should take an adversarial position by pointing out weak and controversial positions in the paper. You should present a short rebuttal of the paper. You should come prepared with problems and counterexamples for the presented work.
Your presentations should have
  • one slide that lists the main contributions of the paper.
  • one slide that places the paper relative to any recent work done by the authors of the paper.
  • one slide that links places the paper relative to other papers presented that week.
  • as the final slide, a listing of at least three technical points that you liked and three areas that should be improved.
3. Weekly critique (10%):
Each week, each student should pick one of the papers for that week and submit via email a one page critique of the paper before the start of class. The critique should offer a brief summary of the paper, points in favor, points against, and comments for improvement. You do not need to submit a critique if you are presenting that week. Additional advice for critiquing papers is here.

The one document should have your name at the top. The document name should follow this template: Week#_Paper#_YourName.

4. Assignment (20%):
One assignment done in a group of 3 or 4 students. More details in class. Assignment will involve using the use of JMeter and PerfMon on an open source system.

5. Project (40%):
One original project (10 pages IEEE format) done alone or in a group of 2 or 3 students. The project will explore one or more of the themes covered in the course.
You need to submit a project proposal (2 pages IEEE format) around 1.5 months before the end of term. The proposal should provide a brief motivation of the project, a detailed discussion of the data and systems that will be used in the project, along with a timeline of milestones, and expected outcome. Make sure that you have cited at least 3 papers in your proposal. Additional advice for project proposals will be discussed in class.