"How do I learn to build big distributed systems?"

According to Stack Overflow's 2020 developer survey, the best-paid engineering roles, like Site Reliability Engineering and Backend Engineering, require distributed systems expertise:

That comes as no surprise as modern applications are distributed systems - when was the last time you worked on software that runs entirely on a single machine without external dependencies?

But, learning to build distributed systems is hard, let alone large-scale ones. It's not that there is a lack of information out there - you can find academic papers, engineering blogs explaining the inner working of large-scale Internet services, and even books on the subject. The problem is that the available information is spread out all over the place, and if you were to put it on a spectrum from theory to practice, you would find that there is a lot of material at the two ends, but not much in the middle.

When I first started learning about distributed systems, I spent hours to connect the missing dots between the theory and the practice. I was looking for an accessible and pragmatic introduction to guide me through the maze of information while stitching together the theory to the practice, and teach me everything I needed to become a practitioner. But there was nothing like that available.

This is why I decided to write a book to teach the fundamentals of distributed systems so that you don't have to spend countless hours scratching your head to understand how everything fits together. The book covers all aspects of the topic: network fundamentals, the theory underpinning distributed systems, architectural patterns of scalable systems, stability patterns that harden systems against failures and operational best-practices on how to maintain large-scale systems with a small team.

It's the kind of book I wished existed when I first started out, and it's based on my experience building large-scale distributed systems that scale to millions of requests per second, and billions of devices. But, no matter the scale of the systems you work on today, the core principles are universal.

After reading the book, you are not going to look at a network calls the same way. And you will apply your newly gained knowledge from day one at your job and on personal projects. Armed with an understanding of the fundamentals, you will have the tools to design distributed systems of your own, grok technical whitepapers, and nail interviews.

Who should read this book?

If you develop the back-end of web or mobile applications, or are on-call for it, this book is for you. When building distributed systems, you need to be familiar with the network stack, data consistency models, architectural patterns that allow your applications to scale, self-healing mechanisms to protect your applications from falling down at the first sign of trouble, and much more.

Although you can build applications without knowing any of that, you will end up spending hours debugging and re-designing your architecture, learning lessons that you could have acquired in a much faster and less painful way. Even if you are an experienced engineer, this book will help you fill gaps in your knowledge that will make you a better practitioner and systems architect.

The book also makes for a great study companion for the system design interview if you want to land a job at a company that runs large-scale distributed systems, like Amazon, Google, Facebook, or Microsoft. If you are interviewing for a senior role, you are expected to be able to design complex networked services and deep dive into any vertical. You can be a world champion at balancing trees, but if you fail the design round, you are out. And if you just meet the bar, don't be surprised when your offer is well below what you expected, even if you aced everything else.

The traditional way to prepare for the system design interview is to practice with tutorials on how to design Twitter, Instagram, or other large-scale web applications. These tutorials focus mostly on connecting boxes with arrows - but that's just one part of the equation and not the most challenging one. The tricky part is understanding failure modes, trade-offs, and costs, which is what skilled interviewers focus on. This is where the book comes in - it will teach you the fundamentals, and the right mindset, to approach any problem in the distributed systems space, giving you the confidence to succeed in an interview.

Table of Contents

Communication
Transmission Control Protocol
User Datagram Protocol
Transport Layer Security
Domain Name System
Application Programming Interfaces
01

Communication

Having a solid foundation of the network stack is essential as you can't build a distributed system without it. Even though each network protocol builds up on top of the other, sometimes the abstractions leak. If you don’t know how the stack works under the hood, you will have a hard time troubleshooting why your system is down or degraded for no apparent reason. On top of that, there is a lot you can learn from the design of the core protocols that can be applied to any distributed system, like TCP’s backpressure mechanisms.

Coordination
Failure Detection
Time
Leader Election
Replication
Consistency Models
Transactions
02

Coordination

Imagine some code that assigns a value to a variable. Then the same code reads the variable right after only to find out the write had no effect! Madness! But with eventual consistency, this is what can happen when one machine writes a value to a store and another, perhaps the same, reads it.

This is where consistency guarantees come in, which define what can and can’t happen. Strong consistency guarantees make our lives easier. But to provide these guarantees, we need to find a way to make networked machines cooperate in harmony. In this chapter, we will explore how to achieve that by solving consensus.

Scalability Patterns
Microservices
Partitioning
Replication
Caching
Load Balancing
Messaging
03

Scalability Patterns

Now that we know how to make a set of nodes cooperate, we can dive into the patterns and architectures used to create horizontally scalable systems. We will start with the basics of sharding and replication and slowly transition into more advanced topics such as the implementation of load balancers, content delivery networks, and asynchronous messaging.

Resiliency Patterns
Timeouts
Retry
Circuit Breaker
Load Shedding
Load Leveling
Rate-Limiting
Health Endpoint
Watchdog
04

Resiliency Patterns

At scale, anything that can go wrong will go wrong. Writing distributed code is different than writing code that runs on a single machine. If you thought multi-threading was hard, think again.

The systems you build need to be robust against failures and unexpected events. Think of spikes of incoming requests and failing downstream dependencies. In this chapter, we will look into self-healing mechanisms that guard our systems against these agents of chaos.

Operational Patterns
Metrics
Logs
Alerts
SLOs
Prober
Chaos Engineering
05

Operational Patterns

You don't want your system to fall down in the middle of the night and find out about it the next morning through a Reddit post. No matter how elegant your design is, if the system lacks monitoring and logging, it’s doomed to fail. Nobody wants to be on call for a black box. In this chapter, you will learn the best practices on how to instrument and operate large-scale systems.


The Author

Roberto Vitillo

Hi! My name is Roberto Vitillo. I have over 10 years of experience in the tech industry as a software engineer, tech lead, and manager.

In 2017 I joined Microsoft to work on an internal data platform as a SaaS product. Since then, I have helped launch two public SaaS products, Product Insights and Playfab. The data pipeline I am responsible for is one of the largest in the world. It processes millions of events per second from billions of devices worldwide.

Before that, I worked for Mozilla, where I wore different hats, from performance engineer to data platform engineer. What I am most proud of is having set the direction of the data platform from its very early days and built a large part of it, including the team.

After getting my master's degree in computer science, I worked on scientific computing applications at the Berkeley Lab. The software I have contributed to is used to this day by the ATLAS experiment at the Large Hadron Collider.

Get the book

The book is constantly updated as I write new chapters. You can buy the early access release for $29 and start taking advantage of the content long before the book is completed. As of now, the book includes the chapters on network fundamentals, distributed algorithms, and resiliency patterns.

If you buy the book now, you will have access to all future updates for free, and you can request a refund within 45 days - no hard feelings.

Buy the book

Format PDF
Book Status 50% Complete
Pages 121
Last Updated 2020-10-17

Frequently Asked Questions

When will the book be completed?

I plan to finish writing the book by the end of the year. So far, I have been releasing a new chapter nearly every month.

How do I contact you if I have a question?

E-mail me at roberto@systemdesignmanual.com.