Book review: Understand distributed systems

August 17, 2022 Utsav

In today's post, I will talk the book, Understanding Distributed Systems by Roberto Vitillo, and why it should be the first book you read on distributed systems or systems design in general.

I often find that most books dive right into the heart of some complex topics around distributed systems, which can overwhelm you if you aren't experienced enough. A common book that people read is DDIA by Martin Klepmann. While the "red book" is an amazing book in its own right, the concepts are too advanced for someone new to distributed systems. Advanced consistency models and quorum algorithms for distributed transactions can be overwhelming even for folks who have worked in the field for a while.

This book bridges that gap between "I've never worked outside a monolithic application" and "I am quite seasoned at scalable systems" perfectly. It is simple and concise enough that you can finish the whole book on a flight from Seattle to New York, but at the same time, it also provides a wealth of information that you can not only digest easily, but also use as a stepping stone to build upon with other, more advanced books.

Understand Distributed Systems is divided into 5 parts: Communication, Coordination, Scalability, Resiliency and Testing. I love this ordering because this is almost the exact order you follow when creating a new, distributed application.

The first step is to understand how distributed systems work, or why we even need them. At a very basic level, they are essentially a set of geographically separated nodes that are geographically responsible for sharing information over the network. So it is critical to understand how communication over a network works -- what the network stack is and what happens at each level. Once you understand that, you can start building systems that communicate effectively over a network.

However, as you add more and more nodes to your system, you need to coordinate that communication. And that is exactly what part 2 is about. In this section, the author explains how you detect failures in your system, how you handle replication, algorithms for selecting the primary nodes, various consensus algorithms and finally how you can perform transactions over a distributed system, which builds to the third part of the book: scalability.

This is arguably the most important part of building a distributed system -- to handle scale. In this section, the author talks about patterns to achieving scalability. Things like microservices, APIs, messaging, partitioning, replication, caching and load balancing. However, a distributed system that can handle scalability unreliably isn't any good. It needs to do so reliably. And that is the fourth section of this book - resiliency.

In this section, the author talks about single points of failure, circuit breakers, retry patterns, rate-limiting and much more -- basically what we refer to as upstream and downstream resiliency.

And finally, the last part of this book is all about testing and operations where you learn about how testing is different for distributed applications, and also about continuous integration and delivery as well as monitoring, which is critical for distributed applications.

All-in-all, this is an amazing book. It gives you just enough information where you get decent idea about what the distributed systems world is all about, and just enough information that you can use to build your knowledge in the field. I highly recommend this book.