Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems
Data is at the center of many challenges in system design today. Difficult issues need to be figured out, such as scalability, consistency, reliability, efficiency, and maintainability. In addition, we have an overwhelming variety of tools, including relational databases, NoSQL datastores, stream or batch processors, and message brokers. What are the right choices for your application? How do you make sense of all these buzzwords?
In this practical and comprehensive guide, author Martin Kleppmann helps you navigate this diverse landscape by examining the pros and cons of various technologies for processing and storing data. The software keeps changing, but the fundamental principles remain the same. With this book, software engineers and architects will learn how to apply those ideas in practice, and how to make full use of data in modern applications.
- Peer under the hood of the systems you already use and learn how to use and operate them more effectively;
- Make informed decisions by identifying the strengths and weaknesses of different tools;
- Navigate the trade-offs around consistency, scalability, fault tolerance, and complexity;
- Understand the distributed systems research upon which modern databases are built;
- Peek behind the scenes of major online services, and learn from their architectures.
Who Should Read Designing Data-Intensive Applications Book?
If you develop applications that have some kind of server/backend for storing or processing data, and your applications use the internet (e.g., web applications, mobile apps, or internet-connected sensors), then this book is for you.
This book is for software engineers, software architects, and technical managers who love to code. It is especially relevant if you need to make decisions about the architecture of the systems you work on—for example, if you need to choose tools for solving a given problem and figure out how best to apply them. But even if you have no choice over your tools, this book will help you better understand their strengths and weaknesses.
You should have some experience building web-based applications or network services, and you should be familiar with relational databases and SQL. Any non-relational databases and other data-related tools you know are a bonus, but not required. A general understanding of common network protocols like TCP and HTTP is helpful. Your choice of programming language or framework makes no difference for this book.
If any of the following are true for you, you’ll find this book valuable:
- You want to learn how to make data systems scalable, for example, to support web or mobile apps with millions of users.
- You need to make applications highly available (minimizing downtime) and operationally robust.
- You are looking for ways of making systems easier to maintain in the long run, even as they grow and as requirements and technologies change.
- You have a natural curiosity for the way things work and want to know what goes on inside major websites and online services. This book breaks down the internals of various databases and data processing systems, and it’s great fun to explore the bright thinking that went into their design.
Sometimes, when discussing scalable data systems, people make comments along the lines of, ‘You’re not Google or Amazon. Stop worrying about scale and just use a relational database’. There is truth in that statement: building for scale that you don’t need is wasted effort and may lock you into an inflexible design. In effect, it is a form of premature optimization. However, it’s also important to choose the right tool for the job, and different technologies each have their own strengths and weaknesses. As we shall see, relational databases are important but not the final word on dealing with data.
Scope of Designing Data-Intensive Applications eBook
This book does not attempt to give detailed instructions on how to install or use specific software packages or APIs, since there is already plenty of documentation for those things. Instead, we discuss the various principles and trade-offs that are fundamental to data systems, and we explore the different design decisions taken by different products.
We look primarily at the architecture of data systems and the ways they are integrated into data-intensive applications. This book doesn’t have space to cover deployment, operations, security, management, and other areas—those are complex and important topics, and we wouldn’t do them justice by making them superficial side notes in this book. They deserve books of their own.
Many of the technologies described in this book fall within the realm of the Big Data buzzword. However, the term ‘Big Data is so overused and underdefined that it is not useful in a serious engineering discussion. This book uses less ambiguous terms, such as single-node versus distributed systems, or online/interactive versus offline/batch processing systems.
This book has a bias toward free and open-source software (FOSS) because reading, modifying, and executing source code is a great way to understand how something works in detail. Open platforms also reduce the risk of vendor lock-in. However, where appropriate, we also discuss proprietary software (closed-source software, software as a service, or companies’ in-house software that is only described in literature but not released publicly).
About the author:
Martin Kleppmann is a researcher in distributed systems and security at the University of Cambridge, and author of Designing Data-Intensive Applications (O’Reilly Media, 2017). Previously he was a software engineer and entrepreneur at Internet companies including LinkedIn and Rapportive, where he worked on large-scale data infrastructure. He is now working on TRVE DATA, a project that aims to bring end-to-end encryption and decentralization to a wide range of applications.
Reviews about the ebook Designing Data-Intensive Applications:
- Emre Sevinç:
I consider this book a mini-encyclopedia of modern data engineering. Like a specialized encyclopedia, it covers a broad field in considerable detail. But it is not a practice or a cookbook for a particular Big Data, NoSQL, or newSQL product. What the author does is to lay down the principles of current distributed big data systems, and he does a very fine job of it.If you are after the obscure details of a particular product, or some tutorials and “how-to”s, go elsewhere. But if you want to understand the main principles, issues, as well as challenges of data-intensive and distributed systems, you’ve come to the right place.Martin Kleppmann starts out by solidly giving the reader the conceptual framework in the first chapter: what does reliability mean? How is it defined? What is the difference between “fault” and “failure”? How do you describe load on a data-intensive system? How do you talk about performance and scalability in a meaningful way? What does it mean to have a “maintainable” system?
The second chapter gives a brief overview of different data models and shows the suitability of them to different use cases, using modern challenges that companies such as Twitter faced. This chapter is a solid foundation for understanding the difference between the relational data model, document data model, graph data model, as well as the languages used for processing data stored using these models.
The third chapter goes into a lot of detail regarding the building blocks of different types of database systems: the data structures and algorithms used for the different systems shown in the previous chapter are described; you get to know hash indexes, SSTables (Sorted String Tables), Log-Structured Merge trees (LSM-trees), B-trees, and other data structures. Following this chapter, you are introduced to Column Databases and the underlying principles and structures behind them.
Following these, the book describes the methods of data encoding, starting from the venerable XML & JSON, and going into the details of formats such as Avro, Thrift, and Protocol Buffers, showing the trade-offs between these choices.
Following the building blocks and foundations comes “Part II”, and this is where things start to get really interesting because now the reader starts to learn about challenging topic of distributed systems: how to use the basic building blocks in a setting where anything can go wrong in the most unexpected ways. Part II is the most complex of part the book: you learn about how to replicate your data, what happens when replication lags behind, how you provide a consistent picture to the end-user or the end-programmer, what algorithms are used for leader election in consensus systems, and how leaderless replication works.
One of the primary purposes of using a distributed system is to have an advantage over a single, central system, and that advantage is to provide better service, meaning a more resilient service with an acceptable level of responsiveness. This means you need to distribute the load and your data, and they’re a lot of schemes for partitioning your data. Chapter 6 of Part II provides a lot of details on partitioning, keys, indexes, secondary indexes, and how to handle data queries when your data is partitioned using various methods.
No data systems book can be complete without touching the topic of transactions, and this book is not an exception to the rule. You learn about the fuzziness surrounding the definition of an ACID, isolation levels, and serializability.
The remaining two chapters of Part II, Chapter 8 and 9 are probably the most interesting part of the book. You are now ready to learn the gory details of how to deal with all kinds of network and other types of faults to keep your data system in a usable and consistent state, the problems with the CAP theorem, version vectors, and that they are not vector clocks, Byzantine faults, how to have a sense of causality and ordering in a distributed system, why algorithms such as Paxos, Raft, and ZAB (used in ZooKeeper) exist distributed transactions, and many more topics.
The rest of the book, that is Part III, is dedicated to batch and stream processing. The author describes the famous Map Reduce batch processing model in detail and briefly touches upon the modern frameworks for processing distributed data processing such as Apache Spark. The final chapter discusses event streams and messaging systems and challenges that arise when trying to process this “data in motion”. You might not be in the business of building the next generation streaming system, but you’ll definitely need to have a handle on these topics because you’ll encounter the described issues in the practical stream processing systems that you deal with daily as a data engineer.
As I said in the opening of this review, consider this a mini-encyclopedia for the modern data engineer, and also don’t be surprised if you see more than 100 references at the end of some chapters; if the author tried to include most of them in the text itself, the book would well go beyond 2000 pages!
At the time of my writing, the book is 90% complete, according to its official site there’s only 1 more chapter to be added (Chapter 12: Materialized Views and Caching), so it is safe to say that I recommend this book to anyone working with distributed big data systems, dealing with NoSQL and newSQL databases, document stores, column-oriented data stores, streaming and messaging systems. As for me, it’ll definitely be my go-to reference for the upcoming years for these topics.
- Nilendu Misra:
In Silicon Valley, the “ability to code” is now the uber-metric to track. Starting from how engineers are interviewed, actual hands-on work (due to processes that overemphasizes “do” over “think, e.g., daily stand-ups require you to say what concrete thing you did yesterday), evaluation of work (“move fast and break things”) to over-emphasizing on downstream “fixes” (prod-ops culture, 24*7 firefighting heroism) – the top echelon of technology gravitated towards things that it can see, feel, measure. What often gets neglected in this “code be all” culture is a deep understanding of fundamental concepts, and how most newer “innovations” are indeed built on a handful time-honored principles.Nowhere else perhaps is this more prominent than in data space that up-levels libraries and frameworks as the conversation starter. That gets in the way of success. It is indeed impossible to model Cassandra’s “tables” without understanding – at least – quorum, compaction, log-merge data structure. Due to the way the present-day solutions are built (“fits one use case perfectly well”), if these solutions are not implemented well to the particular domain, failure is just a release away.
- Yevgeniy Brikman:
A must-read for every programmer. This is the best overview of data storage and distributed systems—two key concepts for building almost any piece of software today—that I’ve seen anywhere. Martin does a wonderful job of taking a massive body of research and distilling complicated concepts and difficult trade-offs down to a level where anyone can understand it.I learned a lot about replication, partitioning, linearizability, locking, write skew, phantoms, transactions, event logs, and more. I’m also a big fan of the final chapter, The Future of Data Systems, which covers ideas such as “unbundling the database” (i.e., using an event log as the primary data store, and handling all other aspects of the “database”, such as secondary indexes, materialized views, and replication, in separate “derived” data systems), end-to-end event streams, and an important discussion on ethics in programming and data systems.The only thing missing is a set of summary tables. I’d love to see a list of all common data systems and how they fair across many dimensions: e.g., support for locking, replication, transaction, consistency levels, and so on. This would be very handy for deciding what system to pick for my next project.
As always, I’ve saved a few of my favorite quotes from the book:
“Document databases are sometimes called schemaless, but that’s misleading, as the code that reads the data usually assumes some kind of structure—i.e., there is an implicit schema, but it is not enforced by the database. A more accurate term is schema-on-read (the structure of the data is implicit, and only interpreted when the data is read), in contrast with schema-on-write (the traditional approach of relational databases, where the schema is explicit and the database ensures all written data conforms to it). Schema-on-read is similar to dynamic (runtime) type checking in programming languages, whereas schema-on-write is similar to static (compile-time) type checking.”
“For defining concurrency, exact time doesn’t matter: we simply call two operations concurrent if they are both unaware of each other, regardless of the physical time at which they occurred. People sometimes make a connection between this principle and the special theory of relativity in physics, which introduced the idea that information cannot travel faster than the speed of light. Consequently, two events that occur some distance apart cannot possibly affect each other if the time between the events is shorter than the time it takes light to travel the distance between them.”
“A node in the network cannot know anything for sure—it can only make guesses based on the messages it receives (or doesn’t receive) via the network.”
“The best way of building fault-tolerant systems is to find some general-purpose abstractions with useful guarantees, implement them once, and then let applications rely on those guarantees.”
“CAP is sometimes presented as Consistency, Availability, Partition tolerance: pick 2 out of 3. Unfortunately, putting it this way is misleading because network partitions are a kind of fault, so they aren’t something about which you have a choice: they will happen whether you like it or not. At times when the network is working correctly, a system can provide both consistency (linearizability) and total availability. When a network fault occurs, you have to choose between either linearizability or total availability. Thus, a better way of phrasing CAP would be either Consistent or Available when Partitioned.”
“The traditional approach to database and schema design is based on the fallacy that data must be written in the same form as it will be queried. Debates about normalization and denormalization (see“Many-to-One and Many-to-Many Relationships”) become largely irrelevant if you can translate data from a write-optimized event log to read-optimized application state: it is entirely reasonable to denormalize data in the read-optimized views, as the translation process gives you a mechanism for keeping it consistent with the event log.”
“As algorithmic decision-making becomes more widespread, someone who has (accurately or falsely) been labeled as risky by some algorithm may suffer a large number of those “no” decisions. Systematically being excluded from jobs, air travel, insurance coverage, property rental, financial services, and other key aspects of society is such a large constraint of the individual’s freedom that it has been called “algorithmic prison”. In countries that respect human rights, the criminal justice system presumes innocence until proven guilty; on the other hand, automated systems can systematically and arbitrarily exclude a person from participating in society without any proof of guilt, and with little chance of appeal.”
“Predictive analytics systems merely extrapolate from the past; if the past is discriminatory, they codify that discrimination. If we want the future to be better than the past, moral imagination is required, and that’s something only humans can provide.”
I’m only 3 chapters into this book and I think it deserves a 5 star already.If you are interested in distributed systems or scalability, this book is a must-read for you. It gives you a high-level understanding of different techniques, including the idea behind it, the pros and cons, and the problem it is trying to solve. A great book for practitioners who want to learn all the essential concepts quickly.I didn’t come from a traditional CS background, but I did have some basic knowledge in hardware and data structure. You will need some of that, such as hard disk vs SSD and AVL tree, to understand the materials. If you are completely new to the backend or DS, you may want to start with another book “Web Scalability for Startup Engineers.” After that book, you can read the free article “Distributed Systems for Fun and Profit” and you are good to go for this amazing book 😀
- Sebastian Gebski:
Honestly, this one took me much more time than I’ve expected.
Plus, it’s definitely one of the best technical books I’ve read in years – but still, it doesn’t mean you should run straight away to your bookshop – read up to the end of the review first.I’ll risk the statement that this book’s content will not be 100% directly applicable to your work, BUT it will make you a better engineer in general. It’s like reading books about Haskell – most likely you’ll never use this language for any practical project/product development, but understanding Haskell (& principles behind its design) will improve your functional-fu.In this case, Martin (true expert, one of the people who stood behind Kafka in LinkedIn – if I remember correctly), doesn’t try to rediscover EAI patterns or feed you with CAP basics -> instead he dives deep into low-level technical differences between practical implementations of message brokers, relational & non-relational databases. He discusses various aspects of distribution, but he doesn’t do stop at theory. This book is all about practical differences, based on actual implementations in popular technologies.
No, 95% of us will not write stuff I tend to call “truly infrastructural”. No, 95% of us will never get down to the implementation of tombstones or dynamic shard re-balancing in Cassandra. But still, even reading about how those practical problems were solved will make us better engineers & will add more options (/ideas) to our palette. For some youngers – it will prove to them that there’s no mysterious magic behind the technology they use – it’s just good, solid, pragmatic engineering after all.
Great book. Truly recommended. Try it yourself & enjoy how it tastes 🙂 I would give 6 freaking stars if I could.
- Greg Watson:
DDIA is easily one of the best tech books (possibly this decade) and is destined to become a classic. Designing Data-Intensive Applications book deals with all the stuff that happens around data engineering: storage, models, structures, access patterns, encoding, replication, partitioning, distributed systems, batch & stream processing, and the future of data systems (don’t expect ML because it is a different beast).Kleppman has coherently blended the relevant computer science theory with modern use cases and applications. The focus is primarily on the core principles and thought-processes that one must apply when it comes to building data services. Design concepts don’t go out-of-date soon, so the book has a very long shelf-life.The high-point of this book is the author’s lucid prose, which indicates mastery of the subject matter and clarity of thought. Conceptualizing reality is an art and the author really shines here. You’ll find that whenever you have a question after reading a particular sentence, the answer to that will be found in the upcoming sentences. It’s like mind-reading.
Also kudos to the author for those nice diagrams and interesting maps (and for avoiding mathematical formulas with Greek symbols). The bibliography at the end of each chapter is thorough enough for unending personal research.
If you are working on or interviewing for big data engineering, systems design, cloud consulting, or DevOps/SRE, then this book is a keeper for a long-long time.
(5.0) excellent summary/foundation/recommendations for distributed systems development covers a lot of the use cases for data-intensive (vs compute-intensive) apps/services. I recommend it to anyone doing service development.Recommendations are well-reasoned, citations are helpful, and are leading me to do a lot more reading.
- Mj Pickles:
I have spent most of my career either as a DBA or integration specialist. This is the best book Designing Data-Intensive Applications by far I have read covering those subjects. His descriptions of the theoretical underpinnings of the different technologies involved are perfectly pitched. He manages to describe complex things simply which demonstrates a real mastery of the subject. His use of simple UNIX commands to illustrate what, for example, a database is doing again is very impressive. This promises to be a very influential read for me.
- David Bjelland:
Like you’d expect of a technical book with such a broad scope, there are sections that most readers in the target audience will probably find either too foundational or too esoteric to justify writing about at this kind of length, but still – at its best, I shudder to think of the time wasted groping in the dark for an ad hoc understanding of concepts it explains holistically in just a few unfussy, lucid pages and a diagram or two.Definitely a book I see myself reaching for as a reference or memory jogger for years to come.
Just finished reading the Designing Data-Intensive Applications ebook. Stunned by how good it is, surpassing even “Release It!” by Michael Nygard, which I was blown away by, nearly 10 years ago. This book is insightful. informative. impartial, extensively researched, mind-expanding, precise, and even, in the end, philosophical.I’d regard the book as required reading for anyone involved in software engineering. I recently asked my manager to buy copies for 15 of my peers in my team (which he did). Buying this book is a no-brainer with respect to personal ROI.
- Sameer Rahmani:
Designing Data-Intensive Applications It’s a really great book. The author is well known in the field and the author of Apache Samza. In this book, he explains even the smallest challenges in creating a distributed data-intensive system.
- Andrew Powell:
This seems to be a very knowledgeable ebook, to be honest, quite a lot of it goes over my head and makes me very afraid of distributed systems, as it sounds like you need a whole department to work on it, so not something I could easily have a go at. There is so much in this that it is worth reading a few times which I should really do, but now near the end, I’m not sure I know where to start 🙂
You almost had me till the very end, Martin Kleppmann, but I will not let that ruin my experience in reading this little book of yours.Going in, I thought I would be reading something like the classic System Design prep Github repos with a lot of information told very quickly. You should know that this is purely about the data part: Kleppmann goes in-depth on databases, message brokers, batch processing from the perspective of how the pieces of data are affected. There is less on pure infrastructure or testing or CI/CD other than what strictly pertains to the data.I liked that the structure of the book meant that chapters were building upon themselves. The start is a standalone database, one machine, then we move on to multiple machines, we move them to different data centers, we make the latency requirements and the throughput more ambitious. You can follow along with minimal experience in the domain and it doesn’t shy away from making a certain generalization, which is ultimately the point of this whole book, that most systems nowadays look at getting data from point A to point B while enriching it along the way:
I learned how to think more critically when it comes to data quality, analysis, and its general flow through the system, particularly:
– Random additional latency, will always remain non-deterministic because you cannot account for context switches to background processes, packet loss, and retransmission, garbage collection pauses, or page switches;
– Physical clocks and their perils;
– A 1-second slowdown in responses can dramatically reduce customer satisfaction (even by over 15%!);
– Ways of structuring a database under the hood and what this means for updates coming in, handling in place vs. via a log;
– What you might still keep in disk depending on size, even though it may cause random I/O;
– Caching techniques, such as storing your last evicted used data somewhere, and loading it back into memory later;
– Ordering events in a replicated and partitioned environment;
– Data cubes and star data;
– When a manual failover is actually more appropriate;
– Completely leader-less techniques for conflict resolution;
– Anti-entropy processes;
– Multi-indexed databases and how this runs under the hood, including concatenated indexes.
The end I thought was very meh, though. It had the air of ‘let’s summarise everything we’ve learned, with Kleppmann dedicated a full chapter to his vision of how data-oriented systems will evolve in the future. Ultimately, it is taking all the information from previous chapters and looking at what’s missing, what would add an improvement to the current state of things. And then there are around 20 pages or so about the importance of data auditing and its associations with surveillance, and the responsibility that the software engineer has in the process. Interesting sure, but nothing groundbreaking that you wouldn’t have heard in election-times, perhaps related to Facebook or Palantir, so I ended up skimming that chapter because I thought it was written in a style that I considered too pompous.
- J. K. Barnett:
This book is astonishingly good. I’ll leave you to read the many other 5 star reviews as they speak volumes. I’ll just add that Kleppmann is a first-class technical writer whose knowledge of his subject truly elemental. This book teaches and informs the reader by bringing a truly deep understanding of the subjects at hand without ever being academic. The result is that your newly gained knowledge about data-centric systems will help you better understand both traditional technologies (which you probably take for granted) and the many emerging technologies of today (which are often presented as revolutionary). A majestic work of truly great insight.
- Ahmad hosseini:
Designing Data-Intensive Applications Ebook changed my view of designing applications!
What is the meaning of Data-Intensive?
We call an application data-intensive if data is its primary challenge- the quality of data, the complexity of data, or the speed at which it is changing.Who should read this book?
I think that all developers must read this book. If you develop applications that have some kind of server/ backend for storing or processing data, and your application uses the internet, then this book is for you.Why should you, as an application developer, care how the database handles storage and retrieved internally?
You do need to select a storage engine that is appropriate for your application, from the many that are available. In order to tune storage to perform well on your kind of workload, you need to have a rough idea of what the storage engine is doing under the hood.
What is scope of this book?
This book compares several different data models and query languages and turns to the internals of storage engine such as Cassandra, Redis, MongoDB, and etc. and looks at how database lay out data on disk. It also pays to distributed data and distributed system and their challenges like consistency, scalability, fault tolerance, and complexity.
- Piotr Kafel:
Book every software engineer should read! I do not have a single complaint so be prepared to read a review full of praise…The book is split into 3 parts. Each one of them is incredibly packed with information.1. Foundations of System Data
Here Mr. Kleppmann describes the basics of how databases, indices, and different encodings work. This chapter is essential to understand the next parts of the book. Even though it might sound like an appetizer you can already find here plenty of meat about schema (and format) evolution and data structures. Highly insightful.
2. Distributed Data
This is my favorite part of the book and I think the hardest to digest. The amount of new information was overwhelming. I’m pretty sure that this part of the book I will re-read many times in the upcoming years.
The chapter focuses on database properties and introduces a lot of nomenclatures that allow for effective and precise communication when discussing storage solutions. It starts on different types of databases, guarantees they provide, and properties. It ends on the mind-blowing discussion around how total order broadcast is equivalent to consensus… Just by saying those words, I feel smarter 10x.
3. Derived data
In this chapter, the author looks mostly into integrations and how in modern systems do we deal with derived data. It focuses mostly on batch jobs and streams. You can find here a great explanation of map reduce, change data capture (CDC) and dataflow as a tool for deriving data. The chapter ends by looking into the future about what might await us.
Have I already mentioned that this book is amazing? This is probably one of the longest reviews I have ever written so I already do not remember what was at the beginning… The book is great. If you are a software engineer and haven’t read it do it. Be warned however that this is not a book that you will quickly finish if you want to really understand everything in it. What if you do not have time for such a heavy/time-consuming book? Well, read it anyway. You will at least realize how much you do not know and maybe you will actually make time to read it the right way.
Just in case I haven’t pointed this out. This book is pure gold!
Designing the Data-Intensive Applications book is monumental. It explains many aspects of designing data applications in a very approachable way. It has everything; from high-level differences between SQL and NoSQL to low-level details of how databases work. The explanations are clear and accompanied by code samples, diagrams, and examples of data engines that work that way.Part I of the book covers the fundamentals (e.g. how to handle data on a single machine). Part II covers Distributed data: how to handle it and issues you’ll face. Part III covers generating derived data (batch and stream processing) and the author’s opinion on the future of data systems.If you’ve ever wondered how does a database store data, what’s the difference between different transaction isolation levels, what to do when the data doesn’t fit in a single machine, what the heck is a data lake, what’s the difference between a star schema and a snowflake schema, how does MapReduce work (or a hundred other data-related questions), then this book is for you. I’ve learned so much from this book and I think it’s recommended for anyone working in IT.