Databases power all modern applications. They’re behind your Angry Birds mobile game as much they’re behind the space shuttle. In the beginning, databases were hosted on a signal physical machine. Basically, it was a computer running only one program: the database. Then we moved to running databases on virtual machines, where resources are shared among multiple operating systems and applications. In recent years we moved to running databases in the cloud. And we no longer use a single database instance to store the data. Modern database systems are spread across multiple computers or nodes, which work together to store, manage, and access the data.
This post is about distributed database architecture. We’ll cover what a distributed database is, what types exist, their benefits and drawbacks, and how to design one.
What Is a Distributed Database?
As stated above, a distributed database is a database design that comprises several nodes working together. A node is basically a computing instance (it can also be a virtual machine or a container) that’s running the database. Each node in the distributed database has its own copy of the database, and these nodes communicate with each other to make sure they all have the same information.
Why Switch From a Single Node to a Multi-node Setup?
In the past, when data was measured in megabytes and database users were measured in dozens, a single database node could have done the work. A typical scenario for this kind of architecture was hosting the database on an on-premises mainframe machine. Developers connected to the database, ran queries, received the output, and disconnected. A single system administrator or a database administrator took care of the system in terms of availability, performance, and upgrades.
Take Netflix as an example. It has a modern database architecture. Hundreds of millions of users all over the world use the application from different devices. Millions use the system at the same time. It should be available 24/7. In this scenario, Netflix couldn’t possibly rely on a single computer running a single database application. If it goes down, millions of users will suffer a service disruption. In addition, storing all the data in one place is neither economically beneficial nor practical.
Imagine saving all the user data in one database instance running on a single PC. The database back end should grow automatically as more subscribers join the service. Thus, a single on-premises database is simply not practical in terms of availability, scalability, and fault tolerance.
Benefits of Distributed Database Architecture
As mentioned above, distributed databases offer many benefits over traditional single-server databases, including improved scalability, availability, performance, and fault tolerance.
Compared to a single database that can only scale horizontally, distributed databases can scale vertically. In other words, if you have a single database, the only way to scale it so it can handle more load is to add memory and RAM. With a distributed database, you can add additional nodes.
Availability and Fault Tolerance
If you only have one database, and if the database goes down, the application will go down with it. But with a distributed database, losing a node won’t impact the whole application, and the service will continue to function.
You can split data across multiple nodes. Therefore, if a node is breached, most of the application’s data will remain secure. The same goes for data corruption. If node data was corrupted due to a server or software error, it won’t affect other nodes.
Reduced Network Traffic
Distributed databases can reduce network traffic by storing data closer to where it will be used, reducing the need to transmit data over the network.
Drawbacks of Distributed Databases
Designing and implementing a single database instance is much easier that designing and implementing a distributed database architecture. The same applies to monitoring, troubleshooting, maintaining, and upgrading. A distributed database requires thorough planning, the right database vendor, the right architecture, and so forth. In addition to the increased complexity, there’s also higher cost as it often requires more hardware, software, and skilled personnel. Lastly, there are consistency and coordination issues. Ensuring consistency across all nodes in a distributed database can be challenging, especially in systems with high concurrency or large amounts of data.
Types of Distributed Database Architecture
There are several types of distributed database architectures. Each has its own strengths and weaknesses, and the choice of architecture depends on the application’s specific needs.
In master-slave architecture, there’s a single primary database that manages all write operations while one or more slave databases replicate the data from the master for read operations. So all insert operations go to one node, and read operations are distributed across nodes. This setup is ideal for read-intensive applications.
With multi-master replication, all nodes provide both read and write capabilities, both master and slave.
In shared-nothing architecture, data is shared, and each node is responsible for only some of the data. Data is essentially split across nodes, and each node is responsible for both read and write.
Federated Database Architecture
In a federated database architecture, there are several independent databases (and even several database types) organized as one meta-database. Basically, what you have here is a unified virtual database that you can query. The queries are distributed internally by the virtual database manager.
Examples of Distributed Databases
There are many examples and vendors that provide database solutions that work and that you can deploy deploy as a distributed architecture. The following are the most popular:
- MongoDB, a popular NoSQL document database that you can distribute across multiple servers. It stores data in collections rather than tables and in documents rather than rows.
- Apache Cassandra, a highly scalable, distributed database system that’s designed for managing large volumes of structured and unstructured data across multiple data centers.
- Amazon DynamoDB, a fully managed NoSQL database service.
Choosing and Designing Your Distributed Database Architecture
When it’s time to choose which database architecture you should use for your organization or application, there are several things to consider. There are no right or wrong answers here. Each architecture has its use cases, so you should choose an architecture that best fits yours. Consider (among other factors) data partitioning, replication, and consistency. In more detail, here are some of the steps that you should take:
- Identify the data that needs to be stored and accessed in the distributed database. This will help determine the amount of storage, schema design, and so forth.
- Determine your data partitioning strategy. Decide on the strategy for partitioning across multiple nodes.
- Choose your replication strategy. You can choose between master-slave, multi-master, or something else.
- Decide on a consistency model. Choose whether you need your data to be consistent across nodes, eventually consistent, or casually consistent.
This is of course not an exhaustive list. You’ll also need to enlist an experienced architect.
Like any other technology, distributed databases have their advantages and drawbacks. However, for modern use cases, their advantages outweigh the drawbacks. There are several types of distributed database architecture, and you should only choose the one that best fits your needs after careful consideration.