Introduction
In recent years, big data has taken the world by storm. Legacy relational database storage on premises has been transformed into complex cloud systems with multiple databases, NoSQL databases, ingestion engines, and so forth. In this post, I’ll explain big data architecture and how it differs from legacy database architecture and describe the different types of big data architectures.
What Is Big Data?
Legacy Systems
Big data refers to extremely large and complex data sets that cannot be effectively processed or analyzed using traditional data processing tools and techniques. When we talk about traditional data processing tools and techniques we usually mean relational databases that are hosted on premises in data centers. Those are usually MsSQL, Oracle, PostreSQL, or databases from other vendors that are hosted on a virtual machine, receive data from a limited number of sources, are accessed by a limited number of people, and have well-defined data. Think of a Cobol application for an insurance company or a bank.
The Current Situation
On the other hand, big data usually means cloud-based storage systems. Multiple sources of input. Multiple stake holders and users accessing the data at the same time. For example, the scale can be billions of events being processed by hundreds of nodes and read by millions of users. In general, big data systems are characterized by the following 3Vs:
- Volume: The size of the data, typically measured in terabytes or petabytes.
- Velocity: The speed at which data is generated and processed, typically in real-time or near-real-time.
- Variety: The different types and formats of data and data sources. This includes structured data (data stored in database rows and columns), semistructured data (XML or JSON), and unstructured data (text, images). Also, data generated from social media, IoT devices, sensors, and so forth.
The combination of the three factors above requires new approaches, tools, and architectures to efficiently store and process the data.
Aspects of Big Data Architecture
Big data architecture is the proper way of designing, implementing, scaling, and maintaining an architecture that will support our needs of storing and processing data. The key aspects we need to think about when designing a big data architecture are as follows:
- Data sources: All the data sources that feed into the big data ecosystem.
- Data ingestion: Responsible for ingesting the data from the various sources.
- Data storage: Responsible for storing the ingested data in a scalable, distributed, and fault-tolerant storage system. Some popular options for big data storage include Hadoop Distributed File System (HDFS), Apache Cassandra, and Amazon S3.
- Data processing: Responsible for processing the data stored in the big data system. This can include batch processing (using tools like Apache Spark and Hadoop MapReduce) and real-time processing (using tools like Apache Storm and Apache Flink).
- Data analysis: Analyzing the processed data to extract insights and generate business value. This can involve tools like Apache HBase, Apache Hive, Tableau and QlikView.
Overall, these components work together to create an architecture that can handle large volumes of data, process it, store it, and generate insights.
Types of Big Data Architecture
There are several types of big data architecture, each with strengths and weaknesses. Organizations tend to choose the architecture that best meets their specific needs and requirements.
Lambda Architecture
Lambda architecture is a popular big data architecture that uses both batch and real-time processing to handle large volumes of data. This architecture has a “batch layer” and a “speed layer.” The batch layer is responsible for processing data in bulk and generating batch views, which are precomputed views of the data that can be used for analysis. The speed layer is intended for real-time data. The outputs from both the batch and speed layers are then combined in a serving layer, which gives a unified view of the data. This architecture provides flexibility in terms of the data it can ingest and the ways to process it. However, it’s complex and requires a lot of effort to maintain it.
Kappa Architecture
Kappa architecture is a variant of the lambda architecture described above. The difference is that it doesn’t include a batch layer. Thus, it focuses solely on real-time processing. It uses a stream processing engine to process data as it arrives and stores it in a data store that can be queried in real time.
Data Lake Architecture
Data lake architecture stores raw, unstructured data in a central repository, allowing organizations to analyze and store large volumes of data quickly. In addition, because it’s one central repository, managing access is easy. Furthermore, because we don’t split the data across storages, we can bug the storage space in advance, thus enjoying the benefits of volume economy by paying less for each consecutive TB of storage space. The drawback of this approach is that dividing data access according to permissions to specific data can be tricky in this architecture.
Hub and Spoke Architecture
Hub and spoke architecture is based on a centralized hub that connects various spokes, or satellite systems. In this architecture, the hub is responsible for managing and storing the data, while the spokes are responsible for ingesting and processing the data. Once the data is in the hub, we can return it to end users or return it to the spokes for additional processing. It’s an architecture that is easy to grasp and visualize. On the other hand, this architecture can suffer from network latency as we pass data from the spokes to the hub and vice versa.
SQreamDB
SQreamDB is a unique architecture that allows you to run your big data workloads on GPUs. This increases performance massively compared to other architectures. SQreamDB compresses the data, and then workload is split into small chunks that run in parallel on separate GPU compute cores. This results in improved storage, processing, and query time. The downside of this architecture is that it’s rather new and not many big data architects are familiar with it.
Choosing the Right Big Data Architecture for Your Organization
As we have seen above, each big data architecture has advantages and drawbacks. Furthermore, each suits specific use cases. For example, if you have only real-time data you can use the kappa architecture. On the other hand, if performance is critical, you might choose to go with SQreamDB. In any case, it’s better to use the service of an experienced database architect to model and design the tailored solution for your needs.
Conclusion
In this post, we discussed big data and how it differs from traditional databases, looked at the different aspects and components of a big data architecture, and described the various types of data architecture and their advantages and drawbacks. Choosing the correct big data architecture can be a complex process. It requires a detailed analysis of the data, the infrastructure, the ingestion modes, data security, and so forth. This is why it’s better to use the help of an experienced big data architect for this task.