Today, the average technology company generates data from dozens of sources, which needs to be consumed by different teams and applications. Each team within an organization has unique requirements, and data engineers need to find solutions to support all use cases.
Backend engineers are used to replicating databases to support high availability, backup, and disaster recovery. For data engineers, things are more complex. Data needs to be searched, analyzed, queried in real-time, merged with different sources, and even returned to the source after being enriched. Supporting such diverse operations increases the need for teams to replicate data between applications, databases, data warehouses, and data lakes.
This article will teach you all you need to know about data replication, including its use cases, most prevalent techniques, and everyday challenges.
What is data replication?
Data replication is the process of copying data from a data storage (source) to another data storage (destination) to serve operations, analytics, or data science purposes. Related processes to data replication include data synchronization (continuous harmonization between the source and destination), data ingestion (collecting data from the source), and data integration (bringing data from disparate sources together in a unified view).
Data replication should not be confused with data migration, which implies that you decommission data from the source after copying. In this case, replication can be used as an intermediate step to support the migration since it keeps the source and destination in sync until the migration is complete. At that point, you may reliably transition to the new data storage.
Data replication can be done in batches or in real-time, in which case the source is often referred to as the publisher and the destination as a subscriber.
Data replication examples
Data can be replicated from one form of storage to another, and data engineers are frequently confronted with many combinations. This section discusses data replication examples between different types of storage like applications, databases, data warehouses, and data lakes.
Database to database replication
There are two database to database replication cases: homogenous or heterogenous (often called cross replication). The first refers to replicating data between the same type of database; the second involves different databases.
In the first case, backend teams often replicate the same database to a different instance to ensure business continuity or improve performance by distributing queries among a set of database replicas, for example, copying data from a MySQL instance to another MySQL instance in a different time zone so that users query the database that is closer to them.
Heterogenous replication may be needed in other circumstances. Often, you require capabilities that your current database cannot offer. For example, you may have an application that stores data in MySQL, but you want to use MongoDB for its map-reduce functionality, in which case you need to apply cross replication between MySQL and MongoDB.
Database to data warehouse replication
Copying data from a database to a data warehouse is, by far, one of the most common data replication example. Data warehouses are ideal for performing analytics, and hence the preferred storage for data analysts and BI to query, connect to dashboards, or even as a starting point for data transformations.
Imagine that you have an application that uses MySQL, and the analytics team needs to use that data to generate business reports every day. It would be harmful to query the backend database directly and perhaps also inefficient. In this case, replicating the data to a Snowflake, BigQuery, or Redshift data warehouse is a popular option.
Database to data lake replication
Another increasingly common data replication use case is building a data lake. Even though the differences between a data lake and a data warehouse are getting blurrier, there are some key differences. Data in databases and data warehouses is mostly transformed using SQL, so when you need to transform the data with a programming language like Python, it’s common to replicate it to a data lake.
Imagine that your application, which uses a MySQL database, contains relevant information for a prediction model. To make the data available to data scientists and ML engineers, you may use a data storage like Amazon S3, from where they can gather and process the raw data for training their models.
More generally, is also common to store raw and unstructured data in the data lake before centralizing it into a data warehouse that enforces a schema.
Database to a search engine replication
As data grows and text searches become slower, replicating your operational database to a search engine can prove quite helpful.
As in the other data replication examples, imagine that you have a MySQL database. Because MySQL's FULLTEXT indexes are not suited for text lookups on large datasets, you decide to switch to a dedicated text search engine, like Elasticsearch, to scale up the operations. A common solution is to use Airbyte to fetch data from MySQL and replicate it to Elasticsearch.
Application to data warehouse replication
The source data may often reside in an external application and be exposed via an API. To make data from that external source available to data analysts, you first need to replicate it into a data warehouse.
For example, imagine that your company uses tools like Salesforce to manage interactions with customers, and the data analysts need to combine that with other sources to generate some reports. You may build an ELT pipeline to access the data through Salesforce’s API, load it into your Snowflake data warehouse, and then transform it.
In our fast-paced, hyper-connected world, it is not uncommon for sources to take the form of continuous streams of data. Hence, another common case is to have an application that emits a stream of sensor data that needs to be stored in a data warehouse. In this situation, a solution like Amazon Kinesis can be utilized to write IoT data to an S3 data lake, from where it will be replicated to Redshift in near real-time.
Data warehouse to application replication
There are cases when you need to analyze and enrich data and then replicate it from a data warehouse back to an application. This use case is called “reverse ETL.”
Let’s take the previous example, in where your company is using Salesforce. Imagine a separate application that generates events related to the customers, such as someone opted-in for marketing notifications. Now, you’re required to integrate that information with Salesforce, so the marketing team knows which customers have opted-in.
Suppose the events data is available in a Snowflake data warehouse. In that case, you may build a reverse ETL pipeline to gather the required data and upload it to Salesforce via their API.
While all combinations between types of sources and destinations are possible, they are not all as common. Some other cases include replicating data from a data lake to a database to make the results of an ML model available to an application, for example. Today data teams avoid replicating data directly between applications as this can easily become hard to maintain if you replicate data from many to many applications. Instead, data teams consider a good investment to centralize all business data to the data warehouse first (ETL/ELT process), and then send it back to the business applications (reverse-ETL process).
Data replication techniques
Data can be replicated on demand, in batches on a schedule, or in real-time as written, updated, or deleted in the source. Typical data replication patterns used by data engineers are ETL (extract, load, transform), and EL(T) pipelines.
The most popular data replication techniques are full replication, incremental replication, and log-based incremental replication. Full replication and incremental replication allow for batch processing. Meanwhile, log-based incremental replication can be a technique for near real-time replication.
Full table replication copies all records from the source to the destination. You can apply full replication when dealing with smaller amounts of data or when replicating data for the first time. This technique tends to cause problems when the data volume increases, especially if your replication frequency is high.
Advantages of full data replication
- If records are hard-deleted from the source, this technique will reflect them in the target storage.
- It doesn’t require a dedicated incremental column.
Disadvantages of full data replication
- It is less efficient and more resource-intensive than other techniques.
Incremental replication, also known as key-based incremental loading, copies only data changed since the previous update. You can use this technique to replicate high volumes of data since it’s very efficient. Given it’s more complex than full replication, you need to have a good monitoring solution in place.
Advantages of incremental replication
- It is efficient since only updated data is copied to the destination.
Disadvantages of incremental replication
- You cannot replicate hard-deleted data in the source.
- It’s more challenging to implement than full replication. You need to keep track of what was copied to the target to avoid missing data or inserting duplicate records.
Log-based incremental replication
Log-based incremental replication is enabled by log-based Change Data Capture (CDC) and is a replication technique that uses a database's binary log files to identify changes. A log file is a record of events that happen in a database.
Advantages of log-based incremental replication
- It’s more efficient than full or incremental replication.
- Data can be copied in near real-time every time a change in the source data is detected.
Disadvantages of log-based incremental replication
- It has to be supported by the source database.
- Only works with specific database event types, like DELETE, INSERT, UPDATE.
- If there are structural changes in the source data, manual intervention may be required to reflect the changes.
Using a data replication tool to solve challenges
Implementing a data integration solution doesn’t come without challenges. At Airbyte, we have interviewed hundreds of data teams and discovered that most of the issues they confront are universal. To simplify the work of data engineers, we have created an open-source data replication framework.
This section covers common challenges and how a data replication tool like Airbyte can help data engineers overcome them.
Using, maintaining, and creating new data sources
If your goal is to achieve seamless data integration from several sources, the solution is to use a tool with several connectors. Having ready-to-use connectors is essential because you don’t want to create and maintain several custom ETL scripts for extracting and loading your data.
The more connectors a solution provides, the better, as you may be adding sources and destinations over time. Apart from providing a vast amount of ready-to-use connectors, Airbyte is an open-source data replication tool. Using an open-source tool is essential because you have access to the code, so if a connector’s script breaks, you can fix it yourself.
If you need special connectors that don’t currently exist on Airbyte, you can create them using Airbyte’s CDK, making it easy to address the long tail of integrations.
Reducing data latency
Depending on how up-to-date you need the data, the extraction procedure can be conducted at a lower or higher frequency. At higher frequencies, the higher processing resources and more optimized scripts you require to reduce the data replication latency.
Airbyte allows you to replicate data in batches on a schedule, with a frequency as low as 5 minutes, and it also supports log-based CDC for several sources like Postgres, MySQL, and MSSQL. As we have seen before, log-based CDC is a method to achieve near real-time data replication.
Increasing data volume
The amount of data extracted has an impact on system design. As the amount of data rises, the solutions for low-volume data do not scale effectively. With vast volumes of data, you may require parallel extraction techniques, which are sophisticated and challenging to maintain from an engineering standpoint.
Airbyte is fully scalable. You can run it locally using Docker or deploy it to any cloud provider if you need to scale it vertically. It can also be scaled horizontally by using Kubernetes.
Handling schema changes
When the data schema changes in the source, the data replication may fail as you haven't updated the schema on the destination. As your company grows, the schema is frequently modified to reflect changes in business processes. The necessity for schema revisions can result in a waste of engineering hours.
Dealing with schema changes coming from internal databases and external APIs is one of the most difficult challenges to overcome. Data engineers commonly have to drop the data in the destination to do a full data replication after updating the schema.
The most effective way to address this problem is to sync with the source data producers and advise them not to make unnecessary structural updates. Still, this is not always possible, and changes are unavoidable.
Normalizing and transforming data
A fundamental principle of ELT philosophy is that raw data should always be available. If the destination has an unmodified version of the data, it can be normalized and transformed without syncing data again. But this implies a transformation step needs to be done to have the data in your desired format.
Airbyte integrates with dbt to perform transformation steps using SQL. Airbyte also automatically normalizes your data by creating a schema and tables and converting it to the format of your destination using dbt.
Monitoring and observability
Working with several data sources produces overhead and administrative issues. As your number of data sources increases, the data management needs also expand, increasing the demand for monitoring, orchestration, and error handling.
You must monitor your extraction system on multiple levels, including resources consumption, errors, and reliability (has your script run?).
Airbyte’s monitoring provides detailed logs of any errors during the data replications so that you can easily debug or report an issue to the community, so other contributors help you solve it. If you need more advanced orchestration you can integrate Airbyte with open-source orchestration tools like Airflow, Prefect, and Dagster.
Depending on data engineers
Building custom data replication strategies requires experts. But being a bottleneck for stakeholders is one of the most unpleasant situations a data engineer can experience. On the other hand, stakeholders may feel frustrated by relying on data engineers to set up a data replication infrastructure.
Using a data replication tool can solve most of the challenges described above. A key benefit of employing such tools is that data analysts and BI engineers can become more independent and begin working with data seamlessly and as soon as possible, in many cases, without depending on data engineers.
Airbyte is trying to democratize data replication and make it accessible for data professionals of all technical levels. When working with Airbyte, you can use the user-friendly data replication UI.
In this article, we learned what data replication is, some common examples of replication, and the most popular data replication techniques. We also reviewed the challenges that many data engineers face and, most importantly, how a data replication tool can help data teams better leverage their time and resources.
The benefits of data replication are clear. But the number of sources and destinations continues to grow, companies need to be prepared for the challenges associated with it. That’s why it’s essential to have a reliable and scalable data replication strategy in place.