How to do Data Replication from MySQL, PostgreSQL, SQL Server, Oracle database using real time change data capture (CDC) to Google BigQuery

When you are dealing with analytic project, you will definitely go through these steps:
1. Deciding your Data Sources
2. Extracting and ingesting your Data from Source to Target
3. Storing your Data in storage/database
4. Transforming your data based on your business needs
5. Consuming your data via explorer/reports/dashboards
6. Activating your data via API, Machine Learning, export, integration

One of the most topic that always been asked is how to ingest data from source to target. There are 2 approach for this, either batch (hourly, daily, etc) or near real time. Based on your needs, both has different approach – you need to check whether the database target able to handle real time data, whether you are able to analyze the data in real time, or if it is in batch mode then how big is the data during the intervals (ex: 10 MB per 30 minutes)

This is the part that will be conducted by Data Engineer. Overall diagram that we would like to achieve in Google end to end platform is like this:

Back to the topic based on title of this post. We will explore how to do database replication from heterogeneous sources such as MySQL, PostgreSQL, SQL Server, Oracle DB, etc using near realtime mechanism via change data capture to Google BigQuery as the platform for datawarehouse from Google which has the capability of realtime ingestion, serverless, and managed service by Google.

There are couple of mechanism provided for you to do this, such as:

1. Use Google Datastream.
Google’s new, serverless CDC and replication service, solves many of these challenges. Datastream synchronizes data across heterogeneous databases, applications, and storage systems with minimal latency. It supports data replication for a variety of use cases, including real-time analytics.

2. Use Cloud Data Fusion.
Google Cloud’s integration service for building ETL and ELT data pipelines. Data Fusion already supports the replication of data from SQL Server and MySQL to BigQuery through an easy-to-use, wizard-driven experience. 

3. Use Google Dataflow.
Dataflow is Google’s fully managed service for unified stream and batch data processing. Dataflow’s integration with Datastream includes the launch of three new templates that replicate data to BigQuery, Cloud Spanner and Cloud SQL for PostgreSQL. You can also extend and customize the Dataflow templates that ingest and process changed data from Datastream sources. 

4. Use 3rd party solution from Google’s partner via Google Marketplace such as FiveTran, Striim, Confluent, Attunity, Informatica, etc

I will write following post to describe and doing Hands On for Datastream, Dataflow, and Datafusion after this.

To summarize, and make it easy for you to decide which are the flow to migrate/replication/change data capture (CDC) from your data sources to target either BigQuery for Analytics or RDBMS for Transactional here is the tree decision diagram for you to follow:

References:
– https://cloud.google.com/dataflow/docs/guides/templates/provided-streaming#datastream-to-bigquery
– https://cloud.google.com/datastream/docs/implementing-datastream-dataflow-analytics
– https://github.com/GoogleCloudPlatform/DataflowTemplates/tree/main/v2/cdc-parent

Leave a Comment

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.