Migration and replication data from SQL Server to Snowflake

SQL Server to Snowflake data replication

Migration Goals and Requirements for Business

We live in the digital age when data derived from constant ongoing processes is collected in digital form for further processing and analysis.

Currently, the volume of stored and processed data is continuously growing. According to the statistics, in 2021, 79 Zettabytes of data were created worldwide. By 2025 the expected amount of data to be generated is estimated at 180 Zettabytes. Approximately 90% of it is replicated data.

Large amounts of data become a challenge for classical RDBMS systems, as they increase the load on a databases, slowdown data processing and as result require more computing power and additional licenses.

As long as not all existing systems nowadays are able to cope with such challenges, many companies are in process of moving data out of traditional RDBMS databases to systems that specialize in storing and processing large amounts of data. Therefore, the demand for data migration to such systems is growing more and more.

In this article, we consider issues of data migration and replication using the example of SQL Server to Snowflake data migration.

Why Snowflake over SQL Server?

Snowflake is a data warehouse system, which runs on the top of cloud infrastructure and provides near-unlimited storage and computing power in real time. It is designed and optimized to store, process and analyze big volumes of data.

Snowflake is commonly used in cases where there is a need to offload classical databases by migrating BI/reporting data into it for further processing and analyses, which irons out most of the issues typical for traditional databases. As a result the demand to migrate SQL Server to Snowflake is constantly growing. Nevertheless, delivering data is the challenge itself.

SQL Server to Snowflake migration challenges

There is no simple option to connect SQL Server to Snowflake, nevertheless Snowflake provides various methods to load data. The data can be delivered using SQL command, files, Snowpipe, etc, which allow the load of large data volumes into the database rapidly.

However, the data needs to be extracted, prepared and delivered in a consistent state. Another problem is to keep the Snowflake database in sync, which usually requires delivering the data that was changed since the last refresh. Also, during the migration, such challenges as data type conversion, data transformation and data filtering need to be taken into account. Let’s overview main SQL Server to Snowflake migration challenges in the details.

Data Type Conversion

Before the migration all the necessary tables should be created in Snowflake. Snowflake supports most of the SQL Server’s data types, however, some of them may have limitations or require conversion. For instance, the VARBINARY type in Snowflake is limited to 8MB. The DATATIMEOFFSET type should be converted to TIMESTAMP WITH TIME ZONE type, etc. All these cases require additional attention during the migration process, and should be taken in consideration while planning the migration of schema structure from Microsoft SQL Server to Snowflake.

Initial Data Load

Initial Load is needed to copy the initial set of user data, which usually implies that a large amount of data is to be extracted and delivered into Snowflake for further processing.

Necessity of preliminary data transformation and filtering may significantly complicate the migration process and require additional data processing steps to be implemented.

Another challenge to move data from SQL Server to Snowflake is the necessity to insert data in the consistent state, which either requires keeping the Source system in "read-only" mode at the time of migration or requires all the data changes that have been generated during the migration to be identified and applied after migration is completed.

Ongoing Data Synchronization

Ongoing Data Synchronization (Ongoing Replication) is the process that keeps target database up to date. The complexity of this process is to track and extract the data set, which has been changed on the Source database since the last refresh and based on it the SQL commands should be generated and executed on the Target database.

All the changes should be applied in the same order as they occurred in the Source database. As with Initial Data Load, any transformations, filtering, and data type conversions can also complicate the replication process.

The Easiest Way to Get Data from SQL Server to Snowflake is with Repstance

Repstance is real-time replication tool for SQL Server and Oracle databases, which has highly advanced replication and transformation abilities. As a target system it supports Oracle, SQL Server, MySQL, PostgreSQL, Amazon Aurora, Amazon Redshift, Snowflake and S3.

Repstance is delivered as a Virtual Machine Image and is accessible from both AWS and Azure Marketplaces, using the “1 Click “ download process.

Why Repstance

Repstance is capable to propagate large volumes of data with very high speed and able to transform, filter and enrich the data as a part of the replication process. The data is inserted into Target Database within transaction and in the same sequential order they were executed in the Source Database. That is to say, the Target Database is always in consistent state.

Repstance supports not only DML, but also DDL replication, meaning that any create or delete changes of table structure will be replicated, if desired.

Before migrate data from SQL Server to Snowflake Repstance creates all the necessary tables, taking in account Target database specific, if the appropriate option is selected. This significantly simplifies the process of migration and minimizes database specialists' involvement.

Prerequisites

In this article’s case Repstance only needs the SQL Server version that permits using of CDC (Change Data Capture) and Network Access to both Source and Target databases.

Data Migration and Replication Setup Overview

To replicate data Repstance needs two processes – Capture and Apply to be configured. These processes can be configured either via the Command Line Interface (CLI) or via the Rest API, or via Web User Interface (Web UI).

The Capture Process extracts data from the Source Database and puts it into locally stored Trail Files in the same sequential order as the transactions occurred in the database. These Trail Files are consumed by Apply Process to insert the captured data into the Target Database. The Apply Process can also modify data or/and data definition if the appropriate transformation rules are configured.

Initial Load is the optional part of Capture Process configuration, which is used to copy all existing data before replicating the changes. Alternately, data replication can be started from any desired timestamp assuming that at this timestamp the data has already been synchronized.

In order to successfully run Capture and Apply Processes both Source and Target Databases should be “prepared” for the replication. It can be done by Repstance itself and does not require any manual SQL scripts to be developed and executed.

Let’s walk through the process of setting up data migration and replication process using Repstance.

Prepare Source and Target Databases

As we already mentioned, before running the Capture and Apply Processes both Source and Target Databases need to be prepared meaning that the minimal set of the Repstance’s objects must be installed. Repstance has built-in functionality that allows it to prepare databases by simply running Repstance’s command.

Repstance creates “REPI” schema and installs all the objects into it to separate them from user’s schemas. It automatically enables CDC on Target Database and on the tables that are configured for the replication.

Prepare Capture and Apply Processes

Once both databases are prepared for the replication the Capture and Apply Process need to be configured.

In order to configure Capture Process it’s enough to provide database credentials and list of tables to be included into the replication. It is a good practice to create a separate database user for Repstance. The database user should have read-write access to REPI’s objects and read-only access to CDC and user’s tables.

Capture Process utilizes different parameters to specify tables for DML and DDL replications and Initial Load. Tables can be specified not only by name but also by mask, which eliminates the need to list all of them. Any data and/or data structure transformation rules can also be defined as a part of Capture Process configuration.

The configuration of Capture Process (list of objects included into the replication, transformation rules, etc.) can be modified at any time with further reloading tables, if needed.

Apply process keeps Target Database in sync using the data that has been extracted by the Capture Process. Apply process will load initial set of data and create all the necessary tables if this option is specified. Based on the data change it will construct and run SQL to apply the changes. Apply Process has built-in data loss protection mechanism that allows gracefully restarting the process if a disaster has occurred. Apply Process has various parameters to specify how the data should be inserted, which allows to customize replication configuration to meet specific business requirements.

For the Apply Process configuration it is necessary to provide Target Database credentials and name of the corresponding Capture Process. This name helps the Apply Process to identify data set extracted by the Capture Process.

Repstance allows configuring up to 10 Apply Processes consuming data from one Capture Process to distribute it across different databases implementing individual data processing rules for each database.

Repstance dynamically selects optimal method to load the data which depends on many factors such as type of the Target system, data volume, types of database operations, number of transactions, etc. For Snowflake data can be loaded and processed using “staging” or through the temporary tables or in some other way.

If the real-time replication mode is not required for the Snowflake database, Repstance can be configured to accumulate all the data changes in memory and apply them on schedule.

Maintenance and monitoring the replication

Repstance gives a 360 view into the intricacies of the database replication process with the highest granular health check, with reporting of replication lags (with cause/consequences analysis), detailed DB replication statistics (including, but not limited to transactions count, objects modification count, DDLs count, average transaction rate calculation, etc.).

Conclusion

Instead of giving a lot of technical details on how to migrate data from SQL Server to Snowflake we decided to simply show you how to elegantly and quickly get you started with Repstance in this video.