Wednesday, 6 November 2019

How to implement Archive and Purge solution for all Fact and dimension tables data in DW database


BI Developer(s) often heard performance issues of BI applications from the customers because of several reasons. One of the most common reasons is the volume of the data.  Over time, underlying databases of BI applications would grow to the extent that would degrade the performance of the applications. Generally, developer(s) would never implement a solution to purge the data from any OLAP system because of the more the data, the prediction would be accurate.

In this article, I will explain how to implement Archive and Purge solution for the data warehouse. My idea is to have one configuration table which stores details of key columns, table details, batch size and the retention period etc to keep the data in online warehouse database based on the retention period and copy rest of the data to archive warehouse database based on batch size and purge the same from online warehouse database. I will be attaching the script as I go over the steps.

The goal is to move the cold or old data from the current warehouse database to archive warehouse database based on retention settings mentioned in the configuration table.

I have presumed the current warehouse database does not have any foreign key constraints created between the fact and dimension tables. I am using "AdventureWorksDW2017" database to implement this. As this sample database has foreign key constraints, I would be disabling all the constraints.


Let's implement it.

1. Download AdvenutureWorksDW2017 database from here. Restore it to your SQL instance.

2. Copy the following script "CREATE CONFIGURATION ENTRIES.sql" and execute it in SQL Instance. This script would do the following things

  • Clone the "AdventureWorksDW2017" to "AdventureWorksDW2017_Archive". Not necessary to clone, create the archive database manually and also create the tables that are needed to move the database from current to archive.
  • Create ArchiveFactTablesConfig table to store the configuration settings such as retention period, batch size, key columns, filtered column, online database and archive database etc
  • Add sample entries to the configuration table for few fact tables (since this is one time you can add all necessary tables configuration manually).
  • Add (AdventureWorksDW_Archive_Log) logging table which tracks the data movement between current and archive data warehouse fact tables.
  • Disable all constraints in the database

3. Let's check the ArchiveFactTablesConfig and logging table after executing Step 2. Here, 7 sample tables under 3 bucket groups. The idea is to have a bucket group column in the configuration table to execute the stored procedure concurrently for each bucket group.
4. Copy the following script "sp_PurgeFactTablesData.SQL" and execute it in SQL Instance. This script would do the following things

  • Add Archive Schema to AdventureWorksDW2017 database
  • Create ARCHIVE.sp_PurgeFactTablesData stored procedure which would accept TABLE_NAME as a parameter. This would  
    • read all settings from the configuration table based on the table name (table name must be unique)
    • copy the data from current to archive based on retention settings
    • purge the data from the current warehouse database.
    • add logging entries to AdventureWorksDW_Archive_Log for tracking purpose.

5. Now we have set up configuration tables and the stored procedure is also available in the current warehouse database. In order to execute the stored procedure concurrently for each bucket group then SSIS package is the better approach.

Before going to the approach, Let's check row count of one fact table.  As I have set RETENTION_PERIOD to 3225 as I would like to keep 3225 days of data in online warehouse database and rest of the data like to move to archive database and purged from the online database. We can use same query to check after the execution of the SSIS package.

FROM AdventureWorksDW2017.dbo.[ArchiveFactTablesConfig]

SELECT * FROM AdventureWorksDW2017.dbo.AdventureWorksDW_Archive_Log

SELECT OrderDate,COUNT(*) AS ROWCNT FROM AdventureWorksDW2017.dbo.FactInternetSales WHERE  DATEDIFF(DD,OrderDate,GETDATE()) > 3225 GROUP BY OrderDate

SELECT OrderDate,COUNT(*) AS ROWCNT FROM AdventureWorksDW2017_Archive.dbo.FactInternetSales WHERE  DATEDIFF(DD,OrderDate,GETDATE()) > 3225 GROUP BY OrderDate

SSIS Approach: Create a Job to call SSIS package from SQL Server Agent and schedule as needed. Copy the following SSIS package "Archive Fact Data.dtsx"  and change the database connection. In order to change the database connection, update the "AdventureWorksDWConn" variable value correctly in the package.

Before executing the package, please make sure you correctly set the FILTER_COLUMN, KEY_COLUMNS, RETENTION_PERIOD and BATCH_SIZE etc.

Let's execute the SSIS package

Let's check the configuration to check the status and logging tables to check row count. Use the same query to check
SELECT * FROM AdventureWorksDW2017.dbo.AdventureWorksDW_Archive_Log WHERE TABLE_NAME = 'FactInternetSales'
SELECT OrderDate,COUNT(*) AS ROWCNT FROM AdventureWorksDW2017.dbo.FactInternetSales WHERE  DATEDIFF(DD,OrderDate,GETDATE()) > 3225
GROUP BY OrderDate
SELECT OrderDate,COUNT(*) AS ROWCNT FROM AdventureWorksDW2017_Archive.dbo.FactInternetSales WHERE  DATEDIFF(DD,OrderDate,GETDATE()) > 3225
GROUP BY OrderDate

With this, you can say that the fact table records copied to an archive database and purged from online warehouse database.

6. Another approach is to use the below code to call "sp_PurgeFactTablesData" stored procedure.

SQL Approach:

Create a SQL Agent job to call below T-SQL code. This approach would be helpful if you would like to load the table records based on the table dependency order and needed to move/process the records sequentially one table after other. You can add one more column in the [ArchiveFactTablesConfig] table to store parent/child table dependency order. You will have to tweak below code based on requirements.

USE AdventureWorksDW2017
IF (OBJECT_ID('tempdb..#Tablenames') IS NOT NULL) DROP TABLE #Tablenames

7.  Please make sure to do the full process of cubes in case if BI applications are connected to Cubes. If users like to analyse the historical data then users will have to connect to the archive database and process the data when needed.

You can download this code from GitHub too
Kindly let me know if there is any better way.


No comments:

Post a comment