Saturday, December 31, 2011

Change Detection

An efficient way to compare data of two database tables 

Oracle Database, SQL Server Database

No question, the most convenient way to detect changes is CDC (Change Data Capture), which means that you only get the changed data rows. In Oracle this can be implemented using Oracle CDC, Golden Gate, Trigger, Materialized View Logs or in SQL Server there is also a CDC functionality.
But due to company rules this is not always an option and sometimes it's also necessary to synchronise the DWH with the source system when you are missing some changes in the DWH, for example after errors in the ETL process or if CDC is disabled for some reason, e.g. after structural changes in the source system. For small dimension tables comparing the source table with the DWH table is not a big deal. But when we need to compare large tables containing millions of rows, this might be a performance issue.

After some testing, I found the by far fastest approach to compare two tables here: On Injecting and Comparing
As we also need updated values, I added an analytic function to partition the result set by the key columns. 
The structure of the result is similiar to the one CDC would give you. You get the inserted rows (I), deleted rows (D) and updated rows (U) with the new values and also the old values (O) if desired.

SELECT Key1,Key2,Attribute1,Attribute2,Attribute3
       CASE WHEN count = 2 and tbl1 = 1 
            THEN 'U'
            WHEN count = 1 and tbl1 = 1
            THEN 'I'
            WHEN count = 1 and tbl1 = 0 
            THEN 'D'
            ELSE 'O' 
       end dmltype
  FROM (SELECT Key1,Key2,Attribute1,Attribute2, Attribute3,
               min(count1) tbl1,         
             count(*) over (partition by Key1,Key2 order by null) count
          FROM(SELECT Key1,Key2,Attribute1,Attribute2,Attribute3, 
                       count(tbl1) count1
                       ,count(tbl2) count2
                  FROM ( SELECT  Key1,Key2,Attribute1,Attribute2,Attribute3
                               1 tbl1, 
                               to_number(null) tbl2 
                           FROM  source_table                     
                        UNION ALL
                         SELECT  Key1,Key2,Attribute1,Attribute2,Attribute3
                               to_number(null) tbl1, 
                               2 tbl2 
                           FROM  DWH_Table                     
                       )
                 GROUP BY Key1,Key2,Attribute1,Attribute2,Attribute3
                 having count(tbl1) != count(tbl2)
               )
               GROUP BY Key1,Key2,Attribute1,Attribute2,Attribute3 
       )

The performance of this query is incredible compared to other solutions, like using an outer join operation. Thank you Tom Kyte!
Please note that you must cast the source and target columns to the same datatype before you can perform the union. If you are comparing tables from databases with different character sets you also have to convert text columns into the same character set, for example using the oracle convert function (e.g. convert('text','WE8EBCDIC500', 'WE8ISO8859P1'))

The query returns inserted, deleted and updated rows, which can be the source for an usual ETL process, that inserts the SCD (1,2,3) data into the DWH.

No comments:

Post a Comment