Friday, August 3, 2012

Slowly Changing DWH (facts+dimensions)

How to enable a DWH for slowly changing facts
Datawarehouse Architecture, Oracle Database, SQL Server, SAP BusinessObjects

In the classic Kimball DWH we have fact tables and dimension tables, which can have different types of historization. Usually this approach satisfies most of the customer needs. Hence I was quite suprised when I had the requirement in a project to track all changes of dimensions as well as of fact tables (SCD Type 2). These were for example estimated values, which get updated frequently. If we can really call them facts is just a theoretical question at the end, the fact is that the customer needs the possibility to view reports, how they were in a given point in time with appropriate dimension and fact values. Of course, most of the time the customer wants to see the most current values, but from time to time also previous reports are needed.
I want to break the design explanation into 3 points, which were necessary to fulfill these requirements:

Change Detection
No question, the most convinient way to detect changes is CDC, which means that you only get the changed data rows. But this is not always an option and sometimes it's also necessary to synchronise the DWH with the source system when you are missing some changes in the DWH, for example after structural changes in the source system. For small dimension tables comparing the source table with the DWH table is not a big deal. But when we need to compare fact tables containing millions of rows, this might be a performance issue.
After some testing, I found the by far fastest approach to compare to tables here: On Injecting and Comparing
As we also need updated values, I added an analytical function to partition the result set by the key columns.


select Key1,Key2,Attribute1,Attribute2,
       case when count = 2 and tbl1 = 1 
            then 'U'
            when count = 1 and tbl1 = 1
            then 'I'
            when count = 1 and tbl1 = 0 
            then 'D'
            else 'O' 
       end flag
  from (select Key1,Key2,Attribute1,D,
               min(count1) tbl1,         
             count(*) over (partition by Key1,Key2 order by null) count
          from (select Key1,Key2,Attribute1,Attribute2, 
                       count(tbl1) count1
                       ,count(tbl2) count2
                  from (select Key1,Key2,Attribute1,D, 
                               1 tbl1, 
                               to_number(null) tbl2 
                          from source_table                     
                        union all
                        select Key1,Key2,Attribute1,D, 
                               to_number(null) tbl1, 
                               2 tbl2 
                          from DWH_Table                     
                       )
                 group by Key1,Key2,Attribute1,Attribute2
                 having count(tbl1) != count(tbl2)
               )
               group by Key1,Key2,Attribute1,Attribute2 
       )

The performance of this query is incredible compared to other solutions.
The Query returns inserted, deleted and updated rows (as well as the old values if you need them), which is the source for an usual ETL process, that inserts the SCD2 data into the DWH (e.g. How to load a Slowly Changing Dimension Type 2 with one SQL Merge statement in Oracle).
Please check out a more detailled explanation here: Change Detection

Data Model
At the beginning this point was the trickiest one. Building a DWH with surrogate keys etc. as we know it was getting really complicated here. To get the right surrogate key of an dimension value in the fact table, we actually have to cut the fact data into pieces every time a fact changes. When a fact table is referencing plenty of dimension, we get a really huge fact table after some time, a complex ETL process and due to the large fact table probably a poor performance. Each time we register a change in one of a referenced dimension table, a new fact row with the new surrogate key of the dimension table must be inserted in our fact table. When you imagine that dimensions can also reference other dimensions, it's not a funny thing to implement an efficient ETL-process for that requirement.
Here I have to thank this blog entry: Slowly Changing Facts, which brought me to the useful Kimball Design Tip #74. Even when this is not exactly what I was looking for, it's good to know that there is also a theoretical answer for the problem of slowly changing facts.
We decided at the end to use the business keys to reference to the dimension values. Of course this would give you multiple correspondending rows in your join, so you have to assure in your reporting tool, that all your tables contain a filter condition with a date, the user can specify in a prompt or however it's called in your relational reporting tool. For example in SAP Business Objects I used that expression:


WHERE TO_DATE(NVL(TRIM(@Prompt('Query Date','A',,MONO,FREE)),'31/12/9998 12:00:00'),'mm/dd/yyyy HH:MI:SS AM') BETWEEN VALID_FROM AND VALID_TO


If the user doesn't enter a value or a enters a space, the 'Query Date' is replaced by a date like '31/12/9998', which returns the current state of the data. If the user enters any date in the past, the query returns exactly the data from that point in time.

Performance Optimization
Usually the customer wants to see the most current state of the report, which should show up in a few seconds, while viewing a previous report state can take several minutes.
This requirement could perfectly fulfilled using query rewrite-enabled materialized views (in SQL Server language: indexed views). For more information about query rewrite, there are plenty of useful internet sources available (e.g. Query Rewrite).
To get a query rewrite enabled materialized view, the query must return deterministic results. That's why we need a constant value to get the results of the most current date. As in most SCD2 implementations a date from year '9999' in the 'valid_to' column represents the current datarow.
We add the following filter condition to each table of the report query in our materialized view.

WHERE TO_DATE(NVL(TRIM(''),'31/12/9998 12:00:00'),'mm/dd/yyyy HH:MI:SS AM') BETWEEN VALID_FROM AND VALID_TO

When the report user enters an empty string '' into the prompt dialog, the query gets automatically rewritten and the precalculated materialized view is used instead of the detail tables. Using this technique, the most current report version is shown instantly. Only when a previous report is needed and the user enters a past date into the prompt dialog, the detail tables must be queried and the report might takes a bit longer to show up.

Note that some prerequisits must be fulfilled to enable query rewrite in materialized views, the most important points are:
Create the materialized view with 'QUERY REWRITE' Option: 'ENABLE QUERY REWRITE'

  • Set query_rewrite_integrity, e.g. alter system set query_rewrite_integrity=stale_tolerated scope=spfile;
  • Set query_rewrite_enabled, e.g. alter system SET query_rewrite_enabled=FORCE scope=spfile;