Friday, August 3, 2012

Slowly Changing DWH (facts+dimensions)

How to enable a DWH for slowly changing facts
Datawarehouse Architecture, Oracle Database, SQL Server, SAP BusinessObjects

In the classic Kimball DWH we have fact tables and dimension tables, which can have different types of historization. Usually this approach satisfies most of the customer needs. Hence I was quite suprised when I had the requirement in a project to track all changes of dimensions as well as of fact tables (SCD Type 2). These were for example estimated values, which get updated frequently. If we can really call them facts is just a theoretical question at the end, the fact is that the customer needs the possibility to view reports, how they were in a given point in time with appropriate dimension and fact values. Of course, most of the time the customer wants to see the most current values, but from time to time also previous reports are needed.
I want to break the design explanation into 3 points, which were necessary to fulfill these requirements:

Change Detection
No question, the most convinient way to detect changes is CDC, which means that you only get the changed data rows. But this is not always an option and sometimes it's also necessary to synchronise the DWH with the source system when you are missing some changes in the DWH, for example after structural changes in the source system. For small dimension tables comparing the source table with the DWH table is not a big deal. But when we need to compare fact tables containing millions of rows, this might be a performance issue.
After some testing, I found the by far fastest approach to compare to tables here: On Injecting and Comparing
As we also need updated values, I added an analytical function to partition the result set by the key columns.


select Key1,Key2,Attribute1,Attribute2,
       case when count = 2 and tbl1 = 1 
            then 'U'
            when count = 1 and tbl1 = 1
            then 'I'
            when count = 1 and tbl1 = 0 
            then 'D'
            else 'O' 
       end flag
  from (select Key1,Key2,Attribute1,D,
               min(count1) tbl1,         
             count(*) over (partition by Key1,Key2 order by null) count
          from (select Key1,Key2,Attribute1,Attribute2, 
                       count(tbl1) count1
                       ,count(tbl2) count2
                  from (select Key1,Key2,Attribute1,D, 
                               1 tbl1, 
                               to_number(null) tbl2 
                          from source_table                     
                        union all
                        select Key1,Key2,Attribute1,D, 
                               to_number(null) tbl1, 
                               2 tbl2 
                          from DWH_Table                     
                       )
                 group by Key1,Key2,Attribute1,Attribute2
                 having count(tbl1) != count(tbl2)
               )
               group by Key1,Key2,Attribute1,Attribute2 
       )

The performance of this query is incredible compared to other solutions.
The Query returns inserted, deleted and updated rows (as well as the old values if you need them), which is the source for an usual ETL process, that inserts the SCD2 data into the DWH (e.g. How to load a Slowly Changing Dimension Type 2 with one SQL Merge statement in Oracle).
Please check out a more detailled explanation here: Change Detection

Data Model
At the beginning this point was the trickiest one. Building a DWH with surrogate keys etc. as we know it was getting really complicated here. To get the right surrogate key of an dimension value in the fact table, we actually have to cut the fact data into pieces every time a fact changes. When a fact table is referencing plenty of dimension, we get a really huge fact table after some time, a complex ETL process and due to the large fact table probably a poor performance. Each time we register a change in one of a referenced dimension table, a new fact row with the new surrogate key of the dimension table must be inserted in our fact table. When you imagine that dimensions can also reference other dimensions, it's not a funny thing to implement an efficient ETL-process for that requirement.
Here I have to thank this blog entry: Slowly Changing Facts, which brought me to the useful Kimball Design Tip #74. Even when this is not exactly what I was looking for, it's good to know that there is also a theoretical answer for the problem of slowly changing facts.
We decided at the end to use the business keys to reference to the dimension values. Of course this would give you multiple correspondending rows in your join, so you have to assure in your reporting tool, that all your tables contain a filter condition with a date, the user can specify in a prompt or however it's called in your relational reporting tool. For example in SAP Business Objects I used that expression:


WHERE TO_DATE(NVL(TRIM(@Prompt('Query Date','A',,MONO,FREE)),'31/12/9998 12:00:00'),'mm/dd/yyyy HH:MI:SS AM') BETWEEN VALID_FROM AND VALID_TO


If the user doesn't enter a value or a enters a space, the 'Query Date' is replaced by a date like '31/12/9998', which returns the current state of the data. If the user enters any date in the past, the query returns exactly the data from that point in time.

Performance Optimization
Usually the customer wants to see the most current state of the report, which should show up in a few seconds, while viewing a previous report state can take several minutes.
This requirement could perfectly fulfilled using query rewrite-enabled materialized views (in SQL Server language: indexed views). For more information about query rewrite, there are plenty of useful internet sources available (e.g. Query Rewrite).
To get a query rewrite enabled materialized view, the query must return deterministic results. That's why we need a constant value to get the results of the most current date. As in most SCD2 implementations a date from year '9999' in the 'valid_to' column represents the current datarow.
We add the following filter condition to each table of the report query in our materialized view.

WHERE TO_DATE(NVL(TRIM(''),'31/12/9998 12:00:00'),'mm/dd/yyyy HH:MI:SS AM') BETWEEN VALID_FROM AND VALID_TO

When the report user enters an empty string '' into the prompt dialog, the query gets automatically rewritten and the precalculated materialized view is used instead of the detail tables. Using this technique, the most current report version is shown instantly. Only when a previous report is needed and the user enters a past date into the prompt dialog, the detail tables must be queried and the report might takes a bit longer to show up.

Note that some prerequisits must be fulfilled to enable query rewrite in materialized views, the most important points are:
Create the materialized view with 'QUERY REWRITE' Option: 'ENABLE QUERY REWRITE'

  • Set query_rewrite_integrity, e.g. alter system set query_rewrite_integrity=stale_tolerated scope=spfile;
  • Set query_rewrite_enabled, e.g. alter system SET query_rewrite_enabled=FORCE scope=spfile;

Friday, April 13, 2012

Overcome the define ranges-restriction when using MS Excel as a data source in Informatica PowerCenter

Informatica PowerCenter

When you use a MS Excel Spreadsheet as a data source in Informatica PowerCenter, there are some really bad restrictions to my mind. One of them is the fact that you must have defined ranges in your spreadsheet, so that PowerCenter can identify them as relational sources. That means if you get frequently new MS Excel files, you have to define the ranges manually every time.

In my last project I made a detour to connect from PowerCenter to MS Excel Spreadsheets to overcome that restriction. The name of the detour is MS Access.

Just create a new MS Access database, name it something like "Import_Excel" and go to the tab "External Data". Chose MS Excel as the datasource, in the following window you can link an Excel table to the Access database. Note that you don't import the Excel data to your Access database, it's a link, so you always get the up-to-date Excel data in your MS Access database.



The good thing is, that you can, but must not use the defined ranges from the Excel spreadsheet. You can also use one of the sheets in your Excel as a table.


Once you have created the link to the MS Excel file, you can read from the created table in your MS Access database like from a usual database table. You could also define a query using SQL to preprocess the data in your MS Access database for Informatica PowerCenter, but that's up to you.

Now you can create a ODBC connection to your MS Access database, that you can use as a data source in Informatica PowerCenter.


Another side effect is, that you need just one single ODBC connection for multiple MS Excel files.

Again let me redirect to another blog with a detailed explanation of how to connect to a ODBC source from Informatica PowerCenter: http://www.clearpeaks.com/blog/etl/ms-excel-spreadsheets-as-a-data-source-in-informatica-powercenter

And don't forget to set the Default buffer block size to a smaller value like "8" instead of "Auto" if you get a "terminated unexpectedly" error.

Thursday, April 5, 2012

Informatica PowerCenter "terminated unexpectedly" when using ODBC

Informatica PowerCenter

To connect to a MS Excel spreadsheet or MS Access database using ODBC there are several ressources available (e.g. http://www.clearpeaks.com/blog/etl/ms-excel-spreadsheets-as-a-data-source-in-informatica-powercenter)
I got a helpful "terminated unexpectedly" error message in the Workflow Monitor and couldn't find the error. Finally I found out that a little session property solved my problem.





I set the "Default buffer block size" to a value of 8 (for example) instead auf Auto and my Workflow terminated successfully.

Saturday, February 18, 2012

Engine-based ETL tool vs. code-generating ETL tool vs. stored procedures

In my DWH projects I was using several approaches to implement the ETL processes. I think it also makes sense to distinguish engine-based etl tools from code-generating etl tools, which i am missing in some discussions.
I want to describe my opinion based on several criterias, I don't claim that my entry is a complete comparison on ETL approaches, I am really interested in your experiences and opinion, so you are very welcome to add a comment.

Usability
ETL tools (engine-based as well as code-generating) provide nice GUIs and state that it makes development easier. My personal experience is, that it's true, they do have fancy GUIs, but that doesn’t necessarily mean, that it’s easier to maintain. There are much more SQL coders out there than people with experience in a specific ETL tool. I personally find an error much faster in a PL/SQL code, than debugging mappings of ETL tools. But I really think that it depends on the ETL process. The GUI gives you a nice overview when you have complex transformation and it also can be used as an documentation of the ETL process.

Development efficiency
In some projects I had to implement the same ETL logic for different source tables (e.g. SCD2). When you have hundreds of tables where you have to do exactly the same staff, I find it more efficient to write an PL/SQL code generator, also in terms of maintenance. In software engineering it's a very bad practice to duplicate code. What you do there is writing a function or a sub program, so you have your logic stored in just one place. I don't understand why modern ETL tools don't provide such an approach. I know that you can use reusable components or templates, but when you have a different table, with different columns, keys, data types, etc. I don't know how to implement a generic solution with ETL tools, or am I wrong? This is only possible with PL/SQL code generators or you write your own code generator for your ETL tool, but this is much more difficult.

Flexibility
Most of the operations of modern ETL tools like lookups, aggregations, branching based inserts, conditions, expressions, string & date manipulations are built in modern database solutions as well, and they are even faster. So in my opinion i prefer here the code-generating ETL tools or stored procedures.
But flexibility also means database-flexibility. This is an clear advantage of ETL tools as they provide connection of almost all common source system.

Performance
For me there is no doubt, that an ELT approach will always outperform an ETL solution, as you have so many performance optimisation techniques in your database and you don't have to transfer your data to your ETL server and back to your DB. I really don't understand the advandate of engine-based ETL tools like Informatica, etc. if someone can explain me, you are welcome! The only thing you can't do with ELT is implementing an ETL process database-independent, but this is the only point I can think of using an engine-based ETL-tool.

Logging
Monitoring, Logging etc. is easier with ETL tools (engine-based and code-generating), as they provide a standardized way to monitor your processes. When writing your own procedures, you have to implement all that on your own.

Conclusion
Of course it always depends on the specific requirements, so I can only tell from my own experiences in different projects. As I already mentioned I really don't get it why one should buy a really expensive engine-based ETL tool when all that functionality is already built in the database, where it is even faster. I like some code-generating ETL tools, because they can support the development process by providing frequently used ETL logics, they give you a good overview of your workflows and provide mechanisms like monitoring and logging. To my mind in some cases it can also be a good solution to write your own ETL procedures, expecially to automatize the development process or to combine that approach with code-generating ETL tools.

I am very looking forward to your opinion with ETL development!

Saturday, December 31, 2011

Change Detection

An efficient way to compare data of two database tables 

Oracle Database, SQL Server Database

No question, the most convenient way to detect changes is CDC (Change Data Capture), which means that you only get the changed data rows. In Oracle this can be implemented using Oracle CDC, Golden Gate, Trigger, Materialized View Logs or in SQL Server there is also a CDC functionality.
But due to company rules this is not always an option and sometimes it's also necessary to synchronise the DWH with the source system when you are missing some changes in the DWH, for example after errors in the ETL process or if CDC is disabled for some reason, e.g. after structural changes in the source system. For small dimension tables comparing the source table with the DWH table is not a big deal. But when we need to compare large tables containing millions of rows, this might be a performance issue.

After some testing, I found the by far fastest approach to compare two tables here: On Injecting and Comparing
As we also need updated values, I added an analytic function to partition the result set by the key columns. 
The structure of the result is similiar to the one CDC would give you. You get the inserted rows (I), deleted rows (D) and updated rows (U) with the new values and also the old values (O) if desired.

SELECT Key1,Key2,Attribute1,Attribute2,Attribute3
       CASE WHEN count = 2 and tbl1 = 1 
            THEN 'U'
            WHEN count = 1 and tbl1 = 1
            THEN 'I'
            WHEN count = 1 and tbl1 = 0 
            THEN 'D'
            ELSE 'O' 
       end dmltype
  FROM (SELECT Key1,Key2,Attribute1,Attribute2, Attribute3,
               min(count1) tbl1,         
             count(*) over (partition by Key1,Key2 order by null) count
          FROM(SELECT Key1,Key2,Attribute1,Attribute2,Attribute3, 
                       count(tbl1) count1
                       ,count(tbl2) count2
                  FROM ( SELECT  Key1,Key2,Attribute1,Attribute2,Attribute3
                               1 tbl1, 
                               to_number(null) tbl2 
                           FROM  source_table                     
                        UNION ALL
                         SELECT  Key1,Key2,Attribute1,Attribute2,Attribute3
                               to_number(null) tbl1, 
                               2 tbl2 
                           FROM  DWH_Table                     
                       )
                 GROUP BY Key1,Key2,Attribute1,Attribute2,Attribute3
                 having count(tbl1) != count(tbl2)
               )
               GROUP BY Key1,Key2,Attribute1,Attribute2,Attribute3 
       )

The performance of this query is incredible compared to other solutions, like using an outer join operation. Thank you Tom Kyte!
Please note that you must cast the source and target columns to the same datatype before you can perform the union. If you are comparing tables from databases with different character sets you also have to convert text columns into the same character set, for example using the oracle convert function (e.g. convert('text','WE8EBCDIC500', 'WE8ISO8859P1'))

The query returns inserted, deleted and updated rows, which can be the source for an usual ETL process, that inserts the SCD (1,2,3) data into the DWH.

Saturday, December 24, 2011

QaaWS and web tier architecture

Connection problems using QaaWS in a web tier architecture

SAP Business Objects XI 3.1

In a recent project it took me some days to get QaaWS working in a BO web tier architecture. I didn't find enough information on the internet, so I want to share my experiences here. The idea was to publish a Query as a Web Service on the Webserver, which gets the data from a database connection stored on the BO Server. A public Xcelsius flash object can then access the data via Web Service.

Our web tier architecture consists of
  • 1 BO Server (CMS, Repository)
  • 1 Web Server (Tomcat 5.5) - exposed to the internet
In retrospect there were three main points I had to consider to finally get Xcelsius via QaaWS and web tier architecture working:

Assign ports
To get the connection between the BO Server and Web Server working when using a firewall you have to manually assign port numbers to the 'Request Port' in your CMC for all necessary BO services. Uncheck the Auto assign box and enter a free port number. If you don't do that, the services would pick different random port numbers. Enter for your CMS request port 6401 for example and don't forget to assign to the other servers a port number, if you need them (Webi processing server, etc.)


After that you can open those ports on your firewall. The Web Server should now communicate with the BO Server.

Adjust dsws.properties
If your QaaWS still doesn't provide any data, you probably have to add your CMS server name to the dsws.properties configuration file. To my mind that's a bug, because I already entered the name of the CMS Server when I did the web tier installation and the connection works, but it does not when using QaaWS. Dswsbobje is looking for the CMS on the Web Server and of course can't find it. So open your dsws.properties file and just add/change the line

domain=Server:port (e.g. domain=BOXI31:6400)

Restart your Web Server. Now also QaaWS (the dswsbobje application) knows where your CMS is.

Add crossdomain.xml
Another prerequisite when accessing data external with flash is that you need crossdomain.xml file in your webapps-root directory (error codes: #2170 or #2048) . Therefore you find lots of information on the internet (e.g. here or here). Just add a text file with the file name 'crossdomain.xml' to the ROOT folder of your web server (e.g. C:\Program Files (x86)\Business Objects\Tomcat55\webapps\ROOT) with the following content:

<?xml version="1.0"?><!DOCTYPE cross-domain-policy SYSTEM"http://www.macromedia.com/xml/dtds/cross-domain-policy.dtd "><cross-domain-policy><allow-http-request-headers-from domain="*" headers="*" secure="false" /><allow-access-from domain="*" secure="false" /></cross-domain-policy>

Restart your Web Server. Try to open your flash object again.

Friday, December 23, 2011

Wait event asynch descriptor resize

Oracle Create Materialized View hangs and waiting for asynch descriptor resize - event 

Oracle Database 11g Release 2 x64 (11.2.0.1.0), Windows Server 2008 R2 64bit

Recently I wanted to create a materialized view for a usual query, which took about 1 minute to execute while the creation of the materialized view didn't stop for hours until I killed the process.

Monitoring the session showed me the wait event asynch descriptor resize. The following article explains the event very detailed and suggests to contact oracle support if it causes bigger problems: Asynch descriptor resize wait event in Oracle. Upgrading the database could also solve the issue.

As for me those solutions weren't possible I started to trace the session and saw lots of sequential reads: 

"WAIT #6: nam='db file sequential read' ela= 4384 file#=5 block#=2092840 blocks=1 obj#=77661 tim=7422676905762"

After hours of tracing and googleing I tried to simplify my query step by step and came to the conclusion that some 'special' SQL constructs cause that the materialized view creation never stops. 

The causing SQL part was a subquery that returns the latest record of a table:
(select a, b, c, day
from table x
where day = (select max(day) from table y where x.a=y.a))

I could rewrite the query using oracle analytic functions (Analytic functions by Example) and voilĂ  my materialized view was created in about 1 minute.

The same problem occured in a query with 'NOT IN(' followed by a subquery. Also in that case there are several ways to rewrite and probably tune the query (NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: Oracle).


Another try could be setting SGA values manually to fixed values instead of using AMM and memory target.

Sorry that I don't have an explanation why those query constructs cause the materialized view creation to hang. Obviously it's a bug and probably solved in later releases. Anyway, if you have the same issue, I hope my post helps you to identify the cause of it.