Wednesday, September 5, 2007

Data Staging


Data staging is the term which is used among data warehouse developers. It shows the strategy that business intelligence team takes for dealing in ETL process. In fact, it shows how the data should be treated in the ETL process. There are different kinds of staging strategies such as ‘Staging Data from Multiple Sources’, ‘Staggered Staging’, ‘Persisted Staging’, ‘Accumulated Staging’, ‘Chunked Accumulated Staging’ and Other Destination Considerations [1].

Staging Data from Multiple Sources
In this kind of data staging, the data is extracted into staging database using an ETL process. Then, it will be loaded into datawarehouse using another ETL process when it gets cleaned. It is a common approach and is used widely when there are different data sources for a data mart.

Staggered Staging
This kind of staging is suitable when you have error handling in your ETL. It means that, if something get wrong, you do not have to start the ETL process from scratch. Instead, you can continue the last persisted point in the ETL process.


Persisted Staging
In this kind of data staging, a copy of staging data is persisted in a separated database. It is very important when you need to provide some auditing mechanism according to some requirements. It has a huge burden on ETL team, but sometimes it is needed.


Accumulated Staging
This is the most advanced pattern among data staging patterns. In this pattern, you should have capability to recognize changes in source data and handle them in ETL process. We have the same staging solution in the bank that I work. In fact, it is a combination of previous pattern and this pattern, because we also persisted data in the staging data. In this form of data staging, we have triggers on our data sources (the DBMS are Oracle). However, it is not a good idea to have triggers on databases. I personally believe that if application designers predicted such requirements in databases, it would be much better solution than recognizing these changes through triggers.


Chunked Accumulated Staging
In this sort of data staging, the data is loaded into data staging database several times during the day. Then, it will be loaded into data warehouse at once. It is suitable for retail stores.


Other Destination Considerations
There are also other types of data staging patterns. For example, it might be needed that you populate data into some data marts based on geographical distribution. In this case the process would be like this figure:


Kimball Opinion
Last but not least, I like to remind that Kimball does not advocates using structured relational data staging area. “The creation of both normalized structures for staging and dimensional structures for presentation means that the data is extracted, transformed, and loaded twice—once into the normalized database and then again when we load the dimensional model. Obviously, this two-step process requires more time and resources for the development effort, more time for the periodic loading or updating of data, and more capacity to store the multiple copies of the data. At the bottom line, this typically translates into the need for larger development, ongoing support, and hardware platform budgets. Unfortunately, some data warehouse project teams have failed miserably because they focused all their energy and resources on constructing the normalized structures rather than allocating time to development of a presentation area that supports improved business decision making.” [2]

However, there are lot of needs in practice that force BI teams to do that. This is common in creating data warehouses, and Kimball mentioned that “It is acceptable to create a normalized database to support the staging processes; however, this is not the end goal. The normalized structures must be off-limits to user queries because they defeat understandability and performance. As soon as a database supports query and presentation services, it must be considered part of the data warehouse presentation area. By default, normalized databases are excluded from the presentation area, which should be strictly dimensionally structured.” [2].


References:
[1] Turley P.,  Cameron S.,  Kasprzak J, Microsoft SQL Server 2005 Integration Services Step by Step, 2007
[2] Ralph Kimball, Margy Ross, The Data Warehouse Toolkit, Second Edition, The Complete Guide to Dimensional Modeling, 2002

2 comments:

Anonymous said...

best..........

Anonymous said...

Excellent!! Well done!!! Much appreciate it!