Data staging is the term which is used among data warehouse
developers. It shows the strategy that business intelligence team takes for
dealing in ETL process. In fact, it shows how the data should be treated in the
ETL process. There are different kinds of staging strategies such as ‘Staging
Data from Multiple Sources’, ‘Staggered Staging’, ‘Persisted Staging’, ‘Accumulated
Staging’, ‘Chunked Accumulated Staging’ and Other Destination Considerations [1].
Staging Data from Multiple Sources
In this kind of data staging, the data is extracted into
staging database using an ETL process. Then, it will be loaded into
datawarehouse using another ETL process when it gets cleaned. It is a common
approach and is used widely when there are different data sources for a data mart.
Staggered Staging
This kind of staging is suitable when you have error
handling in your ETL. It means that, if something get wrong, you do not have to
start the ETL process from scratch. Instead, you can continue the last
persisted point in the ETL process.
Persisted Staging
In this kind of data staging, a copy of staging data is
persisted in a separated database. It is very important when you need to
provide some auditing mechanism according to some requirements. It has a huge
burden on ETL team, but sometimes it is needed.
Accumulated Staging
This is the most advanced pattern among data staging
patterns. In this pattern, you should have capability to recognize changes in
source data and handle them in ETL process. We have the same staging solution in
the bank that I work. In fact, it is a combination of previous pattern and
this pattern, because we also persisted data in the staging data. In this form
of data staging, we have triggers on our data sources (the DBMS are Oracle).
However, it is not a good idea to have triggers on databases. I personally
believe that if application designers predicted such requirements in databases,
it would be much better solution than recognizing these changes through
triggers.
Chunked Accumulated Staging
In this sort of data staging, the data is loaded into data
staging database several times during the day. Then, it will be loaded into
data warehouse at once. It is suitable for retail stores.
Other Destination Considerations
There are also other types of data staging patterns. For example,
it might be needed that you populate data into some data marts based on
geographical distribution. In this case the process would be like this figure:
Last but not least, I like to remind that Kimball does not
advocates using structured relational data staging area. “The creation of both
normalized structures for staging and dimensional structures for presentation
means that the data is extracted, transformed, and loaded twice—once into the
normalized database and then again when we load the dimensional model.
Obviously, this two-step process requires more time and resources for the
development effort, more time for the periodic loading or updating of data, and
more capacity to store the multiple copies of the data. At the bottom line,
this typically translates into the need for larger development, ongoing
support, and hardware platform budgets. Unfortunately, some data warehouse
project teams have failed miserably because they focused all their energy and
resources on constructing the normalized structures rather than allocating time
to development of a presentation area that supports improved business decision
making.” [2]
However, there are lot of needs in practice that force BI
teams to do that. This is common in creating data warehouses, and Kimball mentioned
that “It is acceptable to create a normalized database to support the staging
processes; however, this is not the end goal. The normalized structures must be
off-limits to user queries because they defeat understandability and
performance. As soon as a database supports query and presentation services, it
must be considered part of the data warehouse presentation area. By default,
normalized databases are excluded from the presentation area, which should be strictly
dimensionally structured.” [2].
References:
[1] Turley P.,
Cameron S., Kasprzak J, Microsoft
SQL Server 2005 Integration Services Step by Step, 2007
[2] Ralph Kimball, Margy Ross, The Data Warehouse Toolkit, Second Edition, The Complete Guide to Dimensional Modeling, 2002
[2] Ralph Kimball, Margy Ross, The Data Warehouse Toolkit, Second Edition, The Complete Guide to Dimensional Modeling, 2002
2 comments:
best..........
Excellent!! Well done!!! Much appreciate it!
Post a Comment