Wednesday, September 26, 2007
Late-arriving dimension scenario (inferred members)
Inferred members are the dimension’s members that are not retrieved when the fact data that are related to them, is ready to insert into the fact table.
If you want to insert that data into fact table you must add a row in the dimension and flag it as “Inferred member”, then when the appropriate dimension’s data present, you can update that inferred member. You must consider the fact that regardless of being type 1 or type 2 you must just update the original record and change the “Inferred Member Flag” to no after updating the record. An inferred member turns all the attributes into Type 1 changing until the dimension member details come in from the source.
If you want to insert that data into fact table you must add a row in the dimension and flag it as “Inferred member”, then when the appropriate dimension’s data present, you can update that inferred member. You must consider the fact that regardless of being type 1 or type 2 you must just update the original record and change the “Inferred Member Flag” to no after updating the record. An inferred member turns all the attributes into Type 1 changing until the dimension member details come in from the source.
Slowly Changing Dimension (SCD)
There are three common SCD types:
- Type 0 (fixed): when an attribute of a dimension is fixed attribute, it cannot be changed.
- Type 1 (changing): This technique that is also known as Restating History indicates that the changes in the attribute must be replaced.
- Type 2 (historical): This type of SCD that is known as historical attribute is used for attribute which changes over time. Furthermore, it is important for us to keep the changes and reflect them in the reports.
If you have a historical dimension, and it changes a lot over a specific time, and the changes are not important, you can change the dimensions attribute type to type 1 and do your changes, then change the type of it after it stabilized.
You can find more information in wikipedia.
Monday, September 24, 2007
Surrogate Key
One of the most important things in dimensional modeling is the surrogate key, bust what is the advantages of it?
- It can improve the query performance, because they are numeric value and take less memory than business keys that usually made of string. Although this property limits the fact table width.
- It makes it possible to implement the Slowly Changing Dimension Type 2- because it allows you to have multiple business key in the dimension table with different surrogate key.
- It makes the design normal and the reporting simpler.
- It makes it easier to manage than having multi-value business key, because when you have a dimension that has multi-value business key as primary key, you will forced to add multi fields into fact tables and managing such as this situation would be difficult.
You can use [dimension name]AlternateKey name for business keys. It helps you to normalize the structure.
Wednesday, September 5, 2007
Data Staging
Data staging is the term which is used among data warehouse
developers. It shows the strategy that business intelligence team takes for
dealing in ETL process. In fact, it shows how the data should be treated in the
ETL process. There are different kinds of staging strategies such as ‘Staging
Data from Multiple Sources’, ‘Staggered Staging’, ‘Persisted Staging’, ‘Accumulated
Staging’, ‘Chunked Accumulated Staging’ and Other Destination Considerations [1].
Staging Data from Multiple Sources
In this kind of data staging, the data is extracted into
staging database using an ETL process. Then, it will be loaded into
datawarehouse using another ETL process when it gets cleaned. It is a common
approach and is used widely when there are different data sources for a data mart.
Staggered Staging
This kind of staging is suitable when you have error
handling in your ETL. It means that, if something get wrong, you do not have to
start the ETL process from scratch. Instead, you can continue the last
persisted point in the ETL process.
Persisted Staging
In this kind of data staging, a copy of staging data is
persisted in a separated database. It is very important when you need to
provide some auditing mechanism according to some requirements. It has a huge
burden on ETL team, but sometimes it is needed.
Accumulated Staging
This is the most advanced pattern among data staging
patterns. In this pattern, you should have capability to recognize changes in
source data and handle them in ETL process. We have the same staging solution in
the bank that I work. In fact, it is a combination of previous pattern and
this pattern, because we also persisted data in the staging data. In this form
of data staging, we have triggers on our data sources (the DBMS are Oracle).
However, it is not a good idea to have triggers on databases. I personally
believe that if application designers predicted such requirements in databases,
it would be much better solution than recognizing these changes through
triggers.
Chunked Accumulated Staging
In this sort of data staging, the data is loaded into data
staging database several times during the day. Then, it will be loaded into
data warehouse at once. It is suitable for retail stores.
Other Destination Considerations
There are also other types of data staging patterns. For example,
it might be needed that you populate data into some data marts based on
geographical distribution. In this case the process would be like this figure:
Last but not least, I like to remind that Kimball does not
advocates using structured relational data staging area. “The creation of both
normalized structures for staging and dimensional structures for presentation
means that the data is extracted, transformed, and loaded twice—once into the
normalized database and then again when we load the dimensional model.
Obviously, this two-step process requires more time and resources for the
development effort, more time for the periodic loading or updating of data, and
more capacity to store the multiple copies of the data. At the bottom line,
this typically translates into the need for larger development, ongoing
support, and hardware platform budgets. Unfortunately, some data warehouse
project teams have failed miserably because they focused all their energy and
resources on constructing the normalized structures rather than allocating time
to development of a presentation area that supports improved business decision
making.” [2]
However, there are lot of needs in practice that force BI
teams to do that. This is common in creating data warehouses, and Kimball mentioned
that “It is acceptable to create a normalized database to support the staging
processes; however, this is not the end goal. The normalized structures must be
off-limits to user queries because they defeat understandability and
performance. As soon as a database supports query and presentation services, it
must be considered part of the data warehouse presentation area. By default,
normalized databases are excluded from the presentation area, which should be strictly
dimensionally structured.” [2].
References:
[1] Turley P.,
Cameron S., Kasprzak J, Microsoft
SQL Server 2005 Integration Services Step by Step, 2007
[2] Ralph Kimball, Margy Ross, The Data Warehouse Toolkit, Second Edition, The Complete Guide to Dimensional Modeling, 2002
[2] Ralph Kimball, Margy Ross, The Data Warehouse Toolkit, Second Edition, The Complete Guide to Dimensional Modeling, 2002
FastParse property
The FastParse property of the flat file adapter is available for date, time, and integer source.
It the conversion overhead of each column as it is imported into the data flow buffers, and overall gives generous performance benefits.
In essence, you are telling the connection that the data can be trusted in the source column to be in a standard format.
It the conversion overhead of each column as it is imported into the data flow buffers, and overall gives generous performance benefits.
In essence, you are telling the connection that the data can be trusted in the source column to be in a standard format.
Saturday, September 1, 2007
Using .net code in SSIS
For doing that you must follow these steps:
- Generating a strong name that will be used for your .dll file.This can be done by using sn.exe utility that ships with .net framework, or by using the properties settings of the visual studio 2005 project (example: sn –k “C:\MyKeyFile.snk”).
- Using the GUID that was generated by SN utility within Class Library setting Section (Signing Tab).
- Setting the Class Library setting Section (Build Tab) to Active (Release).
- Use the Global Assembly Cache Tool (Gacutil.exe), which is found in {Drive}:\Program Files\ Microsoft.NET\SDK\v2.0\Bin for adding the assembly to the Global Assembly Cache(GAC). gacutil.exe /i “Path
” The assembly must also be placed in the{Drive}:\WINDOWS\Microsoft.NET\Framework\v2.0.50727.
SSIS requires these custom objects placed within the Global Assembly Cache (GAC) on the machine that is executing the package to maximize performance of the pipeline and tasks
Subscribe to:
Posts (Atom)