Wednesday, September 26, 2007

Late-arriving dimension scenario (inferred members)

Inferred members are the dimension’s members that are not retrieved when the fact data that are related to them, is ready to insert into the fact table.
If you want to insert that data into fact table you must add a row in the dimension and flag it as “Inferred member”, then when the appropriate dimension’s data present, you can update that inferred member. You must consider the fact that regardless of being type 1 or type 2 you must just update the original record and change the “Inferred Member Flag” to no after updating the record. An inferred member turns all the attributes into Type 1 changing until the dimension member details come in from the source.

Slowly Changing Dimension (SCD)

There are three common SCD types:

  • Type 0 (fixed): when an attribute of a dimension is fixed attribute, it cannot be changed.
  • Type 1 (changing): This technique that is also known as Restating History indicates that the changes in the attribute must be replaced.
  • Type 2 (historical): This type of SCD that is known as historical attribute is used for attribute which changes over time. Furthermore, it is important for us to keep the changes and reflect them in the reports.

If you have a historical dimension, and it changes a lot over a specific time, and the changes are not important, you can change the dimensions attribute type to type 1 and do your changes, then change the type of it after it stabilized.

You can find more information in wikipedia.

Monday, September 24, 2007

Surrogate Key

One of the most important things in dimensional modeling is the surrogate key, bust what is the advantages of it?

  1. It can improve the query performance, because they are numeric value and take less memory than business keys that usually made of string. Although this property limits the fact table width.
  2. It makes it possible to implement the Slowly Changing Dimension Type 2- because it allows you to have multiple business key in the dimension table with different surrogate key.
  3. It makes the design normal and the reporting simpler.
  4. It makes it easier to manage than having multi-value business key, because when you have a dimension that has multi-value business key as primary key, you will forced to add multi fields into fact tables and managing such as this situation would be difficult.

You can use [dimension name]AlternateKey name for business keys. It helps you to normalize the structure.

Wednesday, September 5, 2007

Data Staging

Data staging is the term which is used among data warehouse developers. It shows the strategy that business intelligence team takes for dealing in ETL process. In fact, it shows how the data should be treated in the ETL process. There are different kinds of staging strategies such as ‘Staging Data from Multiple Sources’, ‘Staggered Staging’, ‘Persisted Staging’, ‘Accumulated Staging’, ‘Chunked Accumulated Staging’ and Other Destination Considerations [1].

Staging Data from Multiple Sources
In this kind of data staging, the data is extracted into staging database using an ETL process. Then, it will be loaded into datawarehouse using another ETL process when it gets cleaned. It is a common approach and is used widely when there are different data sources for a data mart.

Staggered Staging
This kind of staging is suitable when you have error handling in your ETL. It means that, if something get wrong, you do not have to start the ETL process from scratch. Instead, you can continue the last persisted point in the ETL process.

Persisted Staging
In this kind of data staging, a copy of staging data is persisted in a separated database. It is very important when you need to provide some auditing mechanism according to some requirements. It has a huge burden on ETL team, but sometimes it is needed.

Accumulated Staging
This is the most advanced pattern among data staging patterns. In this pattern, you should have capability to recognize changes in source data and handle them in ETL process. We have the same staging solution in the bank that I work. In fact, it is a combination of previous pattern and this pattern, because we also persisted data in the staging data. In this form of data staging, we have triggers on our data sources (the DBMS are Oracle). However, it is not a good idea to have triggers on databases. I personally believe that if application designers predicted such requirements in databases, it would be much better solution than recognizing these changes through triggers.

Chunked Accumulated Staging
In this sort of data staging, the data is loaded into data staging database several times during the day. Then, it will be loaded into data warehouse at once. It is suitable for retail stores.

Other Destination Considerations
There are also other types of data staging patterns. For example, it might be needed that you populate data into some data marts based on geographical distribution. In this case the process would be like this figure:

Kimball Opinion
Last but not least, I like to remind that Kimball does not advocates using structured relational data staging area. “The creation of both normalized structures for staging and dimensional structures for presentation means that the data is extracted, transformed, and loaded twice—once into the normalized database and then again when we load the dimensional model. Obviously, this two-step process requires more time and resources for the development effort, more time for the periodic loading or updating of data, and more capacity to store the multiple copies of the data. At the bottom line, this typically translates into the need for larger development, ongoing support, and hardware platform budgets. Unfortunately, some data warehouse project teams have failed miserably because they focused all their energy and resources on constructing the normalized structures rather than allocating time to development of a presentation area that supports improved business decision making.” [2]

However, there are lot of needs in practice that force BI teams to do that. This is common in creating data warehouses, and Kimball mentioned that “It is acceptable to create a normalized database to support the staging processes; however, this is not the end goal. The normalized structures must be off-limits to user queries because they defeat understandability and performance. As soon as a database supports query and presentation services, it must be considered part of the data warehouse presentation area. By default, normalized databases are excluded from the presentation area, which should be strictly dimensionally structured.” [2].

[1] Turley P.,  Cameron S.,  Kasprzak J, Microsoft SQL Server 2005 Integration Services Step by Step, 2007
[2] Ralph Kimball, Margy Ross, The Data Warehouse Toolkit, Second Edition, The Complete Guide to Dimensional Modeling, 2002

FastParse property

The FastParse property of the flat file adapter is available for date, time, and integer source.
It the conversion overhead of each column as it is imported into the data flow buffers, and overall gives generous performance benefits.
In essence, you are telling the connection that the data can be trusted in the source column to be in a standard format.

Saturday, September 1, 2007

Using .net code in SSIS

For doing that you must follow these steps:

  1. Generating a strong name that will be used for your .dll file.This can be done by using sn.exe utility that ships with .net framework, or by using the properties settings of the visual studio 2005 project (example: sn –k “C:\MyKeyFile.snk”).
  2. Using the GUID that was generated by SN utility within Class Library setting Section (Signing Tab).
  3. Setting the Class Library setting Section (Build Tab) to Active (Release).
  4. Use the Global Assembly Cache Tool (Gacutil.exe), which is found in {Drive}:\Program Files\ Microsoft.NET\SDK\v2.0\Bin for adding the assembly to the Global Assembly Cache(GAC). gacutil.exe /i “Path
    The assembly must also be placed in the{Drive}:\WINDOWS\Microsoft.NET\Framework\v2.0.50727.

SSIS requires these custom objects placed within the Global Assembly Cache (GAC) on the machine that is executing the package to maximize performance of the pipeline and tasks