Sunday, December 30, 2007

Data Analysis Using SQL and Excel

I work as a system analyst and Designer, and also Database Designer. I have a good command of database and SQL. Moreover, I performed some data realization using Excel. But I never imagined how tightly they can fit together.

I am familiar with normal distribution and some other simple techniques in statistics theoretically, yet I have a big problem with statistic concepts like Regression and other ones.

When I started to read some books about data mining, this was the main problem that I dropped data mining. When I saw this book, I felt like the time that I seen MDX Solution book. It raised my hopes of achieving the difficulty of data mining. I take a quick look at it, and it seems organized very well.

I decided to write an overview about this book in order to help my friend Pedro and some other people who want to make a decision to buy this book or not, but keep it in your mind that I write this post based on taking a quick overview on that, not by reading it completely.

Chapter 1: A Data Miner Looks at SQL
The author introduces Dataflow concept in this chapter.

Chapter 2: What’s In a Table? Getting Started with Data Exploration
This chapter explains how you can explore SQL results with excel charts. I discover another point of view by skimming this chapter.

Chapter 3: How Different Is Different?
The basic concepts of statistic and the combination of statistics, SQL, and Excel are explained in this chapter.

Chapter 4: Where Is It All Happening? Location, Location, Location
The geography and the processes which could be done using SQL and Excel play a primary role in discovering the knowledge. You can observe this great job by reading this chapter.

Chapter 5: It’s a Matter of Time
This chapter does just as the previous one but for time.

Chapter 6: How Long Will Customers Last? Survival Analysis to Understand Customers and Their Value
Nothing will be better than the sentence that author mentioned at the first: “Survival analysis estimates how long it takes for a particular event to happen. A customer starts; when will that customer stop? By assuming that the future will be similar to the past (the homogeneity assumption), the wealth of data about historical customer behavior can help us understand what will happen and when.”

Chapter 7: Factors Affecting Survival: The What and Why of Customer Tenure
“This chapter builds on this foundation, by introducing three extensions of basic survival analysis. These extensions solve some common problems faced when applying survival analysis in the real world. They also make it possible to understand the effects of other factors besides tenure on survival.”

Chapter 8: Customer Purchases and Other Repeated Events
This chapter discusses everything about customer behavior: when, where, and how. With one notable exception: what customers purchase.

Chapter 9: What’s in a Shopping Cart? Market Basket Analysis and Association Rules
“This chapter dives into the detail, looking at the specific products being purchased, to learn both about the customers and the products they are buying. Market basket analysis is the general name for understanding product purchase patterns at the customer level.”

Chapter 10: Data Mining Models in SQL
“This chapter takes an alternative approach that introduces data mining concepts using databases. This perspective presents the important concepts, sidestepping the rigor of theoretical statistics to focus instead on the most important practical aspect: data.”

Chapter 11: The Best-Fit Line: Linear Regression Models

Chapter 12: Building Customer Signatures for Further Analysis
This chapter focuses on data preparation.

Tuesday, December 25, 2007

Data Mining

I love to know more about data mining, but I have never found any simple book which discusses data mining and statistics practically.

Yesterday, I realized that the library bought "Data Analysis Using SQL and Excel" that describes about Data Mining with SQL and Excel in practical way.

I decided to read it, in order to know the fundamental concepts and becoming eligible to read other books.

I hope it will be useful as the author mentioned in the preface.

Friday, December 21, 2007

process analysis services objects through SSIS

There are two methods for populating data into SSAS, which are used mostly for non-standard data sources:

  1. Using Dimension Processing Data Flow Destination
  2. Using Partition Processing Data Flow Destination

You can map your source data into the dimension or Partition, and set the update method which can be Add, Full, or Update.

Saturday, November 17, 2007

Analysis Services Processing Task

This task force the SSAS to process Database, Cube, Partition, Measure Group, Dimension, Mining Structure, and Mining Model through SSIS.
At first you must specify the connection which specifies the objects that must be processed in connection manager combo box.

Then you must add the objects that you want to get processed.

Next you must set their process options based on the action that you want to happen. There are some actions that are available for some objects that I listed them as follow:

When you set the process option to Process Incremental you will have to configure its settings, The settings are included of Measure Group, Partition, and Source Data that can be table or query.

You can change the processing order of this batch via Change Settings button. The Change Settings form has two tabs that are called Processing options tab and Dimension key errors tab.
You can set some general settings such as transaction type (Sequentional, or Parallel), write back table, and process related objects in the first tab.

Finally you can set the action that you want to occure when an error raised in the second tab.

Monday, November 12, 2007

Processing Analysis Services Objects through SSIS

There are two ways to process analysis services objects:
  1. Force SSAS to handle the process via SSIS
    you must use the Control Flow Tasks for this kind of operations:
    • Analysis Services Processing Task
      This task provides you the SSAS objects' options visually.
    • Analysis Services Execute DDL Task
      you can specify a DDL script to this task to perform your Process via XML/A.

  2. Process those objects through SSIS
    the best situation in which you can use these destinations is where you have non-standard sources and you need near real-time processing.
    you must use the Data Flow Destinations for this kind of operations:
    • Dimension Processing Data Flow Destination
    • Partition Processing Data Flow Destination

There are two other ways for handling these processes:

  1. You can run your XML/A script via ASCMD.exe
    you can do this by using the Execute Process Task.
  2. Handling the process with AMO.
    You must write program in .net language through Script Task.

I will describe each of these tasks separately, later.

Tuesday, October 30, 2007

Handling Late-Arriving Facts

When you want to handle this kind of data, you must set the surrogate key and the start date of the dimension data with a look up component at the first, and then you must check what data is the late-arriving one. The rows that those operation dates are lower than the start date of their dimension’s start date (Dim.StartDate>Fact.OperationDate) are late-arriving facts which their surrogate keys must be corrected with the appropriate dimension row. For specifying the correct dimension row you must send these facts to the look up component and correct the surrogate key. Finally you must add all these rows together.
The only tip that is remained is you must correct the dimension’s surrogate key with modifying the SQL of look up component in its advanced tab, for example for Order System the SQL must set as the follow:

Select *
From (Select * from [dbo].[DimProduct] as refTable)
where [refTable].ProductAlternateKey]=?
and startDate <= ? and endDate > ?

The parameters must be set to ProductAlternateKey, OrderDateAlternateKey, OrderDateAlternateKey.

Wednesday, October 17, 2007

Handling Missing Dimension Lookups

There are three methods for handling inferred members:

  • Pre-data flow checkyou must insert these members as inferred members before adding fact data to the fact table.The best situation that you can use this method is when the number of dimensions that support inferred members is one or two.In this approach you have to process the fact table, two times.

  • Post-data flow process (best solution)You must insert all fact data except for the inferred members that must store in a stage table, then you must insert inferred members, finally you should run the fact load process for the stage table.

  • In-line data flow approachyou must set the second lookup item cache mode to “No Cache” to force it to get the key from the database instead of getting from the not-updated cache.You must check that the inferred member has not added before the inserting, because it is probable that some fact data with that key are exists. For each inferred member you must first check for the record, second add it to the dimension, and finally get its surrogate key, therefore this process is suitable when the inferred members make up very small minority of source records.

Sunday, October 14, 2007

Identifying changed fact records

If your data source rows contain "created date" and "modified date", you won’t need to match the fact rows with source rows for identifying changed fact records; conversely you have to match them using one of these two methods (you must match business key and time key in both of these methods):

  1. Database joins between the source data and the fact table: The source data and fact data must be at the same server in order to perform the join efficiently.
  2. Data correlation with the data flow (using the lookup to identify fact change record): you cannot identify deleted rows in this method.

Don’t forget about historical dimension changes and inferred members and late-arriving facts, you must handle those situations if your business supports them.

If you want to update or delete a row in a table, at the same time that you are inserting rows in it; you may cause locking issues. For preventing this situation turn off the table locking the data flow destination. If the problem is remained, turn off the fast load.

Saturday, October 6, 2007

Index Optimizations

Dimension lookups and updates are row-by-row operations; therefore you must create appropriate indexes for getting the best performance. One of the tips that you can follow is creating dimension’s table clustered index on the business key instead of surrogate key. And if you need to do better you can include the current indicator flag for end date as the second column in the Type-2 dimensions. Keep it in your mind that Index optimization must be balanced between the ETL and the query usage. Although you can put the data into staging table and later establish an inner join between it and the dimension for updating the rows that are exists in the stage table. Finally, you must truncate the stage table.
Be cautious about using this method, because in most situations it is not a good idea to put surrogate key away, and index the business keys, for joining between fact-tables and dimension tables are based on surrogate keys.

Wednesday, September 26, 2007

Late-arriving dimension scenario (inferred members)

Inferred members are the dimension’s members that are not retrieved when the fact data that are related to them, is ready to insert into the fact table.
If you want to insert that data into fact table you must add a row in the dimension and flag it as “Inferred member”, then when the appropriate dimension’s data present, you can update that inferred member. You must consider the fact that regardless of being type 1 or type 2 you must just update the original record and change the “Inferred Member Flag” to no after updating the record. An inferred member turns all the attributes into Type 1 changing until the dimension member details come in from the source.

Slowly Changing Dimension (SCD)

There are three common SCD types:

  • Type 0 (fixed): when an attribute of a dimension is fixed attribute, it cannot be changed.
  • Type 1 (changing): This technique that is also known as Restating History indicates that the changes in the attribute must be replaced.
  • Type 2 (historical): This type of SCD that is known as historical attribute is used for attribute which changes over time. Furthermore, it is important for us to keep the changes and reflect them in the reports.

If you have a historical dimension, and it changes a lot over a specific time, and the changes are not important, you can change the dimensions attribute type to type 1 and do your changes, then change the type of it after it stabilized.

You can find more information in wikipedia.

Monday, September 24, 2007

Surrogate Key

One of the most important things in dimensional modeling is the surrogate key, bust what is the advantages of it?

  1. It can improve the query performance, because they are numeric value and take less memory than business keys that usually made of string. Although this property limits the fact table width.
  2. It makes it possible to implement the Slowly Changing Dimension Type 2- because it allows you to have multiple business key in the dimension table with different surrogate key.
  3. It makes the design normal and the reporting simpler.
  4. It makes it easier to manage than having multi-value business key, because when you have a dimension that has multi-value business key as primary key, you will forced to add multi fields into fact tables and managing such as this situation would be difficult.

You can use [dimension name]AlternateKey name for business keys. It helps you to normalize the structure.

Wednesday, September 5, 2007

Data Staging

Data staging is the term which is used among data warehouse developers. It shows the strategy that business intelligence team takes for dealing in ETL process. In fact, it shows how the data should be treated in the ETL process. There are different kinds of staging strategies such as ‘Staging Data from Multiple Sources’, ‘Staggered Staging’, ‘Persisted Staging’, ‘Accumulated Staging’, ‘Chunked Accumulated Staging’ and Other Destination Considerations [1].

Staging Data from Multiple Sources
In this kind of data staging, the data is extracted into staging database using an ETL process. Then, it will be loaded into datawarehouse using another ETL process when it gets cleaned. It is a common approach and is used widely when there are different data sources for a data mart.

Staggered Staging
This kind of staging is suitable when you have error handling in your ETL. It means that, if something get wrong, you do not have to start the ETL process from scratch. Instead, you can continue the last persisted point in the ETL process.

Persisted Staging
In this kind of data staging, a copy of staging data is persisted in a separated database. It is very important when you need to provide some auditing mechanism according to some requirements. It has a huge burden on ETL team, but sometimes it is needed.

Accumulated Staging
This is the most advanced pattern among data staging patterns. In this pattern, you should have capability to recognize changes in source data and handle them in ETL process. We have the same staging solution in the bank that I work. In fact, it is a combination of previous pattern and this pattern, because we also persisted data in the staging data. In this form of data staging, we have triggers on our data sources (the DBMS are Oracle). However, it is not a good idea to have triggers on databases. I personally believe that if application designers predicted such requirements in databases, it would be much better solution than recognizing these changes through triggers.

Chunked Accumulated Staging
In this sort of data staging, the data is loaded into data staging database several times during the day. Then, it will be loaded into data warehouse at once. It is suitable for retail stores.

Other Destination Considerations
There are also other types of data staging patterns. For example, it might be needed that you populate data into some data marts based on geographical distribution. In this case the process would be like this figure:

Kimball Opinion
Last but not least, I like to remind that Kimball does not advocates using structured relational data staging area. “The creation of both normalized structures for staging and dimensional structures for presentation means that the data is extracted, transformed, and loaded twice—once into the normalized database and then again when we load the dimensional model. Obviously, this two-step process requires more time and resources for the development effort, more time for the periodic loading or updating of data, and more capacity to store the multiple copies of the data. At the bottom line, this typically translates into the need for larger development, ongoing support, and hardware platform budgets. Unfortunately, some data warehouse project teams have failed miserably because they focused all their energy and resources on constructing the normalized structures rather than allocating time to development of a presentation area that supports improved business decision making.” [2]

However, there are lot of needs in practice that force BI teams to do that. This is common in creating data warehouses, and Kimball mentioned that “It is acceptable to create a normalized database to support the staging processes; however, this is not the end goal. The normalized structures must be off-limits to user queries because they defeat understandability and performance. As soon as a database supports query and presentation services, it must be considered part of the data warehouse presentation area. By default, normalized databases are excluded from the presentation area, which should be strictly dimensionally structured.” [2].

[1] Turley P.,  Cameron S.,  Kasprzak J, Microsoft SQL Server 2005 Integration Services Step by Step, 2007
[2] Ralph Kimball, Margy Ross, The Data Warehouse Toolkit, Second Edition, The Complete Guide to Dimensional Modeling, 2002

FastParse property

The FastParse property of the flat file adapter is available for date, time, and integer source.
It the conversion overhead of each column as it is imported into the data flow buffers, and overall gives generous performance benefits.
In essence, you are telling the connection that the data can be trusted in the source column to be in a standard format.

Saturday, September 1, 2007

Using .net code in SSIS

For doing that you must follow these steps:

  1. Generating a strong name that will be used for your .dll file.This can be done by using sn.exe utility that ships with .net framework, or by using the properties settings of the visual studio 2005 project (example: sn –k “C:\MyKeyFile.snk”).
  2. Using the GUID that was generated by SN utility within Class Library setting Section (Signing Tab).
  3. Setting the Class Library setting Section (Build Tab) to Active (Release).
  4. Use the Global Assembly Cache Tool (Gacutil.exe), which is found in {Drive}:\Program Files\ Microsoft.NET\SDK\v2.0\Bin for adding the assembly to the Global Assembly Cache(GAC). gacutil.exe /i “Path
    The assembly must also be placed in the{Drive}:\WINDOWS\Microsoft.NET\Framework\v2.0.50727.

SSIS requires these custom objects placed within the Global Assembly Cache (GAC) on the machine that is executing the package to maximize performance of the pipeline and tasks

Wednesday, August 15, 2007

Drill Through

When you deploy your BI project, it is very important for the users to take a look at the sets that they specified during analysis. This kind of action is called Drill through. It will retrieve first 1000 rows by default, yet you can change this property. For do that you must change the DefaultDrillthroughMaxRows property in the msmdsrv.ini file. This file is located, by default, in the :\Program Files\Microsoft SQL Server\\OLAP\Config folder.

Monday, August 13, 2007

Dashboard Applications and KPIs

Dashboard applications include some KPIs which help business users to notify and to understand the proper situation of business.
KPI stands for Key Performance Indicator and shows the situation of business that is defined with a formula.
They are shown as graphical icons and make the recognition of the position easier.
Microsoft introduced new software which is called Microsoft Office Business Scorecard Manager 2005 that can leverage the KPI capabilities of Analysis Services 2005.
Each KPI can be evaluated in five groups: Value, Goal, Status, Trend, and Weight. You can assign a MDX script to each group.

For creating new KPI you must specify the following information:
1. KPI name (is mandatory)
2. Associated measure group (is mandatory): The measure group in which your indicator is.
3. Value expression (is mandatory): The value that you want to compare with something else.
4. Goal expression: The value that you expect to reach.
5. Status: The status of your business that could be shown as one of the following figures in the Figure1.
6. Trend: The trend shows the movement of your business with graphical icons that are shown in the Figure2.
7. Additional properties: You can categorize the place in which your KPI will be shown, in this section. Moreover, you can create parent-child relationship between KPIs and assign a weight to each child KPI to calculate the parent.

Figure 1

Figure 2

Saturday, August 11, 2007

Kimball method vs. Inmon method

As you know a data warehouse is a platform for business intelligence.
There are two approaches to build a data warehouse:

1. Kimball approach
In this approach, you store the data in dimensional models, it helps you to have faster, user-understandable, and resilient to change data warehouse. Also, you can use OLTP or OLAP databases as your storage.

2. Inmon approach
In this approach, you store the data in third normal form, it helps us to handle high-volume databases, yet it is not fast, user-understandable, and resilient to change. Moreover you can not use an OLAP database as your ODS storage.

The Business Intelligence Lifecycle

This lifecycle is an iterative approach and follows four primary principles:

1.Focus on the business (Inception)
In this phase, you must gather the requirements, identify the businesses which are involved in your project scope, and analyze them.

2.Build an information infrastructure (Elaboration)
In this phase, you should design an approach for gathering all the information into single, integrated, easy-to-understand, and well-structured foundation that will be covered all the requirements that you gathered before.

3.Deliver in meaningful increments (Construction)
In this phase, you will build the data warehouse in incremental steps, and deliver them in meaningful schedules.

4.Deliver the entire solution (Transition)
In this phase, you must deliver all things that you have provided before, such as reports, ad hoc queries, websites, documents, and etc.