Sunday, December 30, 2007
Data Analysis Using SQL and Excel
I am familiar with normal distribution and some other simple techniques in statistics theoretically, yet I have a big problem with statistic concepts like Regression and other ones.
When I started to read some books about data mining, this was the main problem that I dropped data mining. When I saw this book, I felt like the time that I seen MDX Solution book. It raised my hopes of achieving the difficulty of data mining. I take a quick look at it, and it seems organized very well.
I decided to write an overview about this book in order to help my friend Pedro and some other people who want to make a decision to buy this book or not, but keep it in your mind that I write this post based on taking a quick overview on that, not by reading it completely.
Chapter 1: A Data Miner Looks at SQL
The author introduces Dataflow concept in this chapter.
Chapter 2: What’s In a Table? Getting Started with Data Exploration
This chapter explains how you can explore SQL results with excel charts. I discover another point of view by skimming this chapter.
Chapter 3: How Different Is Different?
The basic concepts of statistic and the combination of statistics, SQL, and Excel are explained in this chapter.
Chapter 4: Where Is It All Happening? Location, Location, Location
The geography and the processes which could be done using SQL and Excel play a primary role in discovering the knowledge. You can observe this great job by reading this chapter.
Chapter 5: It’s a Matter of Time
This chapter does just as the previous one but for time.
Chapter 6: How Long Will Customers Last? Survival Analysis to Understand Customers and Their Value
Nothing will be better than the sentence that author mentioned at the first: “Survival analysis estimates how long it takes for a particular event to happen. A customer starts; when will that customer stop? By assuming that the future will be similar to the past (the homogeneity assumption), the wealth of data about historical customer behavior can help us understand what will happen and when.”
Chapter 7: Factors Affecting Survival: The What and Why of Customer Tenure
“This chapter builds on this foundation, by introducing three extensions of basic survival analysis. These extensions solve some common problems faced when applying survival analysis in the real world. They also make it possible to understand the effects of other factors besides tenure on survival.”
Chapter 8: Customer Purchases and Other Repeated Events
This chapter discusses everything about customer behavior: when, where, and how. With one notable exception: what customers purchase.
Chapter 9: What’s in a Shopping Cart? Market Basket Analysis and Association Rules
“This chapter dives into the detail, looking at the specific products being purchased, to learn both about the customers and the products they are buying. Market basket analysis is the general name for understanding product purchase patterns at the customer level.”
Chapter 10: Data Mining Models in SQL
“This chapter takes an alternative approach that introduces data mining concepts using databases. This perspective presents the important concepts, sidestepping the rigor of theoretical statistics to focus instead on the most important practical aspect: data.”
Chapter 11: The Best-Fit Line: Linear Regression Models
Chapter 12: Building Customer Signatures for Further Analysis
This chapter focuses on data preparation.
Tuesday, December 25, 2007
Data Mining
I love to know more about data mining, but I have never found any simple book which discusses data mining and statistics practically.
Yesterday, I realized that the library bought "Data Analysis Using SQL and Excel" that describes about Data Mining with SQL and Excel in practical way.
I decided to read it, in order to know the fundamental concepts and becoming eligible to read other books.
I hope it will be useful as the author mentioned in the preface.
Friday, December 21, 2007
process analysis services objects through SSIS
There are two methods for populating data into SSAS, which are used mostly for non-standard data sources:
- Using Dimension Processing Data Flow Destination
- Using Partition Processing Data Flow Destination
You can map your source data into the dimension or Partition, and set the update method which can be Add, Full, or Update.
Saturday, November 17, 2007
Analysis Services Processing Task
This task force the SSAS to process Database, Cube, Partition, Measure Group, Dimension, Mining Structure, and Mining Model through SSIS.
At first you must specify the connection which specifies the objects that must be processed in connection manager combo box.
Then you must add the objects that you want to get processed.
Next you must set their process options based on the action that you want to happen. There are some actions that are available for some objects that I listed them as follow:
When you set the process option to Process Incremental you will have to configure its settings, The settings are included of Measure Group, Partition, and Source Data that can be table or query.
You can change the processing order of this batch via Change Settings button. The Change Settings form has two tabs that are called Processing options tab and Dimension key errors tab.
You can set some general settings such as transaction type (Sequentional, or Parallel), write back table, and process related objects in the first tab.
Finally you can set the action that you want to occure when an error raised in the second tab.
Monday, November 12, 2007
Processing Analysis Services Objects through SSIS
- Force SSAS to handle the process via SSIS
you must use the Control Flow Tasks for this kind of operations: - Analysis Services Processing Task
This task provides you the SSAS objects' options visually. - Analysis Services Execute DDL Task
you can specify a DDL script to this task to perform your Process via XML/A. - Process those objects through SSIS
the best situation in which you can use these destinations is where you have non-standard sources and you need near real-time processing.
you must use the Data Flow Destinations for this kind of operations: - Dimension Processing Data Flow Destination
- Partition Processing Data Flow Destination
There are two other ways for handling these processes:
- You can run your XML/A script via ASCMD.exe
you can do this by using the Execute Process Task. - Handling the process with AMO.
You must write program in .net language through Script Task.
I will describe each of these tasks separately, later.
Tuesday, October 30, 2007
Handling Late-Arriving Facts
When you want to handle this kind of data, you must set the surrogate key and the start date of the dimension data with a look up component at the first, and then you must check what data is the late-arriving one. The rows that those operation dates are lower than the start date of their dimension’s start date (Dim.StartDate>Fact.OperationDate) are late-arriving facts which their surrogate keys must be corrected with the appropriate dimension row. For specifying the correct dimension row you must send these facts to the look up component and correct the surrogate key. Finally you must add all these rows together.
The only tip that is remained is you must correct the dimension’s surrogate key with modifying the SQL of look up component in its advanced tab, for example for Order System the SQL must set as the follow:
Select *
From (Select * from [dbo].[DimProduct] as refTable)
where [refTable].ProductAlternateKey]=?
and startDate <= ? and endDate > ?
The parameters must be set to ProductAlternateKey, OrderDateAlternateKey, OrderDateAlternateKey.
Wednesday, October 17, 2007
Handling Missing Dimension Lookups
- Pre-data flow checkyou must insert these members as inferred members before adding fact data to the fact table.The best situation that you can use this method is when the number of dimensions that support inferred members is one or two.In this approach you have to process the fact table, two times.
- Post-data flow process (best solution)You must insert all fact data except for the inferred members that must store in a stage table, then you must insert inferred members, finally you should run the fact load process for the stage table.
- In-line data flow approachyou must set the second lookup item cache mode to “No Cache” to force it to get the key from the database instead of getting from the not-updated cache.You must check that the inferred member has not added before the inserting, because it is probable that some fact data with that key are exists. For each inferred member you must first check for the record, second add it to the dimension, and finally get its surrogate key, therefore this process is suitable when the inferred members make up very small minority of source records.
Sunday, October 14, 2007
Identifying changed fact records
If your data source rows contain "created date" and "modified date", you won’t need to match the fact rows with source rows for identifying changed fact records; conversely you have to match them using one of these two methods (you must match business key and time key in both of these methods):
- Database joins between the source data and the fact table: The source data and fact data must be at the same server in order to perform the join efficiently.
- Data correlation with the data flow (using the lookup to identify fact change record): you cannot identify deleted rows in this method.
Don’t forget about historical dimension changes and inferred members and late-arriving facts, you must handle those situations if your business supports them.
If you want to update or delete a row in a table, at the same time that you are inserting rows in it; you may cause locking issues. For preventing this situation turn off the table locking the data flow destination. If the problem is remained, turn off the fast load.
Saturday, October 6, 2007
Index Optimizations
Be cautious about using this method, because in most situations it is not a good idea to put surrogate key away, and index the business keys, for joining between fact-tables and dimension tables are based on surrogate keys.
Wednesday, September 26, 2007
Late-arriving dimension scenario (inferred members)
If you want to insert that data into fact table you must add a row in the dimension and flag it as “Inferred member”, then when the appropriate dimension’s data present, you can update that inferred member. You must consider the fact that regardless of being type 1 or type 2 you must just update the original record and change the “Inferred Member Flag” to no after updating the record. An inferred member turns all the attributes into Type 1 changing until the dimension member details come in from the source.
Slowly Changing Dimension (SCD)
There are three common SCD types:
- Type 0 (fixed): when an attribute of a dimension is fixed attribute, it cannot be changed.
- Type 1 (changing): This technique that is also known as Restating History indicates that the changes in the attribute must be replaced.
- Type 2 (historical): This type of SCD that is known as historical attribute is used for attribute which changes over time. Furthermore, it is important for us to keep the changes and reflect them in the reports.
If you have a historical dimension, and it changes a lot over a specific time, and the changes are not important, you can change the dimensions attribute type to type 1 and do your changes, then change the type of it after it stabilized.
You can find more information in wikipedia.
Monday, September 24, 2007
Surrogate Key
One of the most important things in dimensional modeling is the surrogate key, bust what is the advantages of it?
- It can improve the query performance, because they are numeric value and take less memory than business keys that usually made of string. Although this property limits the fact table width.
- It makes it possible to implement the Slowly Changing Dimension Type 2- because it allows you to have multiple business key in the dimension table with different surrogate key.
- It makes the design normal and the reporting simpler.
- It makes it easier to manage than having multi-value business key, because when you have a dimension that has multi-value business key as primary key, you will forced to add multi fields into fact tables and managing such as this situation would be difficult.
You can use [dimension name]AlternateKey name for business keys. It helps you to normalize the structure.
Wednesday, September 5, 2007
Data Staging
[2] Ralph Kimball, Margy Ross, The Data Warehouse Toolkit, Second Edition, The Complete Guide to Dimensional Modeling, 2002
FastParse property
It the conversion overhead of each column as it is imported into the data flow buffers, and overall gives generous performance benefits.
In essence, you are telling the connection that the data can be trusted in the source column to be in a standard format.
Saturday, September 1, 2007
Using .net code in SSIS
For doing that you must follow these steps:
- Generating a strong name that will be used for your .dll file.This can be done by using sn.exe utility that ships with .net framework, or by using the properties settings of the visual studio 2005 project (example: sn –k “C:\MyKeyFile.snk”).
- Using the GUID that was generated by SN utility within Class Library setting Section (Signing Tab).
- Setting the Class Library setting Section (Build Tab) to Active (Release).
- Use the Global Assembly Cache Tool (Gacutil.exe), which is found in {Drive}:\Program Files\ Microsoft.NET\SDK\v2.0\Bin for adding the assembly to the Global Assembly Cache(GAC). gacutil.exe /i “Path
” The assembly must also be placed in the{Drive}:\WINDOWS\Microsoft.NET\Framework\v2.0.50727.
SSIS requires these custom objects placed within the Global Assembly Cache (GAC) on the machine that is executing the package to maximize performance of the pipeline and tasks
Wednesday, August 15, 2007
Drill Through
Monday, August 13, 2007
Dashboard Applications and KPIs
KPI stands for Key Performance Indicator and shows the situation of business that is defined with a formula.
They are shown as graphical icons and make the recognition of the position easier.
Microsoft introduced new software which is called Microsoft Office Business Scorecard Manager 2005 that can leverage the KPI capabilities of Analysis Services 2005.
Each KPI can be evaluated in five groups: Value, Goal, Status, Trend, and Weight. You can assign a MDX script to each group.
For creating new KPI you must specify the following information:
1. KPI name (is mandatory)
2. Associated measure group (is mandatory): The measure group in which your indicator is.
3. Value expression (is mandatory): The value that you want to compare with something else.
4. Goal expression: The value that you expect to reach.
5. Status: The status of your business that could be shown as one of the following figures in the Figure1.
6. Trend: The trend shows the movement of your business with graphical icons that are shown in the Figure2.
7. Additional properties: You can categorize the place in which your KPI will be shown, in this section. Moreover, you can create parent-child relationship between KPIs and assign a weight to each child KPI to calculate the parent.
Saturday, August 11, 2007
Kimball method vs. Inmon method
There are two approaches to build a data warehouse:
1. Kimball approach
In this approach, you store the data in dimensional models, it helps you to have faster, user-understandable, and resilient to change data warehouse. Also, you can use OLTP or OLAP databases as your storage.
2. Inmon approach
In this approach, you store the data in third normal form, it helps us to handle high-volume databases, yet it is not fast, user-understandable, and resilient to change. Moreover you can not use an OLAP database as your ODS storage.
The Business Intelligence Lifecycle
1.Focus on the business (Inception)
In this phase, you must gather the requirements, identify the businesses which are involved in your project scope, and analyze them.
2.Build an information infrastructure (Elaboration)
In this phase, you should design an approach for gathering all the information into single, integrated, easy-to-understand, and well-structured foundation that will be covered all the requirements that you gathered before.
3.Deliver in meaningful increments (Construction)
In this phase, you will build the data warehouse in incremental steps, and deliver them in meaningful schedules.
4.Deliver the entire solution (Transition)
In this phase, you must deliver all things that you have provided before, such as reports, ad hoc queries, websites, documents, and etc.