Business Intelligence: BI Concepts

Showing posts with label BI Concepts. Show all posts

Tuesday, March 23, 2010

Decision Tree

I hade a course previous quarter. It was about how to use decision tree for taking correct decisions. It was very amazing and I want to rewrite one of the question of final exam for showing the types of question that decision tree could help us to solve. This quarter I have the second and advance version of that course called Decision and Risk Analysis, second course.

You are working as a medical advisor at a big hospital and your department is specialized within cardiology. Your manager has asked you to formally analyze the medical condition of atrial fibrillation. In the analysis you will merely look at costs for the hospital and not consider other criteria. Atrial fibrillation is a common condition, which carries with it a significant risk for stroke (80%) if left untreated. Treatment with the medicines warfarin or aspirin significantly reduces this risk, but there are side-effects to both treatments. For the average patient (at moderate risk), treatment with aspirin has a slightly smaller risk for side-effects than warfarin, 4.8% as opposed to 7.2%, but aspirin reduces the risk for stroke less effectively than warfarin. The serious side-effects are either cerebrovascular accident, CVA during the treatment or haemorrhage. Of patients treated with aspirin and affected by side-effects, 64.4% get during their treatment, whereas this number is only 50% for patients treated with warfarin and affected by side-effects.

For the patients who do not suffer from immediate side-effects during the aspirin treatment, the risk for CVA is still 50%, whereas the risk for CVA is only 10% for patients treated by warfarin and not suffering from side-effects during the period of treatment. If an average patient (moderate risk) with atrial fibrillation gets a stroke (CVA), regardless of treatment or no treatment, he or she is classified as affected or unaffected. Out of all CVA:s that occur, 25% are affected and 75% are unaffected. Transition costs(those that happen just once) for the treatment of a patient with CVA is established to 76,600 SEK (unaffected patients), whereas the state costs (those that remain patients for their Iifetime) for treatment of a patient with CVA is estimated to 158,000SEK (affected patients). The treatment of a haemorrhage is estimated to 32,000 SEK, and the cost of a medical treatment (either aspirin or warfarin) is 15,000 SEK.

What would be your recommendation to the hospital in the handling of moderate risk patients with atrial fibrillation and why?

But how could we solve it?

If you draw the decision tree, you will see that the warfarin is the best medical treatment.

Wednesday, November 11, 2009

Predictable Changes with Multiple Version Overlays

There is a technique in Kimball's "Data Warehouse toolkit 2nd" book that is suggested for Predictable Changes with Multiple Version Overlays situations. It was a bit intangible for me what Kimball means for that solution, but I could understand the technique when I give another example to myself ;)

Imagine that we offer variety of products with different categories. Each product could just be belong to one category, and the category of products could be changed at the beginning of each year. However, the category of products could be changed by CEO over the time. It is ok if we want to use Slowly Changing Dimension 2 (SCD2) for this example, but the business requirements that I mention as the following make it impossible to consider this dimension as SCD2.

Imagine we need to answer these questions:

What is the sale of a category over a period of time? (This is not the case that makes it impossible to consider the dimension as SCD2)
What is the sale of the selected category in question one on the next two years? (This is a real trouble)

It may be a bit vague, so I am going to explain second question. Imagine the sales of category 1 in 2002 is X. Analysts may be interested to know what is the sale of category 1, with the products at the time of question 1, in another period of time. We should consider this fact that some category of products is changed over time; hence, how could it be possible to reflect this information.

This is the situation that we must follow what Kimball says. Consider a column for category of each year in product dimension table.

Product D
------------------
ProductID
…
Category 2001
Category 2002
Category 2003
…

I hope I could explain the situation well. I am eager to know what you think about this technique as well ;)

Saturday, March 29, 2008

Conformed Dimension

The word of Enterprise Data Warehouse is very nice at the first glance, yet when businesses want to estimate its cost, they usually prefer to have some important data marts that drive their most important business goals; in order to break its enormous cost. Beside, having these kinds of data marts could be a huge problem especially when those could not integrated with each other during the time that those are created.
Having bus architecture is undoubtedly one of the vital parts of building an enterprise data warehouse during the time, for it will provide the join possibility between the data marts by specifying the conformed dimensions.
Conformed dimensions are what which are common between data marts and make the drill across possible. Off course, it could not be eliminated from enterprise main features.
Finally, the most common problem that you may encounter during building the conformed dimensions is different level of grains that different data marts are needed, so it is recommended that you design dimension’s levels correctly which will even be helpful for drilling through actions.

Saturday, March 1, 2008

Factless Fact Tables

I have recently got familiar with the concept of “Factless Fact Tables”, and due to the fact that it was very interesting for me, I like to share it with you, if you are not already familiar with it.
The “Factless Fact Table” is a table which is similar to Fact Table except for having any measure; I mean that this table just has the links to the dimensions. But what is the benefit?
These tables enable you to track events; indeed they are for recording events.
Another kind of this Table is known as “Factless Coverage Table” which is very usefull. Imagine that you have a retail stores and each store has its own promotion policy. It would be very complicated if you wanted to answer this sort of question: “Which products were on promotion that didn't sell?”
The best way for covering these kinds of questions is the coverage tables. These tables hold the coverage data for answering these questions.
I also recommend reading “Factless Fact Tables” that was written by Ralph Kimball.

Saturday, October 6, 2007

Index Optimizations

Dimension lookups and updates are row-by-row operations; therefore you must create appropriate indexes for getting the best performance. One of the tips that you can follow is creating dimension’s table clustered index on the business key instead of surrogate key. And if you need to do better you can include the current indicator flag for end date as the second column in the Type-2 dimensions. Keep it in your mind that Index optimization must be balanced between the ETL and the query usage. Although you can put the data into staging table and later establish an inner join between it and the dimension for updating the rows that are exists in the stage table. Finally, you must truncate the stage table.
Be cautious about using this method, because in most situations it is not a good idea to put surrogate key away, and index the business keys, for joining between fact-tables and dimension tables are based on surrogate keys.

Wednesday, September 26, 2007

Late-arriving dimension scenario (inferred members)

Inferred members are the dimension’s members that are not retrieved when the fact data that are related to them, is ready to insert into the fact table.
If you want to insert that data into fact table you must add a row in the dimension and flag it as “Inferred member”, then when the appropriate dimension’s data present, you can update that inferred member. You must consider the fact that regardless of being type 1 or type 2 you must just update the original record and change the “Inferred Member Flag” to no after updating the record. An inferred member turns all the attributes into Type 1 changing until the dimension member details come in from the source.

Slowly Changing Dimension (SCD)

There are three common SCD types:

Type 0 (fixed): when an attribute of a dimension is fixed attribute, it cannot be changed.
Type 1 (changing): This technique that is also known as Restating History indicates that the changes in the attribute must be replaced.
Type 2 (historical): This type of SCD that is known as historical attribute is used for attribute which changes over time. Furthermore, it is important for us to keep the changes and reflect them in the reports.

If you have a historical dimension, and it changes a lot over a specific time, and the changes are not important, you can change the dimensions attribute type to type 1 and do your changes, then change the type of it after it stabilized.

You can find more information in wikipedia.

Monday, September 24, 2007

Surrogate Key

One of the most important things in dimensional modeling is the surrogate key, bust what is the advantages of it?

It can improve the query performance, because they are numeric value and take less memory than business keys that usually made of string. Although this property limits the fact table width.
It makes it possible to implement the Slowly Changing Dimension Type 2- because it allows you to have multiple business key in the dimension table with different surrogate key.
It makes the design normal and the reporting simpler.
It makes it easier to manage than having multi-value business key, because when you have a dimension that has multi-value business key as primary key, you will forced to add multi fields into fact tables and managing such as this situation would be difficult.

You can use [dimension name]AlternateKey name for business keys. It helps you to normalize the structure.

Wednesday, September 5, 2007

Data Staging

Data staging is the term which is used among data warehouse developers. It shows the strategy that business intelligence team takes for dealing in ETL process. In fact, it shows how the data should be treated in the ETL process. There are different kinds of staging strategies such as ‘Staging Data from Multiple Sources’, ‘Staggered Staging’, ‘Persisted Staging’, ‘Accumulated Staging’, ‘Chunked Accumulated Staging’ and Other Destination Considerations [1].

Staging Data from Multiple Sources

In this kind of data staging, the data is extracted into staging database using an ETL process. Then, it will be loaded into datawarehouse using another ETL process when it gets cleaned. It is a common approach and is used widely when there are different data sources for a data mart.

Staggered Staging

This kind of staging is suitable when you have error handling in your ETL. It means that, if something get wrong, you do not have to start the ETL process from scratch. Instead, you can continue the last persisted point in the ETL process.

Persisted Staging

In this kind of data staging, a copy of staging data is persisted in a separated database. It is very important when you need to provide some auditing mechanism according to some requirements. It has a huge burden on ETL team, but sometimes it is needed.

Accumulated Staging

This is the most advanced pattern among data staging patterns. In this pattern, you should have capability to recognize changes in source data and handle them in ETL process. We have the same staging solution in the bank that I work. In fact, it is a combination of previous pattern and this pattern, because we also persisted data in the staging data. In this form of data staging, we have triggers on our data sources (the DBMS are Oracle). However, it is not a good idea to have triggers on databases. I personally believe that if application designers predicted such requirements in databases, it would be much better solution than recognizing these changes through triggers.

Chunked Accumulated Staging

In this sort of data staging, the data is loaded into data staging database several times during the day. Then, it will be loaded into data warehouse at once. It is suitable for retail stores.

Other Destination Considerations

There are also other types of data staging patterns. For example, it might be needed that you populate data into some data marts based on geographical distribution. In this case the process would be like this figure:

Kimball Opinion

Last but not least, I like to remind that Kimball does not advocates using structured relational data staging area. “The creation of both normalized structures for staging and dimensional structures for presentation means that the data is extracted, transformed, and loaded twice—once into the normalized database and then again when we load the dimensional model. Obviously, this two-step process requires more time and resources for the development effort, more time for the periodic loading or updating of data, and more capacity to store the multiple copies of the data. At the bottom line, this typically translates into the need for larger development, ongoing support, and hardware platform budgets. Unfortunately, some data warehouse project teams have failed miserably because they focused all their energy and resources on constructing the normalized structures rather than allocating time to development of a presentation area that supports improved business decision making.” [2]

However, there are lot of needs in practice that force BI teams to do that. This is common in creating data warehouses, and Kimball mentioned that “It is acceptable to create a normalized database to support the staging processes; however, this is not the end goal. The normalized structures must be off-limits to user queries because they defeat understandability and performance. As soon as a database supports query and presentation services, it must be considered part of the data warehouse presentation area. By default, normalized databases are excluded from the presentation area, which should be strictly dimensionally structured.” [2].

References:

[1] Turley P., Cameron S., Kasprzak J, Microsoft SQL Server 2005 Integration Services Step by Step, 2007
[2] Ralph Kimball, Margy Ross, The Data Warehouse Toolkit, Second Edition, The Complete Guide to Dimensional Modeling, 2002

Monday, August 13, 2007

Dashboard Applications and KPIs

Dashboard applications include some KPIs which help business users to notify and to understand the proper situation of business.
KPI stands for Key Performance Indicator and shows the situation of business that is defined with a formula.
They are shown as graphical icons and make the recognition of the position easier.
Microsoft introduced new software which is called Microsoft Office Business Scorecard Manager 2005 that can leverage the KPI capabilities of Analysis Services 2005.
Each KPI can be evaluated in five groups: Value, Goal, Status, Trend, and Weight. You can assign a MDX script to each group.

For creating new KPI you must specify the following information:
1. KPI name (is mandatory)
2. Associated measure group (is mandatory): The measure group in which your indicator is.
3. Value expression (is mandatory): The value that you want to compare with something else.
4. Goal expression: The value that you expect to reach.
5. Status: The status of your business that could be shown as one of the following figures in the Figure1.
6. Trend: The trend shows the movement of your business with graphical icons that are shown in the Figure2.
7. Additional properties: You can categorize the place in which your KPI will be shown, in this section. Moreover, you can create parent-child relationship between KPIs and assign a weight to each child KPI to calculate the parent.

Figure 1

Figure 2

Saturday, August 11, 2007

The Business Intelligence Lifecycle

This lifecycle is an iterative approach and follows four primary principles:

1.Focus on the business (Inception)
In this phase, you must gather the requirements, identify the businesses which are involved in your project scope, and analyze them.

2.Build an information infrastructure (Elaboration)
In this phase, you should design an approach for gathering all the information into single, integrated, easy-to-understand, and well-structured foundation that will be covered all the requirements that you gathered before.

3.Deliver in meaningful increments (Construction)
In this phase, you will build the data warehouse in incremental steps, and deliver them in meaningful schedules.

4.Deliver the entire solution (Transition)
In this phase, you must deliver all things that you have provided before, such as reports, ad hoc queries, websites, documents, and etc.

Business Intelligence