Business Intelligence:

Dimensional modeling is the design concept used by many data warehouse designers to build their data warehouse. Dimensional model is the underlying data model used by many of the commercial OLAP products available today in the market. In this model, all data is contained in two types of tables called Fact Table and Dimension Table.

 

Fact table contains the measurements or metrics or facts of business processes. If your business process is Sales, then a measurement of this business process such as "monthly sales number" is captured in the fact table. In addition to the measurements, the only other things a fact table contains are foreign keys for the dimension tables.

 

Context of the measurements are represented in dimension tables. You can also think of the context of a measurement as the characteristics such as who, what, where, when, how  of a measurement (subject ). In your business process Sales, the characteristics of the 'monthly sales number' measurement can be a Location (Where), Time (When), Product Sold (What).

The Dimension Attributes are the various columns in a dimension table. In the Location dimension, the attributes can be Location Code, State, Country, Zip code. Generally the Dimension Attributes are used in report labels, and query constraints such as where Country='USA'. The dimension attributes also contain one or more hierarchical relationships.

Before designing your data warehouse, you need to decide what this data warehouse contains. Say if you want to build a data warehouse containing monthly sales numbers across multiple store locations, across time and across products then your dimensions are:

Location
Time
Product

Each dimension table contains data for one dimension. In the above example you get all your store location information and put that into one single table called Location. Your store location data may be spanned across multiple tables in your OLTP system (unlike OLAP), but you need to de-normalize all that data into one single table.

In Online Transaction Processing (OLTP), the database is designed to achieve efficient transactions such as INSERT and UPDATE. This is very different from the OLAP design.

Unlike OLAP, normalization is very important to reduce duplicates and also cut down on the size of the data. our OLTP schema may look like this:

Locations Table
 

Field Name

Type

Loc_Id

INTEGER (4)

Loc_Code

VARCHAR (5)

Loc_Name

VARCHAR (30)

State_Id

INTEGER (4)

Country_Id

INTEGER (4)

 

States Table

 

Field Name

Type

Sate_Id

INTEGER (4)

State_Name

VARCHAR (50)

 

 

Countries Table
 

Field Name

Type

Country_Id

INTEGER (4)

Country_Name

VARCHAR (50)


 

In order to query for all locations that are in country 'USA' we will have to join these three tables. The SQL will look like:

SELECT * FROM Locations, States, Countries where Locations.State_Id = States.State_Id AND Locations.Country_id=Countries.Country_Id and Country_Name='USA'

 

Dimensional modeling allows only one table per dimension. But your OLTP data spans across multiple tables as described.  So we need de-normalize the OLTP schema and export into your Dimension Tables.

 

For example, for the location dimension, you achieve this by joining the three OLTP tables and inserting the data into the single Location table.

Your Location Table will look like this:
 


Location Dimension Table Schema
 

Field Name

Type

Dim_Id

INTEGER (4)

Loc_Code

VARCHAR (4)

Name

VARCHAR (50)

State_Name

VARCHAR (20)

Country_Name

VARCHAR (20)

All Dimension tables contain a key column called the dimension key. In this example Dim_Id is our dimension Id. This is the unique key into our Location dimension table.

The actual data in your Location Table may look like this

Location Dimension Table Data

Dim_Id

Loc_Code

Name

State_Name

Country_Name

1001           

IL01

Chicago Loop

Illinois

USA

1002           

IL02

Arlington Hts

Illinois

USA

1003

NY01

Brooklyn

New York

USA

1004

TO01

Toronto

Ontario

Canada

1005

MX01

Mexico City

Distrito Federal

Mexico

You may notice that some of the information is repeated in the above dimension table. The State Name and Country Name are repeated through out the table. You may feel that this is waste of data space and against the normalization principles. But in dimensional modeling this type of design makes the querying very optimized and reduces the query times. Also we will learn later that in a typical data warehouse, the dimension tables make up only 10 to 15 % of the storage as the fact table is by far the largest table and takes up the rest of the storage allocation.

After de-normalization, your Time table will look like this:

Time Dimension Table Schema
 

Field Name

Type

Dim_Id

INTEGER (4)

Month

SMALL INTEGER (2)

Month_Name

VARCHAR (3)

Quarter

SMALL INTEGER (4)

Quarter_Name

VARCHAR (2)

Year

SMALL INTEGER (2)

 

The actual data in your Time Table may will look like this:
 

 


Time Dimension Data
 

TM _Dim_Id

TM _Month

TM _Month_Name

TM _Quarter

TM _Quarter_Name

TM_Year

1001           

1

Jan

1

Q1

2003

1002           

2

Feb

1

Q1

2003

1003

3

Mar

1

Q1

2003

1004

4

Apr

2

Q2

2003

1005

5

May

2

Q2

2003

 

After de-normalization, your Product table will look like this:

Product Dimension Table Schema

Field Name

Type

Dim_Id

INTEGER (4)

SKU

VARCHAR (10)

Name

VARCHAR (30)

Category

VARCHAR (30)

In this table PR_Dim_Id is our dimension Id. This is the unique key into our Product dimension table.
 
The actual data in your Product Table may look like this:

Product Dimension Table Data
 

Dim_Id

SKU

Name

Category

1001

DOVE6K

Dove Soap 6Pk

Sanitary

1002

MLK66F#

Skim Milk 1 Gal

Dairy

1003

SMKSAL55

Smoked Salmon 6oz

Meat

Fact table contains the actual business process measurements or metrics called facts. Usually these facts are numeric.

These facts  are generally Additive.

Some times the facts are semi additive such as balances

Some times they are non additive such as unit price

In the above example where you are building a data warehouse of monthly sales in dollars, your fact table will contain the actual sales numbers, one row per month. In addition to the data itself, you will have the foreign keys for the various dimensions in this row.

Granularity or Grain of Fact Table

The level of detail of the fact table is known as the grain of the fact table. In this example the grain of your fact table is monthly sales  number per location per product.


Your Fact Table will look like this

MonthlySales Fact Table Schema
 

Field Name

Type

TM_Dim_Id

INTEGER (4)

PR_ Dim_Id

INTEGER (4)

LOC_ Dim_Id

INTEGER (4)

Sales

INTEGER (4)

In this table the combination of all three dimension table foreign keys make up the primary key in the fact table. (TM_Dim_Id, PR_ Dim_Id, LOC_ Dim_Id) is our primary key.  This is the unique key into our Sales fact table.


The actual data in your MonthlySales Table will look like this:

MonthlySales Fact Table Data
 

TM_Dim_Id

PR_ Dim_Id

LOC_ Dim_Id

Sales

1001

1001

1003

435677

1002

1002

1001

451121

1003

1001

1003

98765

1001

1004

1001

65432

A fact table may contain one or more Facts. Usually you create one fact table per business process or event. For example if you want to analyze the sales numbers and also advertising spending, they are two separate business processes. So you will create two separate fact tables, one for sales data and one for advertising cost data. On the other hand if you want to track the sales tax in addition to the sales number, you simply create one more fact column in the Sales fact table called Tax.

What Is A Star Schema?

If you carefully look at our new dimensional modeled schema, it will look like this:





You can easily tell this looks like a STAR. Hence it is also known as Star Schema.

Advantages of Star Schema

                          Star Schema is very easy to understand, even for non technical business managers

                          Star Schema provides better performance and smaller query times

                          Star schema is easily extensible and will handle future changes easily

A Typical SQL Query Template for the Sales Schema will look like:

--- Select the measurements that you want to aggregate using SUM clause
SELECT P.Name, SUM(F.Sales)


---
JOIN the FACT table with Dimension Tables
FROM Sales F, Time T, Product P, Location L                         
WHERE F.TM_Dim_Id = T.Dim_Id
AND F.PR_Dim_Id = P.Dim_Id
AND  F.LOC_Dim_Id = L.Dim_Id

--- Constrains the Dimension Attributes
AND  T.Month='Jan' AND T.Year='2003' AND L.Country_Name='USA'

-- finally the 'group by' clause identifies the aggregation level. In this example you are aggregating
-- all sales within a product category.

GROUP BY P.Category

Now if you look at a report  generated from your  dimensional modeled data warehouse, it will look like this:










You can tell from the above report that all your dimensions are the row and column headings in the report.  The facts are the numeric numbers in the report.

 

Advantages of Star Schema

                          Star Schema is very easy to understand, even for non technical business managers

                          Star Schema provides better performance and smaller query times

                          Star schema is easily extensible and will handle future changes easily

If we did not de-normalize our dimensions into one table each, then the schema would have looked like this


 

You can see,  this looks like a snow flake, hence this type of schema is called Snowflake Schema.  General rule of thumb is keep away from snow flake schemas as even though they may save you some space, they will cost a lot in terms of query times.

1. First identify the business process.

In this step you will determine what is your business process that your data warehouse represents. This process will be the source of your metrics or measurements.

2. Identify the Grain

You will determine what does one row of fact table mean. In the previous example you have decided that your grain is 'monthly sales per location per product'.

3. Identify the Dimensions

Your dimensions should be descriptive (SQL VARCHAR or CHARACTER) as much as possible and confirm to your grain.

4. Finally Identify the facts

In this step you will identify what are your measurements (or metrics or facts). The facts should be numeric and should confirm to the grain defines in step 2.