"Turn your hidden data into profit"

Building A Data Warehouse

The building of a data warehouse is known as extract, transformation, load (ETL for short).

First we extract the data from its various data sources. These will usually be database tables from various systems within an organisation, but are often supplemented by other files such as spreadsheets and text files.

In the transformation process we have the opportunity to massage the data into what we want for the data warehouse. This is where most of the action takes place. Apart from adding calculations and data from various tables where necessary, this is where we check that the data is consistent. For example dates on different systems in the organisation are often stored in different ways. These differences must be resolved in the transformation process.

In transformation codes must be translated. This is known as data cleansing.

The load process creates the data warehouse. In a dimensional star schema data warehouse this means loading the dimension and fact tables. Often the dimensions do not change much – for example a price list might only change annually – so not all dimensions need to be loaded every day. Even then, the dimensions can often be updated with changes and additions rather than being completely rebuilt each time.

Fact tables – for example your daily sales – need to be updated more often, usually on a daily basis. This involves checking that all records in the fact tables link correctly to the dimension tables, as this is vital for the star schema to join together properly and have any meaning.

Monitoring the build. In most organisations the data warehouse build will need to be monitored to ensure that the data has arrived at its destination and is correct. This monitoring can be done in a number of ways (home-made scripts for example) but should involve some form of automation. A good ETL package will provide tools for monitoring the load such as log files, a shcheduler and reject files for rejected input records. A sophisticated ETL system will be able to send out email alerts when things go wrong, and in some cases correct the problem and reinput the data.

comments

Leave a Reply