Thursday, February 6, 2014

Brief Overview of ETL

In my previous blog post titled ‘Business Intelligence for Beginners’ we learnt basics of BI, Star Schema and defined few related terms.

In this post we will learn what ETL is in more detail. ETL stands for Extract, Transform, Load and is a process of extracting data from various sources, transforming them in a common format and loading the data in Data Warehouse. This data is then used for decision making.


In General, ETL process takes 70% of the DW development time.




Extraction:
Extraction is a process of getting data from various sources and transferring to Data warehouse. It’s not as simple as copy paste, its complex. There are various factors to it, which we will discuss further.

This is the most complex and time consuming part in the entire data warehousing process because of following reasons:

1.    The source system may be poorly designed.
2.    Deciding which data needs to be extracted from every source.
3.    Data needs to be extracted from heterogeneous and disparate system.
4.    Sources may have different Operating systems, DBMS, hardware and communication protocol.

Some Examples of Sources are ERP, CRM, Flat files, Batch File, text files, spreadsheets.
Also, this is not a onetime activity. As the source data gets updated Data warehouse also needs to be periodically (hourly, monthly or yearly) updated for getting up to date Information from raw data.

There are two extraction methods namely, Logical Extraction Method and Physical Extraction Method. We first need to decide how the data can be extracted logically before extracting it physically.

Transformation:
Transformation can be defined as applying set of rules/programs/functions to the data extracted from various sources.

This step is responsible for Data Validation, Accuracy, and Conversion of source data type according to the business rule of the organization.

Also, once the data is transformed data cleansing is done before loading data to target database.
Few forms of transformation:
1.    Simple one-to-one scalar transformation
-       0/1 for M/F of data
2.    One to many element transformation
-       4*20 address field for Street Address/City/Zip/Country
3.    Many to many element transformation
-       Detailed information of every individual

Example: First column in all 3 examples shown below shows the data extracted and the later shows how the data resembles after Transformation.


Loading:
Loading is a process of loading the transformed data to data warehouse in a particular format depending on organisational need. We need to load both facts and dimensions.

Using Following 3 strategies data can be uploaded:
1.    Full data refresh
2.    Incremental data refresh
3.    Continuous data refresh with constant data collection and loading using row level insert.

We will go in more depth in next blog-posts. Stay tuned!


References:




2 comments:

  1. Superb ! Your blog is incredible. I am delighted with it. Thanks for sharing with me.
    Android Training in Chennai |
    Android training |
    Android Course in Chennai

    ReplyDelete
  2. I found your blog on google and read a few of your other posts. I just added you to my Google News Reader. Keep up the great work Look forward to reading more from you in the future.PHP Training in Chennai |
    PHP Course in Chennai |
    PHP Training Chennai

    ReplyDelete