In my previous blog post titled ‘Business
Intelligence for Beginners’ we learnt basics of BI, Star Schema and defined few
related terms.
In this post we will learn what ETL is in
more detail. ETL stands for Extract, Transform, Load and is a process of
extracting data from various sources, transforming them in a common format and
loading the data in Data Warehouse. This data is then used for decision making.
Extraction:
Extraction is a process of getting data
from various sources and transferring to Data warehouse. It’s not as simple as
copy paste, its complex. There are various factors to it, which we will discuss
further.
This is the most complex and time consuming
part in the entire data warehousing process because of following reasons:
1.
The source system may be poorly
designed.
2.
Deciding which data needs to be
extracted from every source.
3.
Data needs to be extracted from
heterogeneous and disparate system.
4.
Sources may have different Operating
systems, DBMS, hardware and communication protocol.
Some Examples of Sources are ERP, CRM, Flat files, Batch File, text files, spreadsheets.
Some Examples of Sources are ERP, CRM, Flat files, Batch File, text files, spreadsheets.
Also, this is not a onetime activity. As
the source data gets updated Data warehouse also needs to be periodically
(hourly, monthly or yearly) updated for getting up to date Information from raw
data.
There are two extraction methods namely,
Logical Extraction Method and Physical Extraction Method. We first need to
decide how the data can be extracted logically before extracting it physically.
Transformation:
Transformation can be defined as applying
set of rules/programs/functions to the data extracted from various sources.
This step is responsible for Data
Validation, Accuracy, and Conversion of source data type according to the
business rule of the organization.
Also, once the data is transformed data
cleansing is done before loading data to target database.
Few forms of transformation:
1.
Simple one-to-one scalar
transformation
-
0/1 for M/F of data
2.
One to many element
transformation
-
4*20 address field for Street
Address/City/Zip/Country
3.
Many to many element
transformation
-
Detailed information of every
individual
Example: First column in all 3
examples shown below shows the data extracted and the later shows how the data resembles
after Transformation.
Loading:
Loading is a process of loading the
transformed data to data warehouse in a particular format depending on
organisational need. We need to load both facts and dimensions.
Using Following 3 strategies data can be
uploaded:
1.
Full data refresh
2.
Incremental data refresh
3.
Continuous data refresh with
constant data collection and loading using row level insert.
We will go in more depth in next blog-posts. Stay tuned!
References:
http://www.zeepedia.com/read.php?issues_of_etl_legacy_data_web_scrapping_data_quality_etl_vs_elt_data_warehousing&b=6&c=18
Superb ! Your blog is incredible. I am delighted with it. Thanks for sharing with me.
ReplyDeleteAndroid Training in Chennai |
Android training |
Android Course in Chennai
I found your blog on google and read a few of your other posts. I just added you to my Google News Reader. Keep up the great work Look forward to reading more from you in the future.PHP Training in Chennai |
ReplyDeletePHP Course in Chennai |
PHP Training Chennai