ETL



What is ETL?

 ETL, if we say it in simpler terms, it is nothing but “Extract, Transform and Load”. We know that testing is an essential part of the design life cycle of any software product. This is also true that for succeeding in any data warehousing projects, testing plays a critical role as one can gain user’s confidence only when the quality of information we provide is up to the mark. 

We have also seen these days that Verification and Validation, when done independently, grabbing business’s attention as they see in it a huge market potential and business gains as well. Beside these, customers have been offered different range of products in terms of service providers, based on the different technologies, processes and solutions, also distributed in many areas. Hence, ETL testing and Data warehouse testing are continuously emerging as the best techniques to provide customers the best of the services and offerings in different areas.


Terms used in ETL Testing: 
1. E (xtraction)- is the operation of selecting data from a source system for further processing. The extracted data is first received in a Database space identified as "Staging" area.

It is of two types:
a) Full Extraction
b) Incremental Extraction

Let’s discuss these in detail:

a)      Full Extraction: It is a type of extraction where whole data from source system is extracted at ones, it replicates data as it is into the target system. In this case, target is always empty or truncated. This happens during initial load of the system and during data migration from one system to another. Data is cumulated at source.

b)      Incremental Extraction: In this type of extraction only additional data is extracted for specific period and then that extracted data is appended or updated in the target system. In this case, target system is not empty prior to load. Data is cumulated at target. Here we define a Load Period Window by ‘Start time’ and ‘End time’. This type of extraction is also known as “Delta Extraction”

2. T (ransformation)- is the change data which will undergo as part of data processing and can be further categorized as follows:

a) Data Cleansing: It means that the data is being cleaned and errors will be fixed.
There are some processes involved to remove the anomalies such as:
·         Unwanted spaces
·         Junk characters
·         Aligning data format such as date format 

b)  Applying business rules: These form the core transformation and are of following types:

·         Business rules such as Commission of agent on the basis of sales
·         Complex equations such as Net Asset values, Net Profit Margin, Gross Profit Margin
·         Apply calculation such as calculating the profit given cost price and selling price

Data Quality Checks that can be applied is broadly classified as :



3. L(oading)-
Loading is the process to take data from the Staging or Landing area to the final destination known as the ‘Base Database’ or ‘Target System’ .
 
There are some important points which must be kept in mind while doing data loading such as:

As we discussed that the Extraction is done in 2 ways similarly Data loading can also be done in 2 ways:

a) History Load: In this case,  the target tables are truncated and entire data is loaded into it. Right utilities should be chosen in this case as it involves huge data load. Hence, there is always some dedicated code written for performing History Loads.

b) Incremental Load: It requires some data warehousing concepts so that it can maintain partial or full history. SCD that is Slowly Changing Dimensions (which is later discussed in the blog) concept is also involved in this case. Also, the concepts of Insert/Update or Delete/Insert is used.
 

No comments:

Post a Comment