ETL — Understanding It and Effectively Using It

Using ETL as an Enabler for Data Warehouses, Data Hubs, and Data Lakes

by Punit Pathak

If you are familiar with databases, data warehouses, data hubs, or data lakes then you have experienced the need for ETL (extract, transform, load) in your overall data flow process.

While there are a number of solutions available, my intent is not to cover individual tools in this post, but focus more on the areas that need to be considered while performing all stages of ETL processing, whether you are developing an automated ETL flow or doing things more manually.

With that being said, if you are looking to build out a Cloud Data Warehouse with a solution such as Snowflake, or have data flowing into a Big Data platform such as Apache Impala or Apache Hive, or are using more traditional database or data warehousing technologies, here are a few links to analysis on the latest ETL tools that you can review (Oct 2018 Review -and- Aug 2018 Analysis.

Keep in mind that if you are leveraging Azure (Data Factory), AWS (Glue), or Google Cloud (Dataprep), each cloud vendor has ETL tools available as well. Finally solutions such as Databricks (Spark), Confluent (Kafka), and Apache NiFi provide varying levels of ETL functionality depending on requirements.

An Introduction to ETL

ETL is a type of data integration process referring to three distinct but interrelated steps (Extract, Transform and Load) and is used to synthesize data from multiple sources many times to build a Data Warehouse, Data Hub, or Data Lake.

The most common mistake and misjudgment made when designing and building an ETL solution is jumping into buying new tools and writing code before having a comprehensive understanding of business requirements/needs.

There are some fundamental things that should be kept in mind before moving forward with implementing an ETL solution and flow.

Why Do We Need ETL?

It is essential to properly format and prepare data in order to load it in the data storage system of your choice. The triple combination of ETL provides crucial functions that are many times combined into a single application or suite of tools that help in the following areas:

Offers deep historical context for business.
Enhances Business Intelligence solutions for decision making.
Enables context and data aggregations so that business can generate higher revenue and/or save money.
Enables a common data repository.
Allows verification of data transformation, aggregation and calculations rules.
Allows sample data comparison between source and target system.
Helps to improve productivity as it codifies and reuses without additional technical skills.

A basic ETL process can be categorized in the below stages:

Data Extraction
Data Cleansing
Transformation
Load

A viable approach should not only match with your organization?s need and business requirements but also performing on all the above stages.

Traversing the Four Stages of ETL ? Pointers to Keep in Mind

Know and understand your data source ? where you need to extract data
Audit your data source
Study your approach for optimal data extraction
Choose a suitable cleansing mechanism according to the extracted data
Once the source data has been cleansed, perform the required transformations accordingly
Know and understand your end destination for the data ? where is it going to ultimately reside
Load the data

The steps above look simple but looks can be deceiving. Let?s now review each step that is required for designing and executing ETL processing and data flows.

Data Extraction and Data Cleaning

Data Source

It is very important to understand the business requirements for ETL processing. The source will be the very first stage to interact with the available data which needs to be extracted. Organizations evaluate data through business intelligence tools which can leverage a diverse range of data types and sources.

The most common of these data types are:

Databases
Flat Files
Web Services
Other Sources such as RSS Feeds

First, analyze how the source data is produced and in what format it needs to be stored. Traditional data sources for BI applications include Oracle, SQL Server, MySql, DB2, Hana, etc.

Evaluate any transactional databases (ERP, HR, CRM, etc.) closely as they store an organization?s daily transactions and can be limiting for BI for two key reasons:

Querying directly in the database for a large amount of data may slow down the source system and prevent the database from recording transactions in real time.
Data in the source system may not be optimized for reporting and analysis.

Usage and Latency

Another consideration is how the data is going to be loaded and how will it be consumed at the destination.

Let?s say the data is going to be used by the BI team for reporting purposes, so you?d certainly want to know how frequently they need the data. Further, if the frequency of retrieving the data is very high but volume is low then a traditional RDBMS might suffice for storing your data as it will be cost effective. If the frequency of retrieving the data is high, and the volume is the same, then a traditional RDBMS could in fact be a bottleneck for your BI team. That type of situation could be well served by a more fit for purpose data warehouse such as Snowflake or Big Data platforms that leverage Hive, Druid, Impala, HBase, etc. in a very efficient manner.

There are many other considerations as well including current tools available in house, SQL compatibility (especially related to end user tools), management overhead, support for a wide variety of data, among other things.

Auditing your Source Data

Data auditing refers to assessing the data quality and utility for a specific purpose. Data auditing also means looking at key metrics, other than quantity, to create a conclusion about the properties of the data set. In short, data audit is dependent on a registry, which is a storage space for data assets.

So, ensure that your data source is analyzed according to your different organization?s fields and then move forward based on prioritizing the fields.

Analyzing Data Extraction

The main objective of the extraction process in ETL is to retrieve all the required data from the source with ease. Therefore, care should be taken to design the extraction process to avoid adverse effects on the source system in terms of performance, response time, and locking.

Steps to Perform Extraction

Push Notification: It?s always nice if the source system is able to provide a notification that the records have been modified and provide the details of changes.
Incremental/Full Extract: Some systems may not provide the push notification service, but may be able to provide the detail of updated records and provide an extract of such records. During further ETL processing, the system needs to identify changes and propagate it down.

There are times where a system may not be able to provide the modified records detail, so in that case, full extraction is the only choice to extract the data. Make sure that full extract requires keeping a copy of the last extracted data in the same format to identify the changes.

While using Full or Incremental Extract, the extracted frequency is critical to keep in mind.

Challenges Faced During the Extraction Process

One of the challenges that we typically face early on with many customers is extracting data from unstructured data sources, e.g. text, emails and web pages and in some cases custom apps are required depending on ETL tool that has been selected by your organization. This can and will increase the overhead cost of maintenance for the ETL process.

Second, the implementation of a CDC (Change Data Capture) strategy is a challenge as it has the potential for disrupting the transaction process during extraction. Many times the extraction schedule would be an incremental extract followed by daily, weekly and monthly to bring the warehouse in sync with the source. Extraction of data from the transactional database has significant overhead as the transactional database is designed for efficient insert and updates rather than reads and executing a large query.

And last, don?t dismiss or forget about the ?small things? referenced below while extracting the data from the source.

Change in data formats over time.
Increase in data velocity and volume.
Rapid changes on data source credentials.
Null issues.
Change requests for new columns, dimensions, derivatives and features.
Writing source specific code which tends to create overhead to future maintenance of ETL flows.

Combining all the above challenges compounds with the number of data sources, each with their own frequency of changes.

Data Cleansing Requirements

Data cleaning, cleansing, and scrubbing approaches deal with detection and separation of invalid, duplicate, or inconsistent data to improve the quality and utility of data that is extracted before it is transferred to a target database or Data Warehouse. With the significant increase in data volumes and data variety across all channels and sources, the data cleansing process plays an increasingly vital role in ETL to ensure that clean, accurate data will be used in downstream decision making and data analysis.

A solid data cleansing approach should satisfy a number of requirements:

Detection and removal of all major errors and inconsistencies in data either dealing with a single source or while integrating multiple sources.
Correcting of mismatches and ensuring that columns are in the same order while also checking that the data is in the same format (such as date and currency).
Enriching or improving data by merging in additional information (such as adding data to assets detail by combining data from Purchasing, Sales and Marketing databases) if required.
Data cleaning should not be performed in isolation but together with schema-related data transformations based on comprehensive metadata.
Mapping functions for data cleaning should be specified in a declarative way and be reusable for other data sources as well as for query processing.

A workflow process must be created to execute all data cleansing and transformation steps for multiple sources and large data sets in a reliable and efficient way.

Data Cleansing Problems

Data quality problems that can be addressed by data cleansing originate as single source or multi-source challenges as listed below:

Potential Problems with Data from a Single Source

Uniqueness
Misspelling
Redundancy/Duplicates
Outside domain range
Data entry errors
Referential integrity
Contradictory values

Potential Problems with Data from Multiple Sources

Naming conflicts at the schema level ? using the same name for different things or using a different name for the same things
Structural conflicts
Inconsistent aggregating
Inconsistent timing

Data Cleansing Approach

While there are a number of suitable approaches for data cleansing, in general, the phases below will apply:

Data Analysis

In order to know the types of errors and inconsistent data that need to be addressed, the data must be analyzed in detail. For data analysis, metadata can be analyzed that will provide insight into the data properties and help detect data quality problems. There are two related approaches to data analysis.

Data Profiling

As data gets bigger and infrastructure moves to the cloud, data profiling is increasingly important. Data profiling, data assessment, data discovery, data quality analysis is a process through which data is examined from an existing data source in order to collect statistics and information about it. In this step, a systematic up-front analysis of the content of the data sources is required.

Data profiling requires that a wide variety of factoring are understood including the scope of the data, variation of data patterns and formats in the database, identifying multiple coding, redundant values, duplicates, nulls values, missing values and other anomalies that appear in the data source, checking of relationships between primary and foreign key plus the need to discover how this relationship influences the data extraction, and analyzing business rules.

Data Mining

Data mining, data discovery, knowledge discovery (KDD) refers to the process of analyzing data from many dimensions, perspectives and then summarizing into useful information. It also refers to the nontrivial extraction of implicit, previously unknown, and potentially useful information from data in databases.

In actual practice, data mining is a part of knowledge discovery although data mining and knowledge discovery can be considered synonyms. Through a defined approach and algorithms, investigation and analysis can occur on both current and historical data to predict future trends so that organizations? will be enabled for proactive and knowledge-driven decisions.

Defining Transformations and Mapping Rules

Many transformations and cleaning steps need to be executed, depending upon the number of data sources, the degree of heterogeneity, and the errors in the data. Sometimes, a schema translation is used to map a source to a common data model for a Data Warehouse, where typically a relational representation is used.

First, data cleaning steps could be used to correct single-source instance problems and prepare the data for integration. Later in the process, schema/data integration and cleaning multi-source instance problems, e.g., duplicates, data mismatch and nulls are dealt with.

Declarative query and a mapping language should be used to specify schema related data transformations and a cleaning process to enable automatic generation of the transformation code.

Verification

The transformation workflow and transformation definition should be tested and evaluated for correctness and effectiveness. Improving the sample or source data or improving the definition may be necessary. Multiple repetitions of analysis, verification and design steps are needed as well because some errors only become important after applying a particular transformation.

Transformation

Execution of transformational steps is required either by running the ETL workflow for loading and by refreshing the data in a data warehouse or during the period of answering the queries on multiple sources.

Backflow and Cleaned Data

After removal of errors, the cleaned data should also be used to replace on the source side in order improve the data quality of the source database. This process will avoid the re-work of future data extraction.

Data Transformation

Once data cleansing is complete, the data needs to be moved to a target system or to an intermediate system for further processing. The transformation step in ETL will help to create a structured data warehouse. Transformation refers to the data cleansing and aggregation that prepares it for analysis. There are two approaches for data transformation in the ETL process.

Multistage Data Transformation: In this process, extracted data is moved to an intermediate area (staging) where transformation occurs prior to loading the data into the final target area (data warehouse).
In-Warehouse Data Transformation: In this process, the flow would be ELT (Extract, Load and then Transform). The extracted data will be loaded into the data warehouse and there the transformation will occur.

Below, aspects of both basic and advanced transformations are reviewed.

Basic Transformation

Format Standardization: Standardize the data type and length according to field format to make it easy for end user to retrieve data.
Cleaning: Includes mapping of values to some derived/short meaning like mapping ?Male? to ?M?, null to ?0?, etc.
Deduplication: Involves removing of duplicate values.
Constraints Implementation: Establishment of key relationships across tables.

Advanced Transformation

Decoding of Fields: Data coming from multiple sources many times will be described by varying field values and often times legacy source systems use fairly cryptic codes to represent business values making it necessary to remove fields having similar information and or changing obscure codes into values that make business sense to users that consume the data.
Merging of Information: It?s common to merge related fields together and view the merged fields as a single entity, e.g. product, product price, product type, description, etc.
Splitting single fields: Splitting a large text field into a single field for easier consumption, e.g. splitting full name into first_name, middle_name and last_name.
Calculated and Derived Values: At times, an aggregation can be required on the dataset before loading it to a Data Warehouse, e.g. calculating total cost and profit margin.
Summarization: Values are summarized to obtain a total figure which is subsequently calculated and stored at multiple levels as business fact in multidimensional tables.

Data Loading

In this phase, extracted and transformed data is loaded into the end target source which may be a simple delimited flat file or a Data Warehouse depending on the requirement of the organization.

There are two types of tables in Data Warehouse: Fact Tables and Dimension Tables. Once the data is loaded into fact and dimension tables, it?s time to improve performance for BI data by creating aggregates.

Aggregation

In order to design an effective aggregate, some basic requirements should be met. First, aggregates should be stored in their own fact table. Next, all dimensions that are related should be a compacted version of dimensions associated with base-level data. Finally, affiliate the base fact tables in one family and force SQL to invoke it.

Aggregation helps to improve performance and speed up query time for analytics related to business decisions.

An Effective Loading Process

Referential Integrity Constraints

Referential integrity constraints will check if a value for a foreign key column is present in the parent table from which the foreign key is derived. This constraint is applied when new rows are inserted or the foreign key column is updated.

While inserting or loading a large amount of data, this constraint can pose a performance bottleneck. Hence, it?s imperative to disable the foreign key constraint on tables dealing with large amounts of data, especially fact tables. Make sure that the purpose for referential integrity is maintained by the ETL process that is being used.

Other Considerations for Effective Loading

Indexes should be removed before loading data into the target. They may be rebuilt after loading.
Manage partitions. The most recommended strategy is to partition tables by date interval such as a year, month, quarter, some identical status, department, etc.
In the case of incremental loading, the database needs to synchronize with the source system. The incremental load will be a more complex task in comparison with full load/historical load.

Below are the most common challenges with incremental loads.

Challenges with Incremental Loads

Ordering: To handle large amounts of data with high availability, data pipelines often leverage a distributed systems approach which implies that data may be processed in a different order than when it was received. If data is deleted or updated, then processing in the wrong order will lead to data errors, therefore maintaining and ordering is crucial for keeping data accurate.
Schema Evaluation: It is necessary to evaluate the source schema at the time of loading the data to ensure data consistency.
Monitoring Capability: Data coming from a variety of sources presents complexities, and potentially failures due to an API being unavailable, network congestion or failure, API credential expiration, data incompleteness or inaccuracy ? monitoring is critical as recovering from these issues can be complex.

A final note that there are three modes of data loading: APPEND, INSERT and REPLACE, and precautions must be taken while performing data loading with different modes as that can cause data loss as well.

Final Thoughts

I hope this article has assisted in giving you a fresh perspective on ETL while enabling you to understand it better and more effectively use it going forward. It would be great to hear from you about your favorite ETL tools and the solutions that you are seeing take center stage for Data Warehousing.

Feel free to share on other channels and be sure and keep up with all new content from Hashmap here.

Punit Kumar Pathak is a Jr. Big Data Developer at Hashmap working across industries (and clouds) on a number of projects involving ETL pipelining as well as log analytics flow design and implementation. He works with a group of innovative technologists and domain experts accelerating high value business outcomes for customers, partners, and the community.