Skip to main content

Power of Automation for Data Testing

Written By: Rajkanth Kamath


Power of Automation for Data Testing

July 28, 2020 7-Minute read

Testing of ETL Projects – Why is it crucial than ever?

Testing of business-critical applications that are data and process-intensive is a laborious process owing to varied factors. Processing of huge amount of data from multiple sources involves complex calculations, transformations that necessitate frequent and elaborate data loading and reloading activities.

A systematic approach towards data validation is necessary owing to high throughput, processing volumes and business sensitivity. Further, frequent and increased number of enhancements rolled out, demand additional levels of validation besides the frequent regression checks.

Accelerating Data Validation with Automation

The elaborate and time-consuming nature of data extraction, loading and validation of ETL implementations, calls for automation of repetitive activities involved.

A comprehensive validation of ETL-processes requires multiple validations at various interface points to establish if the ETL process is effective. Furthermore, ensuring data integrity requires testing of data warehouse as a whole, with multiple services between the ETL processes, the middleware, dashboard visualizations, complex interfacing logic, and messaging across the several layers.

Adopting a continuous testing approach requires automating the testing of different source and target data sets that helps rationalize testing effort when handling multiple and diverse data sources and volumes.

Leveraging the right tool for ETL Automation

While several advanced tools for test automation for various technology streams and platforms are available, tools that can automate the comprehensive testing activity of the ETL process model are limited.

The fundamental check in any ETL Testing is validating if all the data records have been loaded correctly into the data warehouse. This activity could be as simple as comparing a record count of the source and the target table or as complex as validating the data being transformed to a completely different structure through a data modelling technique. When there are multiple data source and targets, the data validation becomes more complex and crucial to the effectiveness of the implementation. 

While SQL Based ETL Testing is the initial step to ETL Testing, the very basic nature of the query-driven validation falls short when data arises from multiple sources and formats to be handled.  The queries would be limited to a single system and format that can be connected to. Furthermore the constant transformation rule changes, typical in large ETL implementations, would require constant maintenance efforts to update the queries to reflect the changes.

It is clear, that the need of the hour is a tool or a framework that would, apart from providing extensive automation support for a multi-technology and data source landscape, should be able to integrate with the test management tools available in the market.

Scenario: Data Consolidation for a Multi-Technology, Multi-application, Multi-Geo Landscape

Let us consider a scenario with one of our customers in the Travel IT space. The scenario involved consolidation of Travel Revenue Systems, of different source markets (geographies) with different instances of the backend revenue management systems (RMS). These different instances had to be merged into a single instance supporting all source markets where each source market can access the data and features relevant to them.

The backend revenue management system had to process the inventories by analyzing the sales of multiple source markets individually. It also had to provide yield adjustments to reservation system at source market level; support numerous currencies and source market data across geographies.

The challenges that transpired during data testing for an ETL implementation of this scale were:

  • Virtualization, manual effort and errors, support for large dataset volumes, performance monitoring
  • Support for (Extraction and Validation) different source (CSV/parquet/json) and target (Hive/presto/s3/Hdfs) systems
  • Comparison of data frames
  • Development, Synchronization, and Maintenance of the Test Cases & Scripts
  • Integration and synchronization with CI-CD pipelines

Building Custom Frameworks: Relevance and Maintenance

Given the limitations of the SQL Query based validation of data, the python Unit test framework was evaluated and recommended for the end-to-end data validation.  The python test framework was customized to automate tests, tie in with CI-CD (Continuous Integration, Continuous Delivery) pipeline, and integrate with the qTest Tool for End to End Test Automation and Management.

This python Unit Test Framework –

•Supported data comparisons across multiple data sources (CSV, parquet, json, table)

• Supported configuration file-based connectivity to multiple data sources and targets

•Was used to connect to multiple sources using apache spark and python, supported comparison with data frames

• Supported development and execution of unit test cases, integration test cases, functional and comparison test cases

• Supported continuous execution of test cases on-demand or on the trigger of changes within the delivery pipeline

• Supported integration of automation test cases into qtest using Jenkins

• Tracked automation test cases execution daily, helped to generate reports and programmed to send to the test team and business

Customization on Python Unit Test Framework

The customization of the framework allowed extension of the automation across the spectrum of testing, while integrating the test management tools for a comprehensive test management coordination. Tests that were done with the framework are listed below:

  • Test Automation, Execution & Maintenance
    • Monitored End-to-end system using with Automation test cases
    • Implemented Integration test cases, functional test cases, performance test cases and Unit test cases
    • Integrated all automated test cases into qtest using with Jenkins pipelines
    • Generated reports on a scheduled basis
    • Comparison of different data sources and targets
    • Connected to Hadoop ecosystems
    • High Frequency of Regression Test Cycles
  • Support for Multiple Data Sources
    •  Support for data comparison across multiple data sources
    •  Reading from different source and target systems with multiple hops
    • Data Integration and Data Sync Verification
    • Data format conversion & loading
  • Test Maintenance
    • Maintenance of test cases, re-usability of test cases, minimize manual test cases
  • Test Execution & Maintenance
    • Scheduled Execution of Automated Tests
    • Auto Publishing of Results and Reports to Business Teams
    • Maintenance of the automation suite

The Test Automation function is one of the most apt solution to overcome the existing manual testing snag. Automation testing can save you both time and money.

If you want to know how we can assist you in your path to automation, reach out to us at