Job Description: Data Platform Engineer (QA Storage Focus)

Role Overview

We are looking for a Data Platform Engineer with strong QA and Data Validation experience to support large-scale data platforms. The ideal candidate will have hands-on experience in testing data pipelines, validating data lakes/storage systems, and ensuring data quality, accuracy, and performance across distributed environments.

Key Responsibilities

Design, develop, and execute data validation and QA test strategies for ETL/ELT pipelines
Perform end-to-end data validation between source systems and target data platforms (Data Lake / Data Warehouse)
Validate large-scale datasets (millions/billions of records) using SQL, Python, and PySpark
Perform file-level and storage validation across data lakes (S3 / ADLS / HDFS)
File count validation
Schema validation
Partition validation
Data completeness checks
Test and validate data ingestion pipelines (batch & streaming)
Validate data across Bronze / Silver / Gold layers (Medallion architecture)
Perform data reconciliation and consistency checks across multiple systems
Develop and maintain automated data validation frameworks using Python (PyTest or similar)
Implement and monitor data quality checks (nulls, duplicates, referential integrity)
Validate data formats such as Parquet, ORC, Delta Lake
Conduct performance testing of data pipelines and queries (Spark / SQL)
Analyze and validate data processing performance, latency, and throughput
Collaborate with Data Engineers to identify and fix data issues and optimize pipelines

Required Skills

Data QA / Testing

Strong experience in ETL/ELT testing and data validation
Expertise in SQL for data validation and reconciliation
Experience with test case design, execution, and defect tracking
Knowledge of data quality frameworks and validation techniques

Data Engineering Knowledge

Understanding of data pipelines (ADF / Airflow / Glue / Databricks)
Experience with PySpark / Apache Spark (basic to intermediate)
Familiarity with data modeling and transformations

Storage / Data Lake Validation (MANDATORY)

Hands-on experience with Data Lakes (AWS S3 / Azure ADLS / HDFS)
Strong knowledge of:
File-based validation
Partitioning strategies
Schema evolution
Experience validating Parquet / ORC / Delta Lake datasets

Programming & Tools

Python (for automation/testing)
SQL (strong)
Experience with PyTest / automation frameworks
Git / CI-CD basics

Cloud Platforms (Any One)

AWS (S3, Glue, Athena) OR
Azure (ADLS, ADF, Databricks)

Nice to Have

Experience with Great Expectations / Deequ (data quality tools)
Knowledge of Kafka / streaming validation
Experience with Delta Lake features (time travel, versioning)
Exposure to data governance tools (Glue Catalog, Unity Catalog)

Ideal Candidate Profile

Strong Data Engineer with QA/testing experience
Hands-on with data validation storage systems
Comfortable working with large-scale distributed data platforms
Detail-oriented with a focus on data accuracy, quality, and performance

Performance Test Data Engineer

Job description