Performance Test Data Engineer
Technology
Roseville, United States1 months agoUntil 5/23/2026
Contract
Job description
Job Description: Data Platform Engineer (QA Storage Focus)
Role Overview
We are looking for a Data Platform Engineer with strong QA and Data Validation experience to support large-scale data platforms. The ideal candidate will have hands-on experience in testing data pipelines, validating data lakes/storage systems, and ensuring data quality, accuracy, and performance across distributed environments.
Key Responsibilities
- Design, develop, and execute data validation and QA test strategies for ETL/ELT pipelines
- Perform end-to-end data validation between source systems and target data platforms (Data Lake / Data Warehouse)
- Validate large-scale datasets (millions/billions of records) using SQL, Python, and PySpark
- Perform file-level and storage validation across data lakes (S3 / ADLS / HDFS)
- File count validation
- Schema validation
- Partition validation
- Data completeness checks
- Test and validate data ingestion pipelines (batch & streaming)
- Validate data across Bronze / Silver / Gold layers (Medallion architecture)
- Perform data reconciliation and consistency checks across multiple systems
- Develop and maintain automated data validation frameworks using Python (PyTest or similar)
- Implement and monitor data quality checks (nulls, duplicates, referential integrity)
- Validate data formats such as Parquet, ORC, Delta Lake
- Conduct performance testing of data pipelines and queries (Spark / SQL)
- Analyze and validate data processing performance, latency, and throughput
- Collaborate with Data Engineers to identify and fix data issues and optimize pipelines
Required Skills
Data QA / Testing
- Strong experience in ETL/ELT testing and data validation
- Expertise in SQL for data validation and reconciliation
- Experience with test case design, execution, and defect tracking
- Knowledge of data quality frameworks and validation techniques
Data Engineering Knowledge
- Understanding of data pipelines (ADF / Airflow / Glue / Databricks)
- Experience with PySpark / Apache Spark (basic to intermediate)
- Familiarity with data modeling and transformations
Storage / Data Lake Validation (MANDATORY)
- Hands-on experience with Data Lakes (AWS S3 / Azure ADLS / HDFS)
- Strong knowledge of:
- File-based validation
- Partitioning strategies
- Schema evolution
- Experience validating Parquet / ORC / Delta Lake datasets
Programming & Tools
- Python (for automation/testing)
- SQL (strong)
- Experience with PyTest / automation frameworks
- Git / CI-CD basics
Cloud Platforms (Any One)
- AWS (S3, Glue, Athena) OR
- Azure (ADLS, ADF, Databricks)
Nice to Have
- Experience with Great Expectations / Deequ (data quality tools)
- Knowledge of Kafka / streaming validation
- Experience with Delta Lake features (time travel, versioning)
- Exposure to data governance tools (Glue Catalog, Unity Catalog)
Ideal Candidate Profile
- Strong Data Engineer with QA/testing experience
- Hands-on with data validation storage systems
- Comfortable working with large-scale distributed data platforms
- Detail-oriented with a focus on data accuracy, quality, and performance
Keywords
UnixApache KafkaApache HadoopSCHEMAApache SparkReferential integrityPartitionAirflowPythonSqlApache ParquetHadoopApache LicenseApache Http ServerOrcUnityDisk partitioningAWSGit
¿Te interesa este puesto?