Data Engineer at Cognizant Technology Solutions (2022-08 – Present)
Data Engineer with expertise in building scalable data pipelines using Databricks, Delta Lake, and AWS services
- Architected end-to-end ingestion pipelines in a scalable medallion (Bronze–Silver–Gold) lakehouse using Databricks and Delta Lake, integrating 6+ enterprise sources (SAP, Oracle, Veeva CRM, APIs, partner feeds) to enable batch and near real-time data processing for a unified Snowflake analytics warehouse
- Developed PySpark pipelines on Databricks using broadcast joins, SCD Type 2 window functions, and dynamic partition overwrite for late-arriving data — processing millions of records within SLA
- Built event-driven ingestion pipelines using Auto Loader and Kafka Structured Streaming, enabling efficient, low-latency data ingestion with schema evolution and fault tolerance
- Developed CDC-based pipelines with Delta MERGE for idempotent upserts and deduplication, ensuring accurate current-state data and handling late-arriving records
- Implemented data governance and security using Unity Catalog, including row-level access control, column-level masking for PII, and centralized lineage tracking
- Orchestrated end-to-end data workflows using Apache Airflow, managing 10+ interdependent DAGs with sensor-based dependencies, SLA tracking, and automated failure alerting for reliable pipeline execution
- Built a config-driven data quality framework enforcing schema validation, null checks, referential integrity, and composite-key deduplication before downstream data promotion
- Architected a 3-layer Snowflake warehouse (Staging → Enriched → DataMart) using MERGE-based loads, clustering keys for performance optimization, and scalable transformation patterns
- Designed a multi-pattern ingestion architecture (CDC, streaming, API, batch) on AWS, handling 100M+ records/day, ensuring zero data loss via checkpoint-based recovery, DLQs, and idempotent processing
- Built end-to-end CDC pipelines using AWS DMS (full-load + CDC) with Multi-AZ failover, optimized LOB handling, and Op-based downstream merge logic enabling reliable replay and consistency
- Engineered a 3-zone S3 data lake (Raw, Quarantine, Curated) using AWS Glue with strict data contracts, enabling schema validation, auditability, and automated reprocessing workflows for failed records
- Engineered schema drift detection and quarantine framework — runtime fingerprint comparison against Glue Data Catalog, routing non-conforming records to quarantine S3 with enriched error metadata
- Developed scalable Glue Spark pipelines with Delta Lake SCD Type 2 MERGE, incorporating Deequ-based data quality checks, dynamic partitioning, and Z-order optimization for large-scale transformations
- Orchestrated pipelines using Step Functions with parallel execution and idempotent design, and built Redshift integration using manifest-based COPY for transactional loads and high-performance BI consumption