Site Reliability Engineer at Zeta (2022-09 – Present)
Optimized Kubernetes cluster resource utilization, managed deployments, enhanced observability, handled on-call incident management, and maintained 99.9%+ service uptime for mission-critical production systems.
- Optimized Kubernetes cluster resource utilization by analyzing performance bottlenecks using a custom script to collect key metrics. Implemented HPA, affinity rules, priority classes, topology adjustments, and JVM metric exposure, leading to 30% reduction in resource wastage and 20% improvement in application performance and stability.
- Independently managed deployments of new Kubernetes clusters using rolling deployment strategy, ensuring zero downtime and 30% faster service transitions. Optimized cluster performance by fine-tuning resource requests/limits, configuring Horizontal Pod Autoscaler (HPA), and implementing node affinity rules, leading to 35% improvement in resource utilization.
- Enhanced observability in Grafana by creating 4 dashboards for monitoring Spring Boot services in a different production zone.
Configured
Prometheus as a data source, set up key performance metrics JVM memory, request latency and error rates, improving real-time monitoring and incident response efficiency by 50%.
- Handled on-call incident management, troubleshooting critical production issues. Diagnosed application failures, infrastructure issues, and performance bottlenecks in Kubernetes, AWS, and CI/CD pipelines using logs and Prometheus monitoring. Coordinated with cross-functional teams for swift resolution and aligned with Customer Support for client communication. Authored internal and external RCAs, implementing preventive measures to reduce recurring incidents and improve system reliability.
- Troubleshot and resolved complex Kubernetes issues including podcrashes, deployment failures, resources bottlenecks, CrashLoopBackOff errors, volume mounting issues, and service disruptions using kubectl and monitoring tools.
- Proactively monitored system health using real-time metrics, log analysis, and intelligent alerting, identifying and resolving issues before customer impact.
- Utilized Terraform to manage AWS infrastructure as code (EC2, VPC, IAM, S3, ECR), maintaining versioned, peer-reviewed configurations that improved environment consistency and reduced provisioning time across dev, staging, and production.
- Collaborated with engineering, QA, and production support teams to ensure smooth releases, stable operations, and seamless code migrations across staging, demo, and production environments.
- Consistently maintained 99.9%+ service uptime for mission-critical production systems by engineering robust monitoring, alerting, and automated recovery mechanisms aligned with defined SLOs and SLIs.
- Deployed a script as Apache Airflow (DAG, runs every day settlements) to check A Records for Zeta customers website hosting, querying AWS Route 53 DNS, monitoring blacklist status, and triggering notifications and Zendesk ticket creation, achieving 100% compliance and decreasing downtime due to blacklisted IPs by 70%.
- Investigated and eliminated JVM memory leaks by capturing and analyzing heap dumps using jmap, preventing OOMKilled pods in production Kubernetes clusters.
- Improved Infrastructure Security by integrating AWS Security Hub with CloudTrail, AWS Inspector, and Shield, improving threat detection accuracy by 35% and ensuring compliance with AWS PCI & non-PCI benchmarks.
Application Support Engineer at Wipro Technologies Pvt Ltd (2022-02 – 2022-09)
Served as primary point of contact for L1 support, managing data requests, analyzing system performance metrics, and collaborating with QA and development teams for issue resolution.
- Served as primary point of contact for L1 support, managing data requests and coordinating with L2 teams for complex issue resolution.
- Analyzed and optimized system performance metrics including CPU utilization, memory usage, disk I/O, and network throughput to prevent resource exhaustion and ensure SLA compliance.
- Proactively tracked system health using real-time metrics and log analysis, identifying and resolving issues before customer impact using Grafana and Kibana.
- Documented actionable bugs with comprehensive reproduction steps and root cause analysis for engineering resolution, improving defect resolution time.
- Implemented lean principles to enhance workflow efficiency, reduce waste, and streamline support processes.
- Worked closely with QA and development teams to identify, reproduce, and resolve software bugs and implement enhancements based on user feedback.