Mani Kovvuru

Site Reliability Engineer at Zeta (2022-09 – Present)

Optimized Kubernetes cluster resource utilization, managed deployments, enhanced observability, handled on-call incident management, and maintained 99.9%+ service uptime for mission-critical production systems.

Optimized Kubernetes cluster resource utilization by analyzing performance bottlenecks using a custom script to collect key metrics. Implemented HPA, affinity rules, priority classes, topology adjustments, and JVM metric exposure, leading to 30% reduction in resource wastage and 20% improvement in application performance and stability.
Independently managed deployments of new Kubernetes clusters using rolling deployment strategy, ensuring zero downtime and 30% faster service transitions. Optimized cluster performance by fine-tuning resource requests/limits, configuring Horizontal Pod Autoscaler (HPA), and implementing node affinity rules, leading to 35% improvement in resource utilization.
Enhanced observability in Grafana by creating 4 dashboards for monitoring Spring Boot services in a different production zone.

Configured

Prometheus as a data source, set up key performance metrics JVM memory, request latency and error rates, improving real-time monitoring and incident response efficiency by 50%.

Handled on-call incident management, troubleshooting critical production issues. Diagnosed application failures, infrastructure issues, and performance bottlenecks in Kubernetes, AWS, and CI/CD pipelines using logs and Prometheus monitoring. Coordinated with cross-functional teams for swift resolution and aligned with Customer Support for client communication. Authored internal and external RCAs, implementing preventive measures to reduce recurring incidents and improve system reliability.
Troubleshot and resolved complex Kubernetes issues including podcrashes, deployment failures, resources bottlenecks, CrashLoopBackOff errors, volume mounting issues, and service disruptions using kubectl and monitoring tools.
Proactively monitored system health using real-time metrics, log analysis, and intelligent alerting, identifying and resolving issues before customer impact.
Utilized Terraform to manage AWS infrastructure as code (EC2, VPC, IAM, S3, ECR), maintaining versioned, peer-reviewed configurations that improved environment consistency and reduced provisioning time across dev, staging, and production.
Collaborated with engineering, QA, and production support teams to ensure smooth releases, stable operations, and seamless code migrations across staging, demo, and production environments.
Consistently maintained 99.9%+ service uptime for mission-critical production systems by engineering robust monitoring, alerting, and automated recovery mechanisms aligned with defined SLOs and SLIs.
Deployed a script as Apache Airflow (DAG, runs every day settlements) to check A Records for Zeta customers website hosting, querying AWS Route 53 DNS, monitoring blacklist status, and triggering notifications and Zendesk ticket creation, achieving 100% compliance and decreasing downtime due to blacklisted IPs by 70%.
Investigated and eliminated JVM memory leaks by capturing and analyzing heap dumps using jmap, preventing OOMKilled pods in production Kubernetes clusters.
Improved Infrastructure Security by integrating AWS Security Hub with CloudTrail, AWS Inspector, and Shield, improving threat detection accuracy by 35% and ensuring compliance with AWS PCI & non-PCI benchmarks.

Application Support Engineer at Wipro Technologies Pvt Ltd (2022-02 – 2022-09)

Served as primary point of contact for L1 support, managing data requests, analyzing system performance metrics, and collaborating with QA and development teams for issue resolution.

Served as primary point of contact for L1 support, managing data requests and coordinating with L2 teams for complex issue resolution.
Analyzed and optimized system performance metrics including CPU utilization, memory usage, disk I/O, and network throughput to prevent resource exhaustion and ensure SLA compliance.
Proactively tracked system health using real-time metrics and log analysis, identifying and resolving issues before customer impact using Grafana and Kibana.
Documented actionable bugs with comprehensive reproduction steps and root cause analysis for engineering resolution, improving defect resolution time.
Implemented lean principles to enhance workflow efficiency, reduce waste, and streamline support processes.
Worked closely with QA and development teams to identify, reproduce, and resolve software bugs and implement enhancements based on user feedback.

Hire this person

About

Experience

Configured

Education

Skills

Reviews

Similar people near Hyderabad

Varun Mittal

Lokesh Reddy Ambati

Dibyalochan Mohanta

Murari Guguloth

Mude Chinni Krushna Naik

Azhar Mohammed

Other similar people

Apeksh Naik

Sohan Sharma

Prashanth kumar Shetty

Mahender singh

Manjunath m Maily

Akash Shelar

Related