Devops/Cloud Engineer
Send a job offer directly to this candidate
Cloud & DevOps Engineer with expertise in AWS, OpenStack, OpenShift, Linux, Kubernetes, and CI/CD automation. Adept at designing and managing cloud infrastructure, monitoring solutions,and containerized deployments. Passionate about DevOps automation, cloud security, and site reliability engineering (SRE).
Managed OpenStack infrastructure across compute, storage, and networking services,
ensuring 24/7 availability.
Integrated monitoring tools like Grafana, Kibana, and Prometheus for 70+ Pan-India sites.
Expanded compute, KVM, and Ceph nodes on live sites, following proper CR and incident timelines.
Configured and monitored virtual machine (VM) health and performance using
Prometheus and custom exporters.
Performed OpenStack upgrades and migrations with minimal downtime, ensuring seamless service continuity.
Automated routine tasks and system upgrades to enhance operational eiciency.
Collaborated with cross-functional teams to troubleshoot performance bottlenecks in
Nova, Neutron, and Cinder services.
Validated MBSS documents for the H-Cloud Project while implementing security hardening concepts.
Managed and configured backups for all sites using the Commvault backup tool.
Deployed and maintained monitoring tools (Prometheus, Thanos, Grafana) for real-time observability and long-term metrics storage.
Integrated Thanos for scalable and centralized metric queries across distributed
Prometheus instances.
Utilized Prometheus CLI query language and Kibana to advance AI/ML use cases.
Performed all testing and alert configuration for site expansions and new node validations.
Created custom dashboards and set up alerts for critical infrastructure metrics.
Led site expansion projects by provisioning compute and storage resources using Heat templates and manual orchestration.
Validated and tested newly expanded nodes to ensure optimal performance and reliability.
Created and resolved TTs (Trouble Tickets) and CRs (Change Requests) for all issues and activities, ensuring compliance with SLA.
Monitored and ensured the timely delivery of health checkup reports using Python and
Linux scripting.
Managed multiple projects within OpenShift, handling resource allocation and scaling for services and pods.
Created, updated, and deleted ConfigMaps and Secrets to manage dynamic application configurations.
Integrated Prometheus and Grafana for real-time monitoring and improved cluster observability.
Troubleshot pod failures and network issues, achieving 99.9% application uptime.
Monitored pod health and resolved performance issues using OpenShift Web Console and CLI tools.
Contributed to automation scripts for container builds, deployment rollbacks, and log collection.
Integrated multiple network components in the NLM project using data analytics and
Excel skills on Logstash and Elasticsearch.
Utilized advanced data analysis techniques to ensure accurate insights and system performance.
Managed OpenStack infrastructure across compute, storage, and networking services,
ensuring 24/7 availability.
Integrated monitoring tools like Grafana, Kibana, and Prometheus for 70+ Pan-India sites.
Expanded compute, KVM, and Ceph nodes on live sites, following proper CR and incident timelines.
Configured and monitored virtual machine (VM) health and performance using
Prometheus and custom exporters.
Performed OpenStack upgrades and migrations with minimal downtime, ensuring seamless service continuity.
Automated routine tasks and system upgrades to enhance operational eiciency.
Collaborated with cross-functional teams to troubleshoot performance bottlenecks in
Nova, Neutron, and Cinder services.
Validated MBSS documents for the H-Cloud Project while implementing security hardening concepts.
Managed and configured backups for all sites using the Commvault backup tool.
Deployed and maintained monitoring tools (Prometheus, Thanos, Grafana) for real-time observability and long-term metrics storage.
Integrated Thanos for scalable and centralized metric queries across distributed
Prometheus instances.
Utilized Prometheus CLI query language and Kibana to advance AI/ML use cases.
Performed all testing and alert configuration for site expansions and new node validations.
Created custom dashboards and set up alerts for critical infrastructure metrics.
Led site expansion projects by provisioning compute and storage resources using Heat templates and manual orchestration.
Validated and tested newly expanded nodes to ensure optimal performance and reliability.
Created and resolved TTs (Trouble Tickets) and CRs (Change Requests) for all issues and activities, ensuring compliance with SLA.
Monitored and ensured the timely delivery of health checkup reports using Python and
Linux scripting.
Managed multiple projects within OpenShift, handling resource allocation and scaling for services and pods.
Created, updated, and deleted ConfigMaps and Secrets to manage dynamic application configurations.
Integrated Prometheus and Grafana for real-time monitoring and improved cluster observability.
Troubleshot pod failures and network issues, achieving 99.9% application uptime.
Monitored pod health and resolved performance issues using OpenShift Web Console and CLI tools.
Contributed to automation scripts for container builds, deployment rollbacks, and log collection.
Integrated multiple network components in the NLM project using data analytics and
Excel skills on Logstash and Elasticsearch.
Utilized advanced data analysis techniques to ensure accurate insights and system performance.