EPAM SystemsWe are building robust GPU-ready Kubernetes and Linux platforms, and need a Senior DevOps Engineer to automate, scale, and optimize orchestration. You will run Kubernetes administration with Volcano scheduling, quotas, and isolation while automating with Python and Bash for AI and research workloads. Join our delivery team and apply today
Deploy, configure, and maintain GPU-enabled Kubernetes clusters and standalone Linux compute environments to keep scheduling and performance optimal
Implement and operate Volcano job scheduling, including queue setup, POD execution, GPU allocation, and namespace quota enforcement
Administer Kubernetes end-to-end, covering namespaces, RBAC, resource quotas, and workload isolation strategies
Develop and maintain Python and Shell automation to simplify job submission, resource provisioning, and system reporting
Collaborate with orchestration, optimization, and observability teams to raise scheduling efficiency, capacity utilization, and researcher workflows
Monitor infrastructure health and resource usage, supplying data and feedback for optimization and reporting requirements
Identify and propose improvements across infrastructure, tooling, and automation workflows to increase performance, scalability, and usability
Ensure operational processes provide researchers with a smooth and efficient experience across varied AI and computational workloads
At least 3 years of experience in DevOps or infrastructure engineering across complex, large-scale environments
Expert-level Kubernetes administration skills, including namespaces, POD scheduling/distribution, PVC, NFS, and resource quota management
Hands-on experience with Volcano scheduler for GPU job execution, including queue configuration and workload prioritization with Kubernetes integration
Proven ability to operate GPU cluster environments in Kubernetes as well as on standalone Linux compute nodes
Advanced Python scripting expertise for infrastructure automation, plus UNIX Shell scripting skills such as Bash
Strong Linux system administration capabilities, including troubleshooting, performance tuning, and configuration management
Solid understanding of infrastructure automation and orchestration concepts and tooling
Fluent English communication skills (spoken and written) for direct client interaction
Knowledge of Helm package management for Kubernetes applications
Familiarity with monitoring and observability solutions, particularly Prometheus, Grafana and Loki
Skills in Infrastructure as Code tools such as Terraform
Background in multi-cloud Kubernetes environments including Amazon EKS and Google GKE
Understanding of Azure Networking including VPN, ExpressRoute and network security
Familiarity with AI-assisted coding tools such as GitHub Copilot, ChatGPT and Claude
Experience with hybrid (cloud and on-premises) scheduling and resource optimization
International projects with top brands
Work with global teams of highly skilled, diverse peers
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Volunteer and community involvement opportunities
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn
¿Te interesa este puesto?