Senior DevOps Engineer

We are building robust GPU-ready Kubernetes and Linux platforms, and need a Senior DevOps Engineer to automate, scale, and optimize orchestration. You will run Kubernetes administration with Volcano scheduling, quotas, and isolation while automating with Python and Bash for AI and research workloads. Join our delivery team and apply today

Responsibilities

Deploy, configure, and maintain GPU-enabled Kubernetes clusters and standalone Linux compute environments to keep scheduling and performance optimal

Implement and operate Volcano job scheduling, including queue setup, POD execution, GPU allocation, and namespace quota enforcement

Administer Kubernetes end-to-end, covering namespaces, RBAC, resource quotas, and workload isolation strategies

Develop and maintain Python and Shell automation to simplify job submission, resource provisioning, and system reporting

Collaborate with orchestration, optimization, and observability teams to raise scheduling efficiency, capacity utilization, and researcher workflows

Monitor infrastructure health and resource usage, supplying data and feedback for optimization and reporting requirements

Identify and propose improvements across infrastructure, tooling, and automation workflows to increase performance, scalability, and usability

Ensure operational processes provide researchers with a smooth and efficient experience across varied AI and computational workloads

Requirements

At least 3 years of experience in DevOps or infrastructure engineering across complex, large-scale environments

Expert-level Kubernetes administration skills, including namespaces, POD scheduling/distribution, PVC, NFS, and resource quota management

Hands-on experience with Volcano scheduler for GPU job execution, including queue configuration and workload prioritization with Kubernetes integration

Proven ability to operate GPU cluster environments in Kubernetes as well as on standalone Linux compute nodes

Advanced Python scripting expertise for infrastructure automation, plus UNIX Shell scripting skills such as Bash

Strong Linux system administration capabilities, including troubleshooting, performance tuning, and configuration management

Solid understanding of infrastructure automation and orchestration concepts and tooling

Fluent English communication skills (spoken and written) for direct client interaction

Nice to have

Knowledge of Helm package management for Kubernetes applications

Familiarity with monitoring and observability solutions, particularly Prometheus, Grafana and Loki

Skills in Infrastructure as Code tools such as Terraform

Background in multi-cloud Kubernetes environments including Amazon EKS and Google GKE

Understanding of Azure Networking including VPN, ExpressRoute and network security

Familiarity with AI-assisted coding tools such as GitHub Copilot, ChatGPT and Claude

Experience with hybrid (cloud and on-premises) scheduling and resource optimization

We offer

International projects with top brands

Work with global teams of highly skilled, diverse peers

Healthcare benefits

Employee financial programs

Paid time off and sick leave

Upskilling, reskilling and certification courses

Unlimited access to the LinkedIn Learning library and 22,000+ courses

Global career opportunities

Volunteer and community involvement opportunities

EPAM Employee Groups

Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Descripción del puesto