DevOps Engineer

We are looking for a Middle DevOps Engineer to automate and optimize Kubernetes platforms for GPU workloads and the Linux infrastructure behind AI research. You will implement and support Volcano-based scheduling, quotas, and cluster operations using Python and UNIX shell scripting while partnering with engineers and researchers. Apply to help deliver reliable, scalable compute environments

Responsibilities

Deploy, configure, and support GPU-enabled Kubernetes clusters and standalone Linux compute systems to improve scheduling and overall efficiency

Administer Volcano scheduling by setting up queues, managing PODs, assigning GPU resources, and enforcing namespace quota controls

Own Kubernetes platform management across namespaces, RBAC, resource quotas, and workload isolation approaches

Develop and maintain Python and Shell automation to simplify job submission, resource allocation, and infrastructure monitoring

Collaborate with orchestration, optimization, and observability teams to increase scheduling throughput, resource utilization, and researcher productivity

Monitor infrastructure health and resource consumption, and share metrics to guide optimization and reporting

Recommend and deliver improvements to infrastructure, tools, and automation processes to enhance scalability, performance, and user experience

Support operational routines that give researchers a smooth environment for AI and computational workloads

Requirements

Professional experience of 2+ years in DevOps or infrastructure engineering, supporting complex large-scale systems

Deep expertise in Kubernetes administration and orchestration, including namespaces, POD scheduling and balancing, PVC, NFS, and resource quota controls

Hands-on experience with the Volcano scheduler for GPU job management, covering queue configuration, workload prioritization, and Kubernetes integration

Proven ability to run GPU cluster environments in Kubernetes and standalone Linux configurations for high-performance computing

Advanced Python scripting skills for automating infrastructure operations, job handling, and system monitoring

Practical proficiency in UNIX Shell scripting (e.g., Bash) to automate system tasks and streamline operational workflows

Strong Linux system administration background, including troubleshooting, performance optimization, and configuration management

Thorough understanding of automation and orchestration tooling and concepts to build scalable, dependable infrastructure

Excellent English communication skills (spoken and written) for direct client engagement and collaboration across teams

Nice to have

Experience with Helm to package and manage Kubernetes applications

Knowledge of monitoring and observability tools such as Prometheus, Grafana, and Loki to track health and performance

Familiarity with Infrastructure as Code tools like Terraform for automated cloud provisioning and management

Background with multi-cloud Kubernetes platforms including Amazon EKS and Google GKE

Skills in Azure networking, including VPN setup, ExpressRoute configuration, and network security

Experience using AI coding assistants (GitHub Copilot, ChatGPT, Claude) to improve development speed and code quality

Understanding of hybrid scheduling and resource optimization across cloud and on-premises environments

We offer

International projects with top brands

Work with global teams of highly skilled, diverse peers

Healthcare benefits

Employee financial programs

Paid time off and sick leave

Upskilling, reskilling and certification courses

Unlimited access to the LinkedIn Learning library and 22,000+ courses

Global career opportunities

Volunteer and community involvement opportunities

EPAM Employee Groups

Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Descripción del puesto