Skip to main content

DevOps Engineer

Tecnología
EPAM Systems
Hace 1 mesesHasta 10/4/2026
Presencial

Descripción del puesto

EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture.

Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

We are delivering resilient Kubernetes and Linux platforms optimized for GPU scheduling and large-scale automation in AI compute environments. As a Middle DevOps Engineer, you will operate Kubernetes (including Volcano) and Linux GPU clusters, automate workflows with Python and UNIX shell scripting, and partner with a client-facing delivery team. Apply to help build reliable, high-throughput compute platforms for advanced AI workloads.

Responsibilities

Deploy, configure, and run GPU-enabled Kubernetes clusters and standalone Linux compute environments while keeping scheduling and performance optimized

Implement and manage Volcano job scheduling, including queue setup, POD execution, GPU allocation, and namespace quota enforcement

Administer Kubernetes end to end, including namespaces, RBAC, resource quotas, and workload isolation approaches

Develop and maintain Python and Shell automation to simplify job submission, resource provisioning, and system reporting

Collaborate with orchestration, optimization, and observability teams to boost scheduling efficiency, improve capacity utilization, and streamline researcher workflows

Monitor infrastructure health and resource utilization, supplying data and feedback for optimization and reporting needs

Identify opportunities to improve infrastructure, tooling, and automation workflows to raise performance, scalability, and usability

Ensure operational processes deliver a smooth and efficient experience for researchers running diverse AI and computational workloads

Requirements

Hands-on background with 2+ years of experience in DevOps or infrastructure engineering in complex, large-scale environments

Deep expertise in Kubernetes administration and orchestration, including namespaces, POD scheduling/distribution, PVC, NFS, and resource quota management

Practical experience using the Volcano scheduler for GPU job execution, queue configuration, and workload prioritization integrated with Kubernetes

Proven ability to run GPU cluster environments in Kubernetes and on standalone Linux compute nodes

Advanced Python scripting skills for infrastructure automation, plus proficiency in UNIX Shell scripting such as Bash

Strong Linux system administration skills, including troubleshooting, performance tuning, and configuration management

Solid understanding of infrastructure automation and orchestration concepts and related tooling

Fluent English communication skills (spoken and written) for direct client interaction

Nice to have

Knowledge of Helm package management for Kubernetes applications

Familiarity with monitoring and observability solutions, particularly Prometheus, Grafana, and Loki

Skills in Infrastructure as Code tools such as Terraform

Background in multi-cloud Kubernetes environments including Amazon EKS and Google GKE

Understanding of Azure Networking including VPN, ExpressRoute, and network security

Familiarity with AI-assisted coding tools such as GitHub Copilot, ChatGPT, and Claude

Experience with hybrid (cloud and on-premises) scheduling and resource optimization

We offer

International projects with top brands

Work with global teams of highly skilled, diverse peers

Healthcare benefits

Employee financial programs

Paid time off and sick leave

Upskilling, reskilling and certification courses

Unlimited access to the LinkedIn Learning library and 22,000+ courses

Global career opportunities

Volunteer and community involvement opportunities

EPAM Employee Groups

Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Keywords
KubernetesLinuxPythonUNIX Shell ScriptingVolcano SchedulerGPU SchedulingAutomationRBACResource QuotasBashTerraformPrometheusGrafanaLokiHelmEKSDevOpsDigital Platform EngineeringAI Compute EnvironmentsVolcanoShell ScriptingInfrastructure as CodeGKEAzure NetworkingHigh-throughput Compute Platforms

¿Te interesa este puesto?