Skip to main content

DevOps Engineer

Tecnología
EPAM Systems
Hace 1 mesesHasta 10/4/2026
Presencial

Descripción del puesto

We are building scalable Kubernetes and Linux infrastructure designed for GPU workloads, efficient scheduling, and repeatable automation at scale. As a Middle DevOps Engineer, you will support and enhance Kubernetes environments with Volcano, operate Linux compute nodes, and deliver automation in Python and Bash within a client-facing team. Apply to help researchers run AI jobs smoothly on reliable compute platforms.

Responsibilities

Install, configure, and operate GPU-enabled Kubernetes clusters and standalone Linux compute environments to maintain optimized scheduling and performance

Configure and administer Volcano job scheduling, including queue setup, POD execution, GPU allocation, and namespace quota enforcement

Manage Kubernetes end to end, covering namespaces, RBAC, resource quotas, and workload isolation approaches

Build and maintain Python and Shell automation to streamline job submission, resource provisioning, and system reporting

Partner with orchestration, optimization, and observability teams to improve scheduling efficiency, increase capacity utilization, and simplify researcher workflows

Track infrastructure health and resource utilization, providing data and feedback for optimization and reporting needs

Drive enhancements to infrastructure, tooling, and automation workflows to improve performance, scalability, and usability

Support operational processes that ensure a smooth and efficient experience for researchers running diverse AI and computational workloads

Requirements

Hands-on background with 2+ years of experience in DevOps or infrastructure engineering within complex, large-scale environments

Strong expertise in Kubernetes administration and orchestration, including namespaces, POD scheduling/distribution, PVC, NFS, and resource quota management

Practical experience with the Volcano scheduler for GPU job execution, queue configuration, and workload prioritization integrated with Kubernetes

Proven ability to operate GPU cluster environments in Kubernetes as well as on standalone Linux compute nodes

Advanced Python scripting skills for infrastructure automation, plus proficiency in UNIX Shell scripting such as Bash

Strong Linux system administration skills, including troubleshooting, performance tuning, and configuration management

Solid understanding of infrastructure automation and orchestration concepts and related tooling

Fluent English communication skills (spoken and written) for direct client interaction

Nice to have

Knowledge of Helm package management for Kubernetes applications

Familiarity with monitoring and observability solutions, particularly Prometheus, Grafana, and Loki

Skills in Infrastructure as Code tools such as Terraform

Background in multi-cloud Kubernetes environments including Amazon EKS and Google GKE

Understanding of Azure Networking including VPN, ExpressRoute, and network security

Familiarity with AI-assisted coding tools such as GitHub Copilot, ChatGPT, and Claude

Experience with hybrid (cloud and on-premises) scheduling and resource optimization

We offer

International projects with top brands

Work with global teams of highly skilled, diverse peers

Healthcare benefits

Employee financial programs

Paid time off and sick leave

Upskilling, reskilling and certification courses

Unlimited access to the LinkedIn Learning library and 22,000+ courses

Global career opportunities

Volunteer and community involvement opportunities

EPAM Employee Groups

Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Keywords
KubernetesLinuxGPU WorkloadsVolcano SchedulerAutomationPythonBashRBACResource QuotasShell ScriptingSystem AdministrationPerformance TuningInfrastructure as CodeTerraformPrometheusGrafanaDevOpsGPUWorkloadsSchedulingVolcanoCompute NodesClient-facingAI JobsWorkload IsolationOrchestrationObservabilityHelmLokiEKSGKEAzure NetworkingHybrid Cloud

¿Te interesa este puesto?