Skip to main content

DevOps Engineer

Tecnología
EPAM Systems
Hace 1 mesesHasta 8/4/2026
Presencial

Descripción del puesto

EPAM is a leading global provider of digital platform engineering and development services. We are committed to having a positive impact on our customers, our employees, and our communities. We embrace a dynamic and inclusive culture.

Here you will collaborate with multi-national teams, contribute to a myriad of innovative projects that deliver the most creative and cutting-edge solutions, and have an opportunity to continuously learn and grow. No matter where you are located, you will join a dedicated, creative, and diverse community that will help you discover your fullest potential.

We are operating Kubernetes and Linux GPU infrastructure that emphasizes Volcano-based scheduling, reliability, and automation for AI compute at scale. As a Middle DevOps Engineer, you will administer Kubernetes and Linux environments, build Python and Bash automation for job workflows, and work closely with client stakeholders in a delivery team. Apply to help optimize compute performance and researcher experience for demanding AI workloads.

Responsibilities

Operate and tune GPU-enabled Kubernetes clusters and standalone Linux compute environments to keep scheduling and performance optimized

Manage Volcano job scheduling, including queue setup, POD execution, GPU allocation, and namespace quota enforcement

Administer Kubernetes end to end with namespaces, RBAC, resource quotas, and workload isolation approaches

Create and support Python and Shell automation to simplify job submission, resource provisioning, and system reporting

Work with orchestration, optimization, and observability teams to raise scheduling efficiency, improve capacity utilization, and streamline researcher workflows

Measure infrastructure health and resource utilization, supplying data and feedback for optimization and reporting needs

Enhance infrastructure, tooling, and automation workflows to improve performance, scalability, and usability

Ensure operations deliver a smooth and efficient experience for researchers running diverse AI and computational workloads

Requirements

Hands-on background with 2+ years of experience in DevOps or infrastructure engineering within complex, large-scale environments

Expertise in Kubernetes administration and orchestration, including namespaces, POD scheduling/distribution, PVC, NFS, and resource quota management

Practical experience with the Volcano scheduler for GPU job execution, queue configuration, and workload prioritization integrated with Kubernetes

Proven ability to operate GPU cluster environments in Kubernetes as well as on standalone Linux compute nodes

Advanced Python scripting skills for infrastructure automation, plus proficiency in UNIX Shell scripting such as Bash

Strong Linux system administration skills, including troubleshooting, performance tuning, and configuration management

Solid understanding of infrastructure automation and orchestration concepts and related tooling

Fluent English communication skills (spoken and written) for direct client interaction

Nice to have

Knowledge of Helm package management for Kubernetes applications

Familiarity with monitoring and observability solutions, particularly Prometheus, Grafana, and Loki

Skills in Infrastructure as Code tools such as Terraform

Background in multi-cloud Kubernetes environments including Amazon EKS and Google GKE

Understanding of Azure Networking including VPN, ExpressRoute, and network security

Familiarity with AI-assisted coding tools such as GitHub Copilot, ChatGPT, and Claude

Experience with hybrid (cloud and on-premises) scheduling and resource optimization

We offer

International projects with top brands

Work with global teams of highly skilled, diverse peers

Healthcare benefits

Employee financial programs

Paid time off and sick leave

Upskilling, reskilling and certification courses

Unlimited access to the LinkedIn Learning library and 22,000+ courses

Global career opportunities

Volunteer and community involvement opportunities

EPAM Employee Groups

Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Keywords
KubernetesLinuxVolcano SchedulerPythonBashAutomationRBACGPU ClusterShell ScriptingTroubleshootingPerformance TuningHelmPrometheusGrafanaLokiTerraformDevOpsDigital Platform EngineeringGPU InfrastructureVolcano SchedulingAI ComputeMiddle DevOps EngineerStakeholder CollaborationPerformance OptimizationResearcher ExperienceResource QuotasPython ScriptingBash ScriptingInfrastructure HealthCapacity UtilizationObservabilityEKSGKEAzure NetworkingHybrid Scheduling

¿Te interesa este puesto?