Skip to main content

DevOps Engineer

Tecnología
EPAM Systems
Hace 1 mesesHasta 8/4/2026
Presencial

Descripción del puesto

We are seeking a Middle DevOps Engineer to deliver Kubernetes and Linux automation for GPU-enabled platforms supporting advanced AI and research workloads. You will run Volcano scheduling, manage quotas and isolation, and build Python and UNIX shell scripting tooling to streamline operations in a client-facing team. Apply today to help scale reliable compute environments

Responsibilities

Configure and support GPU-enabled Kubernetes clusters and standalone Linux compute systems to improve workload scheduling and overall efficiency

Coordinate Volcano job scheduling by managing queues, PODs, GPU allocations, and namespace quota controls

Administer Kubernetes foundations including namespaces, RBAC, resource quotas, and workload isolation strategies

Create and maintain Python and Shell scripts to automate job submission, resource allocation, and monitoring activities

Coordinate with orchestration, optimization, and observability teams to raise scheduling performance, utilization, and researcher productivity

Measure infrastructure condition and resource consumption, and provide data for reporting and optimization decisions

Deliver enhancements to infrastructure, tools, and automation processes to increase scalability, performance, and user satisfaction

Provide operational support that ensures researchers have a smooth environment for AI and computational work

Requirements

2+ years of experience in DevOps or infrastructure engineering roles managing complex, large-scale systems

In-depth Kubernetes administration skills across namespaces, POD scheduling and balancing, PVC, NFS, and resource quota controls

Experience with Volcano for GPU job scheduling, including queue setup, prioritization, and Kubernetes integration

Track record operating GPU cluster environments in Kubernetes and standalone Linux for high-performance computing

Advanced Python scripting capability for infrastructure automation, job handling, and system monitoring

Proficiency with UNIX Shell scripting (including Bash) to automate tasks and enhance operational workflows

Strong Linux system administration knowledge for troubleshooting, performance optimization, and configuration management

Thorough grasp of automation and orchestration tools and practices to support scalable, dependable infrastructure

Excellent English communication skills (spoken and written) for client work and collaboration with cross-functional teams

Nice to have

Helm knowledge for managing Kubernetes application packaging and configuration

Experience with Prometheus, Grafana, and Loki for monitoring and observability

Familiarity with Terraform for Infrastructure as Code provisioning and management

Exposure to Amazon EKS and Google GKE for multi-cloud Kubernetes orchestration

Azure networking experience with VPN, ExpressRoute, and network security practices

Experience leveraging GitHub Copilot, ChatGPT, or Claude to improve development efficiency and code quality

Understanding of hybrid scheduling and resource optimization across cloud and on-premises platforms

We offer

International projects with top brands

Work with global teams of highly skilled, diverse peers

Healthcare benefits

Employee financial programs

Paid time off and sick leave

Upskilling, reskilling and certification courses

Unlimited access to the LinkedIn Learning library and 22,000+ courses

Global career opportunities

Volunteer and community involvement opportunities

EPAM Employee Groups

Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn

Keywords
KubernetesLinuxAutomationGPUVolcano SchedulingPythonUNIX Shell ScriptingRBACResource QuotasWorkload IsolationMonitoringOrchestrationOptimizationObservabilityBashSystem AdministrationDevOpsAIResearch WorkloadsVolcanoSchedulingQuotasIsolationClient-facingHigh-Performance ComputingTerraformPrometheusGrafanaLokiEKSGKEAzure NetworkingGitHub CopilotChatGPTClaude

¿Te interesa este puesto?