DevOps Engineer
EPAM SystemsDescripción del puesto
We are seeking a Middle DevOps Engineer to deliver Kubernetes and Linux automation for GPU-enabled platforms supporting advanced AI and research workloads. You will run Volcano scheduling, manage quotas and isolation, and build Python and UNIX shell scripting tooling to streamline operations in a client-facing team. Apply today to help scale reliable compute environments
Responsibilities
Configure and support GPU-enabled Kubernetes clusters and standalone Linux compute systems to improve workload scheduling and overall efficiency
Coordinate Volcano job scheduling by managing queues, PODs, GPU allocations, and namespace quota controls
Administer Kubernetes foundations including namespaces, RBAC, resource quotas, and workload isolation strategies
Create and maintain Python and Shell scripts to automate job submission, resource allocation, and monitoring activities
Coordinate with orchestration, optimization, and observability teams to raise scheduling performance, utilization, and researcher productivity
Measure infrastructure condition and resource consumption, and provide data for reporting and optimization decisions
Deliver enhancements to infrastructure, tools, and automation processes to increase scalability, performance, and user satisfaction
Provide operational support that ensures researchers have a smooth environment for AI and computational work
Requirements
2+ years of experience in DevOps or infrastructure engineering roles managing complex, large-scale systems
In-depth Kubernetes administration skills across namespaces, POD scheduling and balancing, PVC, NFS, and resource quota controls
Experience with Volcano for GPU job scheduling, including queue setup, prioritization, and Kubernetes integration
Track record operating GPU cluster environments in Kubernetes and standalone Linux for high-performance computing
Advanced Python scripting capability for infrastructure automation, job handling, and system monitoring
Proficiency with UNIX Shell scripting (including Bash) to automate tasks and enhance operational workflows
Strong Linux system administration knowledge for troubleshooting, performance optimization, and configuration management
Thorough grasp of automation and orchestration tools and practices to support scalable, dependable infrastructure
Excellent English communication skills (spoken and written) for client work and collaboration with cross-functional teams
Nice to have
Helm knowledge for managing Kubernetes application packaging and configuration
Experience with Prometheus, Grafana, and Loki for monitoring and observability
Familiarity with Terraform for Infrastructure as Code provisioning and management
Exposure to Amazon EKS and Google GKE for multi-cloud Kubernetes orchestration
Azure networking experience with VPN, ExpressRoute, and network security practices
Experience leveraging GitHub Copilot, ChatGPT, or Claude to improve development efficiency and code quality
Understanding of hybrid scheduling and resource optimization across cloud and on-premises platforms
We offer
International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn
¿Te interesa este puesto?