DevOps Engineer
EPAM SystemsDescripción del puesto
We are looking for a Middle DevOps Engineer to automate and optimize Kubernetes platforms for GPU workloads and the Linux infrastructure behind AI research. You will implement and support Volcano-based scheduling, quotas, and cluster operations using Python and UNIX shell scripting while partnering with engineers and researchers. Apply to help deliver reliable, scalable compute environments
Responsibilities
Deploy, configure, and support GPU-enabled Kubernetes clusters and standalone Linux compute systems to improve scheduling and overall efficiency
Administer Volcano scheduling by setting up queues, managing PODs, assigning GPU resources, and enforcing namespace quota controls
Own Kubernetes platform management across namespaces, RBAC, resource quotas, and workload isolation approaches
Develop and maintain Python and Shell automation to simplify job submission, resource allocation, and infrastructure monitoring
Collaborate with orchestration, optimization, and observability teams to increase scheduling throughput, resource utilization, and researcher productivity
Monitor infrastructure health and resource consumption, and share metrics to guide optimization and reporting
Recommend and deliver improvements to infrastructure, tools, and automation processes to enhance scalability, performance, and user experience
Support operational routines that give researchers a smooth environment for AI and computational workloads
Requirements
Professional experience of 2+ years in DevOps or infrastructure engineering, supporting complex large-scale systems
Deep expertise in Kubernetes administration and orchestration, including namespaces, POD scheduling and balancing, PVC, NFS, and resource quota controls
Hands-on experience with the Volcano scheduler for GPU job management, covering queue configuration, workload prioritization, and Kubernetes integration
Proven ability to run GPU cluster environments in Kubernetes and standalone Linux configurations for high-performance computing
Advanced Python scripting skills for automating infrastructure operations, job handling, and system monitoring
Practical proficiency in UNIX Shell scripting (e.g., Bash) to automate system tasks and streamline operational workflows
Strong Linux system administration background, including troubleshooting, performance optimization, and configuration management
Thorough understanding of automation and orchestration tooling and concepts to build scalable, dependable infrastructure
Excellent English communication skills (spoken and written) for direct client engagement and collaboration across teams
Nice to have
Experience with Helm to package and manage Kubernetes applications
Knowledge of monitoring and observability tools such as Prometheus, Grafana, and Loki to track health and performance
Familiarity with Infrastructure as Code tools like Terraform for automated cloud provisioning and management
Background with multi-cloud Kubernetes platforms including Amazon EKS and Google GKE
Skills in Azure networking, including VPN setup, ExpressRoute configuration, and network security
Experience using AI coding assistants (GitHub Copilot, ChatGPT, Claude) to improve development speed and code quality
Understanding of hybrid scheduling and resource optimization across cloud and on-premises environments
We offer
International projects with top brands
Work with global teams of highly skilled, diverse peers
Healthcare benefits
Employee financial programs
Paid time off and sick leave
Upskilling, reskilling and certification courses
Unlimited access to the LinkedIn Learning library and 22,000+ courses
Global career opportunities
Volunteer and community involvement opportunities
EPAM Employee Groups
Award-winning culture recognized by Glassdoor, Newsweek and LinkedIn
¿Te interesa este puesto?