·Lead deployment and configuration of on-prem Linux-based platforms for AI workloads.
Design and configure KVM virtualization
Architect and implement GPU-enabled environments for production LLM inference.
Deploy and operate containerized LLM serving stacks in production.
Design and validate GPU utilization, isolation, and health monitoring.
Integrate deployments with CI/CD pipelines and security controls.
Apply security hardening, RBAC, encryption, and audit-ready configurations.
Design HA and DR strategies and prepare systems for future scale-out.
Lead troubleshooting, performance tuning, and stabilization.
Produce architecture documentation, runbooks, and handover materials.
Primary Skills
5+ years of experience in DevOps, Platform Engineering, or Infrastructure Engineering roles.
Strong Linux system administration skills, including networking, storage, performance tuning, and security hardening.
Hands-on experience deploying and operating LLMs in production on on-premises environments.
Proven experience managing GPU infrastructure using NVIDIA GPUs (H100, A100, H200 or equivalent).
Hands-on experience installing, configuring, and troubleshooting CUDA drivers and GPU runtimes.
Experience with containerized workloads using Docker or OCI-compliant runtimes.
Hands-on experience with LLM serving frameworks such as vLLM, TensorRT-LLM, or Triton Inference Server.
Experience designing and supporting GPU utilization, isolation, and performance monitoring.
Strong hands-on experience with KVM virtualization in production.
Experience working with modern AI application stacks, including backend APIs, PostgreSQL, vector databases (e.g., Qdrant), and observability tools.
Strong experience using Infrastructure as Code and automation tools, including Ansible.
Hands-on experience designing and operating CI/CD pipelines in enterprise environments.
Hands-on experience with Kubernetes on-premises setup.
Strong understanding and practical implementation experience of DevSecOps practices, including secure pipelines, SAST integration, secrets management, and least-privilege access.
Experience designing or supporting high availability and disaster recovery strategies, including backup, restore, and failover concepts.
Strong experience working in on-prem, air-gapped, or regulated environments