Role Overview

We are seeking a highly skilled DevOps / AI Specialist to join our architecture and delivery team. This role blends DevOps engineering, AI/ML integration, and modern cloud‑native infrastructure. The ideal candidate has a strong foundation in coding, experience with scalable server systems, and a deep understanding of container orchestration—especially how Kubernetes integrates with broader enterprise ecosystems.

You will partner with architects, engineers, and client stakeholders to design, build, and optimize automated, intelligent, cloud‑native systems.

Key Responsibilities

DevOps Engineering

Design, implement, and maintain CI/CD pipelines using modern automation tools (GitHub Actions, GitLab CI, Azure DevOps, Jenkins, etc.).
Manage infrastructure as code (IaC) using Terraform, Helm, Ansible, or similar tooling.
Build and support containerized workloads in cloud and hybrid environments.
Monitor and optimize system performance, availability, and reliability.
Implement GPU-specific observability (DCGM Exporter, nvidia-smi metrics) and inference dashboards tracking tokens/sec, time-to-first-token, queue depth, and VRAM utilization.

AI/ML Integration

Assist in deploying AI/ML models into production using modern MLOps practices.
Integrate AI capabilities (LLMs, inference pipelines, vector stores, API integrations) into existing application and infrastructure workflows.
Work with data teams to ensure smooth model deployment, versioning, and scaling.
Implement observability and automated retraining workflows where applicable.
Deploy and manage inference servers (vLLM, TGI, Triton, Ollama) for production LLM workloads.
Apply inference optimization techniques: quantization (AWQ, GPTQ), batching strategies, and KV cache management.
Manage large model weights (50GB files), model versioning, storage optimization, and model formats (SafeTensors, GGUF, AWQ).

Cloud & Server Infrastructure

Support and optimize cloud environments (AWS, Azure, GCP, or hybrid).
Manage Linux/Windows server systems, network configurations, and security controls.
Ensure high availability, disaster recovery, and secure architectural patterns.
Perform environment provisioning through automation and containerization.
Support air-gapped or restricted network deployments where cloud-native conveniences (public registries, managed services) are unavailable.

Kubernetes & Platform Engineering

Design, deploy, and manage Kubernetes clusters in cloud and on‑prem environments.
Integrate Kubernetes with CI/CD, service meshes, API gateways, observability stacks, and cloud-native services.
Configure role-based access control (RBAC), secrets management, and workload scaling.
Implement GitOps frameworks such as ArgoCD or FluxCD.

Required Qualifications

Strong fundamental coding skills in one or more languages (Python, Go, JavaScript/TypeScript, Bash, etc.).
Hands‑on experience with DevOps tools and practices (CI/CD, IaC, containerization).
Proficiency with Kubernetes concepts—deployments, services, ingress, operators, scaling, and integrations.
Experience with GPU infrastructure: NVIDIA drivers, CUDA, container toolkit, and GPU scheduling in Kubernetes (GPU Operator, device plugins).
Understanding of GPU memory management and multi-GPU workloads.
Solid understanding of server infrastructure, networking, and cloud services.
Experience deploying or integrating AI/ML workloads into production systems.
Strong problem‑solving skills and ability to collaborate within cross‑functional teams.
Strong debugging skills: reading container logs, tracing through Kubernetes events, diagnosing GPU driver issues.
Comfort with clean-slate testing and isolating failures in complex multi-layer stacks.

Preferred Qualifications (Nice to Have)

Experience with MLOps frameworks (MLflow, Kubeflow, SageMaker, Vertex AI, etc.).
Familiarity with LLM orchestration (LangChain, LlamaIndex) and understanding of
when NOT to use them (direct OpenAI-compatible API calls are often simpler).
Knowledge of service mesh technologies (Istio, Linkerd).
Familiarity with security best practices (DevSecOps, vulnerability scanning, secure coding).
Experience hardening APIs: rate limiting, authentication middleware, input validation, security headers, and production WSGI servers (Gunicorn, uWSGI).

Certifications in cloud platforms (AWS, Azure, GCP) or Kubernetes (CKA, CKAD).

DevOps (AI)

Job description

Role Overview

Key Responsibilities

DevOps Engineering

Required Qualifications

Related

Related