
AI Platform Engineering | LLMOps , MLOps Lead
Send a job offer directly to this candidate
Senior AI Platform and LLMOps Engineering Leader with 17+ years of experience designing and operating enterprise-scale AI systems across large production environments. Specialized in building Generative AI platforms, RAG pipelines, and agentic AI systems that automate diagnostics, reasoning, and operational decision-making.
Expert in productionizing LLM applications through scalable deployment architectures, evaluation frameworks, observability pipelines, and governance controls. Strong background in cloud-native AI infrastructure, DevOps automation, and Python-based AI services across AWS and Azure.
Architected and operated enterprise Generative AI and LLMOps platforms supporting production AI workloads across trading, compliance, and risk systems, enabling secure and scalable adoption of GenAI applications.
Designed and implemented agentic AI workflows using LangChain and LangGraph
, enabling multi-step reasoning, retrieval, and tool orchestration for enterprise automation scenarios.
Operationalized production-grade RAG pipelines
, implementing embedding pipelines, vector retrieval using pgvector/OpenSearch/ChromaDB, and prompt orchestration to support knowledge-driven GenAI systems.
Integrated AWS Bedrock for enterprise model access
, configuring guardrails, managing secure foundation model usage, and performing supervised and reinforcement fine-tuning, including LLM-as-Judge–driven distillation to create cost-efficient domain-specific models.
Developed LLM evaluation and observability frameworks using RAGAs, LangFuse, and telemetry pipelines to monitor hallucination, relevance, latency, and response quality in production AI systems.
Built enterprise-grade data engineering pipelines for GenAI and ML workloads
, leveraging AWS Glue for scalable ETL
,
enabling compliant data ingestion for RAG pipelines, fine-tuning, and evaluation workflows.
Designed AI infrastructures with multi-cloud (AWS/Azure)
using Kubernetes, Docker, and vector databases (pgvector, OpenSearch, Chroma DB) for efficient embedding generation and low-latency retrieval.
AI-driven operational intelligence pipelines combining observability data, logs, and knowledge bases to support automated analysis, summarization, and decision assistance for production environments.
Python-based GenAI platform services and APIs using FastAPI
, enabling reusable components for embedding generation, retrieval orchestration, agent workflows, and model inference.
LLM-powered automation workflows for operational diagnostics
, enabling retrieval of runbooks, analysis of observability signals, and automated remediation recommendations.
Led and mentored a team of
AI platform and DevOps engineers
, establishing best practices for LLM deployment, evaluation frameworks, and safe production rollout of AI systems.