We are always looking for exceptional talent to join us on the journey!We are always looking for exceptional talent to join us on the journey! Your MissionAs an MLOps Engineer at Nuvei, your mission is to design, build, and operate the platforms that power our machine learning and generative AI products spanning real-time use cases such as large-scale fraud scoring, MCP & agentic workflows support. Youll create reliable CI/CD for models and Agents, robust data/feature pipelines, secure model serving, and comprehensive observability.

You will also support our agentic AI ecosystem and Model Context Protocol (MCP) services so that models can safely use tools, data, and actions across .You will partner closely with Data Scientists, Data/Platform Engineers, Product, and SRE to ensure every model from classic ML to LLM/RAG agents moves from prototype to production with strong reliability, governance, cost efficiency, and measurable business impact.Responsibilities:Operate & Develop ML/LLM platforms on Kubernetes cloud (Azure; AWS/GCP ok) with Docker, Terraform, and other relevant toolsManage object storage, GPUs, and autoscaling for training & low-latency model servingManage cloud environment, networking, service mesh, secrets, and policies to meet PCI-DSS and data-residency requirementsBuild end-to-end CI/CD for models/agents/MCP tooling (versioning, tests, approvals)Deliver real-time fraud/risk scoring & agent signals under strict latency SLOs.Maintain MCP servers/clients: tool/resource definitions, versioning, quotas, isolation, access controlsIntegrate agents with microservices, event streams, and rule engines; provide SLAs, tracing, and on-call runbooksMeasure operational metrics of ML/LLM (latency, throughput, cost, tokens, tool success, safety events)Enforce governance: RBAC/ABAC, row-level security, encryption, PII/secrets management, audit trails.Partner with DS on packaging (wheels/conda/containers), feature contracts, and reproducible experiments.lead incident response and post-mortems.Drive FinOps: right-sizing, GPU utilization, batching/caching, budget alerts.Requirements: 4 years in DevOps/MLOps/Platform roles building and operating production ML systems (batch and real-time)Strong hands-on with Kubernetes, Docker, Terraform/IaC, and CI/CDPractical experience with Spark/Databricks and scalable data processingProficiency in Python & BashAbility to operate DS code and optimize runtime performance.Experience with model registries (MLflow or similar), experiment tracking, and artifact management.Production model serving using FastAPI/Ray Serve/Triton/TorchServe, including autoscaling and rollout strategiesMonitoring and tracing with Prometheus/Grafana/OpenTelemetry; alerting tied to SLOs/SLAsSolid understanding of PCI-DSS/GDPR considerations for data and ML systemsExperience with the Azure cloud environment is a big plusOperating LLM/agent workloads in production (prompt/config versioning, tool execution reliability, fallback/retry policies)Building/maintaining RAG stacks (indexing pipelines, vector DBs, retrieval evaluation, hybrid search)Implementing guardrails (policy checks, content filters, allow/deny lists) and human-in-the-loop workflowsExperience with feature stores - Qwak Feature Store, FeastA/B testing for models and agents, offline/online evaluation frameworksPayments/fraud/risk domain experience; integrating ML outputs with rule engines and operational systems - AdvantageFamiliarity with Databricks Unity Catalog, dbt, or similar toolingThis position is open to all candidates.

Machine learning operations engineer

תיאור המשרה

קשור

קשור