Observability Engineer

Why We Need This Role: - The platform requires comprehensive observability dashboards built from multiple sources (platform health, use case performance, cost,

security) - The platform needs advanced monitoring beyond GCP native tools for production operations - Prometheus and Grafana expertise required for custom metrics,

alerting, and dashboards - OpenTelemetry instrumentation across all use cases requires dedicated focus - No current team member has deep Prometheus/ Grafana expertise

Job Description: Observability Engineer

About the Role:

Our GenAI platform requires comprehensive observability to ensure production reliability, performance optimisation, and cost management. As our Observability

Engineer, you will design and implement the monitoring, alerting, and dashboarding infrastructure that gives teams visibility into platform health, use case performance, and operational costs.

Key Responsibilities:

• Design and implement observability architecture using Prometheus and Grafana

• Deploy and manage Prometheus stack on GKE with appropriate retention and

HA configuration

• Create comprehensive Grafana dashboards for platform health, API performance, and use case metrics

• Implement custom metrics collection for CrewAI agents, Kong API Gateway, and

LLM usage

• Configure OpenTelemetry instrumentation across all platform services

• Design alerting rules and notification channels for P0-P3 incident severity levels

• Build cost and usage dashboards for LLM token consumption and infrastructure spend

• Integrate with Cloud Monitoring and Cloud Logging for unified observability

• Establish SLI/SLO frameworks for platform and use case services

• Create runbooks for common alerting scenarios and incident response

Required Skills:

• 4+ years experience in observability and monitoring engineering

• Strong expertise in Prometheus (PromQL, recording rules, alerting rules)

• Proficiency in Grafana (dashboard design, variables, annotations, alerting)

• Experience with OpenTelemetry for distributed tracing and metrics

• Knowledge of Kubernetes monitoring patterns and kube-state-metrics

• Understanding of SRE principles (SLIs, SLOs, error budgets)

• Experience with log aggregation and analysis (Loki, ELK, or similar)

• Familiarity with alerting best practices and on-call workflows

Desirable Skills:

• Experience with GCP Cloud Monitoring and Cloud Trace integration

• Knowledge of AI/ML observability patterns (model latency, token usage, drift detection)

• Background in API gateway monitoring (Kong, Envoy, or similar)

• Experience with long-term Prometheus storage

• Familiarity with FinOps and cost observability dashboards

DevOps - SRM (Observability)

وصف الوظيفة