Requirements
Must have:
- Minimum of 5 years in DevOps, Cloud Engineering, or similar roles, with at least 2 years focusing on MLOps in production settings - Advanced skills in Python for development, automation, and scripting - Proven track record in building and deploying production-level APIs and backend services (e.g., FastAPI, Flask, Django) - Strong SQL capabilities and experience in designing and optimizing data models for both relational and NoSQL databases (e.g., BigQuery, Cloud SQL) - Hands-on knowledge with workflow orchestration tools (e.g., Dagster, Airflow) - Expertise in Docker and Kubernetes, including Helm and Kubernetes-native Infrastructure as Code (IaC) tools; experience with GKE is preferred - Extensive familiarity with the Google Cloud Platform (GCP) ecosystem (e.g., Compute Engine, Cloud Storage, BigQuery, Pub/Sub, Vertex AI) - Proficient in GitHub workflows (branching, pull requests, code reviews) - Solid understanding of network architecture, security protocols, and large-scale data processing - Excellent communication skills along with a strong ability to collaborate and solve problems
Responsibilities:
- Oversee the design, implementation, and management of end-to-end MLOps pipelines for the continuous training, deployment, monitoring, and versioning of production ML models - Design, build, and sustain efficient data ingestion and transformation pipelines using modern orchestration tools like Dagster or Airflow - Architect, deploy, and maintain highly scalable and fault-tolerant infrastructure utilizing Kubernetes (GKE) within the Google Cloud Platform (GCP) - Advocate for DevOps best practices, including Infrastructure as Code (IaC) with Terraform or similar tools, and develop reliable CI/CD workflows - Configure and oversee automated deployment and testing pipelines using GitHub Actions and related tools for quick and reliable releases - Write clean and effective Python code for automation, infrastructure tools, and service integration - Design, develop, and deploy high-performance Python APIs (FastAPI, Flask, or similar) to deliver ML predictions and application services in a production environment - Implement comprehensive monitoring, logging, and alerting solutions (e.g., Prometheus, Grafana, cloud-native logging tools) to ensure system reliability - Incorporate security best practices, access controls, and compliance measures suitable for large-scale enterprise settings - Collaborate closely with data scientists, software engineers, and product teams while providing technical guidance and mentorship to junior engineers
Company:
We are a global organization positioned at the crossroads of technology, data, and digital platforms, providing products and services that reach millions around the world. Our engineering teams focus on building and running highly scalable, cloud-native systems that facilitate data-driven decisions, advanced analytics, and machine learning throughout a diverse ecosystem. We emphasize modern engineering practices, automation, and innovation, heavily investing in our personnel and platforms to foster long-term growth and technical mastery. Joining us means working in a collaborative environment among experienced professionals, with opportunities for significant career advancement in a stable and mature setting.