Site Reliability Engineer
Descripción del puesto
Site Reliability Engineer (SRE) – Application Performance Monitoring (APM)
Location: Monterrey, Nuevo León, Mexico (Hybrid – candidates must reside in Monterrey or the metropolitan area).
Language requirement: Fluent English (spoken and written).
About the Role:
We’re looking for a Site Reliability Engineer (SRE) with a passion for Application Performance Monitoring (APM) and system optimization.
In this role, you’ll be at the heart of ensuring the reliability, scalability, and performance of NOV’s mission-critical applications. You’ll work closely with software engineering and operations teams to design monitoring strategies, analyze performance, and proactively prevent issues before they affect users.
If you thrive in fast-paced environments, love solving complex technical challenges, and enjoy turning data into insight, this is the role for you.
What You’ll Do
- Design and manage APM strategies using tools like Elastic APM, Datadog, Dynatrace, or similar platforms.
- Perform deep performance analysis, tracing distributed requests and identifying bottlenecks in both code and infrastructure.
- Build real-time dashboards and alerting systems using Grafana, Kibana, or equivalent tools to visualize system health.
- Proactively monitor systems to detect performance degradations, security threats, and system failures — before users are impacted.
- Define and track Service Level Objectives (SLOs) and Service Level Agreements (SLAs) to continuously improve reliability.
- Lead Root Cause Analysis (RCA) sessions after incidents and implement corrective actions to prevent recurrence.
- Automate repetitive tasks and monitoring setups using Python, Bash, or PowerShell.
- Collaborate with cross-functional teams to embed reliability, performance, and observability best practices into every stage of development.
- Continuously refine tools, processes, and APM strategies to enhance efficiency, reliability, and visibility across platforms.
- Engage with stakeholders to understand performance challenges and shape the platform roadmap.
What You Bring
- Bachelor’s or Master’s degree in Computer Science, Engineering, or related field.
- 5+ years of experience in Site Reliability, DevOps, or Performance Engineering roles.
- Proven hands-on experience with APM tools such as Elastic APM, Datadog, Dynatrace, New Relic, or AppDynamics.
- Expertise in the Elastic Stack (Elasticsearch, Logstash, Kibana, Beats) for logging, monitoring, and APM.
- Deep understanding of SRE principles, DevOps methodologies, and Production Support operations.
- Strong scripting ability in Python, Bash, or PowerShell for automation and analysis.
- Solid grasp of Linux/Unix systems, networking fundamentals, and distributed system architecture.
- Experience with containerization (Docker) and orchestration (Kubernetes).
- Excellent analytical, problem-solving, and collaboration skills, with the ability to communicate effectively in a global team.
Preferred Skills
- Experience with Infrastructure as Code (IaC) tools such as Terraform, Ansible, or Chef.
- Familiarity with cloud-native services (AWS, Azure, or GCP) and serverless architectures (AWS Lambda, Azure Functions).
- Knowledge of CI/CD tools like GitHub Actions, Azure DevOps, or Jenkins.
- Understanding of other observability pillars, including metrics (Prometheus) and logging.
- Experience working in agile environments.
Why NOV
At NOV, we combine over 150 years of innovation with cutting-edge technology to power the global energy industry.
You’ll join a global engineering team that values collaboration, curiosity, and continuous improvement — giving you the opportunity to make a real impact on systems that matter.
¿Te interesa este puesto?