Site Reliability Engineer Professional

Introduction

At IBM Infrastructure & Technology, we design and operate the systems that keep the world running. From high-resiliency mainframes and hybrid cloud platforms to networking, automation, and site reliability. Our teams ensure the performance, security, and scalability that clients and industries depend on every day. Working in Infrastructure & Technology means tackling complex challenges with curiosity and collaboration. You’ll work with diverse technologies and colleagues worldwide to deliver resilient, future‑ready solutions that power innovation. With continuous learning, career growth, and a supportive culture, IBM provides the opportunities to build expertise and shape the infrastructure that drives progress.

Your Role And Responsibilities

We’re seeking a Site Reliability Engineer Professional to support the availability, performance, and day‑to‑day operations of our services and platforms. The engineer in this role will apply SRE best practices like automation, observability, Kubernetes, CI/CD. Responsibilities include system maintenance, tooling improvements, participation in on‑call, and contributing to the reliability and scalability of services.

Key Responsibilities

Operations & Reliability

Participate in an on‑call rotation with mentorship and established runbooks
Perform operational tasks: log reviews, rollouts, restarts, configuration updates, certificate renewals
Maintain and update runbooks, dashboards, diagrams, and documentation

Monitoring & Observability

Build or update dashboards and alerts using Prometheus, Grafana, and Loki
Tune alerts to reduce noise and improve signal quality
Apply golden signal and RED/USE patterns under guidance

Automation & Tooling

Develop automation scripts with Python, Bash, or Go to eliminate repetitive tasks
Contribute to CI/CD pipelines (linting, gates, templates)

Cloud & Platform

Support deployment and operation of workloads on Docker, Kubernetes, and OpenShift
Contribute to infrastructure changes using Terraform and Ansible with review
Assist with basic cloud provisioning tasks

Networking & Security

Apply foundational networking concepts (TCP/IP, DNS, routing, HTTP, TLS) in troubleshooting
Follow least‑privilege and proper secrets‑management practices

Collaboration & Process

Participate and/or lead Agile ceremonies (standups, planning, retros)
Contribute and/or lead blameless post‑incident reviews
Collaborate with cross‑functional teams and use standard Git workflows

Required Technical And Professional Expertise

Between 1 and 3 years of experience in SRE/DevOps/Platform Engineering or related fields
Advanced English proficiency is a must
Strong Linux fundamentals: CLI, processes, permissions, logs, troubleshooting
Proficiency in at least one scripting language (Python, Bash, or Go)
Experience with Git and GitHub workflows
Familiarity with Docker and Kubernetes basics
Experience with CI/CD implementations
Basic networking knowledge

Preferred Technical And Professional Experience

OpenShift experience
Hands‑on exposure to Terraform and Ansible
Experience with Prometheus, Grafana, Loki, Thanos, or OpenTelemetry
Cloud platform fundamentals (IBM Cloud, AWS, Azure, or GCP)
Optional experience with JavaScript or TypeScript

Descripción del puesto