Job Description :
Key Areas of The Teams Responsibilities Are :
- Proactive monitoring and management of business critical 24x7 real-time. Where required to rectify issues in a timely fashion to restore application functionality.
- Ensure incidents are correctly processed, assessing business and technical impact and severity.
- Taking ownership of application incidents and ensuring that they are resolved, this includes retaining ownership of incidents that require 3rd Line or IT Change activity to resolve.
- Ensuring the communication to the business community remains active.
- Application responsibilities will cover Application Infrastructure, Data Fixes, User Queries, User Education and Incident Investigation.
- Monitoring of application events alerts, job schedules, capacity monitors and performance KPI's.
- Creation and ownership of change requests raised to address any of the above issues.
- Proactively share knowledge with the team and update the knowledge base with support documentation (Confluence).
- Work to provide services to agreed Service Level Targets and Operating Level Agreements.
- Leverage AI Ops techniques to analyse logs, metrics, traces, and event data, enabling proactive trend identification and continuous optimization of system performance
Education and Hand on experience required :
- Preferably 5+ years of direct experience in Site Reliability Engineering or DevOps roles, high availability, and incident response in AWS or Azure or GCP.
- Proficiency with cloud computing environments (AWS / GCP/ Azure).
- Good understanding of Application Support processes
- Ideally familiar with monitoring tools such as Splunk, Cloudwatch, Dotcom and Monolith.
- Expertise in Oracle SQL/PostgreSQL: Proficiency in advanced SQL techniques, query optimization, and experience with complex database systems.
- Experience with advanced observability tools (e.g., Prometheus, Grafana, Splunk) for monitoring, logging, and tracing.
- Experience in leading post-mortem analyses and implementing preventative measures to avoid recurrence of incidents.
- Excellent problem-solving skills and the capacity to lead effectively under pressure during incident response and outage management.
- Must understand operating systems most especially Windows and Linux.
- Good scripting experience (preferably including python) an advantage.