Logicloop - Site Reliability Engineer - Logicloop Ventures Limited

Key Areas of The Teams Responsibilities Are :

Proactive monitoring and management of business critical 24x7 real-time. Where required to rectify issues in a timely fashion to restore application functionality.
Ensure incidents are correctly processed, assessing business and technical impact and severity.
Taking ownership of application incidents and ensuring that they are resolved, this includes retaining ownership of incidents that require 3rd Line or IT Change activity to resolve.
Ensuring the communication to the business community remains active.
Application responsibilities will cover Application Infrastructure, Data Fixes, User Queries, User Education and Incident Investigation.
Monitoring of application events alerts, job schedules, capacity monitors and performance KPI's.
Creation and ownership of change requests raised to address any of the above issues.
Proactively share knowledge with the team and update the knowledge base with support documentation (Confluence).
Work to provide services to agreed Service Level Targets and Operating Level Agreements.
Leverage AI Ops techniques to analyse logs, metrics, traces, and event data, enabling proactive trend identification and continuous optimization of system performance

Education and Hand on experience required :

Preferably 5+ years of direct experience in Site Reliability Engineering or DevOps roles, high availability, and incident response in AWS or Azure or GCP.
Proficiency with cloud computing environments (AWS / GCP/ Azure).
Good understanding of Application Support processes
Ideally familiar with monitoring tools such as Splunk, Cloudwatch, Dotcom and Monolith.
Expertise in Oracle SQL/PostgreSQL: Proficiency in advanced SQL techniques, query optimization, and experience with complex database systems.
Experience with advanced observability tools (e.g., Prometheus, Grafana, Splunk) for monitoring, logging, and tracing.
Experience in leading post-mortem analyses and implementing preventative measures to avoid recurrence of incidents.
Excellent problem-solving skills and the capacity to lead effectively under pressure during incident response and outage management.
Must understand operating systems most especially Windows and Linux.
Good scripting experience (preferably including python) an advantage.

Logicloop - Site Reliability Engineer