DevOps Engineer / Sr. DevOps Engineer

*Roles & Responsibilities:
*Duties & accountabilities
Provide second line client-facing technical support for issues escalated by first line support teams.
Apply strong technical skills and good business knowledge together with investigative techniques and problem-solving skills toidentifyand resolve issues efficiently andin a timely manner.
Work collaboratively with development teamrequiredfor third line escalation.
Coordinate with product and delivery teams to ensure the Service Management team is ready for new releases and engaged in early design of new enhancements.
Work on initiatives and continuous improvement process around proactive application health monitoring, reporting, and technical support.
Apply AI/ML techniques to detect anomaly, predict alerting, and to enhance support operations.
*Key Areas of The Teams Responsibilities Are
Proactive monitoring and management of business critical 24x7 real-time. Where required to rectify issues ina timelyfashion to restore application functionality.
Ensure incidents are correctly processed, assessing business and technical impact and severity.
Taking ownership of application incidents and ensuring that they are resolved, this includesretainingownership of incidents that require 3rd Line or IT Change activity to resolve.
Ensuring the communication to the business communityremainsactive.
Application responsibilities will cover Application Infrastructure, Data Fixes, User Queries, UserEducationand Incident Investigation.
Monitoring of application events alerts, job schedules, capacitymonitorsand performance KPI''s. Creation and ownership of change requests raised to address any of the above issues.
Proactively share knowledge with the team and update the knowledge base with support documentation (Confluence).
Work to provide services to agreed Service Level Targets and Operating Level Agreements.
Leverage AI Ops techniques to analyse logs, metrics, traces, and event data, enabling proactive trend identification and continuous optimization of system performance
*Education and Hand on experiencerequired.
Preferably 4+ years of direct experience in Site Reliability Engineering or DevOps roles, high availability, and incident response in AWS or Azure or GCP.
Proficiencywith cloud computing environments (AWS / GCP/ Azure).
Good understanding of Application Support processes
Ideally familiar with monitoring tools such as Splunk,Cloudwatch, Dotcom and Monolith.
Expertisein Oracle SQL/PostgreSQL:Proficiencyin advanced SQL techniques, query optimization, and experience with complex database systems.
Experience with advanced observability tools (e.g., Prometheus, Grafana, Splunk) for monitoring, logging, and tracing.
Experience in leading post-mortem analyses and implementing preventative measures to avoid recurrence of incidents.
Excellent problem-solving skills and the capacity to lead effectively under pressure during incident response and outage management.
Must understand operating systems most especially Windows and Linux. Good scripting experience (preferably including python) an advantage.

Job description

Related

Related