Site Reliability Engineering Manager
Clear.Job description
Lead, NOC & Incident Management – Data Center Operations
Location: Austin, TX
Employment Type: Full-Time
Salary: $200,000 – $300,000 base equity
Benefits: Full Benefits Package
Our client is a fast-growing AI infrastructure company building next-generation data centers and cloud-scale systems. They are seeking a Lead, NOC & Incident Management to establish and lead a cross-functional operations center (NOC) and incident management function, ensuring reliable monitoring and response across the company’s infrastructure portfolio, including datacenter facilities, network backbone, and platform services.
This is a hands-on operational leadership role combining strategic process development with technical credibility. The successful candidate will build 24/7 monitoring and triage capabilities, operationalize incident management frameworks, and drive a culture of proactive, consistent operational excellence.
Key Responsibilities
NOC Build & Operations
- Stand up a cross-functional operations center from scratch, including staffing models, handoff processes, KPIs, and quality standards
- Select and onboard MSP partners for Tier 1 coverage
- Ensure qualified monitoring coverage 24/7 for all critical alerts
- Create, deploy, and operationalize structured incident management frameworks
- Manage on-call rotations, run incident bridges for SEV0/SEV1 events, and lead post-incident reviews
- Partner with internal teams to continuously refine incident response processes
- Maintain runbook quality assurance and tabletop exercises for new infrastructure domains
- Onboard new domains (Facilities, Network, Systems) into NOC coverage aligned with datacenter launches
- Build operational partnerships across Network Ops, DC Ops, Systems/Platform, and Security teams
- Define clear Tier 1 → Tier 2 escalation criteria and ensure the NOC acts as a force multiplier for engineering teams
- Establish processes for full lifecycle management of carrier and vendor tickets
- Track, enforce SLAs, escalate as needed, and maintain documentation for all vendor interactions
- Define and track operational metrics (MTTA, MTTR, escalation rate, false positives, runbook coverage)
- Produce operational reports and use data to reduce alert noise, improve runbooks, and shorten incident response times
Qualifications
- 5 years in network operations, infrastructure operations, or site reliability roles with NOC leadership experience
- Deep experience with structured incident response: severity classification, escalations, incident bridges, post-incident reviews, and RCA workflows
- Technical breadth across infrastructure domains: network, facilities, and platform services
- Proven ability to build operational processes, runbooks, and training programs from scratch
- Strong cross-team influence without direct authority
- Customer SLA mindset with focus on reliable 24/7 operations
- Comfortable operating in a fast-paced, high-growth environment
Preferred Experience
- Experience at hyperscale or large-scale infrastructure companies or telcos
- Hands-on with incident management tools (incident.io, PagerDuty, Opsgenie, ServiceNow)
- MSP/vendor selection, onboarding, and management experience
- Familiarity with datacenter facilities operations, BMS/SCADA alerts, and carrier/ISP processes
- Startup experience in high-growth environments
Compensation & Benefits
- Base salary of $200,000 – $300,000
- Equity participation from day one
- Health, dental, and vision insurance
- Retirement plan aligned with U.S. norms
- Generous PTO policy
¿Te interesa este puesto?