Skip to main content

Site Reliability Engineer (Linux & GPU Environment)

Technology
KMC Solutions
Metro Manila, Philippines1 months agoUntil 3/28/2026

Job description

Job description

Make Your Next Career Move and Defy Your Limits with KMC Solutions!

At KMC Solutions, we make it easy for the world’s fastest-growing companies to scale in the Philippines. As the country’s leading provider of flexible office space and Employer of Record (EOR) services, we help businesses expand without the red tape—offering a faster and easier path to growth in the Philippines.

Internal Job Title: Site Reliability Engineer (SRE - Linux & GPU)

Work Set-Up: Fully Remote (Work From Home)

Work Schedule: Monday to Friday (Central Europe Time | Dayshift, Mid Shift or Night Shift—depending on business needs).

Position Summary:

The Site Reliability Engineer will be responsible to support the validation, testing, and readiness of high-performance GPU compute infrastructure prior to production deployment. This role is critical in ensuring that all hardware, networking, and Linux-based systems meet operational and reliability standards before customer workloads are launched.

This position will work closely with the Germany-based infrastructure and engineering teams to validate system integrity, diagnose and resolve infrastructure issues, and help maintain the stability and reliability of advanced GPU-powered environments. This role is ideal for someone with strong Linux troubleshooting experience and is passionate about infrastructure reliability, automation, and high-performance systems.

Duties and Responsibilities:

  • Infrastructure/Cluster Validation & Testing:

- Validate GPU clusters and Linux-based systems to ensure readiness prior to production release.

- Perform system diagnostics, functional testing, and reliability validation across servers and infrastructure components.

- Verify system health, performance, and network connectivity to ensure operational standards are met.

  • Orchestration & Benchmarking:

- Provision and configure GPU clusters using automated workflows

  • Execute and analyze performance and stability benchmarks orchestrated via Slurm
  • Validate results against expected performance and reliability thresholds
  • Automation & Framework Support:

- Support and enhance automated validation and testing workflows using Python and Ansible.

- Execute automated tests and analyze performance and reliability results.

- Contribute to improving automation coverage, efficiency, and reliability of validation processes.

  • Remediation & System Integrity:

- Diagnose and remediate unhealthy nodes through configuration changes or software fixes.

- Coordinate with on-site support and Smart Hands teams for hardware replacements when required.

- Ensure all issues are resolved and documented prior to handover to production operations.

  • Documentation & Handover:

- Produce clear, accurate documentation of test results, hardware states, and remediation actions.

- Ensure smooth handovers to operations and engineering teams.

- Maintain up-to-date runbooks and validation procedures.

Qualifications:

  • Bachelor’s Degree in Computer Science, Information Technology, or any other related fields.
  • At least 3–5 years of hands-on experience administering, troubleshooting, and supporting Linux-based systems in infrastructure, server, or datacenter environments; with a strong focus on system reliability, validation, and operational readiness.
  • Proven ability to diagnose system-level issues using Linux CLI tools, including analysis of system logs, kernel logs, drivers, and system services.
  • Preferably with experience in automation and scripting using Python and/or infrastructure automation tools such as Ansible.
  • Exposure to high-performance computing (HPC), GPU environments, or clustered infrastructure systems is an advantage.
  • Familiarity with workload schedulers (e.g., Slurm), distributed systems, or high-speed networking technologies (e.g., InfiniBand) is a plus.
  • Understanding of datacenter hardware lifecycle and server validation processes is an advantage.
  • Excellent English communication skills, both written and verbal, with the ability to collaborate effectively with global infrastructure and engineering teams.
  • Strong sense of ownership, accountability, and commitment to maintaining system stability, documentation, and operational standards.

• Ability to work in a fully remote, work-from-home environment with reliable internet connection.

• Amenable to work Monday to Friday (aligned with German business hours).

¿Te interesa este puesto?