Site Reliability Engineer (Linux & GPU Environment)

Job description

Make Your Next Career Move and Defy Your Limits with KMC Solutions!

At KMC Solutions, we make it easy for the world’s fastest-growing companies to scale in the Philippines. As the country’s leading provider of flexible office space and Employer of Record (EOR) services, we help businesses expand without the red tape—offering a faster and easier path to growth in the Philippines.

Internal Job Title: Site Reliability Engineer (SRE - Linux & GPU)

Work Set-Up: Fully Remote (Work From Home)

Work Schedule: Monday to Friday (Central Europe Time | Dayshift, Mid Shift or Night Shift—depending on business needs).

Position Summary:

The Site Reliability Engineer will be responsible to support the validation, testing, and readiness of high-performance GPU compute infrastructure prior to production deployment. This role is critical in ensuring that all hardware, networking, and Linux-based systems meet operational and reliability standards before customer workloads are launched.

This position will work closely with the Germany-based infrastructure and engineering teams to validate system integrity, diagnose and resolve infrastructure issues, and help maintain the stability and reliability of advanced GPU-powered environments. This role is ideal for someone with strong Linux troubleshooting experience and is passionate about infrastructure reliability, automation, and high-performance systems.

Duties and Responsibilities:

Infrastructure/Cluster Validation & Testing:

- Validate GPU clusters and Linux-based systems to ensure readiness prior to production release.

- Perform system diagnostics, functional testing, and reliability validation across servers and infrastructure components.

- Verify system health, performance, and network connectivity to ensure operational standards are met.

Orchestration & Benchmarking:

- Provision and configure GPU clusters using automated workflows

Execute and analyze performance and stability benchmarks orchestrated via Slurm
Validate results against expected performance and reliability thresholds
Automation & Framework Support:

- Support and enhance automated validation and testing workflows using Python and Ansible.

- Execute automated tests and analyze performance and reliability results.

- Contribute to improving automation coverage, efficiency, and reliability of validation processes.

Remediation & System Integrity:

- Diagnose and remediate unhealthy nodes through configuration changes or software fixes.

- Coordinate with on-site support and Smart Hands teams for hardware replacements when required.

- Ensure all issues are resolved and documented prior to handover to production operations.

Documentation & Handover:

- Produce clear, accurate documentation of test results, hardware states, and remediation actions.

- Ensure smooth handovers to operations and engineering teams.

- Maintain up-to-date runbooks and validation procedures.

Qualifications:

Bachelor’s Degree in Computer Science, Information Technology, or any other related fields.
At least 3–5 years of hands-on experience administering, troubleshooting, and supporting Linux-based systems in infrastructure, server, or datacenter environments; with a strong focus on system reliability, validation, and operational readiness.
Proven ability to diagnose system-level issues using Linux CLI tools, including analysis of system logs, kernel logs, drivers, and system services.
Preferably with experience in automation and scripting using Python and/or infrastructure automation tools such as Ansible.
Exposure to high-performance computing (HPC), GPU environments, or clustered infrastructure systems is an advantage.
Familiarity with workload schedulers (e.g., Slurm), distributed systems, or high-speed networking technologies (e.g., InfiniBand) is a plus.
Understanding of datacenter hardware lifecycle and server validation processes is an advantage.
Excellent English communication skills, both written and verbal, with the ability to collaborate effectively with global infrastructure and engineering teams.
Strong sense of ownership, accountability, and commitment to maintaining system stability, documentation, and operational standards.

• Ability to work in a fully remote, work-from-home environment with reliable internet connection.

• Amenable to work Monday to Friday (aligned with German business hours).

Job description

Job description

Position Summary:

Duties and Responsibilities:

Qualifications:

Related

Related