Site Reliability Engineer (Linux & GPU Environment)
Job description
Job description
Make Your Next Career Move and Defy Your Limits with KMC Solutions!
At KMC Solutions, we make it easy for the world’s fastest-growing companies to scale in the Philippines. As the country’s leading provider of flexible office space and Employer of Record (EOR) services, we help businesses expand without the red tape—offering a faster and easier path to growth in the Philippines.
Internal Job Title: Site Reliability Engineer (SRE - Linux & GPU)
Work Set-Up: Fully Remote (Work From Home)
Work Schedule: Monday to Friday (Central Europe Time | Dayshift, Mid Shift or Night Shift—depending on business needs).
Position Summary:
The Site Reliability Engineer will be responsible to support the validation, testing, and readiness of high-performance GPU compute infrastructure prior to production deployment. This role is critical in ensuring that all hardware, networking, and Linux-based systems meet operational and reliability standards before customer workloads are launched.
This position will work closely with the Germany-based infrastructure and engineering teams to validate system integrity, diagnose and resolve infrastructure issues, and help maintain the stability and reliability of advanced GPU-powered environments. This role is ideal for someone with strong Linux troubleshooting experience and is passionate about infrastructure reliability, automation, and high-performance systems.
Duties and Responsibilities:
- Infrastructure/Cluster Validation & Testing:
- Validate GPU clusters and Linux-based systems to ensure readiness prior to production release.
- Perform system diagnostics, functional testing, and reliability validation across servers and infrastructure components.
- Verify system health, performance, and network connectivity to ensure operational standards are met.
- Orchestration & Benchmarking:
- Provision and configure GPU clusters using automated workflows
- Execute and analyze performance and stability benchmarks orchestrated via Slurm
- Validate results against expected performance and reliability thresholds
- Automation & Framework Support:
- Support and enhance automated validation and testing workflows using Python and Ansible.
- Execute automated tests and analyze performance and reliability results.
- Contribute to improving automation coverage, efficiency, and reliability of validation processes.
- Remediation & System Integrity:
- Diagnose and remediate unhealthy nodes through configuration changes or software fixes.
- Coordinate with on-site support and Smart Hands teams for hardware replacements when required.
- Ensure all issues are resolved and documented prior to handover to production operations.
- Documentation & Handover:
- Produce clear, accurate documentation of test results, hardware states, and remediation actions.
- Ensure smooth handovers to operations and engineering teams.
- Maintain up-to-date runbooks and validation procedures.
Qualifications:
- Bachelor’s Degree in Computer Science, Information Technology, or any other related fields.
- At least 3–5 years of hands-on experience administering, troubleshooting, and supporting Linux-based systems in infrastructure, server, or datacenter environments; with a strong focus on system reliability, validation, and operational readiness.
- Proven ability to diagnose system-level issues using Linux CLI tools, including analysis of system logs, kernel logs, drivers, and system services.
- Preferably with experience in automation and scripting using Python and/or infrastructure automation tools such as Ansible.
- Exposure to high-performance computing (HPC), GPU environments, or clustered infrastructure systems is an advantage.
- Familiarity with workload schedulers (e.g., Slurm), distributed systems, or high-speed networking technologies (e.g., InfiniBand) is a plus.
- Understanding of datacenter hardware lifecycle and server validation processes is an advantage.
- Excellent English communication skills, both written and verbal, with the ability to collaborate effectively with global infrastructure and engineering teams.
- Strong sense of ownership, accountability, and commitment to maintaining system stability, documentation, and operational standards.
• Ability to work in a fully remote, work-from-home environment with reliable internet connection.
• Amenable to work Monday to Friday (aligned with German business hours).
¿Te interesa este puesto?