Cloud Infrastructure Engineer
Infinite Computer SolutionsDescripción del puesto
This team is responsible for managing, troubleshooting, and optimizing containerized applications and infrastructure deployed on Kubernetes, RedHat OpenShift, and OpenStack platforms. Will also support NCS and CBIS products. You will serve as the Subject Matter Expert (SME) for core cloud infrastructure technologies, lead the investigation and resolution of complex, high-severity customer issues, and provide end-to-end Escalation, Monitoring, and Emergency (EME) support, acting as final escalaion point to ensure service availability and meet SLAs.
Responsibilities
Manage, troubleshoot, and optimize containerized applications and infrastructure deployed on platforms like Kubernetes, RedHat OpenShift, and OpenStack.
Lead the investigation and resolution of complex, high-severity customer incidents.
Utilize expertise to quickly identify root causes and implement effective, durable solutions.
Prepare and conduct rigorous Root Cause Analysis (RCA) for critical incidents to identify systemic issues and prevent recurrence.
Develop, test, and maintain robust automation scripts using Python and Ansible to streamline daily operational tasks and improve overall service efficiency.
Provide immediate support for urgent cases as part of an on-call rotation.
Stay current with industry best practices and emerging technologies in cloud and containerization.
Required Skills and Experience
Linux Expertise: Strong knowledge and proven hands-on experience with advanced Linux (CentOS) system administration. Familiarity with Red Hat and CentOS is highly valued.
Networking Foundations: Strong knowledge of core networking principles (TCP/IP,
routing, load balancing, firewalls) in a cloud environment. A solid grasp of computer networking fundamentals, such as understanding of VLANs and IP routing, is a must23.
Containerization & Virtualization: Strong knowledge of Kubernetes orchestration, OpenStack platforms, and Docker/Containerization. Knowledge in areas like Podman, Kubernetes, Helm, and/or OpenStack, KVM/QEMU is a significant advantage.
Scripting and Automation: Solid Python scripting skills for task automation and system management. Proficiency in scripting with Bash and Python, or the willingness to learn and adapt, as well as familiarity with Ansible is required.
Root Cause Analysis (RCA): Expertise in preparation and implementation of RCAs.
Escalation and Monitoring: Proven experience with EME (Escalation, Monitoring, and Emergency) management processes.
¿Te interesa este puesto?