Overview:
The opportunity
The Senior Systems Engineer - HPC will be part of the Engineering team responsible for building HPC and AI solutions. This position will play a key role in the design and delivery of HPC systems and their hardware, platform, software, networking, and storage components. Responsibilities will extend to collaborating with multiple internal stakeholders, leading complex projects, implementing industry standard and innovative solutions. Key capabilities required for this role include extensive experience in building HPC solutions with a focus on multiple industries and use cases, as well as experience in the delivery of complex and large-scale projects.
Core42 is the UAE's national-scale enabler for cloud and generative AI, combining G42 Group's expertise across multiple technology disciplines into a single platform for public sector and large enterprise transformations. Building on our capabilities as sovereign cloud and HPC specialist, we bring generative AI, cybersecurity, professional and managed services expertise to enable national-scale program deployments across industries.
Responsibilities:
• Oversee the design, deployment, and optimization of the HPC infrastructure, including hardware, platform, software, networking, and storage components.
• Partake in preparation and review of HLD, LLD documents, scope of work, RFIs, RFPs and RFQs.
• Lead efforts to maximize the efficiency and performance of HPC systems, ensuring optimal resource utilization and minimal downtime.
• Collaborate closely with product and architecture teams to understand and implement customer computational needs and requirements. Provide tailored technical solutions that align with company's strategic goals.
• Develop and implement automation solutions and tools for deployment and management.
• Set up monitoring, logging, and alerting systems.
• Act as L3 support for complex technical issues, perform root cause analysis, and implement solutions to ensure the reliability and availability of HPC systems.
• Maintain comprehensive documentation of HPC configurations, procedures, and best practices to facilitate knowledge sharing and future reference.
• Ensure the security and compliance of the HPC infrastructure, implementing necessary safeguards, and adhering to company standards and regulations
• Collaborate with HPC vendors and suppliers for hardware and software procurement, support, and delivery.
• Assist in budget planning and management for HPC-related expenditures, ensuring cost-effective solutions.
• Stay at the forefront of HPC technology trends, evaluating and recommending new technologies and practices to enhance HPC capabilities.
Qualifications:
To qualify for the role you must have • Bachelor's degree in Information Technology, Computer Science or relevant field.
• Minimum 7 years of hands-on experience in High-Performance Computing (HPC) systems administration and infrastructure management
• Advanced knowledge and expertise in configuring, optimizing, and maintaining complex HPC environments, including hardware, software, and storage systems.
• Proficiency in parallel computing principles, distributed computing, and cluster management.
• Comprehensive knowledge and hands-on experience in the system administration of Linux environments.
• Experience with job schedulers, resource managers, and workflow orchestration tools commonly used in HPC environments ( Slurm, LSF or PBS, K8S )
• Advanced knowledge of Data Center network design and related technologies [OSI model, TCP/IP stack, routing, VLAN/VxLAN, etc].
• Competence in network design and configuration of switches/routers, including InfiniBand and RoCE.
• Experience with large-scale data storage solutions, particularly Ceph, NFS, and Lustre.
• Proficiency in one or more of the parallel libraries/languages such as MPI, OpenMP, OneAPI and CUDA.
• Competence in configuration management tools such as Ansible, Puppet, Terraform, and integration with Git.
• Strong scripting and automation skills (e.g., Python, Bash) for system administration tasks.
• Excellent problem-solving skills and the ability to troubleshoot complex HPC issues effectively.
• In-depth knowledge of performance tuning and optimization techniques for HPC systems.
• Familiarity with containerization and orchestration (Docker, Kubernetes)
• Experience with monitoring and observability (e.g. Prometheus, Grafana, Nagios, Zabbix, Ganglia, ELK)
• Effective communication and collaboration skills to work with cross-functional teams.
• Relevant certification in cloud computing, virtualization, container technologies and systems architecture are advantageous.
What we look for
If you are a performance-driven, inquisitive mind with the agility to adapt to ambiguity, you will fit right in. You should be eager to explore opportunities to build meaningful collaborations with stakeholders and aspire to create unique customer-centric solutions. Bias for action and a passion to conquer new frontiers in the AI space is at the heart of the Core42 community. What working at Core42 offers
•Culture: • An open, diverse and inclusive environment with a global vision that encourages personal growth and focuses on ground-breaking, industry-first innovations.
•Career: •Outstanding learning, development & growth opportunities via structured training programs and innovative, high-tech projects.
•Work-Life: • A hybrid work policy to strike the perfect balance between office and home.
•Rewards: • A competitive remuneration package with a host of perks including healthcare, education support, leave benefits and more. If you can confidently demonstrate that you meet the criteria above, please contact us as soon as possible.
MNCJobsGulf.com will not be responsible for any payment made to a third-party. All Terms of Use are applicable.