OverviewThe Lead Systems Engineer - Computing Technology engages in the design, leads implementation, and provides Level 3 expert support for large-scale private Cloud computing and/or HPC infrastructure, with a specific emphasis on computing technologies including hardware layer, operating system, hypervisor, and orchestration services.Responsibilities
Co-design, lead implementation, and manage hybrid virtualization and containerized platforms based on OpenStack, VMware VCF, and/or Red Hat OpenShift, ensuring platform stability, performance, and compliance with industry standards and best practices.
Define and oversee the implementation of the roadmap for all Virtualization and HPC platforms across the company.
Collaborate with architecture and engineering teams on technology stack component evaluation and selection, ensuring solutions are designed following best practices and optimized from both functional and non-functional perspectives.
Lead regular capacity planning exercises to anticipate and accommodate the growing demands on the virtualized environment and HPC infrastructure, ensuring it meets current and future requirements.
Develop and oversee plans to enhance the reliability of the computing infrastructure, addressing potential points of failure and ensuring high availability of services.
Lead regular performance assessments and implement improvements based on findings in collaboration with relevant teams.
Define and oversee execution of disaster recovery strategies ensuring system integrity, availability, and protection across all platforms and environments.
Design and enhance observability stack in collaboration with the infrastructure operations team ensuring monitoring coverage and accuracy.
Provide L3 expert support, including on-call shifts, and act as the final tier of resolution for L2 support teams through problem analysis and communication with vendors technical support.
Lead the collaboration with architecture and engineering teams on technology stack component evaluation and selection, ensuring solutions adhere to best practices and are optimized for both functional and non-functional requirements.
Lead the analysis and implementation of performance optimization strategies for the cloud computing and/or HPC environment to maximize efficiency and resource utilization.
Lead and mentor a team of engineers and collaborate with other infrastructure engineering and systems architect teams on solution design and delivery.
Collaborate with security management teams to ensure that systems are safe and secure against cybersecurity threats.
Write and maintain relevant documentation, ensuring completeness and quality.
Work closely with process management and operational teams, contributing to process development, standardizing the collaboration framework, and improving collaboration efficiency.
Participate in the Hiring process by conducting technical interviews and contributing to the teams growth and expertise.
Qualifications
Bachelors or masters degree in computer science, Engineering, Software Engineering, or a related field in technology.
2+ years of experience leading a team of 3+ engineers, holding accountability for quality and timely delivery of infrastructure projects.
7+ years of experience and deep expertise in designing, implementing, and managing private cloud stacks with a focus on compute and virtualization technologies.
Extensive hands-on experience with at least one of the following platforms/stacks: OpenStack, Apache CloudStack, VMware VCF and Red Hat OpenShift, and related computing technologies such as x86 hardware, OS, KVM/ESXi, and orchestration services.
7+ years of hands-on experience in Linux Environments and 3+ years of experience in Senior Systems or Infrastructure engineering role.
Profound understanding of hardware architecture and components [x86 and ARM, NUMA, types of memory and channels, types of NICs, etc).
Good understanding of network and storage types and architecture.
Good understanding of Cloud Native concepts and technologies.
Experience in managing large-scale public or private cloud environments and/or working in a cloud service provider environment is highly desirable.
Advanced programming and scripting skills using Python and/or Golang, bash.
Good knowledge in Data center network designs and related technologies [OSI model, TCP/IP stack, routing, VLAN/VxLAN, etc]
Understanding of storage types, architecture, and protocols such as object/block/file storages, NFS/SMB, iSCSI, FC, etc.
Experience with integration of identity management, access management, and authorization solutions (PKI, LDAP, OAUTH, OpenID).
Hands-on experience with monitoring and observability tools like Zabbix or Nagios, Prometheus, Grafana, ELK Stack (Elasticsearch, Logstash, Kibana).
Understanding of CI/CD principles, Infrastructure as Code (IaaC) approach and software defined infrastructure solutions.
Experience with database management and optimization for both SQL and NoSQL databases such as MySQL, PostgreSQL, MongoDB, or Cassandra is highly desirable.
Experience with ITSM tools such as Jira, Redmine, ServiceNow, etc.
Relevant certifications in Linux, virtualization, and cloud computing are a plus.
Knowledge and experience working with GPU-hardware and AI hardware accelerators is a plus.
Strong organizational skills with the ability to multitask and prioritize.
A proactive approach to problem-solving and decision-making.