About This Role
A*STAR HPC System Engineer role requires expertise in designing, optimizing, and maintaining NSCC's supercomputing infrastructure including compute, interconnects, and storage components.
Responsibilities
- Evaluate HPC system architecture for compute, interconnects, and storage components.
- Collaborate with administrators to ensure system reliability and performance.
- Assist in performance tuning and root-cause analysis for complex issues.
- Develop utility tools for system diagnostics and performance profiling.
- Configure job schedulers (Slurm, PBS Pro) to maximize resource utilization and throughput.
- Define security policies in collaboration with administrators.
Requirements
- Degree in Computer Science, Engineering, IT or other relevant areas.
- At least 3 years of experience in managing HPC systems.
- Proficient in UNIX/Linux environments and command line interface (CLI).
- Experience with cluster management software (xCAT, BCM, PHPC, HPCM).
- Experience with job scheduling and workload management software (Slurm or PBS Pro).
- Strong knowledge of HPC storage principles and parallel file systems (Lustre, GPFS, BeeGFS).
- Understanding of RDMA-based interconnects (InfiniBand, RoCE).
- Basic knowledge of network protocols like DHCP, DNS, TFTP.