HPC Specialist Role Overview
DNSnetworks is seeking an HPC (High-Performance Computing) Specialist responsible for the design, deployment, optimization, and management of high-performance computing systems, often used in scientific research, engineering simulations, AI/ML workloads, and large-scale data analytics.
Core Responsibilities- Architecture & System Design
- Design scalable HPC clusters (on-prem, cloud, or hybrid)
- Choose appropriate CPUs, GPUs, interconnects (e.g., InfiniBand), and storage
- Configure Slurm, PBS, or OpenHPC job schedulers
- Cluster Deployment & Maintenance
- Install and manage Linux-based compute nodes
- Maintain job schedulers and resource managers
- Integrate monitoring tools (Prometheus, Grafana, Nagios)
- Performance Tuning & Optimization
- Benchmark workloads and tune for performance (e.g., MPI, CUDA, OpenMP)
- Optimize I/O and inter-node communication
- Ensure efficient job execution and queue handling
- Cloud & Hybrid HPC Integration
- Deploy and manage cloud-based HPC environments (Azure CycleCloud, AWS ParallelCluster, Google Cloud HPC Toolkit)
- Optimize workload portability and orchestration (e.g., Singularity, Kubernetes with KubeFlow or Volcano)
- AI/ML & GPU Workload Support
- Manage AI pipelines that require HPC acceleration (e.g., LLM training)
- Optimize GPU usage (NVIDIA A100/H100, AMD MI300)
- Interface with TensorFlow, PyTorch, and HPCML tools
- Security & Compliance
- Implement security best practices for multi-user environments
- Support data governance for sensitive or regulated workloads
- Maintain audit trails and role-based access control
- User & Application Support
- Assist researchers and data scientists with job submissions and optimization
- Develop documentation, training materials, and run code validation sessions
Qualifications
- Bachelors or Masters in Computer Science, Engineering, Physics, or a related technical field
- 37 years of experience in HPC environments
- Experience supporting AI/ML teams is a major asset
- Certifications: NVIDIA DLI, AWS Certified HPC Specialist, Linux+ or RHCE