HPC Specialist

DNSnetworks Corp - 10 Jobs

Ottawa, ON

Apply Now

Posted today

Job Details:

Full-time

Entry Level

Salary:

HPC Specialist Role Overview

DNSnetworks is seeking an HPC (High-Performance Computing) Specialist responsible for the design, deployment, optimization, and management of high-performance computing systems, often used in scientific research, engineering simulations, AI/ML workloads, and large-scale data analytics.

Core Responsibilities

Architecture & System Design
- Design scalable HPC clusters (on-prem, cloud, or hybrid)
- Choose appropriate CPUs, GPUs, interconnects (e.g., InfiniBand), and storage
- Configure Slurm, PBS, or OpenHPC job schedulers
Cluster Deployment & Maintenance
- Install and manage Linux-based compute nodes
- Maintain job schedulers and resource managers
- Integrate monitoring tools (Prometheus, Grafana, Nagios)
Performance Tuning & Optimization
- Benchmark workloads and tune for performance (e.g., MPI, CUDA, OpenMP)
- Optimize I/O and inter-node communication
- Ensure efficient job execution and queue handling
Cloud & Hybrid HPC Integration
- Deploy and manage cloud-based HPC environments (Azure CycleCloud, AWS ParallelCluster, Google Cloud HPC Toolkit)
- Optimize workload portability and orchestration (e.g., Singularity, Kubernetes with KubeFlow or Volcano)
AI/ML & GPU Workload Support
- Manage AI pipelines that require HPC acceleration (e.g., LLM training)
- Optimize GPU usage (NVIDIA A100/H100, AMD MI300)
- Interface with TensorFlow, PyTorch, and HPCML tools
Security & Compliance
- Implement security best practices for multi-user environments
- Support data governance for sensitive or regulated workloads
- Maintain audit trails and role-based access control
User & Application Support
- Assist researchers and data scientists with job submissions and optimization
- Develop documentation, training materials, and run code validation sessions

Technical Skills & ToolsCategory TechnologiesSchedulersSlurm, PBS, Torque, LSFHPC OS & ConfigRHEL/CentOS, Rocky, Ubuntu ServerHPC File SystemsLustre, BeeGFS, GPFSParallel ComputingMPI, OpenMP, CUDAMonitoring/TelemetryPrometheus, Grafana, GangliaCloud HPCAWS HPC, Azure CycleCloud, GCP HPC ToolkitContainersSingularity, Apptainer, Docker, KubernetesAI/ML SupportPyTorch, TensorFlow, Horovod, MLFlowDevOps ToolsAnsible, Terraform, Git, Jenkins
Qualifications

Bachelors or Masters in Computer Science, Engineering, Physics, or a related technical field
37 years of experience in HPC environments
Experience supporting AI/ML teams is a major asset
Certifications: NVIDIA DLI, AWS Certified HPC Specialist, Linux+ or RHCE

#Education and Training jobs

Apply Now

Save

HPC Specialist

Share This Job:

We’ve updated our terms