Job Title or Location

HPC Specialist

DNSnetworks Corp - 10 Jobs
Ottawa, ON
Posted today
Job Details:
Full-time
Entry Level

Salary:

HPC Specialist Role Overview

DNSnetworks is seeking an HPC (High-Performance Computing) Specialist responsible for the design, deployment, optimization, and management of high-performance computing systems, often used in scientific research, engineering simulations, AI/ML workloads, and large-scale data analytics.

Core Responsibilities
  1. Architecture & System Design
    • Design scalable HPC clusters (on-prem, cloud, or hybrid)
    • Choose appropriate CPUs, GPUs, interconnects (e.g., InfiniBand), and storage
    • Configure Slurm, PBS, or OpenHPC job schedulers
  2. Cluster Deployment & Maintenance
    • Install and manage Linux-based compute nodes
    • Maintain job schedulers and resource managers
    • Integrate monitoring tools (Prometheus, Grafana, Nagios)
  3. Performance Tuning & Optimization
    • Benchmark workloads and tune for performance (e.g., MPI, CUDA, OpenMP)
    • Optimize I/O and inter-node communication
    • Ensure efficient job execution and queue handling
  4. Cloud & Hybrid HPC Integration
    • Deploy and manage cloud-based HPC environments (Azure CycleCloud, AWS ParallelCluster, Google Cloud HPC Toolkit)
    • Optimize workload portability and orchestration (e.g., Singularity, Kubernetes with KubeFlow or Volcano)
  5. AI/ML & GPU Workload Support
    • Manage AI pipelines that require HPC acceleration (e.g., LLM training)
    • Optimize GPU usage (NVIDIA A100/H100, AMD MI300)
    • Interface with TensorFlow, PyTorch, and HPCML tools
  6. Security & Compliance
    • Implement security best practices for multi-user environments
    • Support data governance for sensitive or regulated workloads
    • Maintain audit trails and role-based access control
  7. User & Application Support
    • Assist researchers and data scientists with job submissions and optimization
    • Develop documentation, training materials, and run code validation sessions
Technical Skills & ToolsCategory TechnologiesSchedulersSlurm, PBS, Torque, LSFHPC OS & ConfigRHEL/CentOS, Rocky, Ubuntu ServerHPC File SystemsLustre, BeeGFS, GPFSParallel ComputingMPI, OpenMP, CUDAMonitoring/TelemetryPrometheus, Grafana, GangliaCloud HPCAWS HPC, Azure CycleCloud, GCP HPC ToolkitContainersSingularity, Apptainer, Docker, KubernetesAI/ML SupportPyTorch, TensorFlow, Horovod, MLFlowDevOps ToolsAnsible, Terraform, Git, Jenkins
Qualifications
  • Bachelors or Masters in Computer Science, Engineering, Physics, or a related technical field
  • 37 years of experience in HPC environments
  • Experience supporting AI/ML teams is a major asset
  • Certifications: NVIDIA DLI, AWS Certified HPC Specialist, Linux+ or RHCE

Share This Job: