Job Title or Location
RECENT SEARCHES

Site Reliability Engineer

Boson
Toronto, ON
$150,000 - $300,000 / year
Experienced
Posted 22 days ago

Boson AI is looking for a Site Reliability Engineer to manage our cluster of GPU, CPU and storage servers, used to train and serve large AI models. You'll get to work on the latest NVIDIA GPUs and deal with tens of PB of storage. Your set of responsibilities include configuration, administration and maintenance of the system.

You will join a team responsible for Boson AI's datacenter and beyond. Ideally you should live in the Toronto region and be prepared to go to the datacenter for an oncall that requires physical presence (this workload is shared with other team members).

Tasks and Responsibilities

  • Firewall and IDS system (e.g. OPNSense)
  • Storage system (Ceph arrays) for file and object storage
  • Machine provisioning system (e.g. MAAS and Slurm) for general purpose and ML training jobs
  • Kubernetes administration
  • UFM configuration (Unified Fabric manager for InfiniBand)
  • Ethernet Switch Configuration (STP, BGP, RoCE)
  • VPN, LDAP and related configuration
  • Logging and monitoring (e.g. Grafana and Prometheus)
  • Network Security

Desirable Qualifications and Experience

  • Systems administration
  • VPN / IDS / Security expertise
  • Shell scripting and programming
  • Networking experience (management and design)
  • Hardware experience (deployment, configuration, installation)

$150,000 - $300,000 a year

Compensation is competitive and will depend on the level of seniority of the role.

You should have worked in a related role before. A GitHub profile with publicly visible code is a plus, so are other artifacts that can be reviewed. If you are a fresh graduate, please let us know as we might have similar roles for you, too.

#J-18808-Ljbffr