Titre du poste ou emplacement

Senior Site Reliability Manager

Dawn InfoTek Inc.
Toronto, ON
Posté aujourd'hui
Détails de l'emploi :
Temps plein
Gestion

Job Description

Dawn InfoTek Inc. is seeking a hands-on and strategic Senior Manager to lead our Site Reliability Engineering (SRE), Service Delivery, and Infrastructure Patching teams supporting the Digital Banking Platform in a big Canadian Bank. This role is crucial to our mission of providing always-on, secure, and high-performing banking services for millions of customers.

Key Responsibilities

Technical Leadership & Incident Management

  • Act as the senior technical escalation point for on-call teams, diagnosing and resolving complex infrastructure, cloud, and application issues.
  • Lead major incident response efforts, ensuring rapid restoration and comprehensive root cause analysis (RCA).
  • Collaborate across engineering, platform, and security to troubleshoot issues spanning full-stack environments (cloud, container, and legacy platforms).
  • Maintain high availability and performance of digital banking applications (primarily AWS, OpenShift, Linux, with some legacy WebSphere).
  • Champion proactive monitoring, observability, and alerting (e.g., Dynatrace, OpenSearch, Prometheus, Grafana).

SRE & Reliability Engineering

  • Define and implement best practices for reliability, scalability, and availability tailored to large-scale digital banking.
  • Continuously improve CI/CD pipelines, release automation, and deployment practices.
  • Drive rigorous postmortem analysis and a culture of blameless continuous improvement.
  • Optimize for scalability, redundancy, and resilience—minimizing customer impact from incidents.

Infrastructure & Patching

  • Oversee patching and maintenance for cloud and on-prem environments (AWS, OpenShift, Red Hat VMs, some WebSphere).
  • Ensure zero-downtime patching strategies and automation to mitigate operational risk and security vulnerabilities.
  • Partner with security teams to enforce compliance, harden platforms, and remediate vulnerabilities.

Team Leadership & Process Improvement

  • Lead, mentor, and grow a high-performing team of 8–10 SREs and service engineers.
  • Drive a culture of ownership, operational excellence, and continuous learning.
  • Establish and enforce best practices for incident management, operational documentation, and process automation.
  • Collaborate with development, infrastructure, and product teams to enhance observability, deployment, and proactive issue detection

Required Skills

  • Exceptional hands-on troubleshooting skills in complex, distributed, or high-availability technical environments.
  • Experience in observability, monitoring, and incident management for critical platforms.
  • Demonstrated leadership in technical settings—may include leading projects, initiatives, or mentoring teams, even if not previously a formal people manager.
  • Excellent communicator, able to translate technical detail for both engineers and executives

Bachelor's degree in a technical field

Partager un emploi :