Senior Site Reliability Manager

Dawn InfoTek Inc.

Toronto, ON

Apply Now

Posted today

Job Details:

Full-time

Management

Job Description

Dawn InfoTek Inc. is seeking a hands-on and strategic Senior Manager to lead our Site Reliability Engineering (SRE), Service Delivery, and Infrastructure Patching teams supporting the Digital Banking Platform in a big Canadian Bank. This role is crucial to our mission of providing always-on, secure, and high-performing banking services for millions of customers.

Key Responsibilities

Technical Leadership & Incident Management

Act as the senior technical escalation point for on-call teams, diagnosing and resolving complex infrastructure, cloud, and application issues.
Lead major incident response efforts, ensuring rapid restoration and comprehensive root cause analysis (RCA).
Collaborate across engineering, platform, and security to troubleshoot issues spanning full-stack environments (cloud, container, and legacy platforms).
Maintain high availability and performance of digital banking applications (primarily AWS, OpenShift, Linux, with some legacy WebSphere).
Champion proactive monitoring, observability, and alerting (e.g., Dynatrace, OpenSearch, Prometheus, Grafana).

SRE & Reliability Engineering

Define and implement best practices for reliability, scalability, and availability tailored to large-scale digital banking.
Continuously improve CI/CD pipelines, release automation, and deployment practices.
Drive rigorous postmortem analysis and a culture of blameless continuous improvement.
Optimize for scalability, redundancy, and resilience—minimizing customer impact from incidents.

Infrastructure & Patching

Oversee patching and maintenance for cloud and on-prem environments (AWS, OpenShift, Red Hat VMs, some WebSphere).
Ensure zero-downtime patching strategies and automation to mitigate operational risk and security vulnerabilities.
Partner with security teams to enforce compliance, harden platforms, and remediate vulnerabilities.

Team Leadership & Process Improvement

Lead, mentor, and grow a high-performing team of 8–10 SREs and service engineers.
Drive a culture of ownership, operational excellence, and continuous learning.
Establish and enforce best practices for incident management, operational documentation, and process automation.
Collaborate with development, infrastructure, and product teams to enhance observability, deployment, and proactive issue detection

Required Skills

Exceptional hands-on troubleshooting skills in complex, distributed, or high-availability technical environments.
Experience in observability, monitoring, and incident management for critical platforms.
Demonstrated leadership in technical settings—may include leading projects, initiatives, or mentoring teams, even if not previously a formal people manager.
Excellent communicator, able to translate technical detail for both engineers and executives

Bachelor's degree in a technical field

#Manufacturing jobs

Apply Now

Save

Senior Site Reliability Manager

Share This Job:

We’ve updated our terms