Dawn InfoTek Inc. is seeking a hands-on and strategic Senior Manager to lead our Site Reliability Engineering (SRE), Service Delivery, and Infrastructure Patching teams supporting the Digital Banking Platform in a big Canadian Bank. This role is crucial to our mission of providing always-on, secure, and high-performing banking services for millions of customers.
Key Responsibilities
Technical Leadership & Incident Management
- Act as the senior technical escalation point for on-call teams, diagnosing and resolving complex infrastructure, cloud, and application issues.
- Lead major incident response efforts, ensuring rapid restoration and comprehensive root cause analysis (RCA).
- Collaborate across engineering, platform, and security to troubleshoot issues spanning full-stack environments (cloud, container, and legacy platforms).
- Maintain high availability and performance of digital banking applications (primarily AWS, OpenShift, Linux, with some legacy WebSphere).
- Champion proactive monitoring, observability, and alerting (e.g., Dynatrace, OpenSearch, Prometheus, Grafana).
SRE & Reliability Engineering
- Define and implement best practices for reliability, scalability, and availability tailored to large-scale digital banking.
- Continuously improve CI/CD pipelines, release automation, and deployment practices.
- Drive rigorous postmortem analysis and a culture of blameless continuous improvement.
- Optimize for scalability, redundancy, and resilience—minimizing customer impact from incidents.
Infrastructure & Patching
- Oversee patching and maintenance for cloud and on-prem environments (AWS, OpenShift, Red Hat VMs, some WebSphere).
- Ensure zero-downtime patching strategies and automation to mitigate operational risk and security vulnerabilities.
- Partner with security teams to enforce compliance, harden platforms, and remediate vulnerabilities.
Team Leadership & Process Improvement
- Lead, mentor, and grow a high-performing team of 8–10 SREs and service engineers.
- Drive a culture of ownership, operational excellence, and continuous learning.
- Establish and enforce best practices for incident management, operational documentation, and process automation.
- Collaborate with development, infrastructure, and product teams to enhance observability, deployment, and proactive issue detection
Required Skills
- Exceptional hands-on troubleshooting skills in complex, distributed, or high-availability technical environments.
- Experience in observability, monitoring, and incident management for critical platforms.
- Demonstrated leadership in technical settings—may include leading projects, initiatives, or mentoring teams, even if not previously a formal people manager.
- Excellent communicator, able to translate technical detail for both engineers and executives
Bachelor's degree in a technical field