Requisition ID: 182309
Tangerine is Canada's leading direct bank. We offer flexible and accessible banking options, innovative products, and award-winning Client service. The reason why Tangerine employees come to work each day is to help Canadians live better lives. We focus on making a difference in our communities, and that includes our own internal community. It's important to us that our employees feel empowered and enthusiastic about belonging to our Orange culture.
Digital Engineering Operations SRE team comprises Site Reliability Engineers and Software Developers to improve Scotia Digital production services' availability, scalability, performance, and reliability. The team proactively looks for ways to improve application monitoring, address production issues and investigate and assist with customer inquiries.
Is this role right for you?
Are you passionate about improving automation and ensuring the resiliency of technology? Do you get your energy by providing technology solutions working with a team? We are currently seeking an experienced Site Reliability Engineer who is curious and drives insights from massive-scale data in real-time. Specifically, we are searching for someone who brings fresh ideas, demonstrates a unique and informed viewpoint, and enjoys collaborating with a cross-functional team to investigate and assist with resolving recurring and major issues and help improve the performance of our supported applications.
- You will run the production environment by monitoring availability and taking a holistic view of system health.
- You will improve our suite of software solutions' reliability, quality, and time-to-market.
- Measure and optimize system performance to push our capabilities forward, get ahead of customer needs, and innovate to improve continually.
- You will provide primary operational support and engineering for multiple large, distributed software applications.
- Participate in defining SLIs, SLOs and SLAs for Enterprise Systems.
- Gather and analyze metrics from both applications and infrastructure to assist in performance tuning and fault finding
- Partner with development teams to improve services through rigorous testing and release procedures.
- Participate in system design consulting, release management, and capacity planning.
- Create sustainable systems and services through automation and process improvements.
- Balance feature development speed and reliability with well-defined service level objectives.
- Monitor multiple application health and discover opportunities to optimize in a continuously growing large complex hybrid environment.
- Lead on-call problem escalation and outage recovery effort, not limited to code fixes in presentation and integration layer, but also provide infrastructure level investigation and support where necessary.
- Lead post-incident technical retrospect to discover and implement remediation actions.
- You will be part of a 24/7 on-call rotation and support multiple applications and occasional weekend releases.
- You will perform troubleshooting, deploy systems or execute maintenance tasks as necessary to meet the specified SLOs.
Required skills and experience:
- Be self-motivated, autonomous and a team player in a fast-paced environment.
- Good understanding of Networking concepts: TCP/IP, DNS, HTTP, TLS, OSI Model.
- Good understanding of multi-tier applications, microservices (Docker, Kubernetes etc.)
- Experience instrumenting and monitoring cloud hosted software stacks (preferably GCP)
- Working knowledge of one or more programming languages (Java, NodeJS, Python, etc.).
- Basic knowledge of one or more scripting languages (Ansible, Terraform, Bash etc.).
- 1-2 years of experience in developing and/or supporting complex, large-scale customer-facing platforms.
- Strong working experience with incident management and setting up monitoring alerts.
- Have a proficient understanding of code versioning tools, such as Git/Bitbucket.
- Knowledge about building a highly automated production monitoring and support model, hands-on experience integrating Splunk, Ansible, Dynatrace, Sumologic, Service now ,PagerDuty.com, or equivalents.
- Proven ability to translate ideas into technical and business realities and map technology to business problems.
- Experience with private/public cloud services and platforms.
- Superior verbal and written communication skills with the ability to influence decision-making with stakeholders.
- A proactive approach to spotting problems, areas for improvement, and performance bottlenecks.
- Exceptional written and verbal communication skills
- Excellent problem-solving skills
- Flexible approach to work and the ability to adapt to change
- Prior production support or SRE experience.
- Proficient with MS suite
Nice to have:
- Experience working with scalable containerized systems in the public cloud (GCP etc.).
- Experience with Docker (or other container runtimes) and Kubernetes.
- Experience in building public and internal REST APIs.
- Experience with CI/CD tools such as Jenkins.
- Experience working with database technology such as SQL server, Oracle.
- Experience with the Atlassian tools (JIRA, Confluence).
What's in it for you?
- We have an inclusive and collaborative working environment that encourages creativity, curiosity and celebrates success!
- Dress codes don't apply here; being comfortable does
- We provide you with the tools and technology needed to create meaningful customer experiences.
- Onsite cafeteria for when you work onsite.
- We offer a competitive total rewards package that includes a base salary, a performance bonus, company matching programs (on pension & profit sharing), generous vacation, personal & sick days, personal development funding, maternity leave top-up, parental leave, and more.
Location(s): Canada : Ontario : Toronto
At Tangerine we value the unique skills and experiences each individual brings to the team, and are committed to creating and maintaining an inclusive and accessible environment. If you require accommodation during the recruitment and selection process, please let our Recruitment team know.