Senior Cloud Engineer

RLDatix

Toronto, ON

Full-time

Executive

Posted 14 days ago

Apply on company site

Sr. Cloud Engineer

Location:Canada - Toronto/Remote

Every single day around the world, thousands of patients are harmed from care delivery errors, many of which are preventable. We want to change that. RLDatix is on a mission to improve healthcare by enabling a world where patients receive the best and safest care possible. Trusted by thousands of clients around the world, our connected healthcare operations platform combines software and trusted services to empower organizations with critical data insights across risk, safety, compliance, provider lifecycle and workforce management. Our user-centric approach provides a holistic, real-time view of healthcare operations, connecting disparate information across the enterprise – thus giving organizational leadership the contextualized data they need to make better informed decisions.

RLDatix is truly global, with over 2,000 employees across the UK, Europe, Middle East, Australia, Canada, and the United States. Our strategy is fueled by organic and inorganic growth that brings together the brightest minds and the latest technology – including AI - to deliver marketing leading solutions for our clients. We are looking for people to join our team who are passionate about making a positive change in healthcare. Join us as we work towards our vision of safer, better healthcare for all.

What You Will Do:

Cloud engineers play a pivotal role in ensuring the smooth operation of all user-facing services and production systems within RLDatix. They embody a hybrid profile, combining practical operational skills with software engineering expertise to apply robust engineering principles, operational rigor, and advanced automation across our operational environments and the RLDatix codebase.

Specializing in systems, including operating systems, storage subsystems, and networking, cloud engineers implement leading practices to enhance availability, reliability, and scalability. They possess diverse interests spanning algorithms and distributed systems.

RLDatix presents distinct challenges due to its status as a prominent platform in the healthcare sector, necessitating specialized expertise to address unique operational demands. The insights and experience gained by the cloud engineering team inform and benefit other engineering groups within the organization, as well as RLDatix customers managing self-hosted installations.

Additionally, this person will be responsible for:

Be on an on-call (PagerDuty) rotation to respond to incidents that impact RLDatix's availability, and provide support for service engineers with customer incidents.
Use your on-call shift to prevent incidents from ever happening.
Run our infrastructure with Ansible, Terraform, Github CI/CD, Helm, Argo and Kubernetes.
Build monitoring that alerts on symptoms rather than on outages.
Document every action so your findings turn into repeatable actions and then into automation.
Improve operational processes (such as deployments and upgrades) to make them as boring as possible.

Design, build and maintain core infrastructure that enables RLDatix scaling to support hundreds of thousands of concurrent users.
Debug production issues across services and levels of the stack.
Plan the growth of RLDatix's infrastructure.

The role requires an individual who is a critical thinker, adept at analyzing systems to anticipate edge cases, failure modes, and behaviors, as well as implementing specific solutions. Proficiency in Linux and Unix Shell is essential, along with a deep understanding of configuration management systems like Terraform and Ansible. Strong programming skills in Shell, Python, and/or Go are necessary for automation and scripting tasks. Collaboration and asynchronous communication are key, as is a commitment to thorough documentation to avoid redundant learning. An enthusiastic, proactive attitude toward problem-solving and continuous improvement is paramount, along with a drive to deliver quickly, effectively, and iteratively. Candidates should align with our values and demonstrate experience with technologies such as Nginx, HAProxy, Docker, Kubernetes, Terraform, or similar tools.

What You Will Gain:

This is an excellent opportunity to join our technology company, we need to stay as close to the epicenter of pioneering solutions as possible. Even more, it should be us who develop these solutions. It's a common belief among the Cloud Team, that we need to listen to our colleagues if we want to find the best solutions. And on top of that, we run most of our projects according to the Agile principles. You'll also be working on assignments that make a big impact on the future of HealthCare globally.

Experience/Knowledge You Will Need:

Candidates should ideally have:

Technical

Basic knowledge of 4 technical expertise areas, with deep knowledge 2 area

Configuration Management tools including Ansible (basic syntax, tasks, playbooks).
Terraform basic syntax and Github CI/CD configuration, pipelines, jobs.
Cloud resources provisioning and configuration through CLI/API
Kubernetes basic understanding, CLI, service re-provisioning
Provisioning and setup metric in Datadog or other similar tooling, alerts and silences
Understanding of how to do basic queries in logs tools for general questions.

Operating system (Linux) configuration, package management, startup, and troubleshooting
Block and object storage configuration.
Networking VPCs, proxies and CDNs

Working knowledge of RLDatix products, including deeper knowledge in areas performance management

Execution

Identifies significant projects that result in substantial improvements in reliability, cost savings and/or revenue.
Identifies changes for the product architecture from the reliability, performance and availability perspectives with a data driven approach.
Influences the product roadmap and works with engineering and product counterparts to influence improved resiliency and reliability of the RLDatix product.
Proactively work on the efficiency and capacity planning to set clear requirements and reduce the system resources usage to make RLDatix cheaper to run for all our customers.
Identify parts of the system that do not scale, provides immediate palliative measures and drives long term resolution of these incidents.
Identify Service Level Indicators (SLIs) that will align the team to meet the availability and latency objectives.

Collaboration and Communication

Leads initiatives and problem definition and scoping, design, and planning through epics and blueprints.
Deep domain knowledge and radiation that knowledge through recorded demos, technical presentations, discussions, and Incident Reviews.
Perform and run blameless RCAs on incidents and outages aggressively looking for answers that will prevent the incident from ever happening again.

Influence and Maturity

Set an example for team of Cloud Engineers with positive and inclusive leadership and discussion on work.
Show ownership of a major part of the infrastructure.
Trusted to de-escalate conflicts inside the team.

#Information Technology jobs

Apply on company site

Save Job