Titre du poste ou emplacement

Lead Site Reliability Engineer

GuruLink - 136 emplois

Toronto, ON

Posté hier

Détails de l'emploi :

Temps plein
Expérimenté

Location: REMOTE / Toronto, Ontario
This job allows you to work remotely.

We are partnering with a high-growth software company that operates a globally distributed, large-scale cloud platform supporting billions of daily transactions. The organization is focused on building highly available, data-intensive systems and is seeking a Lead Site Reliability Engineer to help shape the future of its infrastructure and reliability strategy.
This role combines hands-on technical leadership with long-term architectural ownership. You will play a key role in designing, scaling, and evolving mission-critical platform services across a multi-cloud environment while helping drive operational excellence, automation, observability, and platform reliability initiatives.
What You'll Do:
- Define and champion infrastructure automation strategies that reduce operational overhead, improve system performance, and enhance overall platform stability.
- Lead the design, reliability, and long-term evolution of core platform services, ensuring they align with business objectives and scalability requirements.
- Architect and guide the strategy for centralized logging and observability systems, balancing performance, availability, retention requirements, and operational costs.
- Establish frameworks for capacity planning, performance monitoring, and system optimization, proactively identifying opportunities to improve scalability and efficiency.
- Drive cross-functional reliability initiatives in partnership with engineering teams, influencing architectural decisions and promoting resilient service design practices.
- Identify systemic risks, operational bottlenecks, and platform improvement opportunities, taking ownership of solutions with a high degree of autonomy.
- Mentor engineers across the organization on reliability engineering principles, operational excellence, and distributed systems best practices.
Why This Opportunity:
- Work on systems operating at massive scale with demanding performance and reliability requirements.
- Influence platform strategy and architecture across a globally distributed cloud environment.
- Join a collaborative engineering culture that values technical excellence, ownership, and continuous improvement.
- Lead initiatives that directly impact the scalability, availability, and future growth of the platform.

Must Have Skills:
What We're Looking For:
- Demonstrated success in Site Reliability Engineering or Software Engineering roles, with experience designing, building, and operating highly scalable and resilient distributed systems.
- Deep expertise with distributed data and messaging platforms such as Apache Kafka, Apache Pulsar, ScyllaDB, Cassandra, Grafana Loki, or similar technologies.
- Strong experience creating and scaling automation frameworks, operational tooling, and performance analysis practices for large-scale production environments.
- 6+ years of hands-on experience in Site Reliability Engineering or Software Engineering, including ownership of cloud infrastructure strategy across AWS and GCP environments.
- Experience designing and leading observability platforms, including monitoring standards, service-level objectives (SLOs), alerting frameworks, and telemetry strategies.
- Exposure to technologies such as Prometheus, Grafana, Loki, Tempo, Thanos, or similar tooling is highly desirable.
- Proven track record of improving incident management processes, monitoring strategies, operational readiness, and on-call effectiveness within engineering organizations.
- Expert-level experience with Infrastructure as Code practices and tooling, including Terraform, Chef, or comparable platforms.
- Advanced Kubernetes expertise, including cluster architecture, multi-tenant environments, workload optimization, and large-scale container orchestration.
- Strong software development skills with experience working in multiple programming languages such as Go, Node.js, Python, Ruby, and shell scripting.
- Advanced Linux systems knowledge, including performance analysis, troubleshooting, tuning, and root-cause investigation of complex infrastructure issues.

Partager un emploi :