GuruLink - 141 Jobs
Toronto, ON
Job Details:
Location: REMOTE / Quebec, Quebec
This job allows you to work remotely.
Talent.com is one of the largest job search and recruitment platforms in the world, with over 35 million job listings across 78 countries and tens of millions of unique monthly visitors. Our mission is to centralize all jobs available on the web — from company career sites, staffing agencies, and job boards — into a single, seamless experience for job seekers and employers.
Headquartered in Montreal with a global team across North America, South America, and Europe, we're investing in a new generation of AI-native products that will redefine how people find work and how employers find talent.
Talia is one of those products. Talia is the workforce-intelligence layer for high-volume hiring. It's how Talent.com is going from job platform to AI-native software company.
You'll join the Talia team — a small, focused team operating as a startup inside Talent.com. Same speed and autonomy as seed-stage. Distribution, data, and customer relationships of a 15-year-old platform underneath. You'll work directly with the Engineering Lead (your daily collaborator), alongside Product and design-partner customers across healthcare, retail, residential manufacturing, BPO/contact center, and education.
About the Role:
Talia runs agentic AI in production for regulated, high-volume hiring — which means the infrastructure underneath it has to be fast, observable, cheap enough to scale, and defensible to a customer's security team. We're looking for a Senior Backend / Infrastructure Engineer to own that foundation: the AWS footprint, the deployment pipeline, the workflow orchestration layer, and the observability that makes agentic decisions measurable.
This is a hands-on platform role with real ownership. You'll own infrastructure-as-code end-to-end, set the reliability and cost bars, and build the production scaffolding the rest of the team ships on top of. You'll be the person who makes it works on my machine into it's deployed, monitored, and within budget.
How We Build (AI-First):
AI-first development is how this team operates: coding agents are part of the daily workflow, not a novelty. You're expected to bring AI into your own process — scaffolding, refactoring, test generation, code review, research — and to have strong, earned opinions about where it makes you faster and where it gets in the way. The codebase is built to be legible to both humans and agents, and you'll help keep it that way. We measure ourselves by output and judgment, not hours at a keyboard: AI handles the grind so you spend your time on the calls that actually matter.
What You'll Do:
-Own the AWS infrastructure end-to-end — Terraform across dev/staging/prod, ECS Fargate, ALB, ECR, Secrets Manager, networking, and IAM.
-Own and harden the GitLab CI/CD pipeline: build, test, migration, and release stages, including the self-hosted GitLab runners and semantic-release flow.
-Run the Temporal layer that orchestrates Talia's durable agentic workflows — deployment, worker scaling, versioning, and failure recovery.
-Build the observability and cost layer: Prometheus + Grafana dashboards for agent performance, LLM cost, and integration health, plus the alerting that catches regressions before customers do.
-Own the data infrastructure — Postgres (managed/RDS), connection pooling, async access patterns with SQLAlchemy/asyncpg, and safe Alembic migrations against live multi-tenant data.
-Make the latency/cost/reliability trade-offs on inference and data pipelines explicit and measurable — and tune them.
-Build the security and compliance scaffolding — secrets management, audit logging, tenant isolation — that clears customer security, procurement, and legal review.
-Partner with the Engineering Lead and product engineers to keep the deployment story fast: ship multiple times a day, safely.
-Set the operational bar — runbooks, on-call hygiene, incident response — and help interview future infra-minded hires.
Must Have Skills:
Who You Are:
You don't need a specific number of years. You need this body of evidence:
-Built and owned production infrastructure for a real product — not a side project, not a single Terraform module you inherited. You've held the pager.
-Fluent in AWS + Terraform in anger. You've designed a footprint from networking up, and you know where it bites at scale.
-Run containerized services in production (ECS/Fargate, Kubernetes, or equivalent) — deploys, rollbacks, autoscaling, the works.
-Built observability that someone actually used to find a problem at 2am — metrics, traces, dashboards, alerts that mean something.
-Operated databases under real load — migrations against live data, connection limits, the failure modes of async access.
-Comfortable owning cost. You've looked at a cloud or inference bill, found the line that mattered, and cut it without breaking the product.
-Strong Python. You read and write application code, not just YAML and HCL — you can meet the product engineers where they are.
-Excellent communicator who can translate between technical and non-technical audiences, including a customer's security reviewer.
Nice to Have Skills:
Nice to Have:
-Production experience with Temporal (or comparable workflow orchestration — Airflow, Step Functions, Cadence).
-Operated infrastructure that supports LLM/agentic systems in production — inference cost controls, provider rate limits, latency budgets.
-Direct experience with Anthropic, OpenAI, or other major model providers in production, including the operational side (quotas, fallbacks, migrating between them).
-Self-hosted Prometheus + Grafana, or equivalent (Datadog, OpenTelemetry) at production grade.
-Cleared a regulated infrastructure gate — SOC 2, security review, data-residency, or equivalent.
-Multi-tenant SaaS infrastructure experience, especially tenant isolation and per-tenant cost attribution.
-Open source contributions or technical writing in the infra / platform / AI-ops space.