As a Platform/Site Reliability Engineer (SRE), you will play a key role in establishing and enhancing our engineering platform. You will help ensure the reliability, scalability, and efficiency of our systems while developing tools that improve engineering productivity.
You will play a key role in defining and shaping our platform strategy, setting best practices, and driving initiatives that enhance developer experience, system performance, and operational efficiency.
What You Will Be Doing
- DevOps & Infrastructure : Design, implement, and maintain scalable infrastructure to support engineering needs.
- CI/CD Optimization : Improve our continuous integration and continuous deployment pipelines using AWS CDK , including requirements for a deployment tool and database migration tool to enable fast and reliable releases.
- Release Tracking & Deployment : Establish visibility into release cycles, implement automation to streamline deployments, and ensure smooth rollouts.
- Site Reliability & Observability : Implement monitoring, logging, and alerting systems to ensure high availability and performance of services.
- Internal Tooling : Build and maintain tools that improve developer efficiency, automate repetitive tasks, and enhance productivity.
- Security & Compliance : Ensure security best practices are followed in infrastructure, deployments, and internal systems, with a focus on SoC, ISO, and GDPR compliance.
What We're Looking For
- 7+ years of technical experience: 5+ years of experience as an SRE Engineer or similar. Prior startup experience is preferred but not required.
- Deep expertise in AWS , including Fargate and Kubernetes for container orchestration.
- Strong experience with CI/CD pipelines , specifically leveraging AWS CDK , including deployment and database migration tools.
- Proficiency in observability tools (Datadog, Prometheus, Grafana) and performance monitoring.
- Deep understanding of scaling strategies and highly available architectures.
- Experience with scripting and automation using Python, Bash, or TypeScript.
- Knowledge of security best practices , including compliance with SoC, ISO, and GDPR as a bonus.
- Ability to collaborate cross-functionally with engineering teams to drive platform improvements.
- Infrastructure : AWS, Fargate, Redis, PostgreSQL, SQS, CDK, GitHub, Retool
- Backend : Django REST framework, Celery
- Frontend : Next.js, Tailwind css
- LLM : OpenAI, Claude, AWS Bedrock