Responsibilities:
• Design and develop robust observability solutions to monitor, analyze, and troubleshoot distributed systems.
• Familiar with OTEL standards and tools.
• Previous experience working with application teams to implement "self-healing" i.e. alerting that triggers automated remediation.
• Implement and configure monitoring, logging, tracing, and alerting systems to ensure comprehensive coverage of our infrastructure and applications.
• Collaborate with software engineers to instrument code for telemetry data collection and analysis.
• Optimize observability tooling and processes to improve system reliability, performance, and scalability.
• Create dashboards, reports, and visualizations to provide actionable insights into system health and performance.
• Investigate and resolve incidents by analyzing telemetry data and identifying root causes.
• Stay current with industry trends and best practices in observability, and recommend improvements to our observability strategy and infrastructure.
Qualifications:
• Bachelor's degree in Computer Science, Engineering, or a related field (or equivalent experience).
• 3-5 years experience as an Observability Engineer or a similar role in a production environment.
• Deep understanding of observability principles, methodologies, and tools such as Prometheus, Grafana, Jaeger, ELK stack, etc.