Blink Health is seeking a Staff Site Reliability Engineer to establish and advance reliability engineering practices across its healthcare technology platforms. This role is a senior technical leadership position responsible for improving system resilience, scalability, and operational excellence across cloud infrastructure and application services that support millions of patients.
Responsibilities:
- Establish and evolve SRE best practices including SLIs, SLOs, error budgets, incident response, and postmortems.
- Define and drive observability strategy across metrics, logging, tracing, dashboards, and alerting.
- Design and implement automation to reduce operational toil and improve system reliability.
- Lead large, ambiguous infrastructure and reliability initiatives from concept through delivery.
- Partner with engineering teams to improve developer workflows, tooling, and operational readiness.
- Provide technical mentorship, architecture guidance, and design and code reviews across teams.
Requirements:
- 7+ years of experience in site reliability, infrastructure, or platform engineering roles.
- Expert-level troubleshooting across application, system, and network layers.
- Strong Linux and networking expertise, including load balancing, DNS, and TCP/IP.
- Experience with automation and tooling using languages such as Python, Go, or Bash.
- Deep experience with AWS and Kubernetes, including production-grade architectures.
- Strong background in Infrastructure as Code using tools like Terraform or similar.
Benefits:
- Opportunity to work on systems that directly improve healthcare access and affordability.
- Collaborative, learning-focused engineering culture.
- Equal opportunity workplace committed to diversity and inclusion.
This role offers the chance to shape reliability at scale within a high-impact healthcare platform.