Principal Site Reliability Engineer - AI
New York, New York
Full Time
$200k - $250k
Our client is an AI-driven health-tech start-up on a mission to transform patient care through intelligent, secure, and highly reliable clinical automation tools. Their platform powers real-time insights for clinicians, improving patient outcomes and enabling healthcare systems to operate with unprecedented efficiency. They are entering a high-growth phase and are seeking a Principal Site Reliability Engineer to help scale their infrastructure and ensure world-class reliability.
Role OverviewOur client is hiring a Principal Site Reliability Engineer to serve as the technical authority for the reliability, scalability, and performance of their cloud-native infrastructure. This individual will design and implement systems that support rapid product development while meeting the resilience requirements of clinical-grade AI applications. The role blends hands-on engineering with architectural leadership and cross-functional collaboration across product, ML, infrastructure, and security teams.
What You’ll Do-
Architect, build, and optimize scalable, secure, and highly available cloud infrastructure (AWS/GCP/Azure).
-
Lead incident response, root-cause analysis, and production reliability improvements across the platform.
-
Implement observability frameworks (metrics, tracing, logging) that provide deep visibility into system performance.
-
Partner with ML and data engineering teams to operationalize AI/ML pipelines, ensuring reliability from data ingestion through model deployment.
-
Develop automated CI/CD pipelines, infrastructure-as-code, and guardrails for safer, faster deployments.
-
Define SLOs/SLIs and establish long-term reliability roadmaps aligned with clinical-grade requirements.
-
Mentor SREs and software engineers, promoting DevOps and reliability best practices across engineering.
-
Lead capacity planning, performance testing, and system hardening initiatives.
-
Collaborate with security teams to ensure compliance with HIPAA, SOC 2, and relevant privacy and security standards.
-
Evaluate new technologies and drive adoption of tools that improve operational excellence.
-
8+ years in SRE, DevOps, Infrastructure Engineering, or related fields.
-
Deep expertise with Kubernetes, container orchestration, and microservices architecture.
-
Strong experience with cloud platforms (AWS/GCP/Azure) and infrastructure-as-code tools such as Terraform, Pulumi, or CloudFormation.
-
Advanced proficiency in automation/scripting languages such as Python, Go, or Bash.
-
Strong knowledge of distributed systems, reliability engineering patterns, and modern observability stacks (Prometheus, Grafana, OpenTelemetry, Datadog, etc.).
-
Experience supporting highly regulated or mission-critical environments (healthcare, fintech, SaaS).
-
Hands-on experience with ML infrastructure, model lifecycle management, or data pipelines is a plus.
-
Excellent communication skills and a proactive, ownership-oriented mindset.
-
Mission-driven work that directly influences patient care and health outcomes.
-
Ownership of foundational infrastructure in a rapidly scaling AI start-up.
-
Competitive compensation, equity, and benefits.
-
A modern, cloud-native tech stack with the ability to shape future architecture.
-
A collaborative and innovative engineering culture.
If you'd like, I can also create:
-
a shorter/condensed version
-
a more formal corporate version
-
a job-board-optimized version (LinkedIn, Indeed, etc.)
-
a version tailored to a specific tech stack
Just let me know!