Principal Site Reliability Engineer - AI

New York, New York

Hybrid

Full Time

$200k - $250k

About Our Client

Our client is an AI-driven health-tech start-up on a mission to transform patient care through intelligent, secure, and highly reliable clinical automation tools. Their platform powers real-time insights for clinicians, improving patient outcomes and enabling healthcare systems to operate with unprecedented efficiency. They are entering a high-growth phase and are seeking a Principal Site Reliability Engineer to help scale their infrastructure and ensure world-class reliability.

Role Overview

Our client is hiring a Principal Site Reliability Engineer to serve as the technical authority for the reliability, scalability, and performance of their cloud-native infrastructure. This individual will design and implement systems that support rapid product development while meeting the resilience requirements of clinical-grade AI applications. The role blends hands-on engineering with architectural leadership and cross-functional collaboration across product, ML, infrastructure, and security teams.

What You’ll Do

Architect, build, and optimize scalable, secure, and highly available cloud infrastructure (AWS/GCP/Azure).
Lead incident response, root-cause analysis, and production reliability improvements across the platform.
Implement observability frameworks (metrics, tracing, logging) that provide deep visibility into system performance.
Partner with ML and data engineering teams to operationalize AI/ML pipelines, ensuring reliability from data ingestion through model deployment.
Develop automated CI/CD pipelines, infrastructure-as-code, and guardrails for safer, faster deployments.
Define SLOs/SLIs and establish long-term reliability roadmaps aligned with clinical-grade requirements.
Mentor SREs and software engineers, promoting DevOps and reliability best practices across engineering.
Lead capacity planning, performance testing, and system hardening initiatives.
Collaborate with security teams to ensure compliance with HIPAA, SOC 2, and relevant privacy and security standards.
Evaluate new technologies and drive adoption of tools that improve operational excellence.

What They’re Looking For

8+ years in SRE, DevOps, Infrastructure Engineering, or related fields.
Deep expertise with Kubernetes, container orchestration, and microservices architecture.
Strong experience with cloud platforms (AWS/GCP/Azure) and infrastructure-as-code tools such as Terraform, Pulumi, or CloudFormation.
Advanced proficiency in automation/scripting languages such as Python, Go, or Bash.
Strong knowledge of distributed systems, reliability engineering patterns, and modern observability stacks (Prometheus, Grafana, OpenTelemetry, Datadog, etc.).
Experience supporting highly regulated or mission-critical environments (healthcare, fintech, SaaS).
Hands-on experience with ML infrastructure, model lifecycle management, or data pipelines is a plus.
Excellent communication skills and a proactive, ownership-oriented mindset.

Why Candidates Love This Role

Mission-driven work that directly influences patient care and health outcomes.
Ownership of foundational infrastructure in a rapidly scaling AI start-up.
Competitive compensation, equity, and benefits.
A modern, cloud-native tech stack with the ability to shape future architecture.
A collaborative and innovative engineering culture.

If you'd like, I can also create:

a shorter/condensed version
a more formal corporate version
a job-board-optimized version (LinkedIn, Indeed, etc.)
a version tailored to a specific tech stack

Just let me know!

Posted by: Nicholas Costello

Specialization:

Principal Site Reliability Engineer - AI

Related Jobs

Staff Site Reliability Engineer

Staff Platform Engineer

Lead AWS DevOps Engineer

Senior DevOps Engineer- 5 days onsite Manhattan, NY

Senior Site Reliability Engineer (SRE)

Principal Cloud Solutions Architect

Senior DevOps Engineer- Hybrid Downtown Philadelphia

Cloud Security Engineer / NYC / On-site

Cloud Security Engineer / NYC / On-site

DevOps Engineer II / AWS / Remote / Chicago