Motion Recruitment | Jobspring | Workbridge

Principal Site Reliability Engineer - AI

New York, New York

Hybrid

Full Time

$200k - $250k

About Our Client

Our client is an AI-driven health-tech start-up on a mission to transform patient care through intelligent, secure, and highly reliable clinical automation tools. Their platform powers real-time insights for clinicians, improving patient outcomes and enabling healthcare systems to operate with unprecedented efficiency. They are entering a high-growth phase and are seeking a Principal Site Reliability Engineer to help scale their infrastructure and ensure world-class reliability.

Role Overview

Our client is hiring a Principal Site Reliability Engineer to serve as the technical authority for the reliability, scalability, and performance of their cloud-native infrastructure. This individual will design and implement systems that support rapid product development while meeting the resilience requirements of clinical-grade AI applications. The role blends hands-on engineering with architectural leadership and cross-functional collaboration across product, ML, infrastructure, and security teams.

What You’ll Do
  • Architect, build, and optimize scalable, secure, and highly available cloud infrastructure (AWS/GCP/Azure).

  • Lead incident response, root-cause analysis, and production reliability improvements across the platform.

  • Implement observability frameworks (metrics, tracing, logging) that provide deep visibility into system performance.

  • Partner with ML and data engineering teams to operationalize AI/ML pipelines, ensuring reliability from data ingestion through model deployment.

  • Develop automated CI/CD pipelines, infrastructure-as-code, and guardrails for safer, faster deployments.

  • Define SLOs/SLIs and establish long-term reliability roadmaps aligned with clinical-grade requirements.

  • Mentor SREs and software engineers, promoting DevOps and reliability best practices across engineering.

  • Lead capacity planning, performance testing, and system hardening initiatives.

  • Collaborate with security teams to ensure compliance with HIPAA, SOC 2, and relevant privacy and security standards.

  • Evaluate new technologies and drive adoption of tools that improve operational excellence.

What They’re Looking For
  • 8+ years in SRE, DevOps, Infrastructure Engineering, or related fields.

  • Deep expertise with Kubernetes, container orchestration, and microservices architecture.

  • Strong experience with cloud platforms (AWS/GCP/Azure) and infrastructure-as-code tools such as Terraform, Pulumi, or CloudFormation.

  • Advanced proficiency in automation/scripting languages such as Python, Go, or Bash.

  • Strong knowledge of distributed systems, reliability engineering patterns, and modern observability stacks (Prometheus, Grafana, OpenTelemetry, Datadog, etc.).

  • Experience supporting highly regulated or mission-critical environments (healthcare, fintech, SaaS).

  • Hands-on experience with ML infrastructure, model lifecycle management, or data pipelines is a plus.

  • Excellent communication skills and a proactive, ownership-oriented mindset.

Why Candidates Love This Role
  • Mission-driven work that directly influences patient care and health outcomes.

  • Ownership of foundational infrastructure in a rapidly scaling AI start-up.

  • Competitive compensation, equity, and benefits.

  • A modern, cloud-native tech stack with the ability to shape future architecture.

  • A collaborative and innovative engineering culture.

If you'd like, I can also create:

  • a shorter/condensed version

  • a more formal corporate version

  • a job-board-optimized version (LinkedIn, Indeed, etc.)

  • a version tailored to a specific tech stack

Just let me know!

Posted by: Nicholas Costello