Motion Recruitment | Jobspring | Workbridge

Site Reliability Manager

Los Angeles, California

Hybrid

Full Time

$210k - $270k

Job Description

A fast-growing tech company, specializing in building a data platform that helps organizations make safe, fair, and compliant decisions, is seeking an experienced Site Reliability Engineering Manager to lead a team responsible for ensuring the reliability, performance, and scalability of their cloud-based services. The role involves managing incident response, improving system observability, and working closely with product and infrastructure teams to maintain high availability and operational excellence.

Required Skills & Experience
  • 8+ years in relevant technical roles, with 4+ years in leadership or management.
  • Strong background in designing and managing observability tools like Datadog or Prometheus.
  • Experience with containerized microservices on public cloud platform
  • Proficient with Linux, Git, and CI/CD pipelines.
  • Skilled in on-call production support and incident management.
  • Ability to automate tasks and improve reliability using scripting (Python preferred).
  • Experience with Infrastructure as Code tools (Terraform, CloudFormation, etc.).
  • Strong problem-solving skills and commitment to security best practices.
Desired Skills & Experience
  • Familiarity with AWS, Kubernetes, and event-driven architectures.
  • Experience mentoring engineers and leading technical teams.
  • Knowledge of incident management and collaboration tools (PagerDuty, Jira).
  • Ability to define and track service-level objectives and metrics.
  • Participation in continuous improvement.
What You Will Be Doing

Daily Responsibilities:

  • Lead and mentor the SRE team, helping resolve blockers and grow skills.
  • Manage daily incident escalations and coordinate with on-call engineers.
  • Collaborate with other managers to define reliability metrics and dashboards.
  • Communicate incident updates to stakeholders and support cross-team collaboration.
  • Participate in design and infrastructure reviews to embed reliability early.
  • Oversee on-call rotations and ensure thorough incident reviews.
  • Drive automation projects to remove operational bottlenecks and improve system uptime.
The Offer
  • 210K-270K
  • Hybrid

You will receive the following benefits:

  • Medical insurance coverage
  • Dental benefits
  • Vision benefits
  • 401(k) retirement plan with company match
  • Ongoing professional development opportunities
  • Equity ownership options
  • Additional perks and benefits

Posted by: Lily Caringer