Site Reliability Manager
Los Angeles, California
Hybrid
Full Time
$210k - $270k
Job Description
A fast-growing tech company, specializing in building a data platform that helps organizations make safe, fair, and compliant decisions, is seeking an experienced Site Reliability Engineering Manager to lead a team responsible for ensuring the reliability, performance, and scalability of their cloud-based services. The role involves managing incident response, improving system observability, and working closely with product and infrastructure teams to maintain high availability and operational excellence.
- 8+ years in relevant technical roles, with 4+ years in leadership or management.
- Strong background in designing and managing observability tools like Datadog or Prometheus.
- Experience with containerized microservices on public cloud platform
- Proficient with Linux, Git, and CI/CD pipelines.
- Skilled in on-call production support and incident management.
- Ability to automate tasks and improve reliability using scripting (Python preferred).
- Experience with Infrastructure as Code tools (Terraform, CloudFormation, etc.).
- Strong problem-solving skills and commitment to security best practices.
- Familiarity with AWS, Kubernetes, and event-driven architectures.
- Experience mentoring engineers and leading technical teams.
- Knowledge of incident management and collaboration tools (PagerDuty, Jira).
- Ability to define and track service-level objectives and metrics.
- Participation in continuous improvement.
Daily Responsibilities:
- Lead and mentor the SRE team, helping resolve blockers and grow skills.
- Manage daily incident escalations and coordinate with on-call engineers.
- Collaborate with other managers to define reliability metrics and dashboards.
- Communicate incident updates to stakeholders and support cross-team collaboration.
- Participate in design and infrastructure reviews to embed reliability early.
- Oversee on-call rotations and ensure thorough incident reviews.
- Drive automation projects to remove operational bottlenecks and improve system uptime.
- 210K-270K
- Hybrid
You will receive the following benefits:
- Medical insurance coverage
- Dental benefits
- Vision benefits
- 401(k) retirement plan with company match
- Ongoing professional development opportunities
- Equity ownership options
- Additional perks and benefits