Site Reliability Engineer

Los Angeles, CA

Open to Remote

Full Time

$90k - $100k

We are working with a company that is bringing together design systems along with a diverse and driven team of professionals to provide customer fulfillment. This company is looking for a Site Reliability Engineer to bring in a team-oriented and creative player. In this position you will work with the DevOps team to help build projects to products, systems and SDLC pipelines. As well as seek to minimize constraints and increase lead time. Measure and optimize system performance pushing towards our capabilities and getting ahead of company needs. This job entails building, deploying, and managing the critical infrastructure. You will have hands on experience combined with working knowledge.
  1. System and Service Reliability: Monitor, measure, and improve the reliability of systems and services. This involves designing and implementing strategies to prevent and mitigate incidents.
  2. Service Level Objectives (SLOs) and Service Level Indicators (SLIs): Define and track SLOs and SLIs for critical services. Work to meet or exceed these objectives, and ensure they align with business and customer expectations.
  3. Incident Response and Management: Respond to and manage incidents effectively. This includes diagnosing and resolving issues, conducting post-incident reviews, and implementing measures to prevent recurrence.
  4. Automation and Tooling: Develop, maintain, and improve automation tools and processes to streamline operations and enhance the reliability of systems. This may include scripting, configuration management, and monitoring tools.
  5. Capacity Planning and Scaling: Analyze system capacity and performance metrics to plan for future growth. Implement scaling strategies to accommodate increased demand.
  6. Infrastructure as Code (IaC): Leverage IaC principles to manage and provision infrastructure resources. This may involve using tools like Terraform, Ansible, or similar technologies.
  7. Continuous Integration/Continuous Deployment (CI/CD): Work closely with development teams to ensure that CI/CD pipelines are reliable, efficient, and aligned with SRE best practices.
  8. Monitoring and Alerting: Set up and maintain robust monitoring and alerting systems. Ensure that key metrics and events are tracked, and that alerts are triggered appropriately to indicate potential issues.
  9. Capacity and Performance Management: Monitor and manage resource utilization to ensure systems meet performance requirements. This includes optimizing configurations and making recommendations for hardware or resource adjustments.
  10. Security and Compliance: Collaborate with security teams to implement best practices for securing systems and ensuring compliance with relevant industry standards and regulations.
  11. Disaster Recovery and Redundancy: Design and implement strategies for disaster recovery, including backup and redundancy solutions, to ensure high availability in case of system failures.

Posted by: Amanda Oliver

Specialization: DevOps