MATRIX Resources is now “Motion Recruitment” and proud to combine job boards to bring the talent community even more expansive market opportunities.

Staff DevOps or Site Reliability Engineer / AWS / IoT / Local to San Diego

San Diego, California


Full Time

$143k - $190k

We are seeking a highly qualified Senior/Staff Site Reliability Engineer (SRE) to join our dynamic team based in San Diego, CA. In this role, you will be instrumental in solving operational challenges and providing essential support to development teams for critical business applications in production. Our primary focus is on ensuring reliability in all production services, empowering development teams to measure their reliability for informed decision-making.

Key Responsibilities:

  • Collaborate with teams to architect, engineer, and optimize products for Kubernetes and cloud environments.
  • Develop and enhance Continuous Integration/Continuous Deployment (CI/CD) pipelines, release management processes, and associated tools.
  • Maintain observability tools, champion standardization, and promote best practices for development teams.
  • Create tools, automation, and frameworks to enhance system stability and reliability.
  • Lead initiatives to prioritize and promote reliability, achieve uptime goals, and mentor colleagues in SRE best practices.
  • Provide on-call support to development teams for critical business applications in production.
  • Actively contribute to and facilitate an SRE guild, fostering knowledge sharing and collaboration among members.
  • Conduct thorough Production Readiness Reviews, working with teams to establish Service Level Objectives (SLOs) and ensure the delivery of high-quality, dependable services.
  • Contribute to project plans, engineering documentation, and develop operational excellent standard operating procedures and runbooks, with a strong focus on automation.

Experience and Qualifications:

  • 5+ years of experience in an SRE or Platform Engineer role supporting a 24x7 production environment.
  • 3+ years of experience with AWS or comparable cloud resource administration/support in a production environment.
  • Strong expertise in Kubernetes administration, containerization tools (e.g., Docker), and Helm, following industry best practices such as GitOps.
  • Proficiency in scripting languages such as Python, Ruby, Bash, Node.js, and/or Go.
  • Experience with distributed tracing and proficiency in one or more monitoring solutions: Prometheus, Elasticsearch, Datadog, and Cloudwatch.
  • Demonstrated proficiency in current software development lifecycle (SDLC) concepts and best practices, CI/CD pipelines, and test-driven development.
  • Strong problem-solving skills, operational expertise, and a passion for automation.

Posted by: John Bellon