Motion Recruitment | Jobspring | Workbridge

Site Reliability Engineer

Arlington, Virginia

100% Remote

Full Time

$175k - $225k

Site Reliability Engineer

As the Senior or Staff SRE on the Platform Engineering team, you’ll be joining at a foundational stage and play a key role in building and shaping a secure, resilient, and high-performance platform that powers engineering capabilities.

The company is located in New York and will remain 100% remote.

What You Will Be Doing:
  • Drive Platform Excellence: Continuously improve the platform's reliability, scalability, and deployment efficiency through innovative solutions and resilient system design.
  • Build Advanced Observability Solutions: Design, implement, and maintain comprehensive observability and monitoring frameworks to ensure system health, availability, and reliability.
  • Establish and Track Key Performance Metrics: Develop and monitor Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to define and measure system performance benchmarks.
  • Resolve Complex Issues and Perform Root Cause Analysis: Respond swiftly to critical incidents, troubleshoot sophisticated system and application problems, and conduct detailed root cause analyses to implement long-term solutions.
  • Lead with Innovation: Stay current with industry trends and emerging technologies. Evolve best practices to boost development quality and delivery speed.
  • Architect Scalable Systems: Take ownership of designing scalable, fault-tolerant, and distributed systems that meet high standards for performance and reliability.
  • Mentor and Advocate: Promote the use of modern technologies and best practices, foster adoption of sound architectural patterns, and provide mentorship to engineering peers across the organization.
Required Skills & Experience:
  • 10–12+ years of experience in software engineering, DevOps, or Site Reliability Engineering (SRE)
  • Proficiency in at least two of the following languages: JavaScript, TypeScript, Python, Go
  • Strong expertise in diagnosing and resolving issues in complex distributed systems
  • Deep understanding of database performance tuning and optimization best practices
  • Proven ability to innovate and drive the adoption of new tools, processes, and standards
  • Strong skills in system design and cloud-native architecture
  • Expertise in CI/CD pipelines, configuration management, automation, and monitoring
  • Advanced understanding of observability practices and tools such as ELK, Datadog, OpenTelemetry, Prometheus, and Grafana
  • Experience with deployment and orchestration tools like AWS ECS, Kubernetes, Cloud Run, etc.
  • Solid knowledge of Linux systems, virtualization, networking, VPCs, firewalls, and security configurations
  • Hands-on experience with AWS services and infrastructure provisioning through CLI, APIs, or Infrastructure as Code (IaC)
  • Bachelor’s degree in Computer Science or a related technical field, or equivalent practical experience
Applicants must be currently authorized to work in the United States on a full-time basis now and in the future.
This position doesn’t provide sponsorship.

Posted by: Ashton Corbett

Specialization: