Site Reliability Engineer

Arlington, Virginia

100% Remote

Full Time

$175k - $225k

Site Reliability Engineer

As the Senior or Staff SRE on the Platform Engineering team, you’ll be joining at a foundational stage and play a key role in building and shaping a secure, resilient, and high-performance platform that powers engineering capabilities.

The company is located in New York and will remain 100% remote.

What You Will Be Doing:

Drive Platform Excellence: Continuously improve the platform's reliability, scalability, and deployment efficiency through innovative solutions and resilient system design.
Build Advanced Observability Solutions: Design, implement, and maintain comprehensive observability and monitoring frameworks to ensure system health, availability, and reliability.
Establish and Track Key Performance Metrics: Develop and monitor Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to define and measure system performance benchmarks.
Resolve Complex Issues and Perform Root Cause Analysis: Respond swiftly to critical incidents, troubleshoot sophisticated system and application problems, and conduct detailed root cause analyses to implement long-term solutions.
Lead with Innovation: Stay current with industry trends and emerging technologies. Evolve best practices to boost development quality and delivery speed.
Architect Scalable Systems: Take ownership of designing scalable, fault-tolerant, and distributed systems that meet high standards for performance and reliability.
Mentor and Advocate: Promote the use of modern technologies and best practices, foster adoption of sound architectural patterns, and provide mentorship to engineering peers across the organization.

Required Skills & Experience:

10–12+ years of experience in software engineering, DevOps, or Site Reliability Engineering (SRE)
Proficiency in at least two of the following languages: JavaScript, TypeScript, Python, Go
Strong expertise in diagnosing and resolving issues in complex distributed systems
Deep understanding of database performance tuning and optimization best practices
Proven ability to innovate and drive the adoption of new tools, processes, and standards
Strong skills in system design and cloud-native architecture
Expertise in CI/CD pipelines, configuration management, automation, and monitoring
Advanced understanding of observability practices and tools such as ELK, Datadog, OpenTelemetry, Prometheus, and Grafana
Experience with deployment and orchestration tools like AWS ECS, Kubernetes, Cloud Run, etc.
Solid knowledge of Linux systems, virtualization, networking, VPCs, firewalls, and security configurations
Hands-on experience with AWS services and infrastructure provisioning through CLI, APIs, or Infrastructure as Code (IaC)
Bachelor’s degree in Computer Science or a related technical field, or equivalent practical experience

Applicants must be currently authorized to work in the United States on a full-time basis now and in the future.
This position doesn’t provide sponsorship.

Posted by: Ashton Corbett

Specialization:

DevOps

Related Jobs

Site Reliability Engineer
Fort Meade, Maryland
Onsite
•
Direct Hire
•
$170k - $200k