Site Reliability Engineer
Arlington, Virginia
100% Remote
Full Time
$175k - $225k
Site Reliability Engineer
As the Senior or Staff SRE on the Platform Engineering team, you’ll be joining at a foundational stage and play a key role in building and shaping a secure, resilient, and high-performance platform that powers engineering capabilities.
The company is located in New York and will remain 100% remote.
What You Will Be Doing:
This position doesn’t provide sponsorship.
As the Senior or Staff SRE on the Platform Engineering team, you’ll be joining at a foundational stage and play a key role in building and shaping a secure, resilient, and high-performance platform that powers engineering capabilities.
The company is located in New York and will remain 100% remote.
What You Will Be Doing:
- Drive Platform Excellence: Continuously improve the platform's reliability, scalability, and deployment efficiency through innovative solutions and resilient system design.
- Build Advanced Observability Solutions: Design, implement, and maintain comprehensive observability and monitoring frameworks to ensure system health, availability, and reliability.
- Establish and Track Key Performance Metrics: Develop and monitor Service Level Indicators (SLIs) and Service Level Objectives (SLOs) to define and measure system performance benchmarks.
- Resolve Complex Issues and Perform Root Cause Analysis: Respond swiftly to critical incidents, troubleshoot sophisticated system and application problems, and conduct detailed root cause analyses to implement long-term solutions.
- Lead with Innovation: Stay current with industry trends and emerging technologies. Evolve best practices to boost development quality and delivery speed.
- Architect Scalable Systems: Take ownership of designing scalable, fault-tolerant, and distributed systems that meet high standards for performance and reliability.
- Mentor and Advocate: Promote the use of modern technologies and best practices, foster adoption of sound architectural patterns, and provide mentorship to engineering peers across the organization.
- 10–12+ years of experience in software engineering, DevOps, or Site Reliability Engineering (SRE)
- Proficiency in at least two of the following languages: JavaScript, TypeScript, Python, Go
- Strong expertise in diagnosing and resolving issues in complex distributed systems
- Deep understanding of database performance tuning and optimization best practices
- Proven ability to innovate and drive the adoption of new tools, processes, and standards
- Strong skills in system design and cloud-native architecture
- Expertise in CI/CD pipelines, configuration management, automation, and monitoring
- Advanced understanding of observability practices and tools such as ELK, Datadog, OpenTelemetry, Prometheus, and Grafana
- Experience with deployment and orchestration tools like AWS ECS, Kubernetes, Cloud Run, etc.
- Solid knowledge of Linux systems, virtualization, networking, VPCs, firewalls, and security configurations
- Hands-on experience with AWS services and infrastructure provisioning through CLI, APIs, or Infrastructure as Code (IaC)
- Bachelor’s degree in Computer Science or a related technical field, or equivalent practical experience
This position doesn’t provide sponsorship.