Site Reliability Engineer- 5 days onsite NoHo, NYC
Arlington, Virginia
Onsite
Full Time
$150k - $250k
Site Reliability Engineer
This company is developing AI thought partners designed to enhance human intelligence and creativity, transforming how knowledge is created and shared in financial services. We're unapologetically ambitious driven by a clear goal: to build the world’s leading Financial AI company.
The company is located in in NoHo, NYC and will be 5 days onsite.
What You Will Be Doing:
Required Skills & Experience:
This position doesn’t provide sponsorship.
This company is developing AI thought partners designed to enhance human intelligence and creativity, transforming how knowledge is created and shared in financial services. We're unapologetically ambitious driven by a clear goal: to build the world’s leading Financial AI company.
The company is located in in NoHo, NYC and will be 5 days onsite.
What You Will Be Doing:
- Cloud Infrastructure Management: Design, implement, and maintain robust cloud infrastructure on AWS and/or Azure to ensure high availability, scalability, and fault tolerance.
- Monitoring & System Health: Leverage Datadog to build proactive monitoring and alerting systems, enabling rapid detection and resolution of performance issues.
- Kubernetes & Container Management: Administer and optimize Kubernetes clusters, utilizing Helm for efficient package management and deployment automation.
- Automation & Infrastructure as Code: Develop and maintain Infrastructure as Code (IaC) using Terraform; automate routine tasks with scripts written in Bash or Python.
- Cross-Functional Collaboration: Partner with development and operations teams to foster a DevOps mindset, streamline CI/CD workflows, and implement best practices.
- Incident Response & Troubleshooting: Diagnose and resolve complex issues across OS, networking, and database layers in cloud-based environments.
- Documentation: Create and maintain thorough documentation of infrastructure configurations, standard operating procedures, and troubleshooting playbooks.
Required Skills & Experience:
- Bachelor’s degree in Computer Science, Information Technology, or a related discipline.
- 3–5 years of hands-on experience with AWS and/or Azure, including services such as EC2, S3, VPC, and Lambda.
- 2–3 years managing Kubernetes clusters in production environments.
- 2–3 years of experience using Helm for Kubernetes application deployments.
- 2–3 years working with monitoring platforms like Datadog.
- 3–5 years of experience in Linux system administration and shell scripting.
- 2–3 years of experience with Infrastructure as Code (Terraform preferred).
- Strong scripting abilities in Bash and Python.
- Solid understanding of networking concepts, including TCP/IP, DNS, firewalls, and load balancers.
- Experience with CI/CD tools such as Jenkins, GitLab CI, or GitHub Actions.
- Familiarity with cloud-native security practices and regulatory compliance standards.
This position doesn’t provide sponsorship.