Motion Recruitment | Jobspring | Workbridge

Staff Site Reliability Engineer

New York, New York

Hybrid

Full Time

$185k - $225k

Who they are:
Our client is building the AI platform that transforms how insurers evaluate and price risk. Not just another tool that generates summaries or flags issues—they’re creating AI agents that actually understand risk the way veteran underwriters do.

Their platform is processing billions in premium for some of the world’s largest carriers, and they’re just getting started. The technical challenges are wild. They’re teaching AI to understand that a bakery in Florida faces different risks than one in Montana. To know when a manufacturing company’s pivot from toys to medical devices fundamentally changes their risk profile. To make million-dollar decisions with the same intuition as someone who’s been underwriting for 20 years.

What makes our client special isn’t just the technology—it’s that they’re building it with people who deeply understand insurance. Their team includes folks who’ve built and scaled carriers, researchers who’ve pushed the boundaries of AI, and engineers who just love solving seemingly impossible problems. They’re still early, but the impact is already real.

If you want to build AI that matters—that affects real businesses, real people, and billions in economic activity—our client is where you should be. They’re not just digitizing insurance. They’re reimagining what it can be.

What you’ll do:
Lead observability and reliability strategy across the company, moving them from disparate signals to a clear, trusted view of system health by establishing company standards, defining milestones to work toward higher levels of operational maturity, and shared ownership. Operationally, you’ll be responsible for leading their disaster recovery exercises and developing plans for higher levels of maturity to meet their evolving business needs.

JD: Staff Reliability and Observability Engineer

  • Own the end-to-end incident and production experience, including on-call design, incident management, post-incident learning, and clear, template-driven customer communication in partnership with Customer Success.

  • Influence reliability at the application and system level, partnering with engineers to improve instrumentation in code, resolve cross-team tradeoffs, and design for failure across interconnected services and vendors.

  • Establish reliability patterns for modern, AI-driven systems, including long-running requests, partial failures, retries, and graceful degradation, while managing key vendor reliability standards.

Qualifications:

  • Senior+ reliability engineering experience, including time as an SRE, Platform Engineer, or Staff-level engineer, with a background that touches both infrastructure (preferably AWS and/or Azure) and application code.

  • Strong application-level fluency, including analyzing logs and traces, and contributing production code (e.g., meaningful PRs) to improve observability and reliability directly in services.

  • System-level thinking across complex ecosystems, with experience operating and reasoning about multiple interconnected services, vendors, and failure modes, and making explicit, well-documented tradeoffs.

  • Proven influence without authority, demonstrated by raising reliability standards through collaboration across Engineering, Product, and Customer Success, navigating disagreement, and driving alignment—paired with practical experience designing for reliability in AI- and LLM-backed systems using modern developer tooling.

Who you are:

  • A smart self-starter: You have a bias for action. You orient yourself around solutions and outcomes and don’t wait for others to tell you what to do. You also understand how to build alignment and conviction for decisions that can’t be easily reversed.

  • A force multiplier: You look for ways to magnify your impact and your team’s. When you find a productivity hack, you share it with teammates and build tools to make it easier. You document knowledge for future teammates and lean into AI and automation to improve productivity.

  • An empathetic communicator: You communicate nuanced ideas clearly, whether explaining technical decisions in writing or brainstorming in real time. You engage thoughtfully with differing perspectives and compromise when needed.

  • A learner: You thrive on learning new things. You stay current on tech and AI and are excited to share their latest discoveries and hacks.

Posted by: Jon Szynalski