SalesJobsBoard - Curated B2B Sales Jobs

Overview

You're the first reliability engineer at LiteLLM, a YC-backed open-source AI gateway that routes hundreds of millions of LLM API requests daily. You'll debug memory leaks in async Python services, tune connection pools, fix race conditions, optimize hot paths, ensure cache consistency, and make the proxy self-heal when things break. You work directly with Ishaan (founder) on infrastructure that's now critical for NASA, Adobe, Netflix, and Stripe.

Role Snapshot

Aspect	Details
Role Type	Founding Reliability & Performance Engineer
Sales Motion	N/A - Engineering Role
Deal Complexity	N/A - Engineering Role
Sales Cycle	N/A - Engineering Role
Deal Size	N/A - Engineering Role
Quota (est.)	N/A - Engineering Role

Company Context

Stage: Series A (YC W23)

Size: Small team, exact headcount not disclosed

Growth: 36K+ GitHub stars, processing hundreds of millions of requests daily, rapid enterprise adoption

Market Position: Leading open-source AI gateway - infrastructure layer for LLM API management

What You'll Actually Do

Time Breakdown

Debugging/Firefighting (40%) | Performance Optimization (35%) | Architecture/Planning (25%)

Key Activities

Hunt down memory leaks: Async Python services leak memory at scale. You'll profile heap dumps, trace object lifecycles, fix leaks in FastAPI/asyncio code that shows up only under production load.
Tune connection pools: Postgres and Redis connections get exhausted during traffic spikes. You'll figure out optimal pool sizes, implement circuit breakers, add connection retry logic that doesn't cascade failures.
Fix race conditions: Distributed systems have race conditions. Cache invalidation doesn't propagate correctly, request routing gets inconsistent state. You debug these with distributed tracing and fix them.
Optimize hot paths: Latency matters when you're in the request path for LLM calls. You'll profile code, eliminate allocations, batch operations, add smarter caching - shaving milliseconds off P99 latency.
Build self-healing: When Redis goes down or Postgres replication lags, the proxy should degrade gracefully. You'll add health checks, implement fallback logic, make the system recover automatically instead of paging you at 3am.
On-call rotation: You're the first reliability hire. When things break in production (and they will), you're getting paged. Expect to be woken up during your first few months until you've hardened things.

The Honest Reality

What's Hard

You're the only one: No reliability team to collaborate with. No senior engineer to review your work. You make the calls, and if you're wrong, production breaks for NASA and Netflix.
Production is already at scale: You're not building from scratch. The system is live, handling massive traffic. You have to debug and fix issues without breaking what's working.
Async Python is tricky: Memory leaks in async code are hard to find. Race conditions are subtle. The stack isn't as mature as something like Go or Java for high-throughput services.
Pager duty from day one: You'll be on-call immediately. Sleep will be interrupted. You'll be debugging production issues at odd hours until you've stabilized things.
Fast-moving codebase: Small team shipping quickly means code quality varies. You'll spend time understanding what others built before you can fix it.

What Success Looks Like

P99 latency stays under 100ms even during traffic spikes
Zero-downtime deployments become routine
Memory usage stays flat over 24-hour periods (no leaks)
System self-heals from Redis/Postgres failures without manual intervention
You're getting paged less each month as you harden the infrastructure

The Technical Reality

Stack You'll Work With:

Python (FastAPI, asyncio) - main application code
Postgres - primary data store
Redis - caching layer
Kubernetes - orchestration
Prometheus - metrics and monitoring

Problems You'll Actually Solve:

Memory usage grows linearly over time (asyncio tasks not cleaned up)
Connection pool exhaustion during traffic bursts
Cache invalidation race conditions across replicas
Hot path allocations causing GC pressure
Graceful degradation when dependencies fail
Distributed tracing gaps making debugging hard

Who You're Working With

Direct Report: Ishaan (Founder/CEO)

Team Structure: Small engineering team, you're the first dedicated reliability hire

Cross-functional Work: You'll work with backend engineers shipping features, help them understand reliability implications of their code

Requirements

Experience debugging production issues in high-throughput Python services
Strong understanding of async Python (asyncio, event loops, coroutines)
Know how to profile and optimize memory usage, find leaks
Experience with connection pooling, circuit breakers, retry logic
Comfortable with Kubernetes, Prometheus, distributed systems concepts
Can be on-call and debug production fires under pressure
Based in San Francisco or willing to relocate (role is in-person)

Founding Reliability & Performance Engineer