Overview
You're the first reliability engineer at LiteLLM, a YC-backed open-source AI gateway that routes hundreds of millions of LLM API requests daily. You'll debug memory leaks in async Python services, tune connection pools, fix race conditions, optimize hot paths, ensure cache consistency, and make the proxy self-heal when things break. You work directly with Ishaan (founder) on infrastructure that's now critical for NASA, Adobe, Netflix, and Stripe.
Role Snapshot
| Aspect | Details |
|---|---|
| Role Type | Founding Reliability & Performance Engineer |
| Sales Motion | N/A - Engineering Role |
| Deal Complexity | N/A - Engineering Role |
| Sales Cycle | N/A - Engineering Role |
| Deal Size | N/A - Engineering Role |
| Quota (est.) | N/A - Engineering Role |
Company Context
Stage: Series A (YC W23)
Size: Small team, exact headcount not disclosed
Growth: 36K+ GitHub stars, processing hundreds of millions of requests daily, rapid enterprise adoption
Market Position: Leading open-source AI gateway - infrastructure layer for LLM API management
What You'll Actually Do
Time Breakdown
Debugging/Firefighting (40%) | Performance Optimization (35%) | Architecture/Planning (25%)
Key Activities
- Hunt down memory leaks: Async Python services leak memory at scale. You'll profile heap dumps, trace object lifecycles, fix leaks in FastAPI/asyncio code that shows up only under production load.
- Tune connection pools: Postgres and Redis connections get exhausted during traffic spikes. You'll figure out optimal pool sizes, implement circuit breakers, add connection retry logic that doesn't cascade failures.
- Fix race conditions: Distributed systems have race conditions. Cache invalidation doesn't propagate correctly, request routing gets inconsistent state. You debug these with distributed tracing and fix them.
- Optimize hot paths: Latency matters when you're in the request path for LLM calls. You'll profile code, eliminate allocations, batch operations, add smarter caching - shaving milliseconds off P99 latency.
- Build self-healing: When Redis goes down or Postgres replication lags, the proxy should degrade gracefully. You'll add health checks, implement fallback logic, make the system recover automatically instead of paging you at 3am.
- On-call rotation: You're the first reliability hire. When things break in production (and they will), you're getting paged. Expect to be woken up during your first few months until you've hardened things.
The Honest Reality
What's Hard
- You're the only one: No reliability team to collaborate with. No senior engineer to review your work. You make the calls, and if you're wrong, production breaks for NASA and Netflix.
- Production is already at scale: You're not building from scratch. The system is live, handling massive traffic. You have to debug and fix issues without breaking what's working.
- Async Python is tricky: Memory leaks in async code are hard to find. Race conditions are subtle. The stack isn't as mature as something like Go or Java for high-throughput services.
- Pager duty from day one: You'll be on-call immediately. Sleep will be interrupted. You'll be debugging production issues at odd hours until you've stabilized things.
- Fast-moving codebase: Small team shipping quickly means code quality varies. You'll spend time understanding what others built before you can fix it.
What Success Looks Like
- P99 latency stays under 100ms even during traffic spikes
- Zero-downtime deployments become routine
- Memory usage stays flat over 24-hour periods (no leaks)
- System self-heals from Redis/Postgres failures without manual intervention
- You're getting paged less each month as you harden the infrastructure
The Technical Reality
Stack You'll Work With:
- Python (FastAPI, asyncio) - main application code
- Postgres - primary data store
- Redis - caching layer
- Kubernetes - orchestration
- Prometheus - metrics and monitoring
Problems You'll Actually Solve:
- Memory usage grows linearly over time (asyncio tasks not cleaned up)
- Connection pool exhaustion during traffic bursts
- Cache invalidation race conditions across replicas
- Hot path allocations causing GC pressure
- Graceful degradation when dependencies fail
- Distributed tracing gaps making debugging hard
Who You're Working With
Direct Report: Ishaan (Founder/CEO)
Team Structure: Small engineering team, you're the first dedicated reliability hire
Cross-functional Work: You'll work with backend engineers shipping features, help them understand reliability implications of their code
Requirements
- Experience debugging production issues in high-throughput Python services
- Strong understanding of async Python (asyncio, event loops, coroutines)
- Know how to profile and optimize memory usage, find leaks
- Experience with connection pooling, circuit breakers, retry logic
- Comfortable with Kubernetes, Prometheus, distributed systems concepts
- Can be on-call and debug production fires under pressure
- Based in San Francisco or willing to relocate (role is in-person)