Everyone wants AI features. Most of them shipped in the last two years are slow, unreliable, and expensive. Here's how we build them so they're none of those things.
Non-determinism is a feature, not a bug — if you design for it
LLMs are not functions. The same input won't always produce the same output. Applications that treat them like functions fail in production. Design for variance: validate outputs against a schema, retry on failure, and have graceful fallbacks for when the model produces garbage.
We use Zod schemas with structured output mode on every OpenAI call. If the output doesn't match the schema, we retry once, then fall back to a cached or default response.
Cost control
LLM costs compound fast. On a B2C product at scale, uncached GPT-4 calls will bankrupt you. Our approach:
- Semantic caching — cache responses by embedding similarity, not exact string match. Semantically similar prompts return cached responses without a new LLM call.
- Model tiering — use GPT-4 for complex reasoning, GPT-4o-mini for simple classification. Most tasks don't need the expensive model.
- Prompt budgets — every prompt has a max token budget. We prune context aggressively and summarise long histories.
Latency
Users abandon if a feature takes more than 2 seconds. LLM calls average 3–8 seconds. Streaming is the answer — show tokens as they arrive, not after the full response completes. With streaming, perceived latency drops to near-zero.
Evals before you ship
You wouldn't ship a feature without tests. Don't ship AI features without evals. We build a golden dataset of input/expected output pairs and run evals on every model change and every prompt change. Regression is how AI features silently break.