How I Cut a Client's OpenAI Bill by 62% Without Hurting Quality

In February a client showed me an OpenAI bill that had quietly grown to USD 1,470/month. They were not getting more users — the app had just leaked tokens for six months. We took it to USD 558/month without changing what the user sees. Here is the actual playbook, in the order I apply it.

Result: 62% cost reduction. Same product, same UX, same response quality on our eval set. Took three working days plus one weekend.

Step 1 — Measure before you cut

Before I touched anything I added a small middleware in the Node API that logged, per request: model, input tokens, output tokens, latency, cache hit / miss, and which feature triggered it. Logged to a tiny SQLite for the first week, then to ClickHouse once we knew what we were tracking.

app.use(async (req, res, next) => {
  const start = Date.now();
  res.on('finish', () => {
    metrics.write({
      feature: req.route?.path,
      model: res.locals.model,
      in_tokens: res.locals.inTokens,
      out_tokens: res.locals.outTokens,
      cache: res.locals.cacheHit ? 'hit' : 'miss',
      latency_ms: Date.now() - start,
    });
  });
  next();
});

Within 48 hours we knew that 71% of cost was coming from one feature — an auto-summariser that ran on every page load. Most product cost is concentrated like this; you just have to actually look.

Step 2 — Stop using a Ferrari to deliver pizza

The single biggest win: model routing. We were calling GPT-4 for everything, including tasks where GPT-4o-mini or Claude Haiku score identically on our eval. A simple router decides per-request:

function pickModel(task: Task): Model {
  // Routine summaries, classification, simple Q&A
  if (task.complexity === 'low') return 'gpt-4o-mini';
  // Style-sensitive writing, longer chains
  if (task.complexity === 'medium') return 'claude-haiku-4';
  // Hard reasoning, multi-step tool use
  return 'gpt-4o';
}

Cost impact alone: ~38% reduction on the summariser feature without a measurable quality drop on the eval. The key word is measurable. If you do not have an eval set, you are guessing.

Step 3 — Put the prompt on a diet

I went through every system prompt and counted tokens. The "be helpful, be concise, be polite" boilerplate was 380 tokens, repeated on every single call. Trimmed to 95 tokens. Same behaviour, because most of that boilerplate was not actually changing the model's behaviour — it was making the dev who wrote it feel safe.

Rule of thumb: if removing a sentence does not change the eval score, the sentence was decoration.

Step 4 — Cache the obvious

30% of summariser calls were on documents that had not changed since the previous call. We added Redis-backed semantic caching keyed by document hash + model + prompt-template version. On a cache hit, we skip the model entirely.

const key = `sum:${docHash}:${model}:${templateV}`;
const cached = await redis.get(key);
if (cached) return JSON.parse(cached); // ~0ms, $0
const reply = await openai.chat(...);
await redis.set(key, JSON.stringify(reply), 'EX', 86400);
return reply;

Hit rate stabilised around 34% in week 2. That is 34% of paid calls that just stopped happening.

Step 5 — Batch the background jobs

Anything not user-facing (nightly summaries, doc re-indexing, embeddings refreshes) moved to OpenAI's Batch API. It is 50% cheaper and the 24-hour SLA is fine for things that run while everyone sleeps.

Step 6 — Compress long context

For document-heavy queries we were pushing 8k+ tokens of context. Most of it was repetitive boilerplate (legal headers, navigation, repeated metadata). A pre-processor strips that out:

function compress(text: string): string {
  return text
    .replace(/Confidential.{0,200}/gi, '')
    .replace(/Table of Contents.+?(?=\n[A-Z])/s, '')
    .replace(/\s{2,}/g, ' ')
    .trim();
}

Average context per call dropped from 4,800 tokens to 1,950. Cost-per-call dropped roughly in proportion.

Step 7 — Stop streaming when you do not need to

Streaming is great for chat. For background workers that just need the final string, streaming wastes a connection slot and gains you nothing. Switching the worker pool to non-streaming responses freed up enough connections that we dropped one worker container entirely. Small win but free.

Step 8 — Set hard ceilings, alarm on them

Last but most important. Set a daily and a monthly token budget per feature, alarm at 70% / 90% / 100%. We use a Postgres counter + cron, but anything works. The point is to find runaway costs before the invoice does.

Common mistake. Teams alarm on the total OpenAI bill. By then the leak has burned for hours. Alarm per feature, so you know which code path is misbehaving.

The final bill

Before: USD 1,470/month, GPT-4 for everything.
After: USD 558/month, mixed model routing, prompt diet, 34% cache hit, batched background work.
Eval score before vs after: 0.84 vs 0.83 (identical within margin).

None of these tricks are clever. They are just disciplined. If your AI app is burning more than it should, I can usually find the leak in a day — see the contact section on the homepage.