RAG vs Fine-tuning: How I Actually Decide for Production AI Apps

Every other week a client asks me the same question. "Should we fine-tune a model, or just use RAG?" The honest answer is: it depends on three numbers, and most teams never actually calculate them. Here is how I decide on real client work, with the cost math nobody likes to share.

Quick definitions for the team behind you. RAG (Retrieval Augmented Generation) means you pull relevant chunks of your own documents into the prompt at query time. Fine-tuning means you re-train a model on your data so the weights themselves change. They solve different problems.

The three numbers that decide it

Before I open any vendor docs, I write down three values on a sticky note:

How often does the source material change? Daily, weekly, never?
How many tokens of context does each query genuinely need? 500? 5,000? 50,000?
How sensitive is the response style? Is it factual lookup, or does the model need to sound like the brand?

If the docs change often → RAG. If responses must sound like your brand voice or follow a specific output schema reliably → fine-tuning probably wins. If you need both, you do both, and yes that is annoying.

Where I have actually shipped each one

Two real engagements, anonymised:

Case 1 — Customer support knowledge base (RAG)

A SaaS client with about 12,000 support documents and a Help Centre that gets edited daily. Fine-tuning here would have been silly — every doc change would force a re-train cycle of hours and a new evaluation run. We used RAG with OpenAI embeddings stored in PostgreSQL + pgvector, and a thin Node.js layer that:

Embeds the user question.
Pulls the top 6 chunks via cosine similarity.
Stuffs them into the prompt with explicit instructions.
Streams the answer back through SSE so the support agent sees tokens as they arrive.

Total monthly inference cost ended up about USD 180/month for ~25,000 queries. Re-indexing happens nightly on whatever changed in the CMS.

Case 2 — Voice-agent script discipline (fine-tuning)

A telephony client wanted their AI voice agents to follow a very specific opening, escalation phrasing, and closing line — every single call, no drift. RAG was wrong for this. The "knowledge" was already in the system prompt; the problem was style adherence. We fine-tuned a small open-source model on roughly 800 high-quality call transcripts. The system prompt then shrank to almost nothing, latency dropped by ~40%, and we stopped paying for the giant system-prompt tokens on every call.

Fine-tuning rewards consistency. RAG rewards freshness. Pick the one that matches the problem you actually have.

The cost math, with real numbers

This is where most blog posts go quiet. Let us be specific. Assume 100,000 queries per month, average 800-token responses, average 1,500-token context.

RAG with GPT-4o-mini (May 2026 prices): input ~$0.15/M tokens, output ~$0.60/M tokens. Roughly USD 95/month for inference, plus around USD 12/month for embeddings storage (pgvector self-hosted) and ~USD 4 for a nightly re-embed of changed docs. Total ~$111/month.

Fine-tuned GPT-4o-mini at the same volume: training cost is one-off (~USD 25 for a small dataset), then inference at ~3x base model price. So output cost climbs to ~USD 285/month. You save on input tokens (smaller system prompt) — call it ~USD 60 saved. Net ~$250/month plus a re-train cycle every time the brief shifts.

For most teams, RAG is just cheaper. Fine-tuning starts winning when you can shrink the prompt enough that token savings outweigh the multiplier — which usually means high-volume, narrow-domain workloads.

The hybrid that actually ships

What I recommend on most production work is a hybrid. Fine-tune for tone, schema, and refusal behaviour. Use RAG for facts. Concretely:

User question
   ↓
[Retriever] → top 6 chunks from pgvector
   ↓
[Fine-tuned model] → style, format, safety rails
   ↓
Final answer (streamed)

The retriever stays simple and cheap. The fine-tune handles the part that humans complain about ("it sounds robotic", "it broke our output schema again", "it apologised instead of answering"). Those complaints rarely come from missing facts.

Evaluation: the boring part that matters

Whichever way you go, you need an evaluation set. About 100 hand-graded examples is the absolute minimum I will start with on a client project. We score:

Factual accuracy — does the answer match the source?
Style adherence — does it sound on-brand?
Refusal correctness — does it say "I do not know" when it should?
Latency p95 — slow answers are worse than no answers in voice / live chat.

Run the same eval against RAG, fine-tuned, and hybrid. The winner is rarely the one you guessed.

My default starting point

Unless I have a specific reason to do otherwise, I start every AI project with: GPT-4o-mini or Claude Haiku + RAG on pgvector + a careful system prompt. That gets to a working v0 in about a week. We only graduate to fine-tuning after the eval set proves a real, measurable gap that retrieval cannot close.

If you want me to set this up for your product — embeddings pipeline, retriever, eval harness, dashboard, all of it — drop a message via the contact section on the homepage.

RAG vs Fine-tuning: How I Actually Decide for Production AI Apps

The three numbers that decide it

Where I have actually shipped each one

Case 1 — Customer support knowledge base (RAG)

Case 2 — Voice-agent script discipline (fine-tuning)

The cost math, with real numbers

The hybrid that actually ships

Evaluation: the boring part that matters

My default starting point

Keep reading

How I Cut a Client's OpenAI Bill by 62% Without Hurting Quality

Building a Production AI Dashboard with Next.js, Node, and Streaming LLMs

Claude vs GPT-4 in Production: A Working Developer's Take