Every other week a client asks me the same question. "Should we fine-tune a model, or just use RAG?" The honest answer is: it depends on three numbers, and most teams never actually calculate them. Here is how I decide on real client work, with the cost math nobody likes to share.
The three numbers that decide it
Before I open any vendor docs, I write down three values on a sticky note:
- How often does the source material change? Daily, weekly, never?
- How many tokens of context does each query genuinely need? 500? 5,000? 50,000?
- How sensitive is the response style? Is it factual lookup, or does the model need to sound like the brand?
If the docs change often → RAG. If responses must sound like your brand voice or follow a specific output schema reliably → fine-tuning probably wins. If you need both, you do both, and yes that is annoying.
Where I have actually shipped each one
Two real engagements, anonymised:
Case 1 — Customer support knowledge base (RAG)
A SaaS client with about 12,000 support documents and a Help Centre that gets edited daily. Fine-tuning here would have been silly — every doc change would force a re-train cycle of hours and a new evaluation run. We used RAG with OpenAI embeddings stored in PostgreSQL + pgvector, and a thin Node.js layer that:
- Embeds the user question.
- Pulls the top 6 chunks via cosine similarity.
- Stuffs them into the prompt with explicit instructions.
- Streams the answer back through SSE so the support agent sees tokens as they arrive.
Total monthly inference cost ended up about USD 180/month for ~25,000 queries. Re-indexing happens nightly on whatever changed in the CMS.
Case 2 — Voice-agent script discipline (fine-tuning)
A telephony client wanted their AI voice agents to follow a very specific opening, escalation phrasing, and closing line — every single call, no drift. RAG was wrong for this. The "knowledge" was already in the system prompt; the problem was style adherence. We fine-tuned a small open-source model on roughly 800 high-quality call transcripts. The system prompt then shrank to almost nothing, latency dropped by ~40%, and we stopped paying for the giant system-prompt tokens on every call.
The cost math, with real numbers
This is where most blog posts go quiet. Let us be specific. Assume 100,000 queries per month, average 800-token responses, average 1,500-token context.
RAG with GPT-4o-mini (May 2026 prices): input ~$0.15/M tokens, output ~$0.60/M tokens. Roughly USD 95/month for inference, plus around USD 12/month for embeddings storage (pgvector self-hosted) and ~USD 4 for a nightly re-embed of changed docs. Total ~$111/month.
Fine-tuned GPT-4o-mini at the same volume: training cost is one-off (~USD 25 for a small dataset), then inference at ~3x base model price. So output cost climbs to ~USD 285/month. You save on input tokens (smaller system prompt) — call it ~USD 60 saved. Net ~$250/month plus a re-train cycle every time the brief shifts.
For most teams, RAG is just cheaper. Fine-tuning starts winning when you can shrink the prompt enough that token savings outweigh the multiplier — which usually means high-volume, narrow-domain workloads.
The hybrid that actually ships
What I recommend on most production work is a hybrid. Fine-tune for tone, schema, and refusal behaviour. Use RAG for facts. Concretely:
User question
↓
[Retriever] → top 6 chunks from pgvector
↓
[Fine-tuned model] → style, format, safety rails
↓
Final answer (streamed)
The retriever stays simple and cheap. The fine-tune handles the part that humans complain about ("it sounds robotic", "it broke our output schema again", "it apologised instead of answering"). Those complaints rarely come from missing facts.
Evaluation: the boring part that matters
Whichever way you go, you need an evaluation set. About 100 hand-graded examples is the absolute minimum I will start with on a client project. We score:
- Factual accuracy — does the answer match the source?
- Style adherence — does it sound on-brand?
- Refusal correctness — does it say "I do not know" when it should?
- Latency p95 — slow answers are worse than no answers in voice / live chat.
Run the same eval against RAG, fine-tuned, and hybrid. The winner is rarely the one you guessed.
My default starting point
Unless I have a specific reason to do otherwise, I start every AI project with: GPT-4o-mini or Claude Haiku + RAG on pgvector + a careful system prompt. That gets to a working v0 in about a week. We only graduate to fine-tuning after the eval set proves a real, measurable gap that retrieval cannot close.
If you want me to set this up for your product — embeddings pipeline, retriever, eval harness, dashboard, all of it — drop a message via the contact section on the homepage.