LLM selection — AI Delivery Playbook

Decision 1

RAG or not?

RAG = before answering, fetch the few most relevant chunks of the client's knowledge and put them in the prompt. Walk this tree top to bottom and stop at the first match.

Q1 · Does it need facts the base model doesn't reliably know (policies, private / internal / recent data)?

└ NO → PROMPT-ONLY. Write a good system prompt. Cheapest. Always try first.

└ YES ↓

Q2 · Is that knowledge small AND stable (fits in the prompt, rarely changes)?

├ YES → LONG-CONTEXT. Paste the docs in every call + turn on prompt caching.

└ NO (large / many docs / changes often) → RAG. Embed once, retrieve top-k per question.

Q3 · Really about teaching a STYLE / FORMAT / BEHAVIOUR, not facts? → FINE-TUNE. Last resort; most expensive to build & maintain.

Default bias

prompt < long-context < RAG < fine-tune. Most business projects land on RAG. Fine-tuning is rare — it teaches behaviour, not facts, and a good RAG + prompt usually wins for far less effort.

Why our proposal assistant = hybrid

Rate card & method are small + stable → pinned in the prompt (long-context); past projects grow and similarity matters → RAG over them; it's facts not a writing style (not fine-tune). → pinned + RAG.

Cheap RAG that fits a low-cost agency

Embeddings: a local sentence-transformers model (e.g. all-MiniLM-L6-v2). Runs on CPU, $0 per call, no lock-in. Good enough for small/medium KBs.
Vector store: for a few thousand chunks, in-memory numpy cosine search is plenty. Add a real vector DB (pgvector, Qdrant) only when volume forces it.
Generation: Claude Haiku, grounded strictly in the retrieved chunks.

Decision 2

Which Claude model?

Default to the Claude family and pick by tier. Verify live pricing in the Anthropic docs before quoting a client — IDs and tiers below, prices change.

Tier	Model ID	When to use
cheap / fast	`claude-haiku-4-5-20251001`	Our default. High-volume, well-scoped tasks: classification, extraction, grounded RAG answers, drafting.
balanced	`claude-sonnet-4-6`	Harder reasoning, multi-step, ambiguous inputs, agent loops. Also a solid eval judge.
most capable	`claude-opus-4-8`	Hardest reasoning / agents / edge cases — or a strict eval judge over a cheaper shipped model.

The method

Start at Haiku. Build the thin slice on it.
Run the eval. Green + within budget → ship. Done, and cheap.
Red on a specific skill? Route just that step to Sonnet before upgrading everything. Re-eval.
Still red? Climb to Sonnet, then Opus, re-evaluating each step. Stop the moment it's green.
Judge with a stronger model than the one you ship.

Cost levers (pull before upgrading)

Prompt caching for stable parts (system prompt, pasted docs).
Retrieve fewer / smaller chunks — less context, fewer input tokens.
Cap output tokens.
Batch non-urgent work.

Why our proposal assistant climbed to Sonnet

We started on Haiku and ran the eval — it under-structured the proposals and mis-anchored estimates. Drafting a coherent, grounded proposal is real reasoning, so we climbed one tier to claude-sonnet-4-6 and judge with claude-opus-4-8. The eval, not the gut, decides.