Workflow — AI Delivery Playbook

Discovery & Scoping

› What does the client actually want, and how will we know we delivered?

Problem in one sentence: "<user> wants to <do X> so that <outcome>."
Collect 5–10 example input → ideal-output pairs. These become your eval set.
Write down what's out of scope. Agree a measurable success criterion + constraints (budget, latency, languages, privacy).

Gate: one-page scope + example I/O pairs exist. Can't write the examples? You don't understand the problem yet.

↓

Data & Knowledge

› Is there knowledge the model needs that it wasn't trained on?

Inventory sources: docs, FAQs, tickets, a DB, an API, or nothing.
Tag each stable (policies, manuals) vs live (stock, prices, order status).
Check quality (accurate, current, one language, machine-readable) and estimate volume.

Gate: you know exactly which knowledge the answer depends on, and where it lives. This decides Phase 3.

↓

Approach — prompt, RAG, or fine-tune

› How does the needed knowledge get to the model?

Use the decision tree on the LLM selection page. Default bias: prompt < long-context < RAG < fine-tune.

Gate: the approach is written down with its reason.

↓

Model Selection

› Which is the cheapest model that passes the eval?

Default to the Claude family; start at the cheapest tier and climb only if forced.
Route just the hard step to a stronger model before upgrading everything.
Use a stronger model as the eval judge than the one you ship.

Gate: a model is chosen because the eval passed at that tier — not because it felt safe. Full detail on the LLM selection page.

↓

Build the thin slice

› What's the smallest end-to-end version that handles one real example?

Wire the whole pipeline for ONE input: input → (retrieve) → prompt → model → output.
Add guardrails early: refuse / "I don't know" when the answer isn't in the KB; never invent facts, prices, or actions.
Keep prompts in version control.

Gate: one real example flows end to end and produces a sane, grounded answer.

↓

Evaluation

› How do we prove it works — to ourselves and the client?

Turn Phase-1 examples into a golden set (input + ideal answer).
Score automatically: keyword/exact checks + an LLM-judge (correctness + groundedness: every claim supported by retrieved context).
Track a scorecard; watch cost & latency alongside quality — all three are the product.

Gate: a scorecard exists and meets the Phase-1 success criterion. No green eval, no launch.

↓

Deploy, Monitor & Hand-off

› Is it live, watched, and maintainable by someone other than its author?

Ship behind the smallest surface that works (CLI, internal API, webhook, widget).
Log every request: input, retrieved context, output, model, tokens, cost, latency. Add a feedback signal.
Set a budget alarm. Write a one-page runbook. Schedule re-eval — knowledge bases rot.

Gate: in production, logged, budget-alarmed, documented enough to hand over.

The loop

Phases 4–6 are a loop, not a line: pick a model → build → eval → (too weak? climb a tier or add RAG; too expensive? drop a tier or shrink the prompt) → eval again. You exit when the eval is green and the cost fits the budget. That loop is the whole job.