The delivery workflow

Seven phases,
seven gates.

Each phase asks one question and ends with a gate. You don't advance until the gate is green. Phases 4–6 are a loop, not a line.

1

Discovery & Scoping

› What does the client actually want, and how will we know we delivered?
  • Problem in one sentence: "<user> wants to <do X> so that <outcome>."
  • Collect 5–10 example input → ideal-output pairs. These become your eval set.
  • Write down what's out of scope. Agree a measurable success criterion + constraints (budget, latency, languages, privacy).
Gate: one-page scope + example I/O pairs exist. Can't write the examples? You don't understand the problem yet.
2

Data & Knowledge

› Is there knowledge the model needs that it wasn't trained on?
  • Inventory sources: docs, FAQs, tickets, a DB, an API, or nothing.
  • Tag each stable (policies, manuals) vs live (stock, prices, order status).
  • Check quality (accurate, current, one language, machine-readable) and estimate volume.
Gate: you know exactly which knowledge the answer depends on, and where it lives. This decides Phase 3.
3

Approach — prompt, RAG, or fine-tune

› How does the needed knowledge get to the model?
  • Use the decision tree on the LLM selection page. Default bias: prompt < long-context < RAG < fine-tune.
Gate: the approach is written down with its reason.
4

Model Selection

› Which is the cheapest model that passes the eval?
  • Default to the Claude family; start at the cheapest tier and climb only if forced.
  • Route just the hard step to a stronger model before upgrading everything.
  • Use a stronger model as the eval judge than the one you ship.
Gate: a model is chosen because the eval passed at that tier — not because it felt safe. Full detail on the LLM selection page.
5

Build the thin slice

› What's the smallest end-to-end version that handles one real example?
  • Wire the whole pipeline for ONE input: input → (retrieve) → prompt → model → output.
  • Add guardrails early: refuse / "I don't know" when the answer isn't in the KB; never invent facts, prices, or actions.
  • Keep prompts in version control.
Gate: one real example flows end to end and produces a sane, grounded answer.
6

Evaluation

› How do we prove it works — to ourselves and the client?
  • Turn Phase-1 examples into a golden set (input + ideal answer).
  • Score automatically: keyword/exact checks + an LLM-judge (correctness + groundedness: every claim supported by retrieved context).
  • Track a scorecard; watch cost & latency alongside quality — all three are the product.
Gate: a scorecard exists and meets the Phase-1 success criterion. No green eval, no launch.
7

Deploy, Monitor & Hand-off

› Is it live, watched, and maintainable by someone other than its author?
  • Ship behind the smallest surface that works (CLI, internal API, webhook, widget).
  • Log every request: input, retrieved context, output, model, tokens, cost, latency. Add a feedback signal.
  • Set a budget alarm. Write a one-page runbook. Schedule re-eval — knowledge bases rot.
Gate: in production, logged, budget-alarmed, documented enough to hand over.

The loop

Phases 4–6 are a loop, not a line: pick a model → build → eval → (too weak? climb a tier or add RAG; too expensive? drop a tier or shrink the prompt) → eval again. You exit when the eval is green and the cost fits the budget. That loop is the whole job.