Skip to content

Capstone — Ship It

Prereqs: every Applied AI module (LLM Basics, RAG & Agents, Serve & Ship, Frontier). This is the assembly project.

This is the moment Mosaic stops being a course and becomes a thing you’ve built. Eighty-seven lessons in, you have the parts: an embedding model, a vector store, a loop, structured output, an inference server, an eval harness, a safety layer. The capstone is the assembly. One weekend, one real problem, one deployed URL. A friend clones the repo and uses your agent in 30 minutes. That is the bar.

The single biggest reason this capstone fails for people is scoping. Not technical capability — every reader who got this far has the technical capability. The trap is “let me support every data source / make it beautiful / add multi-agent orchestration / train a custom embedding model.” Cut. Pick one corpus, one query type, one LLM, one tool set, one UI, one cloud. A working v0 in one weekend beats a half-built v3 in a month, and the v0 is what teaches you which optimizations were actually worth the time.

There are two non-obvious things that separate a “demo” from a “product,” and they’re the same two things that separate this capstone from a tutorial. First: eval is the bar. Twenty representative queries, rubric-graded by a judge LLM, run in CI. Build the eval before you start optimizing — without it you’ll lie to yourself about whether the system got better. Second: use it for a week. The bugs that matter are the ones that show up after you’ve answered “is the weather nice in SF” forty times and the wrong answer to “what did we decide about pricing in March” finally bites you.

TL;DR

  • The Applied AI track culminates here: build, eval, and deploy a working RAG-agent that you can use in real life. The lessons up to this point are the parts; this is the assembly.
  • Pick a real problem you have. Not a demo. Not a Hello-World. Something you would actually use weekly — your reading inbox, your meeting notes, your coding context, your research notes.
  • The full stack: ingest pipeline → chunked + indexed (hybrid retrieval) → ReAct agent with MCP tools → structured-output API → minimal UI → deployed somewhere persistent.
  • Eval is the bar. Without an eval, you have a demo. With an eval, you have a product. Build the eval before you start optimizing.
  • Most teams fail this not on technical capability but on scoping. Cut features ruthlessly. A working v0 in one weekend beats a half-built v3 in a month.

What “ship it” actually means

A real shipped project has these properties:

  1. Solves a real problem you have, not a demo.
  2. Has at least one real user, even if that user is you.
  3. Persists state: your data, your conversations, your indices live somewhere durable.
  4. Has a URL or a CLI that works without you sitting at the keyboard.
  5. Has at least 20 evaluation queries with rubric-graded answers, run automatically.
  6. Has a README that someone else could clone and run.

Anything that meets these is shippable. Anything that doesn’t is a script.

The architecture template

Pick one ingest source, one retrieval store, one LLM, one tool set, one UI. Resist scope creep.

Build it — step by step

This whole module’s Capstone below has the rich step-by-step. The summary:

  1. Pick the problem. Write a one-paragraph product spec. (30 min)
  2. Ingest a small corpus. 100–1000 docs. Chunk + embed with Nomic Embed v2; index in qdrant or lancedb. (2 hours)
  3. Build the ReAct agent. Claude or GPT, 2–3 tools (retrieve, read_doc, search_web). -shaped if you want; ad-hoc is fine. (3 hours)
  4. Wrap with structured-output endpoint. FastAPI, returns Pydantic objects. (1 hour)
  5. Build a minimal UI. Streamlit takes 30 minutes; Next.js takes 4. Pick. (1–4 hours)
  6. Deploy. Modal (one command for Python apps), Vercel (Next.js), or a cheap VPS. (1–2 hours)
  7. Build the eval harness. 20 representative queries, rubric-graded by Claude as judge. Add CI. (2 hours)
  8. Use it for a week. Find the bugs. Fix the worst ones. (a week of dogfooding)
  9. Write the README. Architecture diagram, install steps, eval results. (1 hour)

Total: one weekend (~12–16 focused hours) for v0. Iteration fills in the rest.

What to optimize, in order

When v0 is up, don’t optimize randomly. The order:

  1. Retrieval quality. Run the eval. If retrieval is wrong, nothing else matters. Hybrid retrieval + reranker if dense alone is missing.
  2. Prompt quality. Tighten the system prompt; few-shot examples in the agent’s context.
  3. Tool design. Consolidate tools the LLM uses awkwardly; split tools that take confusing arguments.
  4. Schema / structured output. Add Pydantic field descriptions; tighten validators.
  5. Latency. Streaming, caching, smaller models for cheap calls.
  6. Cost. Routing (small model for easy queries), prompt caching (Anthropic / OpenAI), aggressive retrieval pruning.

Skip levels at your peril. Optimizing prompts when retrieval is wrong = polishing the wheel that’s pointing the wrong way.

Common scope-creep traps

  • “Let me support all my data sources.” Ship one. Add others later.
  • “Let me make a beautiful UI.” Streamlit is fine for v0. The work that matters is in the agent.
  • “Let me add memory.” ReAct agent loop with full message history is enough memory for v0. Vector-DB-backed memory is v3.
  • “Let me train a custom embedding model.” Off-the-shelf is fine until you’ve validated the eval.
  • “Let me add multi-agent orchestration.” One ReAct agent. Multi-agent is v5.

The first version of every great project is embarrassing. Ship anyway.

Examples of successful capstones

(Compiled from learner submissions, anonymized.)

  • Notes assistant: 3 years of Markdown notes ingested into qdrant, MCP server exposing retrieve / search; used daily via Cursor.
  • Paper triage: arXiv RSS → Anthropic summarizer → quality filter → Notion. Used daily.
  • Meeting recap bot: Otter.ai transcripts → ingest → “what did we decide about X?” agent. Replaces note-taker.
  • Code archaeologist: company codebase → embedded → LLM that answers “where is feature X implemented and who wrote it?”. Saves new-hire onboarding time.
  • Customer-support agent: company docs → RAG → ReAct agent with ticket-creation tool. Deployed to actual customers.

Each took one weekend for v0. Each is now used weekly or more.

What you walk away with

You finish this capstone with:

  1. A working AI product you actually use.
  2. An eval methodology that’s transferable to every future LLM project.
  3. A portfolio artifact more compelling than any course-completion certificate.
  4. The integrative skill of “I can take an idea and ship it.” Most engineers never develop this.

This is the bar. Hit it once, and every future LLM project starts from a position of “I’ve done this before; what’s special about this one?” instead of “where do I start?”

Key takeaways

  1. Solve a real problem you have. Not a demo. Not a Hello-World.
  2. Build v0 in one weekend. Cut features. Ship the integration.
  3. Eval is the bar. Without it, you have a demo, not a product.
  4. Optimize in order: retrieval → prompts → tools → schema → latency → cost.
  5. The README is the artifact. A working agent without a README is invisible. A README without a working agent is a lie.

Go deeper

TL;DR

  • The Applied AI track culminates here: build, eval, and deploy a working RAG-agent that you can use in real life. The lessons up to this point are the parts; this is the assembly.
  • Pick a real problem you have. Not a demo. Not a Hello-World. Something you would actually use weekly — your reading inbox, your meeting notes, your coding context, your research notes.
  • The full stack: ingest pipeline → chunked + indexed (hybrid retrieval) → ReAct agent with MCP tools → structured-output API → minimal UI → deployed somewhere persistent.
  • Eval is the bar. Without an eval, you have a demo. With an eval, you have a product. Build the eval before you start optimizing.
  • Most teams fail this not on technical capability but on scoping. Cut features ruthlessly. A working v0 in one weekend beats a half-built v3 in a month.

Why this matters

Reading curriculum is necessary; shipping artifacts is sufficient. The capstone is what separates “I learned about agents” from “I built one and use it daily.” When recruiters / managers / collaborators ask “what have you built?”, you point at this. It is the single most useful artifact you produce in the Applied AI track, more than any individual lesson’s exercise.

What “ship it” actually means

A real shipped project has these properties:

  1. Solves a real problem you have, not a demo.
  2. Has at least one real user, even if that user is you.
  3. Persists state: your data, your conversations, your indices live somewhere durable.
  4. Has a URL or a CLI that works without you sitting at the keyboard.
  5. Has at least 20 evaluation queries with rubric-graded answers, run automatically.
  6. Has a README that someone else could clone and run.

Anything that meets these is shippable. Anything that doesn’t is a script.

The architecture template

Pick one ingest source, one retrieval store, one LLM, one tool set, one UI. Resist scope creep.

Build it — step by step

This whole module’s Capstone below has the rich step-by-step. The summary:

  1. Pick the problem. Write a one-paragraph product spec. (30 min)
  2. Ingest a small corpus. 100–1000 docs. Chunk + embed with Nomic Embed v2; index in qdrant or lancedb. (2 hours)
  3. Build the ReAct agent. Claude or GPT, 2–3 tools (retrieve, read_doc, search_web). MCP-shaped if you want; ad-hoc is fine. (3 hours)
  4. Wrap with structured-output endpoint. FastAPI, returns Pydantic objects. (1 hour)
  5. Build a minimal UI. Streamlit takes 30 minutes; Next.js takes 4. Pick. (1–4 hours)
  6. Deploy. Modal (one command for Python apps), Vercel (Next.js), or a cheap VPS. (1–2 hours)
  7. Build the eval harness. 20 representative queries, rubric-graded by Claude as judge. Add CI. (2 hours)
  8. Use it for a week. Find the bugs. Fix the worst ones. (a week of dogfooding)
  9. Write the README. Architecture diagram, install steps, eval results. (1 hour)

Total: one weekend (~12–16 focused hours) for v0. Iteration fills in the rest.

What to optimize, in order

When v0 is up, don’t optimize randomly. The order:

  1. Retrieval quality. Run the eval. If retrieval is wrong, nothing else matters. Hybrid retrieval + reranker if dense alone is missing.
  2. Prompt quality. Tighten the system prompt; few-shot examples in the agent’s context.
  3. Tool design. Consolidate tools the LLM uses awkwardly; split tools that take confusing arguments.
  4. Schema / structured output. Add Pydantic field descriptions; tighten validators.
  5. Latency. Streaming, caching, smaller models for cheap calls.
  6. Cost. Routing (small model for easy queries), prompt caching (Anthropic / OpenAI), aggressive retrieval pruning.

Skip levels at your peril. Optimizing prompts when retrieval is wrong = polishing the wheel that’s pointing the wrong way.

Common scope-creep traps

  • “Let me support all my data sources.” Ship one. Add others later.
  • “Let me make a beautiful UI.” Streamlit is fine for v0. The work that matters is in the agent.
  • “Let me add memory.” ReAct agent loop with full message history is enough memory for v0. Vector-DB-backed memory is v3.
  • “Let me train a custom embedding model.” Off-the-shelf is fine until you’ve validated the eval.
  • “Let me add multi-agent orchestration.” One ReAct agent. Multi-agent is v5.

The first version of every great project is embarrassing. Ship anyway.

Examples of successful capstones

(Compiled from learner submissions, anonymized.)

  • Notes assistant: 3 years of Markdown notes ingested into qdrant, MCP server exposing retrieve / search; used daily via Cursor.
  • Paper triage: arXiv RSS → Anthropic summarizer → quality filter → Notion. Used daily.
  • Meeting recap bot: Otter.ai transcripts → ingest → “what did we decide about X?” agent. Replaces note-taker.
  • Code archaeologist: company codebase → embedded → LLM that answers “where is feature X implemented and who wrote it?”. Saves new-hire onboarding time.
  • Customer-support agent: company docs → RAG → ReAct agent with ticket-creation tool. Deployed to actual customers.

Each took one weekend for v0. Each is now used weekly or more.

What you walk away with

You finish this capstone with:

  1. A working AI product you actually use.
  2. An eval methodology that’s transferable to every future LLM project.
  3. A portfolio artifact more compelling than any course-completion certificate.
  4. The integrative skill of “I can take an idea and ship it.” Most engineers never develop this.

This is the bar. Hit it once, and every future LLM project starts from a position of “I’ve done this before; what’s special about this one?” instead of “where do I start?”

Key takeaways

  1. Solve a real problem you have. Not a demo. Not a Hello-World.
  2. Build v0 in one weekend. Cut features. Ship the integration.
  3. Eval is the bar. Without it, you have a demo, not a product.
  4. Optimize in order: retrieval → prompts → tools → schema → latency → cost.
  5. The README is the artifact. A working agent without a README is invisible. A README without a working agent is a lie.

Go deeper