Tutorial: How to Agentically Code a RAG System
A step-by-step guide to building RAG systems using AI agents. Learn the prompting and oversight process for creating semantic search with embeddings, PG Vector, and LLMs.
By Tom Hundley
(Ghostwritten with Grok 4.1, curated and orchestrated by Tom Hundley)
If my story above hooked you, here's the practical follow-on: a step-by-step tutorial on "agentically coding" a RAG system. This means using AI agents (like Grok, Claude, or GPT) to build iteratively via prompts, while you guide with smart questions. I'll frame it around my experience—building Sparkles (the RAG chatbot) and fixing the job tracker. Assume you're on a stack like Next.js/React with Supabase/Postgres for the DB. No code snippets here; focus on the prompting and oversight process. (You can ask your agent for the actual code.)
Step 1: Define Your Vision and Gather Knowledge
Before prompting, educate yourself (like I did). Key concepts:
- Chunking: Split docs (e.g., 10-page resume PDF) into bits (500 chars with 100-char overlap) to avoid overwhelming the embedding model.
- Embedding Model: Pick one like OpenAI's text-embedding-3-small (1536 dims, API key needed) or open-source like bge-small-en (384 dims, hostable via Hugging Face).
- Vector DB: Use PG Vector in Postgres/Supabase for storing/querying embeddings.
- Similarity Metric: Cosine for direction-based matches (ignores length).
- RAG Flow: Query → Embed → Retrieve top-k matches → Augment LLM prompt → Generate response.
Step 2: Prompt the Agent to Set Up the Basics
Start broad but specify RAG. My initial Sparkles prompt: "Build a RAG-based chatbot for my site thomashundley.com. Use semantic search on my resume data in PG Vector via Supabase."
Ask clarifying questions mid-build (agentic vibe):
- "What embedding model are you using to vectorize chunks and queries? (e.g., text-embedding-3-small)"
- "How are you chunking the input? (e.g., RecursiveCharacterTextSplitter with 500 chars and 100 overlap)"
- "Which similarity metric for comparisons? (e.g., cosine)"
Tools: Use LangChain or LlamaIndex for chunking/embedding (pip install via your agent).
Step 3: Handle Data Ingestion
For a PDF resume: "Extract text from my 10-page resume PDF, chunk it, embed each chunk with [model], and store in PG Vector."
Questions to ask:
- "How do you extract text? (e.g., PyPDF2)"
- "What's the chunk size/overlap? Why that?"
- "How many dims in the embeddings?"
Step 4: Implement Retrieval and Generation
Core RAG: "On query (e.g., job desc), embed it, query PG Vector for top-5 matches via cosine, stuff them into an LLM prompt (e.g., Claude Sonnet), and generate a score/response."
Questions:
- "What LLM for generation? (e.g., Anthropic's Claude, not for embeddings)"
- "How do you score? (e.g., cosine from 0-1, then LLM refines)"
- "Handle edge cases? (e.g., location restrictions in job desc)"
Step 5: Deploy and Audit
"Integrate into Next.js: API route for queries, frontend chatbot UI."
Post-deploy: "Run quality control: Score this job desc against my data. Explain why it caught/didn't catch [nuance]."
If it's keywords (plot twist!), redo: "Switch to full RAG with embeddings."
Step 6: Iterate and Scale
- Benchmark: "Compare keyword vs. RAG accuracy on 10 jobs."
- Scale: "Add more data sources? Fine-tune embeddings?"
Ready to try? Prompt an AI with this tutorial as a base. Hit me up if you build something cool.