Building a Production RAG Chatbot with Supabase pgvector and GPT-4o-mini

Retrieval-Augmented Generation (RAG) is the most reliable way to make a large language model answer questions about your data instead of hallucinating. I built the AI assistant on this site — Imli, the chat widget in the corner — as a full production RAG pipeline, and this post walks through exactly how it works and what I learned shipping it.

The architecture at a glance

The pipeline has four moving parts:

A knowledge base — a single, well-structured markdown file containing everything the bot should know: bio, skills, project case studies, and an FAQ.
An ingestion script — chunks the markdown, generates OpenAI embeddings, and stores them in Postgres.
A vector store — Supabase Postgres with the pgvector extension and a cosine-similarity search function.
A serving endpoint — embeds the user's question, retrieves the most relevant chunks, injects them into the system prompt, and streams the model's answer token by token.

No vector database SaaS, no orchestration framework sprawl. Postgres handles vector search comfortably at this scale, and Supabase gives you pgvector with one SQL statement.

Chunking: smaller than you think

I use a recursive character splitter with a chunk size of 1,000 characters and 200 characters of overlap, splitting on paragraph boundaries first. Two practical rules I follow:

Structure the source document for retrieval. Headings, short paragraphs, and self-contained sections produce chunks that make sense in isolation. A chunk that starts mid-sentence retrieves poorly and reads worse in a prompt.
Overlap matters more than size. The 200-character overlap means a fact that straddles a boundary still lands intact in at least one chunk.

Embeddings and the dimension trap

I embed with OpenAI's text-embedding-3-small at 768 dimensions — the model supports dimension reduction, and 768 is plenty for a few hundred chunks while halving storage and search cost versus the default 1,536.

Here's the trap: the dimension is baked into your table schema (vector(768)), your ingestion call, and your query-time embedding call. If any one of them disagrees, you get either a hard insert error or — much worse — silently broken similarity scores. Pin the dimension in one shared constant if you can.

Similarity search in plain SQL

The search function is a single SQL function using pgvector's cosine distance operator:

create or replace function match_documents (
  query_embedding vector(768),
  match_count int default 3
) returns table (id bigint, content text, similarity float)
language plpgsql as $$
begin
  return query
  select documents.id, documents.content,
         1 - (documents.embedding <=> query_embedding) as similarity
  from documents
  order by documents.embedding <=> query_embedding
  limit match_count;
end;
$$;

One tuning note: I started with k = 3 retrieved chunks and found the bot answering "I'm not sure" to questions the knowledge base could answer — the right chunk was ranked 4th or 5th for some phrasings. Bumping to k = 5 fixed it with no noticeable quality loss. Retrieval count is a knob worth testing with real queries, not guessing.

The RLS gotcha that silently breaks re-ingestion

This one cost me real debugging time. Supabase enables Row-Level Security, and my ingestion script deleted all existing rows before re-inserting fresh embeddings — using the anon key. RLS had policies for select and insert but not delete.

Postgres doesn't error in that case. The delete silently affects zero rows, and every re-ingestion appends duplicates instead of replacing content. Your bot starts retrieving three copies of the same paragraph and crowding out everything else. If your RAG answers degrade after content updates, check your delete policy first.

Streaming and tool calling

The serving endpoint streams tokens with the Vercel AI SDK's streamText, which also gives the assistant tools. Imli has one: sendMessageToSujal, which delivers a visitor's message to my Telegram via the Bot API. The model decides when to use it — after collecting the message and contact details conversationally. That single tool turned a Q&A widget into a lead-capture assistant.

Every visitor question also fires a Telegram notification, so I know what people actually ask. That feedback loop drives knowledge-base updates: if the bot says "I'm not sure" twice about the same topic, that topic gets a section in the markdown file.

What I'd tell you to do differently

Keep the knowledge base in one version-controlled markdown file. Editing is trivial, diffs are reviewable, and re-ingestion is one command.
Log the retrieved chunks during development. Most "the AI is wrong" bugs are actually "the retrieval is wrong" bugs.
Write the system prompt to refuse gracefully and offer an escalation path ("I can pass a message to Sujal") instead of letting the model guess.

The whole stack — Supabase, pgvector, text-embedding-3-small, GPT-4o-mini, streaming — costs almost nothing at portfolio scale and the same architecture scales cleanly to real document corpora. If you're building something similar and get stuck, ask Imli to put you in touch.