11 minute read

Building RAG with Postgres

A step by step guide to building a RAG system with Postgres

Introduction

Postgres is a powerful tool for implementing Retrieval-Augmented Generation (RAG) systems. Its versatility and robustness make it an excellent choice for this task. By diving deep into a technology you’re already familiar with, you can experience a significant productivity boost. As the saying goes, “stick with the tools you know.” Using Postgres for RAG allows you to reason about the system more easily, cutting through the hype and focusing on building something great.

Now, let’s explore how we can build a RAG system using Postgres. We’ll go through each component step by step, from data ingestion to response generation, and see how Postgres fits into the overall architecture.

Architecture

Let’s break down the various components we’ll use:

  1. Ingestion pipeline
  2. Postgres database
  3. RAG pipeline
  4. API
Documents, vectors
Hybrid search
Response
User query
API
API Endpoint
RAG Pipeline
Relevant chunks
Top chunks
Generated text
Rerank
Hybrid Search
LLM
Response Generation
Ingestion Pipeline
Raw text
Clean text
Text chunks
Vectors
Process (Clean etc)
Source
Chunk
Embed
Index
Postgres
Postgres DB

As we can see in the architecture, Postgres is the central piece since we’ll be both writing to and reading from it frequently. Let’s start by setting that up.

Storage

The basic unit we’ll be working with is a document. Each document will have multiple chunks, which we’ll discuss in more detail later.

class Document():
    id: str
    title: str
    source: DocumentSource
    meta: dict # arbitrary fields to filter on
    ... # other fields you'll want for a good user experience or query filtering
CREATE TABLE documents (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    title TEXT NOT NULL,
    source TEXT,
    meta JSONB,
    created_at TIMESTAMP WITH TIME ZONE DEFAULT CURRENT_TIMESTAMP
);

You’ll want to query against meta properties like dates, sources, users, etc. Early in development, I’d advise keeping this as a jsonb field in Postgres, as you might want to add or remove different fields. Since we’re working with Postgres, querying JSON data won’t be an issue. If it becomes one, we can simply add an index. Later, we can break out frequently used meta filters into a separate column or even a separate table. There are plenty of tricks we can apply to handle these issues when they occur, but not now.

When we receive a request to run the RAG pipeline, we’ll want to provide only the relevant context. Depending on your source data, you might not need chunks. But in most cases, you do, so here’s a rough model of what it can look like:

class DocumentChunk():
    id: str
    document_id: str
    text: str
    embedding: list[float]

To get the embedding functionality, we’ll need to install pgvector (or pgvectorscale 👀):

CREATE EXTENSION IF NOT EXISTS vector;

CREATE TABLE document_chunks (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    document_id UUID NOT NULL,
    content TEXT,
    fts TSVECTOR GENERATED ALWAYS AS (to_tsvector('english', content)) STORED,
    embedding VECTOR(1536),
    FOREIGN KEY (document_id) REFERENCES documents(id) ON DELETE CASCADE
);

With this setup, we have a one-to-many relationship between the document and its chunks. I really want to emphasize the importance of meta filters, as they will be crucial when retrieving chunks.

Now we have the basics in place for building the ingestion pipeline.

Ingestion Pipeline

Here’s a rough outline of what our ingestion pipeline will look like:

def ingest():
    raw_data = extract_data_from_source()
    clean_data = process_and_clean_data(raw_data)
    chunks = chunk_data(clean_data)
    chunk_embeddings = embed_chunks(chunks)
    document_chunks = zip(chunks, chunk_embeddings)
    insert_document_chunks(document_chunks)

A lot of the important work will happen in the process_and_clean_data function, as this is where we’ll optimize our data for reading. The specifics depend on your use case, but that’s a bit out of scope for this post.

Instead, we’ll focus on the chunk_data and embed_chunks functions. Many people might reach for a tool like LangChain or LlamaIndex for this. Personally, I prefer to be in control of these nitty-gritty details as they can significantly influence your system’s performance.

Chunking

There are many different ways to chunk data, but in 80% of cases, I use either a document-specific chunker or a recursive chunker. Greg has a great resource for different types of chunking/splittingfavicon for github.com. He also created this websitefavicon for chunkviz.up.railway.app which makes it easy to understand how chunking works.

It’s often beneficial to overlap your chunks to mitigate semantic gaps. My go-to settings are usually a chunk size of 256-304 tokens with 32 tokens of overlap. I’ve found this to work quite well, but your mileage may vary.

It’s worth noting that these are starting points and may need optimization. It’s crucial that you examine your chunks (and data in general) to understand if your chunking strategy makes sense for your data.

Embedding

This step is straightforward. You simply need to embed your chunks with an embedding model of your choice. However, this is also an interesting step in the pipeline as you can optimize it using techniques such as contextualization, HyDE, and fine-tuning embeddings.

Indexing

The final step is to insert the data. Nothing fancy here, just a regular INSERT INTO statement. As mentioned earlier, you might want to optimize certain steps in the pipeline, which also means you’ll need to reindex documents. Keeping a reference to the source document is useful so you know which documents and chunks you should reindex.

RAG Pipeline

Retrieval

def retrieve(user_text: str) -> list[DocumentChunk]:
    query = extract_query(user_text)  # rewrite and extract keywords
    chunks = search(query)  # hybrid search with Postgres
    chunks = rerank(chunks)  # rerank your chunks to get the most relevant ones
    return chunks

I’ve written about using Postgres as a search engine in a different post, so I won’t go into depth here. The main difference is the tables we’re using and that we’ll be retrieving chunks instead of entire documents.

Query rewrite

There are several techniques you can use for rewriting your user input. Prompts along the lines of:

Extract a search query from this text
<user_text>{user_text}</user_text>

or

Extract a very short search query and keywords from this text
<user_text>{user_text}</user_text>

You’ll want to adapt this to your specific use case. If it’s keyword-heavy, you might want to focus more on that aspect. If not, you might want to focus on similarity and optimize for embedding search.

Filter

I want to highlight that there are many different ways to find relevant chunks. On top of text search, you should apply filtering. Remember that metadata we talked about? Here are some examples:

  • Time filters: “What did you get done this week?” → WHERE updated_at >= current_date - interval '7 days'
  • Meta filters: “How many people live in Europe?” → WHERE meta->>'continent' = 'europe'

This essentially becomes a text-to-SQL problem which it turns out LLMs are pretty good at too.

Reranking

Now that you have your chunks, you’ll want to run them through a reranker. You probably just want to use Cohere’s reranker. It’s easy to set up, relatively cheap, and most importantly, effective.

Response Generation

You have the context, now it’s time to put it to use. In essence, this is your function:

system_prompt = "You're an expert at answering questions about Postgres. Always use the documents provided in the context."
user_prompt_template = """
Here are some documents. Use them to answer my question.
Always refer to which document you used to answer my question by document id
<documents>
{context}
</documents>

Answer my question:
<question>{question}</question>
"""

def serialize_chunks(chunks: list[DocumentChunk]):
    context = ""
    for index, chunk in enumerate(chunks):
        context += f"\n<document><id>{index}</id><content>{chunk.text}</content></document>"

def generate_answer(user_text: str, chunks: list[DocumentChunk]):
    context = serialize_chunks(chunks)
    user_prompt = user_prompt_template.format(question=user_text, context=context)
    answer = generate_answer(system=system_prompt, messages=[{"role": "user", "content": user_prompt}])
    return answer

There are, of course, many different things you can do here depending on your needs, such as merging chunks, tool calling, source citing, streaming, etc.

Evaluation

As you’ve probably noticed by now, there are many components in this system that can be tweaked. The higher up in the chain you change something, the more downstream effects it can have. This can be both pleasant and unpleasant. That’s why you really need evaluation in place to see if you’re actually improving or if you’re just changing things (with good intentions) but with unintended side effects.

Try to evaluate your components in isolation. How well is your chunking working? How is your retrieval performing? What happens if you change the FTS or replace it with pg_search (BM25)? If you can answer these questions independently, you’ll not only improve the context but also sleep better knowing there won’t be a whack-a-mole game of unexpected consequences.

There are plenty of good resources on this topic, I can recommend Forest Friendsfavicon for forestfriends.tech

Feedback

If you can build a feedback loop into your application, that will help tremendously. You’ll get real data from real users. This can be in the form of 👍/👎 ratings, a preset of questions, or just free text. Be careful not to extrapolate too much from this and always put the feedback in context (what did the user ask, what response did they get).

Observability

Log everything. Make sure to capture all the relevant pieces of information; you’ll want to understand how things are working.

Implement tracing of your calls. You’ll thank yourself later. It doesn’t have to be fancy; there are plenty of good options. Even though OpenTelemetry (OTEL) can be quite annoying to set up, it usually provides a lot of value.

Some argue that wide events are all you needfavicon for isburmistrov.substack.com, and I think there’s a lot of truth in that. It’s simple and efficient in the beginning, and that’s exactly what you want. You can optimize later when you need to.

Next Steps

This is the first implementation of a RAG system. Next, you’ll likely want to consider:

  • Multi-turn messages: Would the user benefit from a back-and-forth conversation?
  • Query expansion: Can we break out a user query into multiple search queries, then assemble the results?
  • Query intent classification: Is the user asking a question or searching for something? This could affect how we rewrite or weigh our hybrid search.
  • More evaluations for retrieval & search: Consider metrics like recall, precision, and maybe NDCG.
  • Look at your data: Yes, this will help you build intuition about it.
  • Implement tool calling: You might have an API you need to integrate. It doesn’t always make sense to index everything yourself. Or you might need to generate and run some code on-demand.
  • Build reasoning & planning: You might be building a more advanced system where you need multiple steps.
  • Clean & preprocess your data: Can you use GAR or HyDE? Remember, garbage in, garbage out 🤷

A bit off topic, but I had to include it.

Recently I listened to Doug Turnbull talk about Generative AI-Augmented Retrieval (GAR), a technique to “make boring-old search better”. This approach focus on how you write your data and how you can optimize it. That means you’re moving some of the challenges to the write stage, where you can usually handle more things asynchronously and optimize in different ways rather than in the synchronoous retrieval stage.

If you’re interested in learning more about GAR, Doug has some excellent resources available. You can find some amazing slides and notebooks on his blogfavicon for softwaredoug.com.

Conclusion

To wrap up we can say that Postgres is also a great tool for RAG. Here are some takeaways to think about when building:

  • Understand your specific use case and data, and don’t forget to look at it
  • Evaluate and refine each component independently and together
  • Balance performance and accuracy. If it’s too slow, users get impatient. If it’s not accurate, users lose trust.
  • Make sure you have a feedback loop and don’t be afraid to change course
  • Maintain observability so you can understand what’s happening

Remember, building an effective RAG system is an iterative process. Don’t be afraid to experiment, measure, and refine your approach as you gain more insights into your specific use case and data :)