Layman's Guide to Computing

Season 14

Issue 176: Retrieval-Augmented Generation (RAG)

Published:

Previously: LLMs can be trained to make tool calls, using the same training data used to train code assistants. The tool specifications are injected into the system prompt that is passed to the model, along with guidance on when to use a tool. Tool calls generated by the model are interpreted by a runtime that detects and executes them, then passes the results of the tool call back to the LLM in the next input.

We mentioned hallucination—mentioning non-existent publications, stories, or facts as though they were real—as one of the pitfalls of GPT-3, and mentioned how reinforcement learning with human feedback (RLHF) helps to combat some of these tendencies in general use.

These days, ChatGPT, Claude, and other chatbots also allow you to upload documents. The runtime supporting these chatbots helps to extract text from the documents with supporting context and include them in the system prompt, allowing the chatbot to answer from the document’s contents to combat hallucination.

In some cases, the document may be too large. In other cases, a company may have a large set of documents the LLM should answer from, but they are too large to all be included in the system prompt.

In such cases, retrieval-augmented generation (RAG) provides an alternative way to inject relevant information into the LLM’s system prompt.

Retrieval-Augmented Generation (RAG)

Like other LLM capabilities, this one comes from the runtime. The LLM plays no part in this and has no control over the process.

The source documents are chunked, and each chunk analyzed to create an embedding. Parts of the document that are closely related have embeddings located more closely.

Before the user’s input is passed to the LLM, it is parsed by the runtime and analyzed into an embedding. This embedding is used to retrieve relevant parts of documents; other information may be used to determine relevant portions as well.

Instead of embedding entire documents, only these relevant portions are included in the system prompt for the LLM to answer the user’s query. In more advanced implementations, the chunks may be further re-ranked by importance and other criteria.

All of this happens in the runtime, beyond the LLM’s token generation loop.

Limitations

When it works well, it works really well: the LLM doesn’t hallucinate, quotes from the source, and if the source is well-tagged, it can even cite from the correct page and paragraph.

But there are ways it can make mistakes too. If no matching documents are found and the LLM isn’t aware, it may hallucinate unless the runtime handles this well. On the opposite end of the spectrum, it may find too many results and not know how to select the most relevant ones. The documents themselves may be contradictory, incomplete, or require too much unwritten context. And lastly, it may miss important nuance found elsewhere in the document, or in other documents, that did not surface in the embedding search.

Alternatives

Still, in cases where you can’t fit entire source documents in the LLM context, what other alternatives do you have?

Then it’s back to a set of tools for your LLM to use for searching the company knowledge base, read documents, and manually extract relevant portions. Naturally, your LLM will need to be trained on a dataset of positive examples of tool usage (Issue 175). In contrast to RAG, where retrieval is automatic and built into the runtime, here you are relying on the LLM’s judgement of which tool to use, and when to use it.

Issue summary: In retrieval-augmented generation (RAG), the runtime performs a search with the user’s request to retrieve relevant chunks from a set of documents from a knowledge base. The chunks may be further re-ranked by the runtime before finally being included in the LLM’s input. One alternative to RAG, where information lookup happens outside of LLM generation, is to provide the LLM with search tools instead, and rely on its judgement to use them well.


Okay, that’s RAG de-mystified. It’s a program that runs a search on the user’s request and injects relevant chunks from the knowledge base into the LLM’s input, beyond the LLM’s control. Now you can speak about RAG a little more informatively.

I avoided discussing RAG’s performance, because results vary. For every detractor you can also find a supporter! Is it going to work well for you? You probably have to try it yourself, or find a consultant who can better advise you.

What I’ll be covering next

Next issue: Issue 177: Multimodal models

Many chatbot models accept image and even audio alongside text. How does this work? De-mystifying in the next issue, so stay tuned!