Layman's Guide to Computing

Season 14

Issue 175: LLM tools

Published:

Previously: Through reinforcement learning with human feedback (RLHF), the LLM is trained on labelled data until it can reliably follow instructions, avoid harmful output, and follow other desired behavior. A system prompt provides guidelines for output. The user’s prompt is inserted into a templated prompt and passed to the LLM, which generates text in a markup format that a display system can understand. A chat interface wraps the entire system to create the illusion of a responsive chatbot.

A chatbot is fun to use for a while, but if all it could do was talk we wouldn’t use it for very long. For starters, it would hallucinate a lot, or give outdated information, because it couldn’t access the internet or do a web search. What would it take for models to be able to use computers to do that?

While this problem was being actively worked on, LLMs were also being trained to generate programming code. It turned out that code, being text-based, was fertile training ground for LLMs. They were improving at it too; while early versions still failed at producing large yet coherent programs, many were able to generate boilerplate code with correct syntax already.

LLMs as tool-using models

For a LLM to use a tool, it needs to be trained to:

for example, to use a weather tool, the LLM needs to be able to:

It turns out that the first two problems are already known and solved: the first text-based interfaces were invented in the 1970s after all, and programmers have always needed a way to invoke programs through a text-based interface. They already had one, in the form of the command line (Issue 15) and the rich syntax that was already built around it. And they had another, in the form of the function call syntax that almost all programming languages had standardized on, like check_weather(location="Singapore, SG", show_temperature=True, show_humidity=True). And training data already existed for both of these, in the form of open-source code readily available online in code repositories (Issue 19).

The structure of a tool-using LLM

For a LLM to be able to output tool calls, you need:

In the system prompt, you would include:

[...]

## Tools available

- `check_weather(location: text, show_temperature: boolean, show_humidity: boolean)
  Check the weather at the given location. Example: "Singapore, SG"
  Pass show_temperature=True and show_humidity=True if temperature and humidity are required in the output
- ...
- ...

Providing a rich set of tools without using up too many tokens is a tricky design balance that requires regular tweaking. In any case, the model is then trained to output the tool calls in a specially marked section of their output.

Invoking the tools

At the point when the model outputs the stop token and the program stops using it to calculate more output tokens, its involvement stops. The program interprets the model’s output, separating the tool calls out, and passes them to another system.

You see, tool calls can be pretty dangerous, especially if they enable the model to carry out destructive actions. A shell command like rm -rf / on Linux or Mac could delete the entire operating system, or important subdirectories. A delete_database tool could do what it says, but with the wrong target specified. So it’s common to have a system that examines the tool call and attempts to determine if it is safe. In a code assistant, this tool call might be shown to the user for explicit approval. In a web-based chatbot like ChatGPT, tool safety is usually handled by another system instead.

Once validated, the tool needs to be executed on a computer system. This computer system needs to have the necessary programs installed. It should also be isolated against potentially destructive actions. We’ve covered how containerization (Issue 149) enables this to be done; an isolated container for each session where necessary.

Finally, the result of the tool call, whether success or failure, is captured and then added to the token sequence which is fed back into the LLM.

This all sounds pretty neat, but with one caveat: only the chatbot provider (OpenAI for ChatGPT, or Anthropic for Claude) can pass these tools to the LLM. Third-party integrations, such as with GitHub or Google Drive, would be tricky for OpenAI/Anthropic to design on their own, yet unsafe for external parties to inject into the system prompt.

Integrating third-party tools

So in Nov 2024, Anthropic proposed another standard: the Model Context Protocol, a way for external parties to specify a set of tools that work together to enable access to other web-based or software-based systems.

When the user registers a MCP server through a graphical or text-based interface, the system reads the tool specifications from the MCP server, injects them into the system prompt, and from there they work like other tools accessible to the LLM.

The runtime

Notice that none of this is mediated or controlled by the LLM. It follows instructions, generates tool calls with the correct syntax in its output, then sees the result in the next input, seemingly by magic. The LLM is operating in a virtualized environment controlled by an external system that doesn’t have a standardized name yet. For now we’ll call it the runtime.

Tools and toolsets make or break a LLM-based assistant. They are the only way a LLM can take actions, get data, and otherwise make sense of the external world. A LLM without any tools is analogous to a human in a sensory deprivation tank—without information from the outside world, even human beings quickly begin to hallucinate.


Issue summary: LLMs can be trained to make tool calls, using the same training data used to train code assistants. The tool specifications are injected into the system prompt that is passed to the model, along with guidance on when to use a tool. Tool calls generated by the model are interpreted by a runtime that detects and executes them, then passes the results of the tool call back to the LLM in the next input.


From here it’s another 3 issues before we get to the topic of the year: AI agents. Before I get there I want to cover three more buzzphrases: retrieval-augmented generation (RAG), multimodal models, and reasoning/thinking models.

By now I hope you’re starting to see that LLMs really are next-token predictors underneath, and all their actual capabilities—the ones that let them know what is happening in real-time and change things in the world—are provided through the runtime. As the runtime grows more powerful and capable, LLMs must also be post-trained (using reinforcement learning a.k.a. RLHF) to use them well.

What I’ll be covering next

Next issue: Issue 176: Retrieval-Augmented Generation (RAG)