Layman's Guide to Computing - Season 14 category

Articles in the Season 14 category

Issue 170: Machine learning models

Models simplify and represent a relationship between input values and output values. The more complex the relationship, the more parameters the model needs to learn. Models are simplifications of reality, and their performance depends on how well they capture underlying patterns in the data, as well as the quality and quantity of the dataset.

Issue 171: The first Generative Pre-Training model, GPT-1

The Transformer architecture, unlike previous machine learning model architectures, could generate its next item while processing all previous items at the same time. The technique of unsupervised learning trained models on unlabelled data, letting the model pick up patterns in underlying data instead of having it learn correct answers only, and was much faster than supervised learning. OpenAI applied both these ideas at scale, producing GPT-1, a model that beat best-performing models while requiring relatively little human supervision during training.

Issue 172: Tokens, the currency of LLMs

A model does not see letters or words, only tokens. These tokens are typically generated from user input through a pre-tokenizer program. Tokens are represented in the model as embeddings, a sequence of numbers representing the token’s position in the embedding matrix. The model uses each token’s embedding, and its surrounding tokens, to infer its meaning in context.

Issue 173: Training, Inference, and Scaling

OpenAI discovered, through models GPT-1 to GPT-3, that scaling compute and (training) data alone was sufficient to sharply increase the capabilities of a LLM: the transformer architecture and unsupervised learning together resulted in a model that was alarmingly intelligent.

Issue 174: Reinforcement Learning

Through reinforcement learning with human feedback (RLHF), the LLM is trained on labelled data until it can reliably follow instructions, avoid harmful output, and follow other desired behavior. A system prompt provides guidelines for output. The user’s prompt is inserted into a templated prompt and passed to the LLM, which generates text in a markup format that a display system can understand. A chat interface wraps the entire system to create the illusion of a responsive chatbot.

Issue 175: LLM tools

LLMs can be trained to make tool calls, using the same training data used to train code assistants. The tool specifications are injected into the system prompt that is passed to the model, along with guidance on when to use a tool. Tool calls generated by the model are interpreted by a runtime that detects and executes them, then passes the results of the tool call back to the LLM in the next input.

Issue 176: Retrieval-Augmented Generation (RAG)

In retrieval-augmented generation (RAG), the runtime performs a search with the user’s request to retrieve relevant chunks from a set of documents from a knowledge base. The chunks may be further re-ranked by the runtime before finally being included in the LLM’s input. One alternative to RAG, where information lookup happens outside of LLM generation, is to provide the LLM with search tools instead, and rely on its judgement to use them well.

Issue 177: Multimodal models

Multimodal models represent text, image, and audio tokens alongside each other in their embedding space. The model uses the input tokens, regardless of type, to calculate the next output token. Multimodal models typically only output text tokens in their response, delegating to more specialized models for image and audio generation if necessary.

Issue 178: Model thinking and reasoning

Thinking/reasoning models are those that have been trained on examples of how to think about different problems in different domains, or plan and execute complex tasks. They often use tools to aid them in goal tracking and updating. The full thinking trace from the model may be removed or hidden to present a more legible response to the user.

Issue 179: Agents

Agents are software applications that comprise a harness, a runtime, and a model (typically accessed through an API instead of directly executed on the computer). They enable a user to type in a request or send it by other means and thus instruct the agent to carry out a task on the computer until completion. The capabilities of agents are limited by the tools available to them.

Issue 180: Running a model

Proprietary models do not have their weights published publicly, while open-weight models do. Various runtimes are available for download, and can run models that have a compatible file format. But models are extremely compute- and memory-intensive, requiring extremely high-end hardware and capacious memory to run.

Issue 181: Quantization

Quantization trades parameter precision for a smaller memory footprint and faster inference, making many models feasible for running on user devices. Model capabilities depend on their parameter count and training data. Models with higher parameter counts can represent more patterns, while model capabilities are added by training them on well-labeled data.

Issue 182: Running a model, part 2

Open-weight models range in size from sub-1B to 100+B. A range of device options below SGD6,000 are already capable of running these models, ranging from the humble Raspberry Pi for running harness support to the Mac Studio M3 for running 70B models.