Layman's Guide to Computing - Season 14

Issue 182: Running a model, part 2

2026-08-31T08:00:00+08:00

Previously: Quantization trades parameter precision for a smaller memory footprint and faster inference, making many models feasible for running on user devices. Model capabilities depend on their parameter count and training data. Models with higher parameter counts can represent more patterns, while model capabilities are added by training them on well-labeled data.

Last issue, we discovered that there are quantized models that can actually run on laptops. (You can also run GPT-1 and GPT-2 on a laptop, but you would likely be disappointed in their performance today given the leaps-and-bounds improvement in AI capability that have happened since 2022.)

Besides gemma-4-12b, what else can we run?

Open-weight model options

Open-weight enthusiasts have a number of well-known options available to them (sizes are unquantized):

Google’s Gemma 3 series, available in sizes 1B to 27B
Google’s Gemma 4 series, available in sizes 5B to 31B
Microsoft’s Phi series, available in sizes 2.7B to 14B
Meta’s Llama series, available in sizes 8B to 405B
Alibaba’s Qwen series, available in sizes 0.5B to 72B
Deepseek’s titular Deepseek series, available in sizes 16B to 72B
Mistral’s Mistral series, available in sizes 7B to 176B

There are also many lesser-known models, whose capabilities are still increasing every few months.

I won’t give you a comprehensive low-down on what each model is good for, because:

the models can be fine-tuned by those who know how, and may have variants that are better at specific task categories,
the models are updated every few months, and see new capabilities added through post-training (supervised learning),
the agent harness and runtime do play a part: some models are useful “out-of-the-box”, some work best within a particular harness or with a particular set of tools

Model capabilities

In Issue 181 I mentioned that more parameters lets the model represent more patterns in its weights, while better training data determines the model’s capabilities. Useful to know as a general pattern, but difficult to apply when deciding on a specific model to run. Should we just run the largest model that our device is capable of running?

As of June 2026:

0.5B–3B models can handle classification, extraction, summarization tasks and are generally good for single request-response purposes.
7B–9B models are useful assistants (in a harness) that can hold short conversations, handle basic Q&A, do simple coding or tool calls, and otherwise generally match GPT-3’s capabilities.
12B–15B models can follow instructions consistently, generate code that mostly works (and do some debugging if necessary), generate tool calls more reliably, making them capable tool-using agents.
27B–35B models can handle most tasks, even across longer contexts: analyze documents and write reports, generate and debug code, execute requests involving multiple steps. With a well-designed harness and accurate task documentation, these become capable general-purpose agents.
70B models can handle what previous tiers can do, but better: fewer hallucinations and mistakes, better answers, better general understanding, more consistent planning, and over longer context windows—smaller models sometimes see a sharp performance drop when the context window extends past a certain length. Some users report better reasoning performance as well.
100B+ Frontier models—GPT-5, Claude Opus, Kimi K2.5, et al—can do all of the above, with state-of-the-art reasoning and thinking, knowledge, error recovery, ambiguity handling, and more

Specialized non-LLM models include:

OpenAI Whisper (0.4B–1.5B) for speech-to-text transcription, text-to-speech generation
Stable Diffusion (0.9B–8B) for text-to-image generation
FLUX.1 (12B) also for text-to-image generation
CLIP (0.4B) for image-to-text understanding
Stable Audio 3 (0.6B–2B) for text-to-audio generation

Models are still improving through post-training (supervised learning) and distillation—a process by which small models are trained on output from larger, more capable models. A 9B model today already exhibits capabilities that GPT-3 (175B) was capable of in 2022. So you should expect a different set of capability tiers this time next year.

Hardware options

The sweet spot for “value-for-money” sits around 12B–35B for now. Smaller models are faster and use less memory. Speed decreases and memory use increases as model size increases.

With this in mind, these are some popular options for running models on-device (local deployment) as of June 2026 (prices are Singapore retail):

Raspberry Pi (8–16GB RAM): popular for tiny models (2B or smaller), used to generate document embeddings for search, OCR documents and clean up the OCRed text, etc. These form the support system for the agent harness, and usually are not used directly for the agent models.
Mini-PCs with a sufficiently capable CPU, no dedicated GPU are a decent budget option.
- AMD Ryzen AI 300 CPUs, 12 CPU cores, 8–12 GPU compute units & 64GB RAM: this can run 7B–13B models capably (if slowly), and 34B quantized models at a crawl. [~SGD2,000]
- AMD Ryzen AI MAX+ (Strix Halo) CPUs, 16 CPU cores, 40–48 GPU compute units & 256GB RAM: this bundles a much more capable integrated GPU (Issue 123) and can run 34B models capably, 70B models at a crawl. [~SGD4,800]
- Mac Mini M4, 12 CPU cores, 10 GPU compute units & 24GB RAM: In a similar category as the Ryzen AI 300. [SGD1,299]
- Mac Mini M4 Pro, 14 CPU cores, 20 GPU compute units & 48GB RAM: In a similar category as the Ryzen AI MAX+. [SGD2,659]
- Mac Studio M3 Ultra, 28 CPU cores, 60 GPU compute units & 96GB RAM: With the highest memory bandwidth of all the units in this category, this can run everything mentioned above, and even run 70B models decently well. That’s what most folks would be buying this for.
  A higher-end 32 CPU core, 80 GPU compute unit configuration exists if you add SGD2,025—doesn’t add new capabilities, makes everything a little faster. [SGD5,199]
Full PCs with a capable CPU & dedicated GPU
- Many options exist here, none below SGD6,000, most above SGD10,000. Dedicated GPUs capable of running AI models already have prices in the thousands.

If you already have an existing laptop/PC and want to know how it will manage different model sizes, you can ask ChatGPT or Claude; they are pretty up-to-date with hardware capabilities and can give you an estimate. Alternatively, try to download and run the models and see for yourself—ground truth doesn’t care about your estimates.

Cloud options

Wow that’s a lot of zeros. Besides, owning hardware comes with its own maintenance needs and headaches. Enter the cloud, i.e. pay-per-use.

If you don’t want to have to manage the hardware that runs these models, don’t plan to be running a model long-term, or want to run a model larger than what your hardware can handle, these are the current most user-friendly options:

HuggingFace not only catalogues model weights, it also automates inference hosting (provided by AWS or Google Cloud underneath). Caveat: not all models are supported; you need a model that lists “HF Inference API” as an Inference Provider. The HuggingFace link in this bullet point links you to models that do. On the model card page, click Deploy > HF Inference Endpoints
Replicate provides an even simpler interface, but for a smaller catalogue of models. Try out the models directly on the model card page, or create an account for deployment options.
Fireworks AI is where you go once you’ve decided on a (supported) model and want reliable hosting. Browse their model list and click Try In Playground or Deploy On Demand (requires registration).

There are other options that require more technical expertise to use, but if you reach that point you shouldn’t be relying on a layman’s guide anymore :)

Issue summary: Open-weight models range in size from sub-1B to 100+B. A range of device options below SGD6,000 are already capable of running these models, ranging from the humble Raspberry Pi for running harness support to the Mac Studio M3 for running 70B models. For larger models, or short-term workloads, cloud options for deploying and running open-weight models also exist.

This is the most tentative issue for this season, and probably for the entire newsletter so far. I try not to write issues that I will have to retroactively edit as the frontier shifts, but I’ll make this an exception: I think expounding on available open-weight models illustrates how the ecosystem is similar to open-source software, that allows the (sufficiently educated) public to experiment and provide feedback, how advances in AI over the past 3–4 years have made them feasible to run on consumer-class devices, and how cloud infrastructure has made larger models accessible to those who don’t own sufficiently powerful hardware.

The Layman’s Guide to Computing archive

Buttondown still does not have a very browseable archive, so I’ve made the newsletter content available on a static site. You can browse past seasons more easily at https://ngjunsiang.github.io/laymansguide/categories.

I may add more seasons in future, as computing technology stabilizes enough for me to write about them in a static newsletter. If you’d like to receive future issues, do subscribe below:

Issue 181: Quantization

2026-08-24T08:00:00+08:00

Previously: Proprietary models do not have their weights published publicly, while open-weight models do. Various runtimes are available for download, and can run models that have a compatible file format. But models are extremely compute- and memory-intensive, requiring extremely high-end hardware and capacious memory to run.

Great, so a 12B model takes up 24GB of disk space, uses 24GB of RAM, and up to 96GB for the KV cache (model’s calculated representation of input tokens). That’s out of reach for most consumers without AI-grade GPUs, which currently cost tens of thousands per unit.

Enter quantization.

Parameter representation

Models are typically trained with full precision, allowing them to store each parameter using 16 bits (2 bytes). This is necessary because the training process results in multiple adjustments to the weights. If the intermediate values are not stored with full precision, subsequent adjustments to those values are not accurately represented, and may result in inaccurate training results.

However, once the model is trained and its weights released, they are effectively “frozen”: the weights do not change as the model is used for inference (Issue 173).

Quantizing parameters

Can we reduce the model size and memory footprint by reducing the precision? Yes. Experiments have shown that models lose some accuracy as their parameters are quantized: represented using 8 bits (twofold reduction), or even 4 bits (fourfold reduction!). Below that range, running the model at 2 bits often results in unacceptable performance.

This inaccuracy shows up in models not following instructions as well, potentially making mistakes more noticeably, especially on complex tasks, or being less accurate with tool call syntax. However, compared to the alternative of not running the model at all, this is usually an acceptable tradeoff for users running the model on their own computers.

Running a quantized model

Okay, let’s run those numbers on a quantized Gemma 4 12B model. We don’t even need to do the quantization ourselves usually: other enthusiasts have already done it, providing the models on HuggingFace as well (they can be identified through the “Q4” in the model naming scheme; 8-bit quantized models are labelled “Q8”).

We already see immediate benefits: the 4-bit quantized model weights are only 7GB, a stark contrast to the 24GB of full-precision weights.

The KV cache requirement now drops to ~6GB for 32K tokens, and ~50GB for 256K tokens. Very uncomfortable for a Macbook, which means we would have to limit ourselves to a 128K or even 64K token context length. Annoying, but not show-stopping.

The inference speed now increases to ~60 tokens/sec, about as responsive as ChatGPT or other chatbots!

What do we gain from larger models?

Unlike programs or data files, which store data as-is (perhaps compressing them for a smaller filesize), models represent information: the training process produces a highly compressed set of numbers that are able to approximately reproduce the training contents (not 100% accurately, but quite close), and more importantly generate tokens following the same pattern for inputs that it was not trained on.

What if we try to break the laws of physics, taking GPT or Claude’s training corpus, and training it into a 1B model? What happens?

1B parameters means the model only has 1 billion numbers to try to represent everything. If the training data is repetitive and largely similar, 1B might even be sufficient since there just isn’t that much variation in the data.

But if the data is highly varied, the model might not be able to adjust the weights to represent everything. It will end up storing one additional data point at the expense of worse representation for other data points. This might show up as a plateau in benchmark scores: the model can’t improve further. Or it might show up as the model not “remembering” data that shows up less frequently.

What do frontier models, often with parameter counts running into trillions, gain? With so many parameters, they can represent more patterns: more thinking scaffolds and reasoning frameworks, more sentence/paragraph patterns from more books and articles, etc. And not just more patterns, but higher-order patterns: writing styles, writing intents, idea development, longform writing structure, etc.

Google’s Gemma 4 12B model will end up not being able to represent everything. Our running model might give less nuanced answers, consider fewer perspectives in its answer, and otherwise give worse answers.

But hey, it runs! Give it a spin, see what you can do with 12B parameters.

Model capabilities

Even frontier models with poor training data will disappoint. 1 trillion parameters won’t necessarily make a model much smarter if the training data is poor.

Most new capabilities are added through additional training, usually supervised learning. If we can’t train the underlying model, we might be able to create skill files explaining how to do something, let the harness read it and add it into the input context, and lean on the model’s pattern-following capabilities to tackle the task.

Either way, if you have the hardware to support it and manage to get a local agent running, try it with different questions and tasks to get a feel for what it can and cannot handle. That beats any amount of reading on what these models are supposed to be able to do.

Issue summary: Quantization trades parameter precision for a smaller memory footprint and faster inference, making many models feasible for running on user devices. Model capabilities depend on their parameter count and training data. Models with higher parameter counts can represent more patterns, while model capabilities are added by training them on well-labeled data.

12 issues in, that’s a wrap! At this point I think what I’ve written is what’s unlikely to change in the next couple of years, and still useful for layfolks to know about the ongoing AI development. Anything newer is still in active development.

What I’ll be covering next

Next issue: Issue 182: Running a model, part 2

In the last issue, I’ll explore other options for running a model on your device (called local deployment in parlance): running smaller models, and other feasible hardware options.

Issue 180: Running a model

2026-08-17T08:00:00+08:00

Previously: Agents are software applications that comprise a harness, a runtime, and a model (typically accessed through an API instead of directly executed on the computer). They enable a user to type in a request or send it by other means and thus instruct the agent to carry out a task on the computer until completion. The capabilities of agents are limited by the tools available to them.

As of June 2026, OpenAI and Anthropic charge about $20/mth for their Pro/Plus plan, and about $200/mth for their Max plan. For those of us who like to stay on free tiers, it can be pretty annoying to hit the dreaded “You have reached the limit for Free plan”, but what can we do short of shelling out for a higher tier?

Wait—if a language model is a bunch of numbers, and a runtime is just a program, why can’t I run it on my own computer instead?

Proprietary models and open-weight models

For starters, you can’t download the GPT-5 or Claude models. They are proprietary models, and their weights (the file containing the model’s parameters) are a guarded trade secret; a leak of the weights would be disastrous for OpenAI or Anthropic.

Okay, fine you say, then let’s run something I can actually download. As of 2026, that typically means you would go to HuggingFace (yes, that is their actual name), currently the world’s largest platform for hosting open-weight models. An open-weight model, analogous to open-source software, means the model’s weights are publicly available and you can download them.

The parts: downloading weights

Let’s download the currently top-trending model, Google’s gemma-4-12B-it. The model card says that this is a multimodal model (Issue 177) with 11.95 billion (12B) parameters (Issue 170). It has a context length of 256K tokens (Issue 172)—important when deciding what kind of tasks it can plausibly take on, since the context length dictates what the total output length (including the input tokens) cannot exceed.

Under Files and versions, we see a whole bunch of files, most of them metadata, configuration information, and other data (such as the token list). The model weights are easy to tell: they are by far the largest file of the collection, weighing in at 23.9GB. We can calculate this: 11.95 billion parameters, with each parameter taking up 16 bits (Issue 40), means 2 bytes per parameter, and thus 23.9 billion bytes for all the parameters. 23.9GB.

The runtime

You have a few options here, listed from easiest to most difficult:

LM Studio – Comes with a graphical user interface (GUI), so click to load the model and you get a chat interface. Great for getting started ASAP, not great if you actually eventually want to use it as an agent.
Ollama – A commandline program, requiring some terminal chops. Sets up an API server that you can use with many other programs.
Hugging Face Transformers – A Python library for working with models, which means it’s programmers-only. Great if you are building or customizing your own agent harness, but definitely not ready-to-run as-is.
llama.cpp – The most low-level, close-to-the-metal option. Gives you a commandline program for using the model, but you have to manage all other technical detail on your own. Not for the faint-hearted.
vLLM – A GPU-only library for serving models over an API. Presumably we do not have four thousand bucks to spend on an entry-level GPU for models, such as the RTX 4090 with 24GB of GPU memory, and are running the model on a CPU, so this option is automatically disqualified for us.

Hardware requirements

Great. So we’ve downloaded and installed LM Studio, launched it, and then selected our gemma-4-12B-it model for loading.

A screenshot of LM Studio
Source: LM Studio

The first thing that would probably happen is your system will complain about insufficient memory and stop. You see, to run this model, we would need to read the model weights (23.9GB) into memory, immediately using up 24GB of memory. Even assuming no other apps are running, we still need more memory for the following: - operating system overhead (~1-2GB) - memory used by the runtime (1-3GB)

Oh? It didn’t crash for you? I see, you had the Macbook Pro with 64GB memory, or something in that weight class. Great, let’s start prompting your model then. It won’t work as quickly as ChatGPT, but it should manage a comfortable ~20–30 tokens/sec, slightly slower than reading speed but useable.

Unfortunately, as you ask more and more questions within the same session, it will run more and more slowly, and eventually it will crash. You see, the model generates a representation of the entire input, called the KV cache, which stores its computed values for how each token in the input relates to other tokens in the input. This is estimated to take up ~12GB for 32K tokens, so ~96GB if using the full 256K context length.

Yeah, this isn’t for the faint-hearted.

Issue summary: Proprietary models do not have their weights published publicly, while open-weight models do. Various runtimes are available for download, and can run models that have a compatible file format. But models are extremely compute- and memory-intensive, requiring extremely high-end hardware and capacious memory to run.

This is the pessimistic view. Next issue, we look at some optimizations that are available even to newcomers to enable models to run faster and with a smaller memory footprint.

What I’ll be covering next

Next issue: Issue 181: Quantization

Issue 179: Agents

2026-08-10T08:00:00+08:00

Previously: Thinking/reasoning models are those that have been trained on examples of how to think about different problems in different domains, or plan and execute complex tasks. They often use tools to aid them in goal tracking and updating. The full thinking trace from the model may be removed or hidden to present a more legible response to the user.

Let’s review the ingredients we have so far:

A large language model (Issue 170) or multimodal model (Issue 177): a next-token predictor that takes input tokens and keeps generating output tokens which feed back to the input
Training data, which the model is trained on to pick up general patterns through unsupervised learning (Issue 171), and then steered to avoid harmful output and generate useful output through the use of labelled training data through supervised learning (Issue 174)
A runtime (Issue 175), which handles multiple responsibilities:
- parsing the model output to block it if found to be harmful
- formatting the text for display to the user
- separating and executing tool calls (typically in an isolated container), and injecting the results back into the input (Issue 175)
- processing thinking tokens, removing or hiding them (Issue 178)
Other optional runtime extensions, such as those that add retrieval-augmented generation (RAG) capabilities (Issue 176), or add information that the model remembered about the signed-in user

What does an agent do?

AI Agents

agent(n.)
late 15c., “one who acts,” from Latin agentem (nominative agens) “effective, powerful,” present participle of agere “to set in motion, drive forward; to do, perform; keep in movement” (from PIE root *ag- “to drive, draw out or forth, move”).

The term “agent” means “one who acts”. So agents are software applications, comprising a trained model and a runtime. We can broadly think of the model as the “brains” of the partnership, and the runtime as the “body”.

Because agents need a computer (physical or virtual) to “act”, these software applications are typically installed on a computer, although they may also include a web interface to allow users to control them remotely.

The model has remained conceptually similar as I went from Issue 170 to here, but the runtime is picking up more and more responsibilities. So as not to muddy the terms, I’ll keep the runtime focused on the model: processing the output, executing tool calls and injecting results, re-invoking the model if it has not reached a stop token, and any RAG if implemented. Everything else that we are adding today, that makes the agent an effective partner and piece of software, I’ll explain under the label harness.

The model

Some harnesses make it easy to swap out the underlying model, allowing the model to run the agent harness with different models. Many model providers have standardized on OpenAI’s API (Issue 4) so as to make their models easily accessible to programmers.

While state-of-the-art models are capable enough to not require a more specialized version for agentic use, the agent harness usually provides a special system prompt for this purpose. This special prompt includes information on the use context, on the tools available to the model, and other pertinent information to guide the model and keep it on task.

The runtime

A runtime used within a harness needs to include additional features: the ability to pause or stop the model, to understand access control configuration (which tool calls require user approval) and route matching tool calls to the user for permission grants, and introspectability: allowing the harness program to check the state of the runtime and model.

The harness

When a user uses agentic software, the harness is what they see. That means the harness handles typical software responsibilities:

it handles installation and initial setup, allowing the user to select a directory that the agent will begin working from
it handles extensions/plugins that the user may wish to install, making the tools/MCPs (Issue 175) available to the runtime
it handles file uploads (and any necessary format conversion or resizing), request customisation (e.g. enabling extended thinking), other request-related settings
it handles the model output through the runtime, displaying to the user tool calls and their results, any visible thinking traces, and any permission requests which come from the runtime (remember that the model remains unaware of these). If the API supports it, the harness streams these to the user, allowing them to see tokens as the model outputs them, without having to wait for the model to finish the entire response
it provides an interrupt mechanism for the user to halt the runtime if the model is going off-track, or to queue up more messages for the runtime to inject into the request at an appropriate juncture
some harnesses may support agent memory features, giving the agent tools to write information to its internal memory, and retrieve the information when required
harnesses for continuously running agents may include features for setting the wake-up interval of the agent, e.g. invoking the agent every 30 seconds with standard instructions to check for outstanding tasks and complete them
harnesses that integrate with external services will include features for receiving requests via email, WhatsApp, Telegram, or other channels, passing them to the agent and returning the response when it is ready.

What an agent does

… I don’t know what to say here. By itself, a model can do nothing besides generate text. When embedded in a harness+runtime, what it can do is limited by the tools it has available—remember that the model relies on the runtime executing its tool calls to have any effect on the world.

With simple toolsets (primarily a commandline tool), the agent can plausibly:

read, edit, and delete text files on the computer
search through files on the computer
check the computer’s stats, such as memory usage, free space on disk, CPU usage
troubleshoot or diagnose computer issues
perform a web search or retrieve a web page

If given the appropriate tools and permissions from the user, the agent can also:

install or uninstall software on the computer (through the commandline)
download source code, compile it, install it, and run it
run a server on the computer, handle web requests, return responses
read, write, and test code
push code to a code repository
add bug reports or issues to a task board, or read existing ones from it
send requests to an API (if authenticated by the user), and thus execute any supported action through the APIs of Google Drive, Dropbox, Notion, and other services (Issue 6)

With more advanced tools or MCP servers that handle the complex details, an agent can even:

be registered as a plugin in Adobe or Microsoft Office software, reading and editing documents
work with PDF files
fix bugs

When provided with detailed explanations of how to perform complex tasks (typically through a skill file that the agent can read), the agent can plausibly:

analyze large datasets
follow company workflows
scan software or APIs for vulnerabilities

… Why haven’t they taken over the world yet?

Because most people aren’t using them!

… Just kidding, there are other reasons too. For example:

Most complex tasks aren’t described in skill files that are agent-readable, or are not well described
Many of the advanced tools or MCP servers that are needed don’t exist, e.g. those for editing PDF files reliably. If they exist they aren’t always reliable
The really effective tools might be hyper-customized for the tool author and not as useful for others
Most users are used to doing things themselves, and don’t have enough experience with an agent harness to be accustomed to instructing one
Users might not know that it is possible to do something, and have not considered asking an agent to do it
Agent models still have limited context windows (even a context window of 1 million tokens can fill up quickly with a sufficiently complex task), and ways to enable a model to keep relevant task details in context while removing irrelevant details are still being studied
The model might not have been trained on a particular task, and its general reasoning capabilities might not be sufficient to carry out the task effectively
Agent harnesses tend to run in the commandline, or be designed primarily for programmer use, thus scaring layfolks away
…

Agent capabilities tend to be emergent. That means researchers and frontier labs can train a model to carry out tasks A, B, and C, and a user giving the agent a different kind of task discovers that it is also effective at task D but not task E.

Generally, a question can “can an agent do F?” can’t be answered definitively prior to actually asking the model to do F. And even if one person fails to get the agent to execute the task successfully, another person might succeed, because they asked differently, because they are familiar with the terminology required to instruct the agent, or for some other reason.

All of this is still ongoing research work: agents only really took off in 2025, when Anthropic released Claude Code which became the first generally capable agent. Since then, every day users are discovering new things that it can do. The things that it can’t, Anthropic and other frontier labs are still training it to be able to do them.

Issue summary: Agents are software applications that comprise a harness, a runtime, and a model (typically accessed through an API instead of directly executed on the computer). They enable a user to type in a request or send it by other means and thus instruct the agent to carry out a task on the computer until completion. The capabilities of agents are limited by the tools available to them.

You now have a pretty good idea of all the pieces involved in getting an AI agent to do things. The part I can’t authoritatively tell you about is what they can or can’t do, because that is still changing every week as frontier labs continue to train more capable models and agent harnesses continue to add more tools and features.

If you’re curious, consider trying them out. You could search for an online guide, or let ChatGPT/Claude help get you started.

What I’ll be covering next

In ten issues, I’ve walked you through the key concepts that help you understand what AI agents do. With three issues left to go, what else should I cover?

Some questions I’m anticipating, or have fielded some variant of:

Can I run my own AI model?
Why can’t the AI do <thing>?

Question 2 has a boring answer and an interesting one. The boring answer is “because it hasn’t been trained yet”. The interesting answer is … not really suitable for a newletter titled Layman’s Guide to Computing, because it’ll be rooted in philosophy and cognitive science. In a different publication perhaps.

So let’s tackle question 1, which will draw on computing concepts I’ve covered in earlier issues and give you an idea of the kind of compute and memory capacity needed to run a model.

Next issue: Issue 180: Running a model

Issue 178: Model thinking and reasoning

2026-08-03T08:00:00+08:00

Previously: Multimodal models represent text, image, and audio tokens alongside each other in their embedding space. The model uses the input tokens, regardless of type, to calculate the next output token. Multimodal models typically only output text tokens in their response, delegating to more specialized models for image and audio generation if necessary.

In this issue we fill in the last piece of the puzzle needed to “unlock untold economic value”, if the AI labs are to be believed. Let’s talk about how models “think”.

Making thinking happen

You’re in a lesson. The teacher asks a question, something innocuous really: “What’s the value of X?” All eyes are on you. You reply with the first answer off the top of your head. Wrongly, it turns out.

Your teacher could mock you at this point, but if they decide to get you to think harder instead, what do they say?

As it happens, this trick works on LLMs too. The ways we try to get people to think harder appear to be well-represented in books, on the internet, and in other media that the models are trained on.

What this means is that you add any of the following:

“think step by step.”
“think carefully.”
“check your assumptions before you answer.”

And it influences the model’s next token. It begins to output phrases like:

“Let’s break this down.”
“First, let’s identify what’s being asked.”
“One way to approach this is…”
“Before answering, let’s consider…”
“Let’s work through the problem systematically.”

It begins to imitate the patterns of careful thinking that it picked up during training. Surprisingly (or perhaps unsurprisingly), this improves the model’s answer in many cases! It generates a much longer answer, taking more time and using more compute in the process—this is what AI folks call “spending compute for intelligence”. If you don’t have a large LLM, you can have a smaller LLM “think harder” and come up with a better answer.

Where thinking breaks down: insufficient examples

When this trick was first discovered, early adopters experimented with different prompt patterns, trying to get models to generate longer responses that led to better answers. But thinking doesn’t always succeed. We’ve all had the experience of trying to think through some difficult math problem, writing lots of working that ultimately led nowhere.

GPT-3 may have been trained on a really large dataset, but most webpages and books are not showcases of how to solve difficult problems through clear thinking.

So it’s back to supervised learning again. Look for examples of how to solve difficult problems. Recruit experts and have them write down their chain of thought for different kinds of problems. Then train the model on this labelled data, so that it doesn’t require users to be clever with prompts to extract this thinking. Train the model to differentiate between requests for a quick answer, and requests requiring deeper thinking.

Thinking vs. planning

A model that is able to think longer and in a more disciplined way to produce a better answer is able to tackle harder questions. These are the models that were solving olympiad questions that humans struggled to solve.

But this isn’t enough for another kind of challenge: long-horizon tasks that involve multiple tool calls, putting together information and feedback from multiple sources, maintaining task coherence and a consistent goal orientation throughout the process, and finally producing output in the correct format.

For example, filing tax returns involves digging through a large number of financial documents, remaining aware of legal requirements for filing, extracting relevant information, and putting it together following those requirements. None of the steps along the way involve extreme intelligence or genius insight, it’s just a lot of tedious steps and details to keep track of. Along the way, detours and failed tool calls threaten to derail the model; it can get stuck researching an edge case rule, debugging a failing tool call, or get distracted by other things.

This requires the model to plan. It has to take an end-goal, break it down into phases and steps, think about immediate steps, execute them and observe the result, decide next steps, repeat, …. Along the way, it has to keep track of goals and sub-goals (usually aided by task management tools), be able to tell when they are met and check them off the list.

Books and websites seldom contain detailed worked examples of how to do this, so the model has to be trained with labelled data (again!), given examples of planning steps through supervised learning until it is able to reproduce them reliably.

Hidden vs visible thinking

Frontier labs found that showing the full thinking process to users isn’t always beneficial. For example, the full thinking trace—tokens that constitute the analysis and are not part of the final answer—could be really lengthy. Users tend not to like that; they want to see the key steps for a quick check, and then the final answer.

Perhaps the full thinking trace includes mistakes the model made and corrected later, erroneous tool calls that it subsequently fixed, search tool calls which the user does not need to see the full contents of, etc. In other cases, frontier labs may have found ways for the model to output a more efficient form of thinking with tokens that is not human-readable.

This means one more step in the runtime: detecting and processing thinking tokens. If the model is trained to demarcate thinking tokens with a special start and end sequence, e.g. <thinking>...</thinking>, the runtime may look for it.

Once detected, this hidden thinking may be removed, summarized (with a different model), or collapsed to take up less space in the user interface.

Issue summary: Thinking/reasoning models are those that have been trained on examples of how to think about different problems in different domains, or plan and execute complex tasks. They often use tools to aid them in goal tracking and updating. The full thinking trace from the model may be removed or hidden to present a more legible response to the user.

This really is the primary concept behind thinking/reasoning models: more supervised training to output a sequence of tokens that lead the model to a useful answer.

If this sounds simple, that’s because most of the magic is in the model training: crafting and labelling training examples, and then training the model on them, is a much more complicated process than it sounds, and I’m excluding it from this issue because it is very technical and not suited for a newsletter named Layman’s Guide.

Now you know what a model is doing when you activate a feature named “Extended Thinking”, or switch to a model that is described as a thinking/reasoning model.

What I’ll be covering next

Next issue: Issue 179: Agents

Finally we can talk about this term, “agents”, and what differentiates them from a model. If you’ve heard this term before and wondered what goes into one, subscribe to be notified when I lay it bare ;)

Issue 177: Multimodal models

2026-07-27T08:00:00+08:00

Previously: In retrieval-augmented generation (RAG), the runtime performs a search with the user’s request to retrieve relevant chunks from a set of documents from a knowledge base. The chunks may be further re-ranked by the runtime before finally being included in the LLM’s input. One alternative to RAG, where information lookup happens outside of LLM generation, is to provide the LLM with search tools instead, and rely on its judgement to use them well.

Multimodal models. Try saying that three times quickly. It’s quite a mouthful, but if you’ve managed to keep up so far, it’s really not complicated, so I don’t expect this to be a long issue.

Multimodal models

While a large language model works only with text tokens, a multimodal model can work with other types of tokens as well. We’ve previously covered what text tokens are and how LLMs use them (Issue 172), so let’s focus on image and audio tokens.

The approach is similar, really: text gets broken up into common repeating patterns. Image and audio likewise gets broken up into common repeating patterns. Each common repeating pattern is represented by a number, or set of numbers, and located in an embedding space (Issue 172).

Image tokens

There are a variety of approaches for tokenizing images. A common way to do this is to break it up into 16×16-pixel patches. Each pixel has three values representing red+green+blue (Issues 43 & 44), so each patch is a sequence of 16×16×3 = 768 values.

Each unique combination of 768 values constitutes an image token. During training, these image tokens appear alongside other tokens (text, image, audio), and the model adjusts its embedding parameters to locate semantically similar tokens in close proximity.

During inference (Issue 173), hidden layers represent more abstract patterns that the model identifies: lower layers may encode information about edges, while higher layers capture information about shapes, textures, and even objects.

Audio tokens

While intuitively it seems natural to chunk audio into 1-second or even sub-second samples, in reality 1 second of audio contains 44,100 samples (Issue 45) which is still far too large.

Instead, audio is usually converted from waveform representation (amplitude vs time) into spectrum representation (frequency vs amplitude at a snapshot in time). The spectrogram gets split into shorter windows of a few milliseconds each (a few thousand samples per window). The values of each frequency in that window then naturally form an audio token, which appear alongside other tokens in training and get represented in embedding space the same way as other tokens.

Multimodal models need supervised training

Supervised learning plays a big part here. Images, audio, and text seldom appear together in unlabelled training data (except in video), so associating images and audio with text relies heavily on manual labelling. This is why multimodal models took so long to emerge.

During inference, all tokens regardless of type are represented as embeddings, and the model uses the input tokens to calculate the output token.

Multimodal models vs image/audio generation models

An app like ChatGPT can take user-uploaded image files, reference them in their response to the user, and then generate an image, or even convert the response from text to audio. But this seamlessness is an illusion; at the backend, these do not use the same model.

Multimodal models can take input tokens of multiple types, but typically only generate text in response; users do not expect image patches or audio snippets in the response, and would not know how to interpret them.

Instead, image and audio generation use different kinds of (non-Transformer) models, which might be worth exploring briefly in a future issue, but not this one.

Issue summary: Multimodal models represent text, image, and audio tokens alongside each other in their embedding space. The model uses the input tokens, regardless of type, to calculate the next output token. Multimodal models typically only output text tokens in their response, delegating to more specialized models for image and audio generation if necessary.

There you go. Multimodal models demystified: once you figure out how to tokenize something alongside text, and give the model lots of labelled data to associate it with text tokens during training, you can create another modality for your model. This sentence hides months of complexity that AI labs tackle, because that’s what you’re reading Layman’s Guide for, isn’t it?

What I’ll be covering next

Next issue: Issue 178: Model thinking and reasoning

We’ve covered retrieval-augmented generation (RAG), and now we’ve covered multimodal models. Text, images, audio: Check check checked. Tools? You bet.

We’ve got almost all the ingredients to assemble an AI to scare the economic labor pool, but we are still lacking one final piece of the puzzle: how do LLMs “think”?

Issue 176: Retrieval-Augmented Generation (RAG)

2026-07-20T08:00:00+08:00

Previously: LLMs can be trained to make tool calls, using the same training data used to train code assistants. The tool specifications are injected into the system prompt that is passed to the model, along with guidance on when to use a tool. Tool calls generated by the model are interpreted by a runtime that detects and executes them, then passes the results of the tool call back to the LLM in the next input.

We mentioned hallucination—mentioning non-existent publications, stories, or facts as though they were real—as one of the pitfalls of GPT-3, and mentioned how reinforcement learning with human feedback (RLHF) helps to combat some of these tendencies in general use.

These days, ChatGPT, Claude, and other chatbots also allow you to upload documents. The runtime supporting these chatbots helps to extract text from the documents with supporting context and include them in the system prompt, allowing the chatbot to answer from the document’s contents to combat hallucination.

In some cases, the document may be too large. In other cases, a company may have a large set of documents the LLM should answer from, but they are too large to all be included in the system prompt.

In such cases, retrieval-augmented generation (RAG) provides an alternative way to inject relevant information into the LLM’s system prompt.

Retrieval-Augmented Generation (RAG)

Like other LLM capabilities, this one comes from the runtime. The LLM plays no part in this and has no control over the process.

The source documents are chunked, and each chunk analyzed to create an embedding. Parts of the document that are closely related have embeddings located more closely.

Before the user’s input is passed to the LLM, it is parsed by the runtime and analyzed into an embedding. This embedding is used to retrieve relevant parts of documents; other information may be used to determine relevant portions as well.

Instead of embedding entire documents, only these relevant portions are included in the system prompt for the LLM to answer the user’s query. In more advanced implementations, the chunks may be further re-ranked by importance and other criteria.

All of this happens in the runtime, beyond the LLM’s token generation loop.

Limitations

When it works well, it works really well: the LLM doesn’t hallucinate, quotes from the source, and if the source is well-tagged, it can even cite from the correct page and paragraph.

But there are ways it can make mistakes too. If no matching documents are found and the LLM isn’t aware, it may hallucinate unless the runtime handles this well. On the opposite end of the spectrum, it may find too many results and not know how to select the most relevant ones. The documents themselves may be contradictory, incomplete, or require too much unwritten context. And lastly, it may miss important nuance found elsewhere in the document, or in other documents, that did not surface in the embedding search.

Alternatives

Still, in cases where you can’t fit entire source documents in the LLM context, what other alternatives do you have?

Then it’s back to a set of tools for your LLM to use for searching the company knowledge base, read documents, and manually extract relevant portions. Naturally, your LLM will need to be trained on a dataset of positive examples of tool usage (Issue 175). In contrast to RAG, where retrieval is automatic and built into the runtime, here you are relying on the LLM’s judgement of which tool to use, and when to use it.

Issue summary: In retrieval-augmented generation (RAG), the runtime performs a search with the user’s request to retrieve relevant chunks from a set of documents from a knowledge base. The chunks may be further re-ranked by the runtime before finally being included in the LLM’s input. One alternative to RAG, where information lookup happens outside of LLM generation, is to provide the LLM with search tools instead, and rely on its judgement to use them well.

Okay, that’s RAG de-mystified. It’s a program that runs a search on the user’s request and injects relevant chunks from the knowledge base into the LLM’s input, beyond the LLM’s control. Now you can speak about RAG a little more informatively.

I avoided discussing RAG’s performance, because results vary. For every detractor you can also find a supporter! Is it going to work well for you? You probably have to try it yourself, or find a consultant who can better advise you.

What I’ll be covering next

Next issue: Issue 177: Multimodal models

Many chatbot models accept image and even audio alongside text. How does this work? De-mystifying in the next issue, so stay tuned!

Issue 175: LLM tools

2026-07-13T08:00:00+08:00

Previously: Through reinforcement learning with human feedback (RLHF), the LLM is trained on labelled data until it can reliably follow instructions, avoid harmful output, and follow other desired behavior. A system prompt provides guidelines for output. The user’s prompt is inserted into a templated prompt and passed to the LLM, which generates text in a markup format that a display system can understand. A chat interface wraps the entire system to create the illusion of a responsive chatbot.

A chatbot is fun to use for a while, but if all it could do was talk we wouldn’t use it for very long. For starters, it would hallucinate a lot, or give outdated information, because it couldn’t access the internet or do a web search. What would it take for models to be able to use computers to do that?

While this problem was being actively worked on, LLMs were also being trained to generate programming code. It turned out that code, being text-based, was fertile training ground for LLMs. They were improving at it too; while early versions still failed at producing large yet coherent programs, many were able to generate boilerplate code with correct syntax already.

LLMs as tool-using models

For a LLM to use a tool, it needs to be trained to:

state the tool to use
pass the appropriate options
interpret the result, when passed back to the LLM (in the next request)

for example, to use a weather tool, the LLM needs to be able to:

say “use the weather tool”
pass options: “location: Singapore, SG, show me the temperature and humidity as well”
interpret the result: mostly self-explanatory, but e.g. it may need to understand if the location provided in the output may be the nearest known location and not the user’s actual location

It turns out that the first two problems are already known and solved: the first text-based interfaces were invented in the 1970s after all, and programmers have always needed a way to invoke programs through a text-based interface. They already had one, in the form of the command line (Issue 15) and the rich syntax that was already built around it. And they had another, in the form of the function call syntax that almost all programming languages had standardized on, like check_weather(location="Singapore, SG", show_temperature=True, show_humidity=True). And training data already existed for both of these, in the form of open-source code readily available online in code repositories (Issue 19).

The structure of a tool-using LLM

For a LLM to be able to output tool calls, you need:

tool specifications, usually injected through the system prompt, telling the LLM the available tools and their options
guidance on when to use each tool, typically through further instructions in the system prompt, through RLHF (Issue 174), or both
familiarity with the tool call syntax used, typically trained into the model through RLHF.

In the system prompt, you would include:

[...]

## Tools available

- `check_weather(location: text, show_temperature: boolean, show_humidity: boolean)
  Check the weather at the given location. Example: "Singapore, SG"
  Pass show_temperature=True and show_humidity=True if temperature and humidity are required in the output
- ...
- ...

Providing a rich set of tools without using up too many tokens is a tricky design balance that requires regular tweaking. In any case, the model is then trained to output the tool calls in a specially marked section of their output.

Invoking the tools

At the point when the model outputs the stop token and the program stops using it to calculate more output tokens, its involvement stops. The program interprets the model’s output, separating the tool calls out, and passes them to another system.

You see, tool calls can be pretty dangerous, especially if they enable the model to carry out destructive actions. A shell command like rm -rf / on Linux or Mac could delete the entire operating system, or important subdirectories. A delete_database tool could do what it says, but with the wrong target specified. So it’s common to have a system that examines the tool call and attempts to determine if it is safe. In a code assistant, this tool call might be shown to the user for explicit approval. In a web-based chatbot like ChatGPT, tool safety is usually handled by another system instead.

Once validated, the tool needs to be executed on a computer system. This computer system needs to have the necessary programs installed. It should also be isolated against potentially destructive actions. We’ve covered how containerization (Issue 149) enables this to be done; an isolated container for each session where necessary.

Finally, the result of the tool call, whether success or failure, is captured and then added to the token sequence which is fed back into the LLM.

This all sounds pretty neat, but with one caveat: only the chatbot provider (OpenAI for ChatGPT, or Anthropic for Claude) can pass these tools to the LLM. Third-party integrations, such as with GitHub or Google Drive, would be tricky for OpenAI/Anthropic to design on their own, yet unsafe for external parties to inject into the system prompt.

Integrating third-party tools

So in Nov 2024, Anthropic proposed another standard: the Model Context Protocol, a way for external parties to specify a set of tools that work together to enable access to other web-based or software-based systems.

When the user registers a MCP server through a graphical or text-based interface, the system reads the tool specifications from the MCP server, injects them into the system prompt, and from there they work like other tools accessible to the LLM.

The runtime

Notice that none of this is mediated or controlled by the LLM. It follows instructions, generates tool calls with the correct syntax in its output, then sees the result in the next input, seemingly by magic. The LLM is operating in a virtualized environment controlled by an external system that doesn’t have a standardized name yet. For now we’ll call it the runtime.

Tools and toolsets make or break a LLM-based assistant. They are the only way a LLM can take actions, get data, and otherwise make sense of the external world. A LLM without any tools is analogous to a human in a sensory deprivation tank—without information from the outside world, even human beings quickly begin to hallucinate.

Issue summary: LLMs can be trained to make tool calls, using the same training data used to train code assistants. The tool specifications are injected into the system prompt that is passed to the model, along with guidance on when to use a tool. Tool calls generated by the model are interpreted by a runtime that detects and executes them, then passes the results of the tool call back to the LLM in the next input.

From here it’s another 3 issues before we get to the topic of the year: AI agents. Before I get there I want to cover three more buzzphrases: retrieval-augmented generation (RAG), multimodal models, and reasoning/thinking models.

By now I hope you’re starting to see that LLMs really are next-token predictors underneath, and all their actual capabilities—the ones that let them know what is happening in real-time and change things in the world—are provided through the runtime. As the runtime grows more powerful and capable, LLMs must also be post-trained (using reinforcement learning a.k.a. RLHF) to use them well.

What I’ll be covering next

Next issue: Issue 176: Retrieval-Augmented Generation (RAG)

Issue 174: Reinforcement Learning

2026-07-06T08:00:00+08:00

Previously: OpenAI discovered, through models GPT-1 to GPT-3, that scaling compute and (training) data alone was sufficient to sharply increase the capabilities of a LLM: the transformer architecture and unsupervised learning together resulted in a model that was alarmingly intelligent.

Mechanistically, a LLM is a next-token predictor: from a set of parameters, and an input sequence of tokens, a program continually calculates the next token, which gets appended to the input sequence, and the new sequence gets fed in as the input again, until a stop token is generated.

OpenAI had discovered that by training GPT-3 (with over a hundred billion parameters) on a very large dataset (hundreds of billions of tokens), they ended up with a next-token predictor that appeared to generate readable, sensible text.

But that doesn’t mean that GPT-3 was ready for public use yet: what about those hallucinations, that toxic output, the prompt injections that caused it to ignore OpenAI’s instructions?

Reinforcement learning

Unsupervised learning may have created a genius model, but now OpenAI had to fall back on supervised learning to make it useful.

In 2022, OpenAI researchers submitted a paper titled “Training language models to follow instructions with human feedback”:

Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT.

It was back to painstaking human labelling of data again, getting humans to write desired outputs and label toxic content to train the model on. Through this process of reinforcement learning with human feedback (RLHF), InstructGPT was born.

RLHF was necessary to adjust the model parameters so that instructions like “explain …” were treated as guiding instructions rather than starting text that the model would steer away from.

Data cleaning and labelling

Prompt injections would continue to be an issue, but in the meantime OpenAI could address toxic content by first cleaning up the dataset to remove toxic, low-quality content and add other high-quality data sources.

This need for new, novel data sources still drives frontier machine learning labs today, who pay for high-quality data sources they can use to train their models.

Creating a chat assistant

InstructGPT was ready to take instructions. But … how do we get instructions from the user? How do we pass the responses back to them? The model was trained, the API was ready … but OpenAI needed a graphical interface, a familiar mental model of interaction that the public could use intuitively.

One already existed: chat apps like WhatsApp were popular at the time, and users intuitively understood a chat input when they saw one. But how could OpenAI get InstructGPT to respond reliably like a chat assistant with a consistent personality and style?

It turned out the answer was already in the training data.

Prompt framing

There was a lot of training data in the form of interviews, movie scripts, things that look like:

Alice: Why do cats like to jump on furniture?
Bob: …

And in many cases, arranging the user’s question along with a system prompt like so was enough to have the LLM roleplay a helpful assistant:

# System Prompt

You are ChatGPT, a large language model trained by OpenAI. [...]
Knowledge cutoff: 2024-06
Current date: 2025-09-03

Personality: Engage warmly yet honestly with the user. [...]

User: <user's input>
Assistant:

Pass the above prompt to InstructGPT and it helpfully follows the pattern, demonstrating GPT-3’s capabilities token after token, until it reaches a stop token. The program then takes the tokens generated after the prompt and displays them to the user.

What if the output is toxic, hallucinatory, or otherwise unacceptable? Back to RLHF again.

The ChatGPT wrapper

Even with the API in place, some window dressing is still needed. The LLM, being a language model, can only generate text, not format it. Most LLMs are RLHF-trained to generate text in a markup format (such as HTML or Markdown). The display system takes the LLM’s output, interprets the markup, and displays it as something the user can understand, making headers bold and larger, adding bullets or numbers to lists, formatting code accordingly, and so on.

The wrapper can also do some helpful things, like filter the LLM’s output for harmful text and block it from appearing, as a kind of last-layer defence against offensive output. Add a login screen, a way for users to access past chats, a few other niceties …

Finally, OpenAI launched ChatGPT in November 2022. And the world as we knew it changed forever.

Issue summary: Through reinforcement learning with human feedback (RLHF), the LLM is trained on labelled data until it can reliably follow instructions, avoid harmful output, and follow other desired behavior. A system prompt provides guidelines for output. The user’s prompt is inserted into a templated prompt and passed to the LLM, which generates text in a markup format that a display system can understand. A chat interface wraps the entire system to create the illusion of a responsive chatbot.

ChatGPT was the beginning of many other features to follow. Among them: multimodal models, and tool calls. The former is easy to understand, so let’s unpack how LLM tools work in the next issue.

What I’ll be covering next

Next issue: Issue 175: LLM tools

Issue 173: Training, Inference, and Scaling

2026-06-29T08:00:00+08:00

Previously: A model does not see letters or words, only tokens. These tokens are typically generated from user input through a pre-tokenizer program. Tokens are represented in the model as embeddings, a sequence of numbers representing the token’s position in the embedding matrix. The model uses each token’s embedding, and its surrounding tokens, to infer its meaning in context.

Model Training

In issue 171, I explained a little about how model training happens:

we pass tokens generated from text to the input
we pass the expected output (in supervised training), or the subsequent tokens (in unsupervised training)
the model generates output from input
we compare the model’s output to the expected output
we adjust model parameters
we repeat from step 3, attempting to adjust parameters to have the model generate output that is closer to the expected output

Notice that there’s a “forward” step: step 3, where the input “feeds forward” to each hidden layer. Here, the computer calculates the values for the next layer based on the values of the previous layer and on the model’s parameters between the two layers. This is repeated for each layer until we get to the output.

Notice also that there’s a “backward” step: step 5, where we could adjust model parameters randomly—inefficient! Instead, the mathematical technique of gradient descent gives us a more optimized way to adjust the last hidden layer based on how it would affect the output. The second-to-last hidden layer is then adjusted with the same technique, based on how it would affect the last hidden layer. And this is repeated all the way to the first hidden layer. This “backward trickling” is called backpropagation, or “backprop” more informally.

The above steps are repeated for each input:output data pair (supervised training) or for each token sequence run (unsupervised training). That’s a lot of repeated steps; researchers often have some shortcuts they take to speed up the process. Even then, it is still too many for a typical CPU to complete in a reasonable time; the big labs use specialized GPUs instead (Issue 123), resulting in training runs that take weeks to months to complete on multiple GPUs for today’s state-of-the-art LLMs.

This is not a cheap hobby.

Inference

Fortunately, using a model is a different affair, involving only steps 1 and 3 of the above. No backpropagation, no repeated runs. Just pass the input in, run one forward step per output token, repeat until done. This process is called inference, and is what happens when we users send a request to ChatGPT or Claude.

(Hang on, how does a model “know” when it is “done generating text”? In model training, a special token, e.g. <EOS> for end-of-sequence, is inserted at the end of text. When this token is detected in the program, it stops invoking the model.)

Needless to say, inference is much cheaper than training, which is why we are able to enjoy many of these models for free.

Scaling up to GPT-2

GPT-1 had 117 million parameters, was trained on ~7,000 books (about 5GB), took a few days to complete training on 8 GPUs, costing $0.5 mil or less.

In Nov 2019, OpenAI released GPT-2, which was the first large language model to capture some public attention. GPT-2 had 1.5 billion parameters (1.5B), was trained on ~40GB of text from the web, and took a few weeks to train on hundreds of GPUs, costing OpenAI $1 mil to $5 mil to train.

GPT-2 was the same architecture that GPT-1 used, only with a larger model (tenfold) and with more training data (eightfold). What they got was a model that:

could perform tasks it was never explicitly trained on (zero-shot learning): answer questions, understand text, summarize, translate (rudimentarily)
could generalize from examples given in user input (one-shot/few-shot learning) without needing supervised learning
showed emerging ability on non-language tasks: counting, basic arithmetic, even some attempts at simple proofs

These are capabilities we take for granted today, but in early 2019 this was cutting-edge performance never demonstrated by any other machine learning model, and certainly not with so little human supervision. This discovery was scary enough that it took OpenAI nine months to fully release GPT-2’s weights, fearing how its capabilities might be misused. The Verge reported: “OpenAI has published the text-generating AI it said was too dangerous to share”, but fortunately in the same article “the lab says it’s seen ‘no strong evidence of misuse so far’”.

The bitter lesson, and GPT-3

These findings prompted Rich Sutton, an influential machine learning researcher, to write a blog post published on 13 March 2019 where he summed up this finding in a single sentence: “The bitter lesson is that general methods that leverage computation are ultimately the most effective, and by a large margin.” Elaborating, he adds “seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation.”

A tenfold increase in model parameters and training data led to a surprising leap in capability. OpenAI and other researchers wondered: What if we pushed this to its logical conclusion, and threw more compute and more data into machine learning training?

In Jun 2020, OpenAI released GPT-3, available through their web API (Issue 4). GPT-3 had 175 billion parameters (175B, a hundredfold increase in model size), was trained on a mix of books and websites totalling 300 billion tokens, took weeks to train on hundreds of GPUs, and cost OpenAI up to $12 mil to train.

GPT-3 could:

take instructions given in natural language
reliably tackle many tasks zero-shot (with no examples)
reliably adapt examples given in the user input, and generalize from patterns

It had reached a level of capability that took the focus away from training data and placed it on the user input, called the prompt: without further training, the model could give you a response, the quality of which depended on the quality of your prompt.

Alarming behavior

LLMs had finally reached a point where they were easy enough to use by the general public. But before it could actually launch for public use, there were some concerns to be addressed.

For one, GPT-3 was extremely prone to hallucinations—making up things that never happened, papers that were never written, academic journals that never existed. It also readily reproduced toxic outputs from its data source—the internet (especially reddit and 4chan). It was extremely steerable through the prompt—a little too steerable for OpenAI’s liking, when some users got GPT-3 to leak its system prompt—the instructions that OpenAI prepended to every request guiding GPT-3’s response style and guardrails.

It would be some time before ChatGPT could even launch without dragging OpenAI down with it.

Issue summary: OpenAI discovered, through models GPT-1 to GPT-3, that scaling compute and (training) data alone was sufficient to sharply increase the capabilities of a LLM: the transformer architecture and unsupervised learning together resulted in a model that was alarmingly intelligent.

We are getting closer to the LLMs we know and love/hate today. This issue covered the miracle story of GPTs 1 to 3. If GPT-3 was a child genius, ChatGPT is GPT-3 dressed up for work. Let’s talk about what OpenAI had to do to it for public release—next issue.

What I’ll be covering next

Next issue: Issue 174: Reinforcement Learning

Issue 172: Tokens, the currency of LLMs

2026-06-22T08:00:00+08:00

Previously: The Transformer architecture, unlike previous machine learning model architectures, could generate its next item while processing all previous items at the same time. The technique of unsupervised learning trained models on unlabelled data, letting the model pick up patterns in underlying data instead of having it learn correct answers only, and was much faster than supervised learning. OpenAI applied both these ideas at scale, producing GPT-1, a model that beat best-performing models while requiring relatively little human supervision during training.

Wait—what exactly does a large language model (LLM) work with? Individual letters? Entire words? No, they work with—

Tokens

Tokens are clusters of letters that make up the training data. The large language model (LLM) does not “see” letters or words, only tokens.

Tokens are … quite unlike phonemes, syllables, or other word-fragments you and I are familiar with. They are typically programmatically generated by a separate program (not a model), based on letter-clusters that appear most frequently in the data.

For example, using OpenAI’s Tokenizer tool to visualize the above paragraph gives us this:

OpenAI Tokenizer - text view

OpenAI Tokenizer - token ID view

There is little human-discernible pattern as to what definitively constitutes a token: it could be a single punctuation mark, a letter or two (and sometimes including their preceding space, sometimes not), or an entire word.

Whatever the case, what we see as " you and I", a LLM sees as [481, 326, 357]. A pre-tokenizer program tokenizes all input into numerical values.

Now you understand a little better why ChatGPT struggles to count Rs in “strawberry”, or in any other fruit really.

Embeddings

How does the model tell 481, 326, and 357 apart? How does it store or represent them within itself? Here, I am going to need you to use your imagination. You are familiar with the concept of a scatter plot, yes? A graph that looks like this:

A scatterplot with 2 dimensions
Source: EmbeddedSource

Now imagine a scatterplot with as many data points as tokens. In GPT-1’s case, that’s approx. 40,000 tokens—its vocabulary size. Yes, I know that’s a lot of points, but you can roughly visualize that, yes? Good, that’s the easy part.

Now I need you to imagine the scatterplot with … *checks notes*—768 dimensions. No, that is not a typo, we are talking about a scatterplot with 768 dimensions. Oh, that’s too difficult to imagine? Yeah. Sorry, that’s why I don’t have an image attached. Just try your best 🙏

Essentially that is what a LLM generates as a result of its training. Each token in its vocabulary becomes a data point, and each data point is represented in this 768-dimensional space using 768 decimal numbers ranging from 0 to 1.0. This positional representation using many decimal numbers is called an embedding.

Other uses for embeddings

Embeddings are also not a new idea: they precede GPT by decades, having been conceptualized as early as the 1980s.

Because they’re such a handy and intuitive mathematical way to represent or visualize tokens and semantics, they’re also used often in semantic search engines (which try to infer what you mean instead of what you said), recommendation engines (suggesting similar things based on what you bought or liked), relevance scoring, etc.

How a LLM represents semantics

There’s more to a LLM than this collection of 40,000 embeddings; it forms only a tiny fraction of the entire model. But it is critical to how the LLM “learns” information from the text. Based on where the tokens appear relative to each other in the text, and the higher-order patterns that the model detects through its hidden layers, the model adjusts the embedding for each token, placing semantically similar ones closer to each other and dissimilar tokens farther away from each other.

And because this is a mathematical space with direction (in 768 dimensions), the model can also pick up on analogy to some extent: if you draw a (768-dimensional) arrow pointing king → queen and another arrow pointing father → mother within this embedding matrix, they end up almost parallel. This means the model can solve SAT vocab pairs, giving you “mother” when you give it “king:queen, father:?”

If an LLM relied only on this embedding matrix, it would not be able to distinguish “bat” as a warm flying mammal from “bat” as a piece of sporting equipment. The rest of the model—using the Transformer architecture, you’ll recall from issue 171—uses the tokens surrounding it and their positions to infer the context that “bat” is being used in.

Model pricing and limits

Most ChatGPT/Claude users are familiar with those products as subscriptions, where they pay a certain price per month to use ChatGPT/Claude for some arbitrary amount, and if they use too much too quickly they hit a usage limit and have to wait for it to reset.

But if you are a business, and using the API instead, you’ll be looking at a different page, such as the API pricing page for OpenAI’s API. Notice that prices are typically quoted in units of “1M tokens”, standing for “1 million tokens”. Now you know what those tokens are referring to.

Likewise, when Anthropic explains how usage and length limits work, and tell you that “Claude’s context window is 200K tokens”, you now know what they are referring to. More importantly, you know it doesn’t mean 200 characters or 200 words.

Issue summary: A model does not see letters or words, only tokens. These tokens are typically generated from user input through a pre-tokenizer program. Tokens are represented in the model as embeddings, a sequence of numbers representing the token’s position in the embedding matrix. The model uses each token’s embedding, and its surrounding tokens, to infer its meaning in context.

I would have gone on longer, but I think tokens are a pretty novel concept for most layfolks and deserve their own issue to sit with and digest before we talk about what a model does.

What I’ll be covering next

Next issue: Issue 173: Training, Inference, and Scaling

Issue 171: The first Generative Pre-Training model, GPT-1

2026-06-15T08:00:00+08:00

Previously: Models simplify and represent a relationship between input values and output values. The more complex the relationship, the more parameters the model needs to learn. Models are simplifications of reality, and their performance depends on how well they capture underlying patterns in the data, as well as the quality and quantity of the dataset.

We are going to set aside image and audio models for today, and narrow down to focus on language models in particular, because that’s what sparked off the AI craze.

The pre-2018 machine learning paradigm

I am not a machine learning researcher and can’t tell you what the prevailing research paradigm at that point was. But in open-source and consumer applications, it seemed machine learning models were bespoke:

You started with something specific you needed, like image classification, optical character recognition (OCR), speech recognition, translation, …
You collected a huuuuuge dataset of input-output pairs for that specific task: images and their labels, scanned documents and their text, audio, etc. And by huuuuuge I mean tens of thousands to millions of examples.
After collecting the data you often have to clean it up (remove duplicates, remove outliers, etc.) and label it (e.g. label images with their correct labels).
You then trained a model on part of the dataset, tweaking parameters and trying different architectures (ways of arranging parameters).
You tested the model on the other part of the dataset, passing each input through the model and comparing the output to the expected output, and measuring how well it performed.
You repeated steps 4 and 5 until you were satisfied with the model’s performance, and then you deployed it for use.

This technique of using labelled data to train the model is called supervised learning, because of the need to tweak the model’s parameters (under human supervision) to match the expected output.

There were (and still are) many machine learning models trained this way and used. For example, tesseract is an open-source OCR engine that was first released in 2005. It was trained on a dataset of scanned documents and their corresponding text, and has been used in various applications for OCR tasks. Another example is the ResNet architecture for image classification, which was introduced in 2015 and has been widely used for image recognition tasks.

The Transformer architecture

Before Google’s 2017 paper on the attention mechanism, the prevailing machine learning models had two problematic limitations:

they “looked” at input data one item at a time to produce the output, resulting in slow output generation
because of the above, data that was processed earlier seldom made it through to the end of the model, resulting in a recency bias: the model tended to focus on the most recent input data and ignore earlier input data

The attention mechanism introduced in Google’s 2017 paper allowed models to “look” at all input data at once, speeding up output generation. The same mechanism also computed which parts of the input data were most relevant for producing the output.

Attention was not a new mechanism in machine learning: prior models had used them, but in separate stages, and alongside other mechanisms. Google’s paper was the first to ask: “what if we only used attention everywhere?” The resulting architecture, which they called the “Transformer”, was a breakthrough in speed and simplicity.

Unsupervised learning

Besides the Transformer architecture, another breakthrough was already making its rounds: instead of task-specific datasets, researchers wondered why they needed so many task-specific datasets. Since the data represented different subsets of reality (from different tasks), what if they just trained a single model on a really, really large dataset of text to produce a base model? Then they could fine-tune it on smaller task-specific datasets to produce task-specific models.

This technique, called unsupervised learning, did not require data to be labelled—the model “learns” patterns in the underlying data without human correction, simply trying to predict the next word in the training data given the previous words.

Generative Pre-trained Transformer (GPT)

A few researchers at OpenAI then had the idea to try this pre-training approach on the Transformer architecture. OpenAI built the first Generative Pre-trained Transformer (GPT) model, which they released in 2018. Generative means the model generates output based on input, producing one output item at a time (but processing all inputs simultaneously). Pre-trained means the model was largely trained through unsupervised learning. Transformer refers to the underlying architecture.

They went big on scale: GPT-1 trained on a dataset of 7,000 self-published books comprising 985 million words, representing this data using 117 million parameters—an unheard-of scale at the time (but now considered paltry). It attracted attention from the research community not only by improving on best-performing models on various language tasks, but by improving on all of them, with minimal task-specific training.

Due to the unprecedented number of parameters used, GPT-1 was considered a large language model (LLM), to distinguish it from smaller models that came before. However, this was a research idea, with code that was far from release-ready, and nobody except research-minded folks knew how to get GPT-1 running. And thus, this went unnoticed by the public.

Still, this was a breakthrough: no research lab before OpenAI had the kind of resources that enabled them to try this idea. It did require resources that most labs didn’t have at the time: 8 GPUs, when most labs ran their training on a single GPU.

Issue summary: The Transformer architecture, unlike previous machine learning model architectures, could generate its next item while processing all previous items at the same time. The technique of unsupervised learning trained models on unlabelled data, letting the model pick up patterns in underlying data instead of having it learn correct answers only, and was much faster than supervised learning. OpenAI applied both these ideas at scale, producing GPT-1, a model that beat best-performing models while requiring relatively little human supervision during training.

We’re almost at the meaty part! I kinda snuck in 2 ideas today: the Transformer architecture (a minor part of this series actually) and unsupervised learning. I don’t think you would have wanted to wait a week in between before hearing how OpenAI combined the two, haha … so there you go.

What I’ll be covering next

Next issue: Issue 172: Tokens, the currency of LLMs

Wait—what exactly does a large language model (LLM) work with? Individual letters? Entire words? Find out next issue!

Issue 170: Machine learning models

2026-06-08T11:30:00+08:00

Previously: By better understanding how search bots categorise pages, a website owner can use keywords and other techniques to optimise the ranking of their page for specific search terms.

[Editor’s Note] Layman’s Guide to Computing went on hiatus after its 13th season, because my promise when I began was to write only things widespread enough that I thought layfolks should have an accessible-yet-useful introduction to.

As I wrapped up Season 13 in 2022, the trend at that time was cloud computing. I tackled emulation and virtualization in Season 12, then the internet and online services in Season 13. ChatGPT launched in November 2022 that year. In 2024, I was first asked if I would continue Layman’s Guide again to write about AI. I said no; less than half my colleagues were using ChatGPT or had heard of it, and I didn’t think there would be enough common knowledge for me to usefully write about AI yet.

But now, in 2026, even my employers are actively promoting genAI, my students are using ChatGPT, and by the end of this year it would likely be difficult to find someone who hasn’t heard of Claude Code or Gemini Pro or Codex. I suppose it’s time to add one more season.

There are many explainers out there; I’ve read a large number of them, many very good! But this is Layman’s Guide to Computing, and something I noticed talking to laypeople is confusion: where did this AI come from? Why hadn’t it been invented earlier? How does it work? What can it do? What can’t it do?

So let’s rewind time: I started writing Layman’s Guide to Computing in 2018. A year before that, eight machine learning engineers at Google had published “Attention is All You Need,” the paper that introduced the transformer architecture that underpins most of today’s genAI. In mid-2018, before I started writing, OpenAI was still a non-profit research lab founded by Elon Musk and Sam Altman, and had just released the first version of GPT, a language model that was not yet large enough to generate coherent text. Following Google’s whitepaper on the attention mechanism, they had just released a paper, “Improving Language Understanding by Generative Pre-Training”, that described the architecture and training process for GPT, their first large language model.

It’s a little hard to mentally reconstruct the tech culture and public awareness of the field of artificial intelligence and machine learning at that point in time. So let’s start by understanding: what is a model? How were they used then?

Models

You may not know it, but you were already using models in your daily life in 2017. When the iPhone launched, it had intelligent autocorrect and touch auto-adjustment features. For these features to work, Apple had to train machine learning models on large datasets of text and touch interactions. These models were then deployed on the iPhone to provide the autocorrect and touch adjustment functionality.

What are these models? You would likely have used them in a stats course, perhaps even in high school. If you were ever asked to sketch a best-fit line, a trendline, or a linear regression, you were already drawing a model. To do that, you:

Hypothesized a linear relationship between an input variable x and an output variable y.
Collected data points (x, y) through an experiment.
Represented the relationship between x and y using a mathematical formula (y = mx + b).
Determined the parameters m and b that best fit the data points.

You compressed the data—multiple sets of points (which we call a dataset)—into two parameters, m and b, a simpler representation that captures the underlying relationship. This representation is a model. (We sometimes call it a mental model when we don’t have it formally represented as a mathematical relationship, just a conceptual description.)

Apple’s machine learning models do something similar. An autocorrect model takes a dataset of incorrect words/phrases and their actual words/phrases, and compresses it into a text correction model. A touch auto-adjustment model takes a dataset of touch interactions and their intended targets, and compresses it into a model that can predict the intended touch target based on the touch input.

tl;dr A model takes in input values and produces output values based on patterns it has learned from training data.

More complex models

Of course, more complex models do not use a linear equation or a simple mathematical formula anymore. Machine learning researchers first represent more complex relationships using more complex formulas, such as polynomials or decision trees, which use more parameters.

But for other purposes the input may not be a single variable and the output may not be a single variable either. For example, in image recognition, the input is an image (which can be represented as a grid of pixel values), and the output is a label (e.g., “cat”, “dog”, “car”). An image classifier may have 64 input values (one for each pixel in an 8×8 image) and 10 output values (one for each possible label). The model would learn to map the input pixel values to the correct label based on patterns in the training data. That’s 640 parameters (64 input values x 10 output values) that the model would learn to adjust during training.

This direct mapping of input to output can only take us so far. Perhaps output 1 doesn’t just depend on inputs 1 to 10, but on some intermediate value calculated from them. Now we have to add intermediate layers between input and output, which researchers call “hidden layers”. These layers allow the model to learn and represent more complex relationships between input and output. Each layer can have its own parameters, and the model learns to adjust these parameters during training to improve its performance.

tl;dr More complex models use more parameters to represent the relationship between input values, intermediate values, and output values. Each parameter represents a relationship between two values. The more parameters, the more complex the relationships the model can learn.

Limitations of models

Models sound like mathematical dark magic, and often feel like it too. But like the mathematical models we learned in school, they have limitations.

If you’ve seen how far some of your data points deviate from your best-fit line or trendline, you already know that the model cannot accurately represent all the data points—it is only a simplification. Likewise, all machine learning models are simplifications of reality.

Their performance depends on how well they capture underlying patterns in the data: pick an inappropriate representation for the feature, e.g. a linear formula instead of a polynomial, and the model will perform poorly.

It is also possible to go to the other extreme, adding a complex model with many parameters that fits the training data perfectly, but does not predict other data points well—an overfitted model. You can have a computer come up with a sine-decay formula that fits your first 6 data points perfectly, but wildly overshoot a 7th data point.

Also, their performance depends on the quality and quantity of the dataset. If your data does not represent the underlying reality well enough, missing important patterns or exceptions, or not covering a sufficient variety of cases, the model can pick out the wrong features and learn the wrong patterns. In the early days of machine learning, some researchers found that when training image classifiers on images of dogs and cats, the model began identifying any brown creature sitting on grass as a dog, because the training dataset had many images of dogs sitting on grass, but few images of cats sitting on grass. The model had learned to associate grass with dogs, which was not the intended pattern.

Issue summary: Models simplify and represent a relationship between input values and output values. The more complex the relationship, the more parameters the model needs to learn. Models are simplifications of reality, and their performance depends on how well they capture underlying patterns in the data, as well as the quality and quantity of the dataset.

After experiencing the magic of ChatGPT and other genAI tools, it’s easy to forget, or perhaps not even realise, that fundamentally they are powered by the same underlying principles that we apply in simpler experiments.

But between y = mx + b and ChatGPT, there is still … such a huge gulf of complexity. We still have quite a way to go.

What I’ll be covering next

Next issue: Issue 171: The first Generative Pre-Training model, GPT-1

What was the fundamental insight that made GPT and other LLMs possible? Find out next season ;)