<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom"><title>Layman's Guide to Computing - Season 14</title><link href="https://ngjunsiang.github.io/laymansguide/" rel="alternate"></link><link href="https://ngjunsiang.github.io/laymansguide/feeds/season-14.atom.xml" rel="self"></link><id>https://ngjunsiang.github.io/laymansguide/</id><updated>2026-08-31T08:00:00+08:00</updated><entry><title>Issue 182: Running a model, part 2</title><link href="https://ngjunsiang.github.io/laymansguide/issue182.html" rel="alternate"></link><published>2026-08-31T08:00:00+08:00</published><updated>2026-08-31T08:00:00+08:00</updated><author><name>J S Ng</name></author><id>tag:ngjunsiang.github.io,2026-08-31:/laymansguide/issue182.html</id><summary type="html">&lt;p&gt;Open-weight models range in size from sub-1B to 100+B. A range of device options below &lt;span class="caps"&gt;SGD6&lt;/span&gt;,000 are already capable of running these models, ranging from the humble Raspberry Pi for running harness support to the Mac Studio M3 for running 70B&amp;nbsp;models.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;Previously:&lt;/strong&gt; Quantization trades parameter precision for a smaller memory footprint and faster inference, making many models feasible for running on user devices. Model capabilities depend on their parameter count and training data. Models with higher parameter counts can represent more patterns, while model capabilities are added by training them on well-labeled&amp;nbsp;data.&lt;/p&gt;
&lt;p&gt;Last issue, we discovered that there are quantized models that can actually run on laptops. (You can also run &lt;span class="caps"&gt;GPT&lt;/span&gt;-1 and &lt;span class="caps"&gt;GPT&lt;/span&gt;-2 on a laptop, but you would likely be disappointed in their performance today given the leaps-and-bounds improvement in &lt;span class="caps"&gt;AI&lt;/span&gt; capability that have happened since&amp;nbsp;2022.)&lt;/p&gt;
&lt;p&gt;Besides gemma-4-12b, what else can we&amp;nbsp;run?&lt;/p&gt;
&lt;h2&gt;Open-weight model&amp;nbsp;options&lt;/h2&gt;
&lt;p&gt;Open-weight enthusiasts have a number of well-known options available to them (sizes are&amp;nbsp;unquantized):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Google&amp;#8217;s Gemma 3 series, available in sizes 1B to&amp;nbsp;27B&lt;/li&gt;
&lt;li&gt;Google&amp;#8217;s Gemma 4 series, available in sizes 5B to&amp;nbsp;31B&lt;/li&gt;
&lt;li&gt;Microsoft&amp;#8217;s Phi series, available in sizes 2.7B to&amp;nbsp;14B&lt;/li&gt;
&lt;li&gt;Meta&amp;#8217;s Llama series, available in sizes 8B to&amp;nbsp;405B&lt;/li&gt;
&lt;li&gt;Alibaba&amp;#8217;s Qwen series, available in sizes 0.5B to&amp;nbsp;72B&lt;/li&gt;
&lt;li&gt;Deepseek&amp;#8217;s titular Deepseek series, available in sizes 16B to&amp;nbsp;72B&lt;/li&gt;
&lt;li&gt;Mistral&amp;#8217;s Mistral series, available in sizes 7B to&amp;nbsp;176B&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are also many lesser-known models, whose capabilities are still increasing every few&amp;nbsp;months.&lt;/p&gt;
&lt;p&gt;I won&amp;#8217;t give you a comprehensive low-down on what each model is good for,&amp;nbsp;because:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;the models can be fine-tuned by those who know how, and may have variants that are better at specific task&amp;nbsp;categories,&lt;/li&gt;
&lt;li&gt;the models are updated every few months, and see new capabilities added through post-training (supervised&amp;nbsp;learning),&lt;/li&gt;
&lt;li&gt;the agent harness and runtime do play a part: some models are useful &amp;#8220;out-of-the-box&amp;#8221;, some work best within a particular harness or with a particular set of&amp;nbsp;tools&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Model&amp;nbsp;capabilities&lt;/h2&gt;
&lt;p&gt;In &lt;a href="https://ngjunsiang.github.io/laymansguide/issue181.html"&gt;Issue 181&lt;/a&gt; I mentioned that &lt;strong&gt;more parameters&lt;/strong&gt; lets the model represent more patterns in its weights, while &lt;strong&gt;better training data&lt;/strong&gt; determines the model&amp;#8217;s capabilities. Useful to know as a general pattern, but difficult to apply when deciding on a specific model to run. Should we just run the largest model that our device is capable of&amp;nbsp;running?&lt;/p&gt;
&lt;p&gt;As of June&amp;nbsp;2026:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;0.5B–3B&lt;/strong&gt; models can handle classification, extraction, summarization tasks and are generally good for single request-response&amp;nbsp;purposes.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;7B–9B&lt;/strong&gt; models are useful assistants (in a harness) that can hold short conversations, handle basic Q&amp;amp;A, do simple coding or tool calls, and otherwise generally match &lt;span class="caps"&gt;GPT&lt;/span&gt;-3&amp;#8217;s&amp;nbsp;capabilities.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;12B–15B&lt;/strong&gt; models can follow instructions consistently, generate code that mostly works (and do some debugging if necessary), generate tool calls more reliably, making them capable tool-using&amp;nbsp;agents.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;27B–35B&lt;/strong&gt; models can handle most tasks, even across longer contexts: analyze documents and write reports, generate and debug code, execute requests involving multiple steps. With a well-designed harness and accurate task documentation, these become capable general-purpose&amp;nbsp;agents.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;70B&lt;/strong&gt; models can handle what previous tiers can do, but better: fewer hallucinations and mistakes, better answers, better general understanding, more consistent planning, and over longer context windows—smaller models sometimes see a sharp performance drop when the context window extends past a certain length. Some users report better reasoning performance as&amp;nbsp;well.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;100B+ Frontier models&lt;/strong&gt;—&lt;span class="caps"&gt;GPT&lt;/span&gt;-5, Claude Opus, Kimi K2.5, et al—can do all of the above, with state-of-the-art reasoning and thinking, knowledge, error recovery, ambiguity handling, and&amp;nbsp;more&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Specialized non-&lt;span class="caps"&gt;LLM&lt;/span&gt; models&amp;nbsp;include:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;OpenAI Whisper (0.4B–1.5B)&lt;/strong&gt; for speech-to-text transcription, text-to-speech&amp;nbsp;generation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stable Diffusion (0.9B–8B)&lt;/strong&gt; for text-to-image&amp;nbsp;generation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;span class="caps"&gt;FLUX&lt;/span&gt;.1 (12B)&lt;/strong&gt; also for text-to-image&amp;nbsp;generation&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;span class="caps"&gt;CLIP&lt;/span&gt; (0.4B)&lt;/strong&gt; for image-to-text&amp;nbsp;understanding&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Stable Audio 3 (0.6B–2B)&lt;/strong&gt; for text-to-audio&amp;nbsp;generation&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Models are still improving through post-training (supervised learning) and distillation—a process by which small models are trained on output from larger, more capable models. A 9B model today already exhibits capabilities that &lt;span class="caps"&gt;GPT&lt;/span&gt;-3 (175B) was capable of in 2022. So you should expect a different set of capability tiers this time next&amp;nbsp;year.&lt;/p&gt;
&lt;h2&gt;Hardware&amp;nbsp;options&lt;/h2&gt;
&lt;p&gt;The sweet spot for &amp;#8220;value-for-money&amp;#8221; sits around &lt;strong&gt;12B–35B&lt;/strong&gt; for now. Smaller models are faster and use less memory. Speed decreases and memory use increases as model size&amp;nbsp;increases.&lt;/p&gt;
&lt;p&gt;With this in mind, these are some popular options for running models on-device (local deployment) as of June 2026 (prices are Singapore&amp;nbsp;retail):&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Raspberry Pi&lt;/strong&gt; (8–&lt;span class="caps"&gt;16GB&lt;/span&gt; &lt;span class="caps"&gt;RAM&lt;/span&gt;): popular for tiny models (2B or smaller), used to generate document embeddings for search, &lt;span class="caps"&gt;OCR&lt;/span&gt; documents and clean up the OCRed text, etc. These form the support system for the agent harness, and usually are not used directly for the agent&amp;nbsp;models.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mini-PCs&lt;/strong&gt; with a sufficiently capable &lt;span class="caps"&gt;CPU&lt;/span&gt;, no dedicated &lt;span class="caps"&gt;GPU&lt;/span&gt; are a decent budget option.&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;&lt;span class="caps"&gt;AMD&lt;/span&gt; Ryzen &lt;span class="caps"&gt;AI&lt;/span&gt; 300 CPUs, 12 &lt;span class="caps"&gt;CPU&lt;/span&gt; cores, 8–12 &lt;span class="caps"&gt;GPU&lt;/span&gt; compute units &lt;span class="amp"&gt;&amp;amp;&lt;/span&gt; &lt;span class="caps"&gt;64GB&lt;/span&gt; &lt;span class="caps"&gt;RAM&lt;/span&gt;&lt;/strong&gt;: this can run 7B–13B models capably (if slowly), and 34B quantized models at a crawl. [~&lt;span class="caps"&gt;SGD2&lt;/span&gt;,000]&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;&lt;span class="caps"&gt;AMD&lt;/span&gt; Ryzen &lt;span class="caps"&gt;AI&lt;/span&gt; &lt;span class="caps"&gt;MAX&lt;/span&gt;+ (Strix Halo) CPUs, 16 &lt;span class="caps"&gt;CPU&lt;/span&gt; cores, 40–48 &lt;span class="caps"&gt;GPU&lt;/span&gt; compute units &lt;span class="amp"&gt;&amp;amp;&lt;/span&gt; &lt;span class="caps"&gt;256GB&lt;/span&gt; &lt;span class="caps"&gt;RAM&lt;/span&gt;&lt;/strong&gt;: this bundles a much more capable integrated &lt;span class="caps"&gt;GPU&lt;/span&gt; (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue123.html"&gt;Issue 123&lt;/a&gt;) and can run 34B models capably, 70B models at a crawl. [~&lt;span class="caps"&gt;SGD4&lt;/span&gt;,800]&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mac Mini M4, 12 &lt;span class="caps"&gt;CPU&lt;/span&gt; cores, 10 &lt;span class="caps"&gt;GPU&lt;/span&gt; compute units &lt;span class="amp"&gt;&amp;amp;&lt;/span&gt; &lt;span class="caps"&gt;24GB&lt;/span&gt; &lt;span class="caps"&gt;RAM&lt;/span&gt;&lt;/strong&gt;: In a similar category as the Ryzen &lt;span class="caps"&gt;AI&lt;/span&gt; 300. [&lt;span class="caps"&gt;SGD1&lt;/span&gt;,299]&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mac Mini M4 Pro, 14 &lt;span class="caps"&gt;CPU&lt;/span&gt; cores, 20 &lt;span class="caps"&gt;GPU&lt;/span&gt; compute units &lt;span class="amp"&gt;&amp;amp;&lt;/span&gt; &lt;span class="caps"&gt;48GB&lt;/span&gt; &lt;span class="caps"&gt;RAM&lt;/span&gt;&lt;/strong&gt;: In a similar category as the Ryzen &lt;span class="caps"&gt;AI&lt;/span&gt; &lt;span class="caps"&gt;MAX&lt;/span&gt;+. [&lt;span class="caps"&gt;SGD2&lt;/span&gt;,659]&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mac Studio M3 Ultra, 28 &lt;span class="caps"&gt;CPU&lt;/span&gt; cores, 60 &lt;span class="caps"&gt;GPU&lt;/span&gt; compute units &lt;span class="amp"&gt;&amp;amp;&lt;/span&gt; &lt;span class="caps"&gt;96GB&lt;/span&gt; &lt;span class="caps"&gt;RAM&lt;/span&gt;&lt;/strong&gt;: With the highest memory bandwidth of all the units in this category, this can run everything mentioned above, and even run 70B models decently well. That&amp;#8217;s what most folks would be buying this for.&lt;br /&gt;
A higher-end 32 &lt;span class="caps"&gt;CPU&lt;/span&gt; core, 80 &lt;span class="caps"&gt;GPU&lt;/span&gt; compute unit configuration exists if you add &lt;span class="caps"&gt;SGD2&lt;/span&gt;,025—doesn&amp;#8217;t add new capabilities, makes everything a little faster. [&lt;span class="caps"&gt;SGD5&lt;/span&gt;,199]&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Full PCs&lt;/strong&gt; with a capable &lt;span class="caps"&gt;CPU&lt;/span&gt; &lt;span class="amp"&gt;&amp;amp;&lt;/span&gt; dedicated &lt;span class="caps"&gt;GPU&lt;/span&gt;&lt;ul&gt;
&lt;li&gt;Many options exist here, none below &lt;span class="caps"&gt;SGD6&lt;/span&gt;,000, most above &lt;span class="caps"&gt;SGD10&lt;/span&gt;,000. Dedicated GPUs capable of running &lt;span class="caps"&gt;AI&lt;/span&gt; models already have prices in the&amp;nbsp;thousands.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If you already have an existing laptop/&lt;span class="caps"&gt;PC&lt;/span&gt; and want to know how it will manage different model sizes, you can ask ChatGPT or Claude; they are pretty up-to-date with hardware capabilities and can give you an estimate. Alternatively, try to download and run the models and see for yourself—ground truth doesn&amp;#8217;t care about your&amp;nbsp;estimates.&lt;/p&gt;
&lt;h2&gt;Cloud&amp;nbsp;options&lt;/h2&gt;
&lt;p&gt;Wow that&amp;#8217;s a lot of zeros. Besides, owning hardware comes with its own maintenance needs and headaches. Enter the cloud, i.e.&amp;nbsp;pay-per-use.&lt;/p&gt;
&lt;p&gt;If you don&amp;#8217;t want to have to manage the hardware that runs these models, don&amp;#8217;t plan to be running a model long-term, or want to run a model larger than what your hardware can handle, these are the current most user-friendly&amp;nbsp;options:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;a href="https://huggingface.co/models?inference_provider=hf-inference"&gt;HuggingFace&lt;/a&gt; not only catalogues model weights, it also automates inference hosting (provided by &lt;span class="caps"&gt;AWS&lt;/span&gt; or Google Cloud underneath). Caveat: not all models are supported; you need a model that lists &amp;#8220;&lt;span class="caps"&gt;HF&lt;/span&gt; Inference &lt;span class="caps"&gt;API&lt;/span&gt;&amp;#8221; as an Inference Provider. The HuggingFace link in this bullet point links you to models that do. On the model card page, click &lt;strong&gt;Deploy &amp;gt; &lt;span class="caps"&gt;HF&lt;/span&gt; Inference&amp;nbsp;Endpoints&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://replicate.com/explore"&gt;Replicate&lt;/a&gt; provides an even simpler interface, but for a smaller catalogue of models. Try out the models directly on the model card page, or create an account for deployment&amp;nbsp;options.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://fireworks.ai/models"&gt;Fireworks &lt;span class="caps"&gt;AI&lt;/span&gt;&lt;/a&gt; is where you go once you&amp;#8217;ve decided on a (supported) model and want reliable hosting. Browse their model list and click &lt;strong&gt;Try In Playground&lt;/strong&gt; or &lt;strong&gt;Deploy On Demand&lt;/strong&gt; (requires&amp;nbsp;registration).&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;There are other options that require more technical expertise to use, but if you reach that point you shouldn&amp;#8217;t be relying on a layman&amp;#8217;s guide anymore&amp;nbsp;:)&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Issue summary:&lt;/strong&gt; Open-weight models range in size from sub-1B to 100+B. A range of device options below &lt;span class="caps"&gt;SGD6&lt;/span&gt;,000 are already capable of running these models, ranging from the humble Raspberry Pi for running harness support to the Mac Studio M3 for running 70B models. For larger models, or short-term workloads, cloud options for deploying and running open-weight models also&amp;nbsp;exist.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;This is the most tentative issue for this season, and probably for the entire newsletter so far. I try not to write issues that I will have to retroactively edit as the frontier shifts, but I&amp;#8217;ll make this an exception: I think expounding on available open-weight models illustrates how the ecosystem is similar to open-source software, that allows the (sufficiently educated) public to experiment and provide feedback, how advances in &lt;span class="caps"&gt;AI&lt;/span&gt; over the past 3–4 years have made them feasible to run on consumer-class devices, and how cloud infrastructure has made larger models accessible to those who don&amp;#8217;t own sufficiently powerful&amp;nbsp;hardware.&lt;/p&gt;
&lt;h2&gt;The Layman&amp;#8217;s Guide to Computing&amp;nbsp;archive&lt;/h2&gt;
&lt;p&gt;Buttondown still does not have a very browseable archive, so I&amp;#8217;ve made the newsletter content available on a static site. You can browse past seasons more easily at &lt;a href="https://ngjunsiang.github.io/laymansguide/categories"&gt;https://ngjunsiang.github.io/laymansguide/categories&lt;/a&gt;.&lt;/p&gt;
&lt;p&gt;I may add more seasons in future, as computing technology stabilizes enough for me to write about them in a static newsletter. If you&amp;#8217;d like to receive future issues, do subscribe&amp;nbsp;below:&lt;/p&gt;</content><category term="Season 14"></category></entry><entry><title>Issue 181: Quantization</title><link href="https://ngjunsiang.github.io/laymansguide/issue181.html" rel="alternate"></link><published>2026-08-24T08:00:00+08:00</published><updated>2026-08-24T08:00:00+08:00</updated><author><name>J S Ng</name></author><id>tag:ngjunsiang.github.io,2026-08-24:/laymansguide/issue181.html</id><summary type="html">&lt;p&gt;Quantization trades parameter precision for a smaller memory footprint and faster inference, making many models feasible for running on user devices. Model capabilities depend on their parameter count and training data. Models with higher parameter counts can represent more patterns, while model capabilities are added by training them on well-labeled&amp;nbsp;data.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;Previously:&lt;/strong&gt; Proprietary models do not have their weights published publicly, while open-weight models do. Various runtimes are available for download, and can run models that have a compatible file format. But models are extremely compute- and memory-intensive, requiring extremely high-end hardware and capacious memory to&amp;nbsp;run.&lt;/p&gt;
&lt;p&gt;Great, so a 12B model takes up &lt;span class="caps"&gt;24GB&lt;/span&gt; of disk space, uses &lt;span class="caps"&gt;24GB&lt;/span&gt; of &lt;span class="caps"&gt;RAM&lt;/span&gt;, and up to &lt;span class="caps"&gt;96GB&lt;/span&gt; for the &lt;span class="caps"&gt;KV&lt;/span&gt; cache (model&amp;#8217;s calculated representation of input tokens). That&amp;#8217;s out of reach for most consumers without &lt;span class="caps"&gt;AI&lt;/span&gt;-grade GPUs, which currently cost tens of thousands per&amp;nbsp;unit.&lt;/p&gt;
&lt;p&gt;Enter&amp;nbsp;quantization.&lt;/p&gt;
&lt;h2&gt;Parameter&amp;nbsp;representation&lt;/h2&gt;
&lt;p&gt;Models are typically trained with full precision, allowing them to store each parameter using 16 bits (2 bytes). This is necessary because the training process results in multiple adjustments to the weights. If the intermediate values are not stored with full precision, subsequent adjustments to those values are not accurately represented, and may result in inaccurate training&amp;nbsp;results.&lt;/p&gt;
&lt;p&gt;However, once the model is trained and its weights released, they are effectively &amp;#8220;frozen&amp;#8221;: the weights do not change as the model is used for inference (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue173.html"&gt;Issue 173&lt;/a&gt;).&lt;/p&gt;
&lt;h2&gt;Quantizing&amp;nbsp;parameters&lt;/h2&gt;
&lt;p&gt;Can we reduce the model size and memory footprint by reducing the precision? Yes. Experiments have shown that models lose some accuracy as their parameters are quantized: represented using 8 bits (twofold reduction), or even 4 bits (fourfold reduction!). Below that range, running the model at 2 bits often results in unacceptable&amp;nbsp;performance.&lt;/p&gt;
&lt;p&gt;This inaccuracy shows up in models not following instructions as well, potentially making mistakes more noticeably, especially on complex tasks, or being less accurate with tool call syntax. However, compared to the alternative of not running the model at all, this is usually an acceptable tradeoff for users running the model on their own&amp;nbsp;computers.&lt;/p&gt;
&lt;h2&gt;Running a quantized&amp;nbsp;model&lt;/h2&gt;
&lt;p&gt;Okay, let&amp;#8217;s run those numbers on a quantized Gemma 4 12B model. We don&amp;#8217;t even need to do the quantization ourselves usually: other enthusiasts have already done it, &lt;a href="https://huggingface.co/Brunobkr/OFFELLIA_Q4_0_gemma-4-12B-it.gguf"&gt;providing the models on HuggingFace as well&lt;/a&gt; (they can be identified through the &amp;#8220;Q4&amp;#8221; in the model naming scheme; 8-bit quantized models are labelled&amp;nbsp;&amp;#8220;Q8&amp;#8221;).&lt;/p&gt;
&lt;p&gt;We already see immediate benefits: the 4-bit quantized model weights are only &lt;span class="caps"&gt;7GB&lt;/span&gt;, a stark contrast to the &lt;span class="caps"&gt;24GB&lt;/span&gt; of full-precision&amp;nbsp;weights.&lt;/p&gt;
&lt;p&gt;The &lt;span class="caps"&gt;KV&lt;/span&gt; cache requirement now drops to ~&lt;span class="caps"&gt;6GB&lt;/span&gt; for 32K tokens, and ~&lt;span class="caps"&gt;50GB&lt;/span&gt; for 256K tokens. &lt;em&gt;Very&lt;/em&gt; uncomfortable for a Macbook, which means we would have to limit ourselves to a 128K or even 64K token context length. Annoying, but not&amp;nbsp;show-stopping.&lt;/p&gt;
&lt;p&gt;The inference speed now increases to ~60 tokens/sec, about as responsive as ChatGPT or other&amp;nbsp;chatbots!&lt;/p&gt;
&lt;h2&gt;What do we gain from larger&amp;nbsp;models?&lt;/h2&gt;
&lt;p&gt;Unlike programs or data files, which store data as-is (perhaps compressing them for a smaller filesize), models &lt;strong&gt;represent&lt;/strong&gt; information: the training process produces a highly compressed set of numbers that are able to approximately reproduce the training contents (not 100% accurately, but quite close), and more importantly generate tokens following the same pattern for inputs that it was not trained&amp;nbsp;on.&lt;/p&gt;
&lt;p&gt;What if we try to break the laws of physics, taking &lt;span class="caps"&gt;GPT&lt;/span&gt; or Claude&amp;#8217;s training corpus, and training it into a 1B model? What&amp;nbsp;happens?&lt;/p&gt;
&lt;p&gt;1B parameters means the model only has 1 billion numbers to try to represent everything. If the training data is repetitive and largely similar, 1B might even be sufficient since there just isn&amp;#8217;t that much variation in the&amp;nbsp;data.&lt;/p&gt;
&lt;p&gt;But if the data is highly varied, the model might not be able to adjust the weights to represent everything. It will end up storing one additional data point at the expense of worse representation for other data points. This might show up as a plateau in benchmark scores: the model can&amp;#8217;t improve further. Or it might show up as the model not &amp;#8220;remembering&amp;#8221; data that shows up less&amp;nbsp;frequently.&lt;/p&gt;
&lt;p&gt;What do frontier models, often with parameter counts running into trillions, gain? With so many parameters, they can represent more patterns: more thinking scaffolds and reasoning frameworks, more sentence/paragraph patterns from more books and articles, etc. And not just more patterns, but higher-order patterns: writing styles, writing intents, idea development, longform writing structure,&amp;nbsp;etc.&lt;/p&gt;
&lt;p&gt;Google&amp;#8217;s Gemma 4 12B model will end up not being able to represent everything. Our running model might give less nuanced answers, consider fewer perspectives in its answer, and otherwise give worse&amp;nbsp;answers.&lt;/p&gt;
&lt;p&gt;But hey, it runs! Give it a spin, see what you can do with 12B&amp;nbsp;parameters.&lt;/p&gt;
&lt;h2&gt;Model&amp;nbsp;capabilities&lt;/h2&gt;
&lt;p&gt;Even frontier models with poor training data will disappoint. 1 trillion parameters won&amp;#8217;t necessarily make a model much smarter if the training data is&amp;nbsp;poor.&lt;/p&gt;
&lt;p&gt;Most new capabilities are added through additional training, usually supervised learning. If we can&amp;#8217;t train the underlying model, we might be able to create skill files explaining how to do something, let the harness read it and add it into the input context, and lean on the model&amp;#8217;s pattern-following capabilities to tackle the&amp;nbsp;task.&lt;/p&gt;
&lt;p&gt;Either way, if you have the hardware to support it and manage to get a local agent running, try it with different questions and tasks to get a feel for what it can and cannot handle. That beats any amount of reading on what these models are supposed to be able to&amp;nbsp;do.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Issue summary:&lt;/strong&gt; Quantization trades parameter precision for a smaller memory footprint and faster inference, making many models feasible for running on user devices. Model capabilities depend on their parameter count and training data. Models with higher parameter counts can represent more patterns, while model capabilities are added by training them on well-labeled&amp;nbsp;data.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;12 issues in, that&amp;#8217;s a wrap! At this point I think what I&amp;#8217;ve written is what&amp;#8217;s unlikely to change in the next couple of years, and still useful for layfolks to know about the ongoing &lt;span class="caps"&gt;AI&lt;/span&gt; development. Anything newer is still in active&amp;nbsp;development.&lt;/p&gt;
&lt;h2&gt;What I’ll be covering&amp;nbsp;next&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Next issue:&lt;/strong&gt; &lt;a href="https://ngjunsiang.github.io/laymansguide/issue182.html"&gt;Issue 182: Running a model, part&amp;nbsp;2&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;In the last issue, I&amp;#8217;ll explore other options for running a model on your device (called &lt;strong&gt;local deployment&lt;/strong&gt; in parlance): running smaller models, and other feasible hardware&amp;nbsp;options.&lt;/p&gt;</content><category term="Season 14"></category></entry><entry><title>Issue 180: Running a model</title><link href="https://ngjunsiang.github.io/laymansguide/issue180.html" rel="alternate"></link><published>2026-08-17T08:00:00+08:00</published><updated>2026-08-17T08:00:00+08:00</updated><author><name>J S Ng</name></author><id>tag:ngjunsiang.github.io,2026-08-17:/laymansguide/issue180.html</id><summary type="html">&lt;p&gt;Proprietary models do not have their weights published publicly, while open-weight models do. Various runtimes are available for download, and can run models that have a compatible file format. But models are extremely compute- and memory-intensive, requiring extremely high-end hardware and capacious memory to&amp;nbsp;run.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;Previously:&lt;/strong&gt; Agents are software applications that comprise a harness, a runtime, and a model (typically accessed through an &lt;span class="caps"&gt;API&lt;/span&gt; instead of directly executed on the computer). They enable a user to type in a request or send it by other means and thus instruct the agent to carry out a task on the computer until completion. The capabilities of agents are limited by the tools available to&amp;nbsp;them.&lt;/p&gt;
&lt;p&gt;As of June 2026, OpenAI and Anthropic charge about $20/mth for their Pro/Plus plan, and about $200/mth for their Max plan. For those of us who like to stay on free tiers, it can be pretty annoying to hit the dreaded &amp;#8220;You have reached the limit for Free plan&amp;#8221;, but what can we do short of shelling out for a higher&amp;nbsp;tier?&lt;/p&gt;
&lt;p&gt;Wait—if a language model is a bunch of numbers, and a runtime is just a program, why can&amp;#8217;t I run it on my own computer&amp;nbsp;instead?&lt;/p&gt;
&lt;h2&gt;Proprietary models and open-weight&amp;nbsp;models&lt;/h2&gt;
&lt;p&gt;For starters, you can&amp;#8217;t download the &lt;span class="caps"&gt;GPT&lt;/span&gt;-5 or Claude models. They are proprietary models, and their weights (the file containing the model&amp;#8217;s parameters) are a guarded trade secret; a leak of the weights would be disastrous for OpenAI or&amp;nbsp;Anthropic.&lt;/p&gt;
&lt;p&gt;&lt;em&gt;Okay, fine&lt;/em&gt; you say, &lt;em&gt;then let&amp;#8217;s run something I can actually download.&lt;/em&gt; As of 2026, that typically means you would go to &lt;a href="https://huggingface.co/"&gt;HuggingFace&lt;/a&gt; (yes, that is their actual name), currently the world&amp;#8217;s largest platform for hosting open-weight models. An open-weight model, analogous to open-source software, means the model&amp;#8217;s weights are publicly available and you can download&amp;nbsp;them.&lt;/p&gt;
&lt;h2&gt;The parts: downloading&amp;nbsp;weights&lt;/h2&gt;
&lt;p&gt;Let&amp;#8217;s download the currently top-trending model, Google&amp;#8217;s &lt;a href="https://huggingface.co/google/gemma-4-12B-it"&gt;&lt;code&gt;gemma-4-12B-it&lt;/code&gt;&lt;/a&gt;. The &lt;a href="https://huggingface.co/google/gemma-4-12B-it"&gt;model card&lt;/a&gt; says that this is a multimodal model (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue177.html"&gt;Issue 177&lt;/a&gt;) with 11.95 billion (12B) parameters (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue170.html"&gt;Issue 170&lt;/a&gt;). It has a context length of 256K tokens (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue172.html"&gt;Issue 172&lt;/a&gt;)—important when deciding what kind of tasks it can plausibly take on, since the context length dictates what the total output length (including the input tokens) cannot&amp;nbsp;exceed.&lt;/p&gt;
&lt;p&gt;Under &lt;a href="https://huggingface.co/google/gemma-4-12B-it/tree/main"&gt;Files and versions&lt;/a&gt;, we see a whole bunch of files, most of them metadata, configuration information, and other data (such as the token list). The model weights are easy to tell: they are by far the largest file of the collection, weighing in at 23.&lt;span class="caps"&gt;9GB&lt;/span&gt;. We can calculate this: 11.95 billion parameters, with each parameter taking up 16 bits (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue040.html"&gt;Issue 40&lt;/a&gt;), means 2 bytes per parameter, and thus 23.9 billion bytes for all the parameters. 23.&lt;span class="caps"&gt;9GB&lt;/span&gt;.&lt;/p&gt;
&lt;h2&gt;The&amp;nbsp;runtime&lt;/h2&gt;
&lt;p&gt;You have a few options here, listed from easiest to most&amp;nbsp;difficult:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;a href="https://lmstudio.ai/"&gt;&lt;span class="caps"&gt;LM&lt;/span&gt; Studio&lt;/a&gt; – Comes with a graphical user interface (&lt;span class="caps"&gt;GUI&lt;/span&gt;), so click to load the model and you get a chat interface. Great for getting started &lt;span class="caps"&gt;ASAP&lt;/span&gt;, not great if you actually eventually want to use it as an&amp;nbsp;agent.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ollama.com/"&gt;Ollama&lt;/a&gt; – A commandline program, requiring some terminal chops. Sets up an &lt;span class="caps"&gt;API&lt;/span&gt; server that you can use with many other&amp;nbsp;programs.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/huggingface/transformers"&gt;Hugging Face Transformers&lt;/a&gt; – A Python library for working with models, which means it&amp;#8217;s programmers-only. Great if you are building or customizing your own agent harness, but definitely not ready-to-run&amp;nbsp;as-is.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/ggml-org/llama.cpp"&gt;llama.cpp&lt;/a&gt; – The most low-level, close-to-the-metal option. Gives you a commandline program for using the model, but you have to manage all other technical detail on your own. Not for the&amp;nbsp;faint-hearted.&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/vllm-project/vllm"&gt;vLLM&lt;/a&gt; – A &lt;span class="caps"&gt;GPU&lt;/span&gt;-only library for serving models over an &lt;span class="caps"&gt;API&lt;/span&gt;. Presumably we do not have four thousand bucks to spend on an entry-level &lt;span class="caps"&gt;GPU&lt;/span&gt; for models, such as the &lt;span class="caps"&gt;RTX&lt;/span&gt; 4090 with &lt;span class="caps"&gt;24GB&lt;/span&gt; of &lt;span class="caps"&gt;GPU&lt;/span&gt; memory, and are running the model on a &lt;span class="caps"&gt;CPU&lt;/span&gt;, so this option is automatically disqualified for&amp;nbsp;us.&lt;/li&gt;
&lt;/ol&gt;
&lt;h2&gt;Hardware&amp;nbsp;requirements&lt;/h2&gt;
&lt;p&gt;Great. So we&amp;#8217;ve downloaded and installed &lt;span class="caps"&gt;LM&lt;/span&gt; Studio, launched it, and then selected&amp;nbsp;our &lt;code&gt;gemma-4-12B-it&lt;/code&gt; model for&amp;nbsp;loading.&lt;/p&gt;
&lt;p&gt;&lt;img alt="A screenshot of LM Studio" src="https://ngjunsiang.github.io/laymansguide/lm-studio.png" /&gt;&lt;br /&gt;
&lt;em&gt;A screenshot of &lt;span class="caps"&gt;LM&lt;/span&gt; Studio&lt;/em&gt;&lt;br /&gt;
Source: &lt;a href="https://lmstudio.ai/"&gt;&lt;span class="caps"&gt;LM&lt;/span&gt;&amp;nbsp;Studio&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;The first thing that would probably happen is your system will complain about insufficient memory and stop. You see, to run this model, we would need to read the model weights (23.&lt;span class="caps"&gt;9GB&lt;/span&gt;) into memory, immediately using up &lt;span class="caps"&gt;24GB&lt;/span&gt; of memory. Even assuming no other apps are running, we still need more memory for the following:
- operating system overhead (~1-&lt;span class="caps"&gt;2GB&lt;/span&gt;)
- memory used by the runtime (1-&lt;span class="caps"&gt;3GB&lt;/span&gt;)&lt;/p&gt;
&lt;p&gt;Oh? It didn&amp;#8217;t crash for you? I see, you had the Macbook Pro with &lt;span class="caps"&gt;64GB&lt;/span&gt; memory, or something in that weight class. Great, let&amp;#8217;s start prompting your model then. It won&amp;#8217;t work as quickly as ChatGPT, but it should manage a comfortable ~20–30 tokens/sec, slightly slower than reading speed but&amp;nbsp;useable.&lt;/p&gt;
&lt;p&gt;Unfortunately, as you ask more and more questions within the same session, it will run more and more slowly, and eventually it will crash. You see, the model generates a representation of the entire input, called the &lt;strong&gt;&lt;span class="caps"&gt;KV&lt;/span&gt; cache&lt;/strong&gt;, which stores its computed values for how each token in the input relates to other tokens in the input. This is estimated to take up ~&lt;span class="caps"&gt;12GB&lt;/span&gt; for 32K tokens, so ~&lt;span class="caps"&gt;96GB&lt;/span&gt; if using the full 256K context&amp;nbsp;length.&lt;/p&gt;
&lt;p&gt;Yeah, this isn&amp;#8217;t for the&amp;nbsp;faint-hearted.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Issue summary:&lt;/strong&gt; Proprietary models do not have their weights published publicly, while open-weight models do. Various runtimes are available for download, and can run models that have a compatible file format. But models are extremely compute- and memory-intensive, requiring extremely high-end hardware and capacious memory to&amp;nbsp;run.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;This is the pessimistic view. Next issue, we look at some optimizations that are available even to newcomers to enable models to run faster and with a smaller memory&amp;nbsp;footprint.&lt;/p&gt;
&lt;h2&gt;What I’ll be covering&amp;nbsp;next&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Next issue:&lt;/strong&gt; &lt;a href="https://ngjunsiang.github.io/laymansguide/issue181.html"&gt;Issue 181:&amp;nbsp;Quantization&lt;/a&gt;&lt;/p&gt;</content><category term="Season 14"></category></entry><entry><title>Issue 179: Agents</title><link href="https://ngjunsiang.github.io/laymansguide/issue179.html" rel="alternate"></link><published>2026-08-10T08:00:00+08:00</published><updated>2026-08-10T08:00:00+08:00</updated><author><name>J S Ng</name></author><id>tag:ngjunsiang.github.io,2026-08-10:/laymansguide/issue179.html</id><summary type="html">&lt;p&gt;Agents are software applications that comprise a harness, a runtime, and a model (typically accessed through an &lt;span class="caps"&gt;API&lt;/span&gt; instead of directly executed on the computer). They enable a user to type in a request or send it by other means and thus instruct the agent to carry out a task on the computer until completion. The capabilities of agents are limited by the tools available to&amp;nbsp;them.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;Previously:&lt;/strong&gt; Thinking/reasoning models are those that have been trained on examples of how to think about different problems in different domains, or plan and execute complex tasks. They often use tools to aid them in goal tracking and updating. The full thinking trace from the model may be removed or hidden to present a more legible response to the&amp;nbsp;user.&lt;/p&gt;
&lt;p&gt;Let&amp;#8217;s review the ingredients we have so&amp;nbsp;far:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;A large language model (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue170.html"&gt;Issue 170&lt;/a&gt;) or multimodal model (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue177.html"&gt;Issue 177&lt;/a&gt;): a next-token predictor that takes input tokens and keeps generating output tokens which feed back to the&amp;nbsp;input&lt;/li&gt;
&lt;li&gt;Training data, which the model is trained on to pick up general patterns through unsupervised learning (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue171.html"&gt;Issue 171&lt;/a&gt;), and then steered to avoid harmful output and generate useful output through the use of labelled training data through supervised learning (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue174.html"&gt;Issue 174&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;A runtime (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue175.html"&gt;Issue 175&lt;/a&gt;), which handles multiple responsibilities:&lt;ul&gt;
&lt;li&gt;parsing the model output to block it if found to be&amp;nbsp;harmful&lt;/li&gt;
&lt;li&gt;formatting the text for display to the&amp;nbsp;user&lt;/li&gt;
&lt;li&gt;separating and executing tool calls (typically in an isolated container), and injecting the results back into the input (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue175.html"&gt;Issue 175&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;processing thinking tokens, removing or hiding them (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue178.html"&gt;Issue 178&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;Other optional runtime extensions, such as those that add retrieval-augmented generation (&lt;span class="caps"&gt;RAG&lt;/span&gt;) capabilities (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue176.html"&gt;Issue 176&lt;/a&gt;), or add information that the model remembered about the signed-in&amp;nbsp;user&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;What does an agent&amp;nbsp;do?&lt;/p&gt;
&lt;h2&gt;&lt;span class="caps"&gt;AI&lt;/span&gt;&amp;nbsp;Agents&lt;/h2&gt;
&lt;blockquote&gt;
&lt;p&gt;agent(n.)&lt;br /&gt;
late 15c., &amp;#8220;one who acts,&amp;#8221; from Latin &lt;em&gt;agentem&lt;/em&gt; (nominative &lt;em&gt;agens&lt;/em&gt;) &amp;#8220;effective, powerful,&amp;#8221; present participle of &lt;em&gt;agere&lt;/em&gt; &amp;#8220;to set in motion, drive forward; to do, perform; keep in movement&amp;#8221; (from &lt;span class="caps"&gt;PIE&lt;/span&gt; root &lt;strong&gt;*ag-&lt;/strong&gt; &amp;#8220;to drive, draw out or forth,&amp;nbsp;move&amp;#8221;).&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;The term &amp;#8220;agent&amp;#8221; means &amp;#8220;one who acts&amp;#8221;. So agents are software applications, comprising a trained model and a runtime. We can broadly think of the model as the &amp;#8220;brains&amp;#8221; of the partnership, and the runtime as the&amp;nbsp;&amp;#8220;body&amp;#8221;.&lt;/p&gt;
&lt;p&gt;Because agents need a computer (physical or virtual) to &amp;#8220;act&amp;#8221;, these software applications are typically installed on a computer, although they may also include a web interface to allow users to control them&amp;nbsp;remotely.&lt;/p&gt;
&lt;p&gt;The model has remained conceptually similar as I went from Issue 170 to here, but the runtime is picking up more and more responsibilities. So as not to muddy the terms, I&amp;#8217;ll keep the runtime focused on the model: processing the output, executing tool calls and injecting results, re-invoking the model if it has not reached a stop token, and any &lt;span class="caps"&gt;RAG&lt;/span&gt; if implemented. Everything else that we are adding today, that makes the agent an effective partner and piece of software, I&amp;#8217;ll explain under the label &lt;strong&gt;harness&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;The&amp;nbsp;model&lt;/h2&gt;
&lt;p&gt;Some harnesses make it easy to swap out the underlying model, allowing the model to run the agent harness with different models. Many model providers have standardized on OpenAI&amp;#8217;s &lt;span class="caps"&gt;API&lt;/span&gt; (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue004.html"&gt;Issue 4&lt;/a&gt;) so as to make their models easily accessible to&amp;nbsp;programmers.&lt;/p&gt;
&lt;p&gt;While state-of-the-art models are capable enough to not require a more specialized version for agentic use, the agent harness usually provides a special system prompt for this purpose. This special prompt includes information on the use context, on the tools available to the model, and other pertinent information to guide the model and keep it on&amp;nbsp;task.&lt;/p&gt;
&lt;h2&gt;The&amp;nbsp;runtime&lt;/h2&gt;
&lt;p&gt;A runtime used within a harness needs to include additional features: the ability to pause or stop the model, to understand access control configuration (which tool calls require user approval) and route matching tool calls to the user for permission grants, and introspectability: allowing the harness program to check the state of the runtime and&amp;nbsp;model.&lt;/p&gt;
&lt;h2&gt;The&amp;nbsp;harness&lt;/h2&gt;
&lt;p&gt;When a user uses agentic software, the harness is what they see. That means the harness handles typical software&amp;nbsp;responsibilities:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;it handles installation and initial setup, allowing the user to select a directory that the agent will begin working&amp;nbsp;from&lt;/li&gt;
&lt;li&gt;it handles extensions/plugins that the user may wish to install, making the tools/MCPs (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue175.html"&gt;Issue 175&lt;/a&gt;) available to the&amp;nbsp;runtime&lt;/li&gt;
&lt;li&gt;it handles file uploads (and any necessary format conversion or resizing), request customisation (e.g. enabling extended thinking), other request-related&amp;nbsp;settings&lt;/li&gt;
&lt;li&gt;it handles the model output through the runtime, displaying to the user tool calls and their results, any visible thinking traces, and any permission requests which come from the runtime (remember that the model remains unaware of these). If the &lt;span class="caps"&gt;API&lt;/span&gt; supports it, the harness streams these to the user, allowing them to see tokens as the model outputs them, without having to wait for the model to finish the entire&amp;nbsp;response&lt;/li&gt;
&lt;li&gt;it provides an interrupt mechanism for the user to halt the runtime if the model is going off-track, or to queue up more messages for the runtime to inject into the request at an appropriate&amp;nbsp;juncture&lt;/li&gt;
&lt;li&gt;some harnesses may support agent memory features, giving the agent tools to write information to its internal memory, and retrieve the information when&amp;nbsp;required&lt;/li&gt;
&lt;li&gt;harnesses for continuously running agents may include features for setting the wake-up interval of the agent, e.g. invoking the agent every 30 seconds with standard instructions to check for outstanding tasks and complete&amp;nbsp;them&lt;/li&gt;
&lt;li&gt;harnesses that integrate with external services will include features for receiving requests via email, WhatsApp, Telegram, or other channels, passing them to the agent and returning the response when it is&amp;nbsp;ready.&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;What an agent&amp;nbsp;does&lt;/h2&gt;
&lt;p&gt;&amp;#8230; I don&amp;#8217;t know what to say here. By itself, a model can do nothing besides generate text. When embedded in a harness+runtime, what it can do is limited by the tools it has available—remember that the model relies on the runtime executing its tool calls to have any effect on the&amp;nbsp;world.&lt;/p&gt;
&lt;p&gt;With simple toolsets (primarily a commandline tool), the agent can&amp;nbsp;plausibly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;read, edit, and delete text files on the&amp;nbsp;computer&lt;/li&gt;
&lt;li&gt;search through files on the&amp;nbsp;computer&lt;/li&gt;
&lt;li&gt;check the computer&amp;#8217;s stats, such as memory usage, free space on disk, &lt;span class="caps"&gt;CPU&lt;/span&gt;&amp;nbsp;usage&lt;/li&gt;
&lt;li&gt;troubleshoot or diagnose computer&amp;nbsp;issues&lt;/li&gt;
&lt;li&gt;perform a web search or retrieve a web&amp;nbsp;page&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;If given the appropriate tools and permissions from the user, the agent can&amp;nbsp;also:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;install or uninstall software on the computer (through the&amp;nbsp;commandline)&lt;/li&gt;
&lt;li&gt;download source code, compile it, install it, and run&amp;nbsp;it&lt;/li&gt;
&lt;li&gt;run a server on the computer, handle web requests, return&amp;nbsp;responses&lt;/li&gt;
&lt;li&gt;read, write, and test&amp;nbsp;code&lt;/li&gt;
&lt;li&gt;push code to a code&amp;nbsp;repository&lt;/li&gt;
&lt;li&gt;add bug reports or issues to a task board, or read existing ones from&amp;nbsp;it&lt;/li&gt;
&lt;li&gt;send requests to an &lt;span class="caps"&gt;API&lt;/span&gt; (if authenticated by the user), and thus execute any supported action through the APIs of Google Drive, Dropbox, Notion, and other services (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue006.html"&gt;Issue 6&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;With more advanced tools or &lt;span class="caps"&gt;MCP&lt;/span&gt; servers that handle the complex details, an agent can&amp;nbsp;even:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;be registered as a plugin in Adobe or Microsoft Office software, reading and editing&amp;nbsp;documents&lt;/li&gt;
&lt;li&gt;work with &lt;span class="caps"&gt;PDF&lt;/span&gt;&amp;nbsp;files&lt;/li&gt;
&lt;li&gt;fix&amp;nbsp;bugs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;When provided with detailed explanations of how to perform complex tasks (typically through a skill file that the agent can read), the agent can&amp;nbsp;plausibly:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;analyze large&amp;nbsp;datasets&lt;/li&gt;
&lt;li&gt;follow company&amp;nbsp;workflows&lt;/li&gt;
&lt;li&gt;scan software or APIs for&amp;nbsp;vulnerabilities&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;&amp;#8230; Why haven&amp;#8217;t they taken over the world&amp;nbsp;yet?&lt;/h2&gt;
&lt;p&gt;Because most people aren&amp;#8217;t using&amp;nbsp;them!&lt;/p&gt;
&lt;p&gt;&amp;#8230; Just kidding, there are other reasons too. For&amp;nbsp;example:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Most complex tasks aren&amp;#8217;t described in skill files that are agent-readable, or are not well&amp;nbsp;described&lt;/li&gt;
&lt;li&gt;Many of the advanced tools or &lt;span class="caps"&gt;MCP&lt;/span&gt; servers that are needed don&amp;#8217;t exist, e.g. those for editing &lt;span class="caps"&gt;PDF&lt;/span&gt; files reliably. If they exist they aren&amp;#8217;t always&amp;nbsp;reliable&lt;/li&gt;
&lt;li&gt;The really effective tools might be hyper-customized for the tool author and not as useful for&amp;nbsp;others&lt;/li&gt;
&lt;li&gt;Most users are used to doing things themselves, and don&amp;#8217;t have enough experience with an agent harness to be accustomed to instructing&amp;nbsp;one&lt;/li&gt;
&lt;li&gt;Users might not know that it is possible to do something, and have not considered asking an agent to do&amp;nbsp;it&lt;/li&gt;
&lt;li&gt;Agent models still have limited context windows (even a context window of 1 million tokens can fill up quickly with a sufficiently complex task), and ways to enable a model to keep relevant task details in context while removing irrelevant details are still being&amp;nbsp;studied&lt;/li&gt;
&lt;li&gt;The model might not have been trained on a particular task, and its general reasoning capabilities might not be sufficient to carry out the task&amp;nbsp;effectively&lt;/li&gt;
&lt;li&gt;Agent harnesses tend to run in the commandline, or be designed primarily for programmer use, thus scaring layfolks&amp;nbsp;away&lt;/li&gt;
&lt;li&gt;&amp;#8230;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Agent capabilities tend to be emergent. That means researchers and frontier labs can train a model to carry out tasks A, B, and C, and a user giving the agent a different kind of task &lt;em&gt;discovers&lt;/em&gt; that it is also effective at task D but not task&amp;nbsp;E.&lt;/p&gt;
&lt;p&gt;Generally, a question can &amp;#8220;can an agent do F?&amp;#8221; can&amp;#8217;t be answered definitively prior to actually asking the model to do F. And even if one person fails to get the agent to execute the task successfully, another person might succeed, because they asked differently, because they are familiar with the terminology required to instruct the agent, or for some other&amp;nbsp;reason.&lt;/p&gt;
&lt;p&gt;All of this is still ongoing research work: agents only really took off in 2025, when &lt;a href="https://www.anthropic.com/news/claude-3-7-sonnet"&gt;Anthropic released Claude Code&lt;/a&gt; which became the first generally capable agent. Since then, every day users are discovering new things that it can do. The things that it can&amp;#8217;t, Anthropic and other frontier labs are still training it to be able to do&amp;nbsp;them.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Issue summary:&lt;/strong&gt; Agents are software applications that comprise a harness, a runtime, and a model (typically accessed through an &lt;span class="caps"&gt;API&lt;/span&gt; instead of directly executed on the computer). They enable a user to type in a request or send it by other means and thus instruct the agent to carry out a task on the computer until completion. The capabilities of agents are limited by the tools available to&amp;nbsp;them.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;You now have a pretty good idea of all the pieces involved in getting an &lt;span class="caps"&gt;AI&lt;/span&gt; agent to do things. The part I can&amp;#8217;t authoritatively tell you about is what they can or can&amp;#8217;t do, because that is still changing every week as frontier labs continue to train more capable models and agent harnesses continue to add more tools and&amp;nbsp;features.&lt;/p&gt;
&lt;p&gt;If you&amp;#8217;re curious, consider trying them out. You could search for an online guide, or let ChatGPT/Claude help get you&amp;nbsp;started.&lt;/p&gt;
&lt;h2&gt;What I’ll be covering&amp;nbsp;next&lt;/h2&gt;
&lt;p&gt;In ten issues, I&amp;#8217;ve walked you through the key concepts that help you understand what &lt;span class="caps"&gt;AI&lt;/span&gt; agents do. With three issues left to go, what else should I&amp;nbsp;cover?&lt;/p&gt;
&lt;p&gt;Some questions I&amp;#8217;m anticipating, or have fielded some variant&amp;nbsp;of:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Can I run my own &lt;span class="caps"&gt;AI&lt;/span&gt;&amp;nbsp;model?&lt;/li&gt;
&lt;li&gt;Why can&amp;#8217;t the &lt;span class="caps"&gt;AI&lt;/span&gt; do&amp;nbsp;&amp;lt;thing&amp;gt;?&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Question 2 has a boring answer and an interesting one. The boring answer is &amp;#8220;because it hasn&amp;#8217;t been trained yet&amp;#8221;. The interesting answer is &amp;#8230; not really suitable for a newletter titled Layman&amp;#8217;s Guide to &lt;em&gt;Computing&lt;/em&gt;, because it&amp;#8217;ll be rooted in philosophy and cognitive science. In a different publication&amp;nbsp;perhaps.&lt;/p&gt;
&lt;p&gt;So let&amp;#8217;s tackle question 1, which will draw on computing concepts I&amp;#8217;ve covered in earlier issues and give you an idea of the kind of compute and memory capacity needed to run a&amp;nbsp;model.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Next issue:&lt;/strong&gt; &lt;a href="https://ngjunsiang.github.io/laymansguide/issue180.html"&gt;Issue 180: Running a&amp;nbsp;model&lt;/a&gt;&lt;/p&gt;</content><category term="Season 14"></category></entry><entry><title>Issue 178: Model thinking and reasoning</title><link href="https://ngjunsiang.github.io/laymansguide/issue178.html" rel="alternate"></link><published>2026-08-03T08:00:00+08:00</published><updated>2026-08-03T08:00:00+08:00</updated><author><name>J S Ng</name></author><id>tag:ngjunsiang.github.io,2026-08-03:/laymansguide/issue178.html</id><summary type="html">&lt;p&gt;Thinking/reasoning models are those that have been trained on examples of how to think about different problems in different domains, or plan and execute complex tasks. They often use tools to aid them in goal tracking and updating. The full thinking trace from the model may be removed or hidden to present a more legible response to the&amp;nbsp;user.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;Previously:&lt;/strong&gt; Multimodal models represent text, image, and audio tokens alongside each other in their embedding space. The model uses the input tokens, regardless of type, to calculate the next output token. Multimodal models typically only output text tokens in their response, delegating to more specialized models for image and audio generation if&amp;nbsp;necessary.&lt;/p&gt;
&lt;p&gt;In this issue we fill in the last piece of the puzzle needed to &amp;#8220;unlock untold economic value&amp;#8221;, if the &lt;span class="caps"&gt;AI&lt;/span&gt; labs are to be believed. Let&amp;#8217;s talk about how models&amp;nbsp;&amp;#8220;think&amp;#8221;.&lt;/p&gt;
&lt;h2&gt;Making thinking&amp;nbsp;happen&lt;/h2&gt;
&lt;p&gt;You&amp;#8217;re in a lesson. The teacher asks a question, something innocuous really: &amp;#8220;What&amp;#8217;s the value of X?&amp;#8221; All eyes are on you. You reply with the first answer off the top of your head. Wrongly, it turns&amp;nbsp;out.&lt;/p&gt;
&lt;p&gt;Your teacher could mock you at this point, but if they decide to get you to think harder instead, what do they&amp;nbsp;say?&lt;/p&gt;
&lt;p&gt;As it happens, this trick works on LLMs too. The ways we try to get people to think harder appear to be well-represented in books, on the internet, and in other media that the models are trained&amp;nbsp;on.&lt;/p&gt;
&lt;p&gt;What this means is that you add any of the&amp;nbsp;following:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="dquo"&gt;&amp;#8220;&lt;/span&gt;think step by&amp;nbsp;step.&amp;#8221;&lt;/li&gt;
&lt;li&gt;&lt;span class="dquo"&gt;&amp;#8220;&lt;/span&gt;think&amp;nbsp;carefully.&amp;#8221;&lt;/li&gt;
&lt;li&gt;&lt;span class="dquo"&gt;&amp;#8220;&lt;/span&gt;check your assumptions before you&amp;nbsp;answer.&amp;#8221;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;And it influences the model&amp;#8217;s next token. It begins to output phrases&amp;nbsp;like:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;span class="dquo"&gt;&amp;#8220;&lt;/span&gt;Let&amp;#8217;s break this&amp;nbsp;down.&amp;#8221;&lt;/li&gt;
&lt;li&gt;&lt;span class="dquo"&gt;&amp;#8220;&lt;/span&gt;First, let&amp;#8217;s identify what&amp;#8217;s being&amp;nbsp;asked.&amp;#8221;&lt;/li&gt;
&lt;li&gt;&lt;span class="dquo"&gt;&amp;#8220;&lt;/span&gt;One way to approach this&amp;nbsp;is&amp;#8230;&amp;#8221;&lt;/li&gt;
&lt;li&gt;&lt;span class="dquo"&gt;&amp;#8220;&lt;/span&gt;Before answering, let&amp;#8217;s&amp;nbsp;consider&amp;#8230;&amp;#8221;&lt;/li&gt;
&lt;li&gt;&lt;span class="dquo"&gt;&amp;#8220;&lt;/span&gt;Let&amp;#8217;s work through the problem&amp;nbsp;systematically.&amp;#8221;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It begins to imitate the patterns of careful thinking that it picked up during training. Surprisingly (or perhaps unsurprisingly), this improves the model&amp;#8217;s answer in many cases! It generates a much longer answer, taking more time and using more compute in the process—this is what &lt;span class="caps"&gt;AI&lt;/span&gt; folks call &amp;#8220;spending compute for intelligence&amp;#8221;. If you don&amp;#8217;t have a large &lt;span class="caps"&gt;LLM&lt;/span&gt;, you can have a smaller &lt;span class="caps"&gt;LLM&lt;/span&gt; &amp;#8220;think harder&amp;#8221; and come up with a better&amp;nbsp;answer.&lt;/p&gt;
&lt;h2&gt;Where thinking breaks down: insufficient&amp;nbsp;examples&lt;/h2&gt;
&lt;p&gt;When this trick was first discovered, early adopters experimented with different prompt patterns, trying to get models to generate longer responses that led to better answers. But thinking doesn&amp;#8217;t always succeed. We&amp;#8217;ve all had the experience of trying to think through some difficult math problem, writing lots of working that ultimately led&amp;nbsp;nowhere.&lt;/p&gt;
&lt;p&gt;&lt;span class="caps"&gt;GPT&lt;/span&gt;-3 may have been trained on a really large dataset, but most webpages and books are not showcases of how to solve difficult problems through clear&amp;nbsp;thinking.&lt;/p&gt;
&lt;p&gt;So it&amp;#8217;s back to supervised learning again. Look for examples of how to solve difficult problems. Recruit experts and have them write down their chain of thought for different kinds of problems. Then train the model on this labelled data, so that it doesn&amp;#8217;t require users to be clever with prompts to extract this thinking. Train the model to differentiate between requests for a quick answer, and requests requiring deeper&amp;nbsp;thinking.&lt;/p&gt;
&lt;h2&gt;Thinking vs.&amp;nbsp;planning&lt;/h2&gt;
&lt;p&gt;A model that is able to think longer and in a more disciplined way to produce a better answer is able to tackle harder questions. These are the models that were solving olympiad questions that humans struggled to&amp;nbsp;solve.&lt;/p&gt;
&lt;p&gt;But this isn&amp;#8217;t enough for another kind of challenge: long-horizon tasks that involve multiple tool calls, putting together information and feedback from multiple sources, maintaining task coherence and a consistent goal orientation throughout the process, and finally producing output in the correct&amp;nbsp;format.&lt;/p&gt;
&lt;p&gt;For example, filing tax returns involves digging through a large number of financial documents, remaining aware of legal requirements for filing, extracting relevant information, and putting it together following those requirements. None of the steps along the way involve extreme intelligence or genius insight, it&amp;#8217;s just a lot of tedious steps and details to keep track of. Along the way, detours and failed tool calls threaten to derail the model; it can get stuck researching an edge case rule, debugging a failing tool call, or get distracted by other&amp;nbsp;things.&lt;/p&gt;
&lt;p&gt;This requires the model to &lt;em&gt;plan&lt;/em&gt;. It has to take an end-goal, break it down into phases and steps, think about immediate steps, execute them and observe the result, decide next steps, repeat, &amp;#8230;. Along the way, it has to keep track of goals and sub-goals (usually aided by task management tools), be able to tell when they are met and check them off the&amp;nbsp;list.&lt;/p&gt;
&lt;p&gt;Books and websites seldom contain detailed worked examples of how to do this, so the model has to be trained with labelled data (again!), given examples of planning steps through supervised learning until it is able to reproduce them&amp;nbsp;reliably.&lt;/p&gt;
&lt;h2&gt;Hidden vs visible&amp;nbsp;thinking&lt;/h2&gt;
&lt;p&gt;Frontier labs found that showing the full thinking process to users isn&amp;#8217;t always beneficial. For example, the full thinking trace—tokens that constitute the analysis and are not part of the final answer—could be really lengthy. Users tend not to like that; they want to see the key steps for a quick check, and then the final&amp;nbsp;answer.&lt;/p&gt;
&lt;p&gt;Perhaps the full thinking trace includes mistakes the model made and corrected later, erroneous tool calls that it subsequently fixed, search tool calls which the user does not need to see the full contents of, etc. In other cases, frontier labs may have found ways for the model to output a more efficient form of thinking with tokens that is not&amp;nbsp;human-readable.&lt;/p&gt;
&lt;p&gt;This means one more step in the runtime: detecting and processing thinking tokens. If the model is trained to demarcate thinking tokens with a special start and end sequence,&amp;nbsp;e.g. &lt;code&gt;&amp;lt;thinking&amp;gt;...&amp;lt;/thinking&amp;gt;&lt;/code&gt;, the runtime may look for&amp;nbsp;it.&lt;/p&gt;
&lt;p&gt;Once detected, this hidden thinking may be removed, summarized (with a different model), or collapsed to take up less space in the user&amp;nbsp;interface.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Issue summary:&lt;/strong&gt; Thinking/reasoning models are those that have been trained on examples of how to think about different problems in different domains, or plan and execute complex tasks. They often use tools to aid them in goal tracking and updating. The full thinking trace from the model may be removed or hidden to present a more legible response to the&amp;nbsp;user.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;This really is the primary concept behind thinking/reasoning models: more supervised training to output a sequence of tokens that lead the model to a useful&amp;nbsp;answer.&lt;/p&gt;
&lt;p&gt;If this sounds simple, that&amp;#8217;s because most of the magic is in the model training: crafting and labelling training examples, and then training the model on them, is a much more complicated process than it sounds, and I&amp;#8217;m excluding it from this issue because it is very technical and not suited for a newsletter named Layman&amp;#8217;s&amp;nbsp;Guide.&lt;/p&gt;
&lt;p&gt;Now you know what a model is doing when you activate a feature named &amp;#8220;Extended Thinking&amp;#8221;, or switch to a model that is described as a thinking/reasoning&amp;nbsp;model.&lt;/p&gt;
&lt;h2&gt;What I’ll be covering&amp;nbsp;next&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Next issue:&lt;/strong&gt; &lt;a href="https://ngjunsiang.github.io/laymansguide/issue179.html"&gt;Issue 179:&amp;nbsp;Agents&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Finally we can talk about this term, &amp;#8220;agents&amp;#8221;, and what differentiates them from a model. If you&amp;#8217;ve heard this term before and wondered what goes into one, subscribe to be notified when I lay it bare&amp;nbsp;;)&lt;/p&gt;</content><category term="Season 14"></category></entry><entry><title>Issue 177: Multimodal models</title><link href="https://ngjunsiang.github.io/laymansguide/issue177.html" rel="alternate"></link><published>2026-07-27T08:00:00+08:00</published><updated>2026-07-27T08:00:00+08:00</updated><author><name>J S Ng</name></author><id>tag:ngjunsiang.github.io,2026-07-27:/laymansguide/issue177.html</id><summary type="html">&lt;p&gt;Multimodal models represent text, image, and audio tokens alongside each other in their embedding space. The model uses the input tokens, regardless of type, to calculate the next output token. Multimodal models typically only output text tokens in their response, delegating to more specialized models for image and audio generation if&amp;nbsp;necessary.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;Previously:&lt;/strong&gt; In retrieval-augmented generation (&lt;span class="caps"&gt;RAG&lt;/span&gt;), the runtime performs a search with the user&amp;#8217;s request to retrieve relevant chunks from a set of documents from a knowledge base. The chunks may be further re-ranked by the runtime before finally being included in the &lt;span class="caps"&gt;LLM&lt;/span&gt;&amp;#8217;s input. One alternative to &lt;span class="caps"&gt;RAG&lt;/span&gt;, where information lookup happens outside of &lt;span class="caps"&gt;LLM&lt;/span&gt; generation, is to provide the &lt;span class="caps"&gt;LLM&lt;/span&gt; with search tools instead, and rely on its judgement to use them&amp;nbsp;well.&lt;/p&gt;
&lt;p&gt;Multimodal models. Try saying that three times quickly. It&amp;#8217;s quite a mouthful, but if you&amp;#8217;ve managed to keep up so far, it&amp;#8217;s really not complicated, so I don&amp;#8217;t expect this to be a long&amp;nbsp;issue.&lt;/p&gt;
&lt;h2&gt;Multimodal&amp;nbsp;models&lt;/h2&gt;
&lt;p&gt;While a large language model works only with text tokens, a &lt;strong&gt;multimodal model&lt;/strong&gt; can work with other types of tokens as well. We&amp;#8217;ve previously covered what text tokens are and how LLMs use them (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue172.html"&gt;Issue 172&lt;/a&gt;), so let&amp;#8217;s focus on image and audio&amp;nbsp;tokens.&lt;/p&gt;
&lt;p&gt;The approach is similar, really: text gets broken up into common repeating patterns. Image and audio likewise gets broken up into common repeating patterns. Each common repeating pattern is represented by a number, or set of numbers, and located in an embedding space (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue172.html"&gt;Issue 172&lt;/a&gt;).&lt;/p&gt;
&lt;h2&gt;Image&amp;nbsp;tokens&lt;/h2&gt;
&lt;p&gt;There are a variety of approaches for tokenizing images. A common way to do this is to break it up into 16×16-pixel patches. Each pixel has three values representing red+green+blue (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue043.html"&gt;Issues 43&lt;/a&gt; &lt;span class="amp"&gt;&amp;amp;&lt;/span&gt; &lt;a href="https://ngjunsiang.github.io/laymansguide/issue044.html"&gt;44&lt;/a&gt;), so each patch is a sequence of 16×16×3 = 768&amp;nbsp;values.&lt;/p&gt;
&lt;p&gt;Each unique combination of 768 values constitutes an image token. During training, these image tokens appear alongside other tokens (text, image, audio), and the model adjusts its embedding parameters to locate semantically similar tokens in close&amp;nbsp;proximity.&lt;/p&gt;
&lt;p&gt;During inference (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue173.html"&gt;Issue 173&lt;/a&gt;), hidden layers represent more abstract patterns that the model identifies: lower layers may encode information about edges, while higher layers capture information about shapes, textures, and even&amp;nbsp;objects.&lt;/p&gt;
&lt;h2&gt;Audio&amp;nbsp;tokens&lt;/h2&gt;
&lt;p&gt;While intuitively it seems natural to chunk audio into 1-second or even sub-second samples, in reality 1 second of audio contains 44,100 samples (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue045.html"&gt;Issue 45&lt;/a&gt;) which is still far too&amp;nbsp;large.&lt;/p&gt;
&lt;p&gt;Instead, audio is usually converted from waveform representation (amplitude vs time) into spectrum representation (frequency vs amplitude at a snapshot in time). The spectrogram gets split into shorter windows of a few milliseconds each (a few thousand samples per window). The values of each frequency in that window then naturally form an audio token, which appear alongside other tokens in training and get represented in embedding space the same way as other&amp;nbsp;tokens.&lt;/p&gt;
&lt;h2&gt;Multimodal models need supervised&amp;nbsp;training&lt;/h2&gt;
&lt;p&gt;Supervised learning plays a big part here. Images, audio, and text seldom appear together in unlabelled training data (except in video), so associating images and audio with text relies heavily on manual labelling. This is why multimodal models took so long to&amp;nbsp;emerge.&lt;/p&gt;
&lt;p&gt;During inference, all tokens regardless of type are represented as embeddings, and the model uses the input tokens to calculate the output&amp;nbsp;token.&lt;/p&gt;
&lt;h2&gt;Multimodal models vs image/audio generation&amp;nbsp;models&lt;/h2&gt;
&lt;p&gt;An app like ChatGPT can take user-uploaded image files, reference them in their response to the user, and then generate an image, or even convert the response from text to audio. But this seamlessness is an illusion; at the backend, these do not use the same&amp;nbsp;model.&lt;/p&gt;
&lt;p&gt;Multimodal models can take input tokens of multiple types, but typically only generate text in response; users do not expect image patches or audio snippets in the response, and would not know how to interpret&amp;nbsp;them.&lt;/p&gt;
&lt;p&gt;Instead, image and audio generation use different kinds of (non-Transformer) models, which might be worth exploring briefly in a future issue, but not this&amp;nbsp;one.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Issue summary:&lt;/strong&gt; Multimodal models represent text, image, and audio tokens alongside each other in their embedding space. The model uses the input tokens, regardless of type, to calculate the next output token. Multimodal models typically only output text tokens in their response, delegating to more specialized models for image and audio generation if&amp;nbsp;necessary.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;There you go. Multimodal models demystified: once you figure out how to tokenize something alongside text, and give the model lots of labelled data to associate it with text tokens during training, you can create another modality for your model. This sentence hides months of complexity that &lt;span class="caps"&gt;AI&lt;/span&gt; labs tackle, because that&amp;#8217;s what you&amp;#8217;re reading Layman&amp;#8217;s Guide for, isn&amp;#8217;t&amp;nbsp;it?&lt;/p&gt;
&lt;h2&gt;What I’ll be covering&amp;nbsp;next&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Next issue:&lt;/strong&gt; &lt;a href="https://ngjunsiang.github.io/laymansguide/issue178.html"&gt;Issue 178: Model thinking and&amp;nbsp;reasoning&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;We&amp;#8217;ve covered retrieval-augmented generation (&lt;span class="caps"&gt;RAG&lt;/span&gt;), and now we&amp;#8217;ve covered multimodal models. Text, images, audio: Check check checked. Tools? You&amp;nbsp;bet.&lt;/p&gt;
&lt;p&gt;We&amp;#8217;ve got almost all the ingredients to assemble an &lt;span class="caps"&gt;AI&lt;/span&gt; to scare the economic labor pool, but we are still lacking one final piece of the puzzle: how do LLMs&amp;nbsp;&amp;#8220;think&amp;#8221;?&lt;/p&gt;</content><category term="Season 14"></category></entry><entry><title>Issue 176: Retrieval-Augmented Generation (RAG)</title><link href="https://ngjunsiang.github.io/laymansguide/issue176.html" rel="alternate"></link><published>2026-07-20T08:00:00+08:00</published><updated>2026-07-20T08:00:00+08:00</updated><author><name>J S Ng</name></author><id>tag:ngjunsiang.github.io,2026-07-20:/laymansguide/issue176.html</id><summary type="html">&lt;p&gt;In retrieval-augmented generation (&lt;span class="caps"&gt;RAG&lt;/span&gt;), the runtime performs a search with the user&amp;#8217;s request to retrieve relevant chunks from a set of documents from a knowledge base. The chunks may be further re-ranked by the runtime before finally being included in the &lt;span class="caps"&gt;LLM&lt;/span&gt;&amp;#8217;s input. One alternative to &lt;span class="caps"&gt;RAG&lt;/span&gt;, where information lookup happens outside of &lt;span class="caps"&gt;LLM&lt;/span&gt; generation, is to provide the &lt;span class="caps"&gt;LLM&lt;/span&gt; with search tools instead, and rely on its judgement to use them&amp;nbsp;well.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;Previously:&lt;/strong&gt; LLMs can be trained to make tool calls, using the same training data used to train code assistants. The tool specifications are injected into the system prompt that is passed to the model, along with guidance on when to use a tool. Tool calls generated by the model are interpreted by a runtime that detects and executes them, then passes the results of the tool call back to the &lt;span class="caps"&gt;LLM&lt;/span&gt; in the next&amp;nbsp;input.&lt;/p&gt;
&lt;p&gt;We mentioned hallucination—mentioning non-existent publications, stories, or facts as though they were real—as one of the pitfalls of &lt;span class="caps"&gt;GPT&lt;/span&gt;-3, and mentioned how reinforcement learning with human feedback (&lt;span class="caps"&gt;RLHF&lt;/span&gt;) helps to combat some of these tendencies in general&amp;nbsp;use.&lt;/p&gt;
&lt;p&gt;These days, ChatGPT, Claude, and other chatbots also allow you to upload documents. The runtime supporting these chatbots helps to extract text from the documents with supporting context and include them in the system prompt, allowing the chatbot to answer from the document&amp;#8217;s contents to combat&amp;nbsp;hallucination.&lt;/p&gt;
&lt;p&gt;In some cases, the document may be too large. In other cases, a company may have a large set of documents the &lt;span class="caps"&gt;LLM&lt;/span&gt; should answer from, but they are too large to all be included in the system&amp;nbsp;prompt.&lt;/p&gt;
&lt;p&gt;In such cases, &lt;strong&gt;retrieval-augmented generation&lt;/strong&gt; (&lt;span class="caps"&gt;RAG&lt;/span&gt;) provides an alternative way to inject relevant information into the &lt;span class="caps"&gt;LLM&lt;/span&gt;&amp;#8217;s system&amp;nbsp;prompt.&lt;/p&gt;
&lt;h2&gt;Retrieval-Augmented Generation (&lt;span class="caps"&gt;RAG&lt;/span&gt;)&lt;/h2&gt;
&lt;p&gt;Like other &lt;span class="caps"&gt;LLM&lt;/span&gt; capabilities, this one comes from the runtime. The &lt;span class="caps"&gt;LLM&lt;/span&gt; plays no part in this and has no control over the&amp;nbsp;process.&lt;/p&gt;
&lt;p&gt;The source documents are chunked, and each chunk analyzed to create an embedding. Parts of the document that are closely related have embeddings located more&amp;nbsp;closely.&lt;/p&gt;
&lt;p&gt;Before the user&amp;#8217;s input is passed to the &lt;span class="caps"&gt;LLM&lt;/span&gt;, it is parsed by the runtime and analyzed into an embedding. This embedding is used to retrieve relevant parts of documents; other information may be used to determine relevant portions as&amp;nbsp;well.&lt;/p&gt;
&lt;p&gt;Instead of embedding entire documents, only these relevant portions are included in the system prompt for the &lt;span class="caps"&gt;LLM&lt;/span&gt; to answer the user&amp;#8217;s query. In more advanced implementations, the chunks may be further re-ranked by importance and other&amp;nbsp;criteria.&lt;/p&gt;
&lt;p&gt;All of this happens in the runtime, beyond the &lt;span class="caps"&gt;LLM&lt;/span&gt;&amp;#8217;s token generation&amp;nbsp;loop.&lt;/p&gt;
&lt;h2&gt;Limitations&lt;/h2&gt;
&lt;p&gt;When it works well, it works really well: the &lt;span class="caps"&gt;LLM&lt;/span&gt; doesn&amp;#8217;t hallucinate, quotes from the source, and if the source is well-tagged, it can even cite from the correct page and&amp;nbsp;paragraph.&lt;/p&gt;
&lt;p&gt;But there are ways it can make mistakes too. If no matching documents are found and the &lt;span class="caps"&gt;LLM&lt;/span&gt; isn&amp;#8217;t aware, it may hallucinate unless the runtime handles this well. On the opposite end of the spectrum, it may find too many results and not know how to select the most relevant ones. The documents themselves may be contradictory, incomplete, or require too much unwritten context. And lastly, it may miss important nuance found elsewhere in the document, or in other documents, that did not surface in the embedding&amp;nbsp;search.&lt;/p&gt;
&lt;h2&gt;Alternatives&lt;/h2&gt;
&lt;p&gt;Still, in cases where you can&amp;#8217;t fit entire source documents in the &lt;span class="caps"&gt;LLM&lt;/span&gt; context, what other alternatives do you&amp;nbsp;have?&lt;/p&gt;
&lt;p&gt;Then it&amp;#8217;s back to a set of tools for your &lt;span class="caps"&gt;LLM&lt;/span&gt; to use for searching the company knowledge base, read documents, and manually extract relevant portions. Naturally, your &lt;span class="caps"&gt;LLM&lt;/span&gt; will need to be trained on a dataset of positive examples of tool usage (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue175.html"&gt;Issue 175&lt;/a&gt;). In contrast to &lt;span class="caps"&gt;RAG&lt;/span&gt;, where retrieval is automatic and built into the runtime, here you are relying on the &lt;span class="caps"&gt;LLM&lt;/span&gt;&amp;#8217;s judgement of which tool to use, and when to use&amp;nbsp;it.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Issue summary:&lt;/strong&gt; In retrieval-augmented generation (&lt;span class="caps"&gt;RAG&lt;/span&gt;), the runtime performs a search with the user&amp;#8217;s request to retrieve relevant chunks from a set of documents from a knowledge base. The chunks may be further re-ranked by the runtime before finally being included in the &lt;span class="caps"&gt;LLM&lt;/span&gt;&amp;#8217;s input. One alternative to &lt;span class="caps"&gt;RAG&lt;/span&gt;, where information lookup happens outside of &lt;span class="caps"&gt;LLM&lt;/span&gt; generation, is to provide the &lt;span class="caps"&gt;LLM&lt;/span&gt; with search tools instead, and rely on its judgement to use them&amp;nbsp;well.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;Okay, that&amp;#8217;s &lt;span class="caps"&gt;RAG&lt;/span&gt; de-mystified. It&amp;#8217;s a program that runs a search on the user&amp;#8217;s request and injects relevant chunks from the knowledge base into the &lt;span class="caps"&gt;LLM&lt;/span&gt;&amp;#8217;s input, beyond the &lt;span class="caps"&gt;LLM&lt;/span&gt;&amp;#8217;s control. Now you can speak about &lt;span class="caps"&gt;RAG&lt;/span&gt; a little more&amp;nbsp;informatively.&lt;/p&gt;
&lt;p&gt;I avoided discussing &lt;span class="caps"&gt;RAG&lt;/span&gt;&amp;#8217;s performance, because results vary. For every detractor you can also find a supporter! Is it going to work well for you? You probably have to try it yourself, or find a consultant who can better advise&amp;nbsp;you.&lt;/p&gt;
&lt;h2&gt;What I’ll be covering&amp;nbsp;next&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Next issue:&lt;/strong&gt; &lt;a href="https://ngjunsiang.github.io/laymansguide/issue177.html"&gt;Issue 177: Multimodal&amp;nbsp;models&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Many chatbot models accept image and even audio alongside text. How does this work? De-mystifying in the next issue, so stay&amp;nbsp;tuned!&lt;/p&gt;</content><category term="Season 14"></category></entry><entry><title>Issue 175: LLM tools</title><link href="https://ngjunsiang.github.io/laymansguide/issue175.html" rel="alternate"></link><published>2026-07-13T08:00:00+08:00</published><updated>2026-07-13T08:00:00+08:00</updated><author><name>J S Ng</name></author><id>tag:ngjunsiang.github.io,2026-07-13:/laymansguide/issue175.html</id><summary type="html">&lt;p&gt;LLMs can be trained to make tool calls, using the same training data used to train code assistants. The tool specifications are injected into the system prompt that is passed to the model, along with guidance on when to use a tool. Tool calls generated by the model are interpreted by a runtime that detects and executes them, then passes the results of the tool call back to the &lt;span class="caps"&gt;LLM&lt;/span&gt; in the next&amp;nbsp;input.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;Previously:&lt;/strong&gt; Through reinforcement learning with human feedback (&lt;span class="caps"&gt;RLHF&lt;/span&gt;), the &lt;span class="caps"&gt;LLM&lt;/span&gt; is trained on labelled data until it can reliably follow instructions, avoid harmful output, and follow other desired behavior. A system prompt provides guidelines for output. The user&amp;#8217;s prompt is inserted into a templated prompt and passed to the &lt;span class="caps"&gt;LLM&lt;/span&gt;, which generates text in a markup format that a display system can understand. A chat interface wraps the entire system to create the illusion of a responsive&amp;nbsp;chatbot.&lt;/p&gt;
&lt;p&gt;A chatbot is fun to use for a while, but if all it could do was talk we wouldn&amp;#8217;t use it for very long. For starters, it would hallucinate a lot, or give outdated information, because it couldn&amp;#8217;t access the internet or do a web search. What would it take for models to be able to use computers to do&amp;nbsp;that?&lt;/p&gt;
&lt;p&gt;While this problem was being actively worked on, LLMs were also being trained to generate programming code. It turned out that code, being text-based, was fertile training ground for LLMs. They were improving at it too; while early versions still failed at producing large yet coherent programs, many were able to generate boilerplate code with correct syntax&amp;nbsp;already.&lt;/p&gt;
&lt;h2&gt;LLMs as tool-using&amp;nbsp;models&lt;/h2&gt;
&lt;p&gt;For a &lt;span class="caps"&gt;LLM&lt;/span&gt; to use a tool, it needs to be trained&amp;nbsp;to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;state the tool to&amp;nbsp;use&lt;/li&gt;
&lt;li&gt;pass the appropriate&amp;nbsp;options&lt;/li&gt;
&lt;li&gt;interpret the result, when passed back to the &lt;span class="caps"&gt;LLM&lt;/span&gt; (in the next&amp;nbsp;request)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;for example, to use a weather tool, the &lt;span class="caps"&gt;LLM&lt;/span&gt; needs to be able&amp;nbsp;to:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;say &amp;#8220;use the weather&amp;nbsp;tool&amp;#8221;&lt;/li&gt;
&lt;li&gt;pass options: &amp;#8220;location: Singapore, &lt;span class="caps"&gt;SG&lt;/span&gt;, show me the temperature and humidity as&amp;nbsp;well&amp;#8221;&lt;/li&gt;
&lt;li&gt;interpret the result: mostly self-explanatory, but e.g. it may need to understand if the location provided in the output may be the nearest known location and not the user&amp;#8217;s actual&amp;nbsp;location&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It turns out that the first two problems are already known and solved: the first text-based interfaces were invented in the 1970s after all, and programmers have always needed a way to invoke programs through a text-based interface. They already had one, in the form of the command line (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue015.html"&gt;Issue 15&lt;/a&gt;) and the rich syntax that was already built around it. And they had another, in the form of the function call syntax that almost all programming languages had standardized on,&amp;nbsp;like &lt;code&gt;check_weather(location="Singapore, SG", show_temperature=True, show_humidity=True)&lt;/code&gt;. And training data already existed for both of these, in the form of open-source code readily available online in code repositories (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue019.html"&gt;Issue 19&lt;/a&gt;).&lt;/p&gt;
&lt;h2&gt;The structure of a tool-using &lt;span class="caps"&gt;LLM&lt;/span&gt;&lt;/h2&gt;
&lt;p&gt;For a &lt;span class="caps"&gt;LLM&lt;/span&gt; to be able to output tool calls, you&amp;nbsp;need:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;tool specifications, usually injected through the system prompt, telling the &lt;span class="caps"&gt;LLM&lt;/span&gt; the available tools and their&amp;nbsp;options&lt;/li&gt;
&lt;li&gt;guidance on when to use each tool, typically through further instructions in the system prompt, through &lt;span class="caps"&gt;RLHF&lt;/span&gt; (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue174.html"&gt;Issue 174&lt;/a&gt;), or&amp;nbsp;both&lt;/li&gt;
&lt;li&gt;familiarity with the tool call syntax used, typically trained into the model through &lt;span class="caps"&gt;RLHF&lt;/span&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;In the system prompt, you would&amp;nbsp;include:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;&lt;span class="k"&gt;[...]&lt;/span&gt;

&lt;span class="c1"&gt;## Tools available&lt;/span&gt;

&lt;span class="na"&gt;- `check_weather(location&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;text, show_temperature: boolean, show_humidity: boolean)&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;Check the weather at the given location. Example&lt;/span&gt;&lt;span class="o"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s"&gt;&amp;quot;Singapore, SG&amp;quot;&lt;/span&gt;
&lt;span class="w"&gt;  &lt;/span&gt;&lt;span class="na"&gt;Pass show_temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="s"&gt;True and show_humidity=True if temperature and humidity are required in the output&lt;/span&gt;
&lt;span class="na"&gt;- ...&lt;/span&gt;
&lt;span class="na"&gt;- ...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Providing a rich set of tools without using up too many tokens is a tricky design balance that requires regular tweaking. In any case, the model is then trained to output the tool calls in a specially marked section of their&amp;nbsp;output.&lt;/p&gt;
&lt;h2&gt;Invoking the&amp;nbsp;tools&lt;/h2&gt;
&lt;p&gt;At the point when the model outputs the stop token and the program stops using it to calculate more output tokens, its involvement stops. The program interprets the model&amp;#8217;s output, separating the tool calls out, and passes them to another&amp;nbsp;system.&lt;/p&gt;
&lt;p&gt;You see, tool calls can be pretty dangerous, especially if they enable the model to carry out destructive actions. A shell command&amp;nbsp;like &lt;code&gt;rm -rf /&lt;/code&gt; on Linux or Mac could delete the entire operating system, or important subdirectories.&amp;nbsp;A &lt;code&gt;delete_database&lt;/code&gt; tool could do what it says, but with the wrong target specified. So it&amp;#8217;s common to have a system that examines the tool call and attempts to determine if it is safe. In a code assistant, this tool call might be shown to the user for explicit approval. In a web-based chatbot like ChatGPT, tool safety is usually handled by another system&amp;nbsp;instead.&lt;/p&gt;
&lt;p&gt;Once validated, the tool needs to be &lt;em&gt;executed&lt;/em&gt; on a computer system. This computer system needs to have the necessary programs installed. It should also be isolated against potentially destructive actions. We&amp;#8217;ve covered how containerization (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue149.html"&gt;Issue 149&lt;/a&gt;) enables this to be done; an isolated container for each session where&amp;nbsp;necessary.&lt;/p&gt;
&lt;p&gt;Finally, the result of the tool call, whether success or failure, is captured and then added to the token sequence which is fed back into the &lt;span class="caps"&gt;LLM&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;This all sounds pretty neat, but with one caveat: only the chatbot provider (OpenAI for ChatGPT, or Anthropic for Claude) can pass these tools to the &lt;span class="caps"&gt;LLM&lt;/span&gt;. Third-party integrations, such as with GitHub or Google Drive, would be tricky for OpenAI/Anthropic to design on their own, yet unsafe for external parties to inject into the system&amp;nbsp;prompt.&lt;/p&gt;
&lt;h2&gt;Integrating third-party&amp;nbsp;tools&lt;/h2&gt;
&lt;p&gt;So in Nov 2024, Anthropic proposed another standard: the &lt;a href="https://modelcontextprotocol.io/docs/getting-started/intro"&gt;Model Context Protocol&lt;/a&gt;, a way for external parties to specify a set of tools that work together to enable access to other web-based or software-based&amp;nbsp;systems.&lt;/p&gt;
&lt;p&gt;When the user registers a &lt;span class="caps"&gt;MCP&lt;/span&gt; server through a graphical or text-based interface, the system reads the tool specifications from the &lt;span class="caps"&gt;MCP&lt;/span&gt; server, injects them into the system prompt, and from there they work like other tools accessible to the &lt;span class="caps"&gt;LLM&lt;/span&gt;.&lt;/p&gt;
&lt;h2&gt;The&amp;nbsp;runtime&lt;/h2&gt;
&lt;p&gt;Notice that none of this is mediated or controlled by the &lt;span class="caps"&gt;LLM&lt;/span&gt;. It follows instructions, generates tool calls with the correct syntax in its output, then sees the result in the next input, seemingly by magic. The &lt;span class="caps"&gt;LLM&lt;/span&gt; is operating in a virtualized environment controlled by an external system that doesn&amp;#8217;t have a standardized name yet. For now we&amp;#8217;ll call it the &lt;strong&gt;runtime&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Tools and toolsets make or break a &lt;span class="caps"&gt;LLM&lt;/span&gt;-based assistant. They are the only way a &lt;span class="caps"&gt;LLM&lt;/span&gt; can take actions, get data, and otherwise make sense of the external world. A &lt;span class="caps"&gt;LLM&lt;/span&gt; without any tools is analogous to a human in a sensory deprivation tank—without information from the outside world, even &lt;a href="https://en.wikipedia.org/wiki/Sensory_deprivation"&gt;human beings quickly begin to hallucinate&lt;/a&gt;.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;&lt;strong&gt;Issue summary:&lt;/strong&gt; LLMs can be trained to make tool calls, using the same training data used to train code assistants. The tool specifications are injected into the system prompt that is passed to the model, along with guidance on when to use a tool. Tool calls generated by the model are interpreted by a runtime that detects and executes them, then passes the results of the tool call back to the &lt;span class="caps"&gt;LLM&lt;/span&gt; in the next&amp;nbsp;input.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;From here it&amp;#8217;s another 3 issues before we get to the topic of the year: &lt;span class="caps"&gt;AI&lt;/span&gt; agents. Before I get there I want to cover three more buzzphrases: &lt;strong&gt;retrieval-augmented generation&lt;/strong&gt; (&lt;span class="caps"&gt;RAG&lt;/span&gt;), &lt;strong&gt;multimodal&lt;/strong&gt; models, and &lt;strong&gt;reasoning/thinking&lt;/strong&gt;&amp;nbsp;models.&lt;/p&gt;
&lt;p&gt;By now I hope you&amp;#8217;re starting to see that LLMs really are next-token predictors underneath, and all their actual capabilities—the ones that let them know what is happening in real-time and change things in the world—are provided through the runtime. As the runtime grows more powerful and capable, LLMs must also be post-trained (using reinforcement learning a.k.a. &lt;span class="caps"&gt;RLHF&lt;/span&gt;) to use them&amp;nbsp;well.&lt;/p&gt;
&lt;h2&gt;What I’ll be covering&amp;nbsp;next&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Next issue:&lt;/strong&gt; &lt;a href="https://ngjunsiang.github.io/laymansguide/issue176.html"&gt;Issue 176: Retrieval-Augmented Generation (&lt;span class="caps"&gt;RAG&lt;/span&gt;)&lt;/a&gt;&lt;/p&gt;</content><category term="Season 14"></category></entry><entry><title>Issue 174: Reinforcement Learning</title><link href="https://ngjunsiang.github.io/laymansguide/issue174.html" rel="alternate"></link><published>2026-07-06T08:00:00+08:00</published><updated>2026-07-06T08:00:00+08:00</updated><author><name>J S Ng</name></author><id>tag:ngjunsiang.github.io,2026-07-06:/laymansguide/issue174.html</id><summary type="html">&lt;p&gt;Through reinforcement learning with human feedback (&lt;span class="caps"&gt;RLHF&lt;/span&gt;), the &lt;span class="caps"&gt;LLM&lt;/span&gt; is trained on labelled data until it can reliably follow instructions, avoid harmful output, and follow other desired behavior. A system prompt provides guidelines for output. The user&amp;#8217;s prompt is inserted into a templated prompt and passed to the &lt;span class="caps"&gt;LLM&lt;/span&gt;, which generates text in a markup format that a display system can understand. A chat interface wraps the entire system to create the illusion of a responsive&amp;nbsp;chatbot.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;Previously:&lt;/strong&gt; OpenAI discovered, through models &lt;span class="caps"&gt;GPT&lt;/span&gt;-1 to &lt;span class="caps"&gt;GPT&lt;/span&gt;-3, that scaling compute and (training) data &lt;em&gt;alone&lt;/em&gt; was sufficient to sharply increase the capabilities of a &lt;span class="caps"&gt;LLM&lt;/span&gt;: the transformer architecture and unsupervised learning together resulted in a model that was alarmingly&amp;nbsp;intelligent.&lt;/p&gt;
&lt;p&gt;Mechanistically, a &lt;span class="caps"&gt;LLM&lt;/span&gt; is a next-token predictor: from a set of parameters, and an input sequence of tokens, a program continually calculates the next token, which gets appended to the input sequence, and the new sequence gets fed in as the input again, until a stop token is&amp;nbsp;generated.&lt;/p&gt;
&lt;p&gt;OpenAI had discovered that by training &lt;span class="caps"&gt;GPT&lt;/span&gt;-3 (with over a hundred billion parameters) on a very large dataset (hundreds of billions of tokens), they ended up with a next-token predictor that appeared to generate readable, sensible&amp;nbsp;text.&lt;/p&gt;
&lt;p&gt;But that doesn&amp;#8217;t mean that &lt;span class="caps"&gt;GPT&lt;/span&gt;-3 was ready for public use yet: what about those hallucinations, that toxic output, the prompt injections that caused it to ignore OpenAI&amp;#8217;s&amp;nbsp;instructions?&lt;/p&gt;
&lt;h2&gt;Reinforcement&amp;nbsp;learning&lt;/h2&gt;
&lt;p&gt;Unsupervised learning may have created a genius model, but now OpenAI had to fall back on supervised learning to make it&amp;nbsp;useful.&lt;/p&gt;
&lt;p&gt;In 2022, OpenAI researchers submitted a paper titled &lt;a href="https://arxiv.org/abs/2203.02155"&gt;&amp;#8220;Training language models to follow instructions with human feedback&amp;#8221;&lt;/a&gt;:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Starting with a set of labeler-written prompts and prompts submitted through the OpenAI &lt;span class="caps"&gt;API&lt;/span&gt;, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune &lt;span class="caps"&gt;GPT&lt;/span&gt;-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models&amp;nbsp;InstructGPT.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;It was back to painstaking human labelling of data again, getting humans to write desired outputs and label toxic content to train the model on. Through this process of &lt;strong&gt;reinforcement learning with human feedback&lt;/strong&gt; (&lt;span class="caps"&gt;RLHF&lt;/span&gt;), InstructGPT was&amp;nbsp;born.&lt;/p&gt;
&lt;p&gt;&lt;span class="caps"&gt;RLHF&lt;/span&gt; was necessary to adjust the model parameters so that instructions like &amp;#8220;explain &amp;#8230;&amp;#8221; were treated as guiding instructions rather than starting text that the model would steer away&amp;nbsp;from.&lt;/p&gt;
&lt;h2&gt;Data cleaning and&amp;nbsp;labelling&lt;/h2&gt;
&lt;p&gt;Prompt injections would continue to be an issue, but in the meantime OpenAI could address toxic content by first cleaning up the dataset to remove toxic, low-quality content and add other high-quality data&amp;nbsp;sources.&lt;/p&gt;
&lt;p&gt;This need for new, novel data sources still drives frontier machine learning labs today, who pay for high-quality data sources they can use to train their&amp;nbsp;models.&lt;/p&gt;
&lt;h2&gt;Creating a chat&amp;nbsp;assistant&lt;/h2&gt;
&lt;p&gt;InstructGPT was ready to take instructions. But &amp;#8230; how do we get instructions from the user? How do we pass the responses back to them? The model was trained, the &lt;span class="caps"&gt;API&lt;/span&gt; was ready &amp;#8230; but OpenAI needed a graphical interface, a familiar mental model of interaction that the public could use&amp;nbsp;intuitively.&lt;/p&gt;
&lt;p&gt;One already existed: chat apps like WhatsApp were popular at the time, and users intuitively understood a chat input when they saw one. But how could OpenAI get InstructGPT to respond reliably like a chat assistant with a consistent personality and&amp;nbsp;style?&lt;/p&gt;
&lt;p&gt;It turned out the answer was already in the training&amp;nbsp;data.&lt;/p&gt;
&lt;h2&gt;Prompt&amp;nbsp;framing&lt;/h2&gt;
&lt;p&gt;There was a lot of training data in the form of interviews, movie scripts, things that look&amp;nbsp;like:&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;Alice: Why do cats like to jump on furniture?&lt;br /&gt;
Bob:&amp;nbsp;&amp;#8230;&lt;/p&gt;
&lt;/blockquote&gt;
&lt;p&gt;And in many cases, arranging the user&amp;#8217;s question along with a system prompt like so was enough to have the &lt;span class="caps"&gt;LLM&lt;/span&gt; roleplay a helpful&amp;nbsp;assistant:&lt;/p&gt;
&lt;div class="codehilite"&gt;&lt;pre&gt;&lt;span&gt;&lt;/span&gt;&lt;code&gt;# System Prompt

You are ChatGPT, a large language model trained by OpenAI. [...]
Knowledge cutoff: 2024-06
Current date: 2025-09-03

Personality: Engage warmly yet honestly with the user. [...]

User: &amp;lt;user&amp;#39;s input&amp;gt;
Assistant: 
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;

&lt;p&gt;Pass the above prompt to InstructGPT and it helpfully follows the pattern, demonstrating &lt;span class="caps"&gt;GPT&lt;/span&gt;-3&amp;#8217;s capabilities token after token, until it reaches a stop token. The program then takes the tokens generated after the prompt and displays them to the&amp;nbsp;user.&lt;/p&gt;
&lt;p&gt;What if the output is toxic, hallucinatory, or otherwise unacceptable? Back to &lt;span class="caps"&gt;RLHF&lt;/span&gt;&amp;nbsp;again.&lt;/p&gt;
&lt;h2&gt;The ChatGPT&amp;nbsp;wrapper&lt;/h2&gt;
&lt;p&gt;Even with the &lt;span class="caps"&gt;API&lt;/span&gt; in place, some window dressing is still needed. The &lt;span class="caps"&gt;LLM&lt;/span&gt;, being a language model, can only generate text, not format it. Most LLMs are &lt;span class="caps"&gt;RLHF&lt;/span&gt;-trained to generate text in a markup format (such as &lt;span class="caps"&gt;HTML&lt;/span&gt; or Markdown). The display system takes the &lt;span class="caps"&gt;LLM&lt;/span&gt;&amp;#8217;s output, interprets the markup, and displays it as something the user can understand, making headers bold and larger, adding bullets or numbers to lists, formatting code accordingly, and so&amp;nbsp;on.&lt;/p&gt;
&lt;p&gt;The wrapper can also do some helpful things, like filter the &lt;span class="caps"&gt;LLM&lt;/span&gt;&amp;#8217;s output for harmful text and block it from appearing, as a kind of last-layer defence against offensive output. Add a login screen, a way for users to access past chats, a few other niceties&amp;nbsp;&amp;#8230;&lt;/p&gt;
&lt;p&gt;Finally, &lt;a href="https://openai.com/index/chatgpt/"&gt;OpenAI launched ChatGPT in November 2022&lt;/a&gt;. And the world as we knew it changed&amp;nbsp;forever.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Issue summary:&lt;/strong&gt; Through reinforcement learning with human feedback (&lt;span class="caps"&gt;RLHF&lt;/span&gt;), the &lt;span class="caps"&gt;LLM&lt;/span&gt; is trained on labelled data until it can reliably follow instructions, avoid harmful output, and follow other desired behavior. A system prompt provides guidelines for output. The user&amp;#8217;s prompt is inserted into a templated prompt and passed to the &lt;span class="caps"&gt;LLM&lt;/span&gt;, which generates text in a markup format that a display system can understand. A chat interface wraps the entire system to create the illusion of a responsive&amp;nbsp;chatbot.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;ChatGPT was the beginning of many other features to follow. Among them: multimodal models, and tool calls. The former is easy to understand, so let&amp;#8217;s unpack how &lt;span class="caps"&gt;LLM&lt;/span&gt; tools work in the next&amp;nbsp;issue.&lt;/p&gt;
&lt;h2&gt;What I’ll be covering&amp;nbsp;next&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Next issue:&lt;/strong&gt; &lt;a href="https://ngjunsiang.github.io/laymansguide/issue175.html"&gt;Issue 175: &lt;span class="caps"&gt;LLM&lt;/span&gt;&amp;nbsp;tools&lt;/a&gt;&lt;/p&gt;</content><category term="Season 14"></category></entry><entry><title>Issue 173: Training, Inference, and Scaling</title><link href="https://ngjunsiang.github.io/laymansguide/issue173.html" rel="alternate"></link><published>2026-06-29T08:00:00+08:00</published><updated>2026-06-29T08:00:00+08:00</updated><author><name>J S Ng</name></author><id>tag:ngjunsiang.github.io,2026-06-29:/laymansguide/issue173.html</id><summary type="html">&lt;p&gt;OpenAI discovered, through models &lt;span class="caps"&gt;GPT&lt;/span&gt;-1 to &lt;span class="caps"&gt;GPT&lt;/span&gt;-3, that scaling compute and (training) data &lt;em&gt;alone&lt;/em&gt; was sufficient to sharply increase the capabilities of a &lt;span class="caps"&gt;LLM&lt;/span&gt;: the transformer architecture and unsupervised learning together resulted in a model that was alarmingly&amp;nbsp;intelligent.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;Previously:&lt;/strong&gt; A model does not see letters or words, only tokens. These tokens are typically generated from user input through a pre-tokenizer program. Tokens are represented in the model as embeddings, a sequence of numbers representing the token&amp;#8217;s position in the embedding matrix. The model uses each token&amp;#8217;s embedding, and its surrounding tokens, to infer its meaning in&amp;nbsp;context.&lt;/p&gt;
&lt;h2&gt;Model&amp;nbsp;Training&lt;/h2&gt;
&lt;p&gt;In issue 171, I explained a little about how model training&amp;nbsp;happens:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;we pass tokens generated from text to the&amp;nbsp;input&lt;/li&gt;
&lt;li&gt;we pass the expected output (in supervised training), or the subsequent tokens (in unsupervised&amp;nbsp;training)&lt;/li&gt;
&lt;li&gt;the model generates output from&amp;nbsp;input&lt;/li&gt;
&lt;li&gt;we compare the model&amp;#8217;s output to the expected&amp;nbsp;output&lt;/li&gt;
&lt;li&gt;we adjust model&amp;nbsp;parameters&lt;/li&gt;
&lt;li&gt;we repeat from step 3, attempting to adjust parameters to have the model generate output that is closer to the expected&amp;nbsp;output&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Notice that there&amp;#8217;s a &amp;#8220;forward&amp;#8221; step: step 3, where the input &amp;#8220;feeds forward&amp;#8221; to each hidden layer. Here, the computer calculates the values for the next layer based on the values of the previous layer and on the model&amp;#8217;s parameters between the two layers. This is repeated for each layer until we get to the&amp;nbsp;output.&lt;/p&gt;
&lt;p&gt;Notice also that there&amp;#8217;s a &amp;#8220;backward&amp;#8221; step: step 5, where we could adjust model parameters randomly—inefficient! Instead, the mathematical technique of gradient descent gives us a more optimized way to adjust the last hidden layer based on how it would affect the output. The second-to-last hidden layer is then adjusted with the same technique, based on how it would affect the last hidden layer. And this is repeated all the way to the first hidden layer. This &amp;#8220;backward trickling&amp;#8221; is called &lt;strong&gt;backpropagation&lt;/strong&gt;, or &amp;#8220;backprop&amp;#8221; more&amp;nbsp;informally.&lt;/p&gt;
&lt;p&gt;The above steps are repeated &lt;em&gt;for each input:output data pair&lt;/em&gt; (supervised training) or &lt;em&gt;for each token sequence run&lt;/em&gt; (unsupervised training). That&amp;#8217;s &lt;strong&gt;a lot&lt;/strong&gt; of repeated steps; researchers often have some shortcuts they take to speed up the process. Even then, it is still too many for a typical &lt;span class="caps"&gt;CPU&lt;/span&gt; to complete in a reasonable time; the big labs use specialized GPUs instead (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue123.html"&gt;Issue 123&lt;/a&gt;), resulting in training runs that take weeks to months to complete on multiple GPUs for today&amp;#8217;s state-of-the-art&amp;nbsp;LLMs.&lt;/p&gt;
&lt;p&gt;This is not a cheap&amp;nbsp;hobby.&lt;/p&gt;
&lt;h2&gt;Inference&lt;/h2&gt;
&lt;p&gt;Fortunately, using a model is a different affair, involving only steps 1 and 3 of the above. No backpropagation, no repeated runs. Just pass the input in, run one forward step per output token, repeat until done. This process is called &lt;strong&gt;inference&lt;/strong&gt;, and is what happens when we users send a request to ChatGPT or&amp;nbsp;Claude.&lt;/p&gt;
&lt;p&gt;(Hang on, how does a model &amp;#8220;know&amp;#8221; when it is &amp;#8220;done generating text&amp;#8221;? In model training, a special token,&amp;nbsp;e.g. &lt;code&gt;&amp;lt;EOS&amp;gt;&lt;/code&gt; for end-of-sequence, is inserted at the end of text. When this token is detected in the program, it stops invoking the&amp;nbsp;model.)&lt;/p&gt;
&lt;p&gt;Needless to say, inference is much cheaper than training, which is why we are able to enjoy many of these models for&amp;nbsp;free.&lt;/p&gt;
&lt;h2&gt;Scaling up to &lt;span class="caps"&gt;GPT&lt;/span&gt;-2&lt;/h2&gt;
&lt;p&gt;&lt;span class="caps"&gt;GPT&lt;/span&gt;-1 had 117 million parameters, was trained on ~7,000 books (about &lt;span class="caps"&gt;5GB&lt;/span&gt;), took a few days to complete training on 8 GPUs, costing $0.5 mil or&amp;nbsp;less.&lt;/p&gt;
&lt;p&gt;In Nov 2019, &lt;a href="https://openai.com/index/gpt-2-1-5b-release/"&gt;OpenAI released &lt;span class="caps"&gt;GPT&lt;/span&gt;-2&lt;/a&gt;, which was the first large language model to capture some public attention. &lt;span class="caps"&gt;GPT&lt;/span&gt;-2 had 1.5 billion parameters (1.5B), was trained on ~&lt;span class="caps"&gt;40GB&lt;/span&gt; of text from the web, and took a few weeks to train on hundreds of GPUs, costing OpenAI $1 mil to $5 mil to&amp;nbsp;train.&lt;/p&gt;
&lt;p&gt;&lt;span class="caps"&gt;GPT&lt;/span&gt;-2 was the same architecture that &lt;span class="caps"&gt;GPT&lt;/span&gt;-1 used, only with a larger model (tenfold) and with more training data (eightfold). What they got was a model&amp;nbsp;that:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;could perform tasks it was never explicitly trained on (zero-shot learning): answer questions, understand text, summarize, translate&amp;nbsp;(rudimentarily)&lt;/li&gt;
&lt;li&gt;could generalize from examples given in user input (one-shot/few-shot learning) without needing supervised&amp;nbsp;learning&lt;/li&gt;
&lt;li&gt;showed emerging ability on non-language tasks: counting, basic arithmetic, even some attempts at simple&amp;nbsp;proofs&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These are capabilities we take for granted today, but in early 2019 this was cutting-edge performance never demonstrated by any other machine learning model, and certainly not with so little human supervision. This discovery was scary enough that it took OpenAI nine months to fully release &lt;span class="caps"&gt;GPT&lt;/span&gt;-2&amp;#8217;s weights, fearing how its capabilities might be misused. The Verge reported: &amp;#8220;&lt;a href="https://www.theverge.com/2019/11/7/20953040/openai-text-generation-ai-gpt-2-full-model-release-1-5b-parameters"&gt;OpenAI has published the text-generating &lt;span class="caps"&gt;AI&lt;/span&gt; it said was too dangerous to share&lt;/a&gt;&amp;#8221;, but fortunately in the same article &amp;#8220;the lab says it&amp;#8217;s seen &amp;#8216;no strong evidence of misuse so&amp;nbsp;far&amp;#8217;&amp;#8221;.&lt;/p&gt;
&lt;h2&gt;The bitter lesson, and &lt;span class="caps"&gt;GPT&lt;/span&gt;-3&lt;/h2&gt;
&lt;p&gt;These findings prompted Rich Sutton, an influential machine learning researcher, to write &lt;a href="http://www.incompleteideas.net/IncIdeas/BitterLesson.html"&gt;a blog post published on 13 March 2019&lt;/a&gt; where he summed up this finding in a single sentence: &amp;#8220;The bitter lesson is that general methods that leverage computation are ultimately the most effective, and by a large margin.&amp;#8221; Elaborating, he adds &amp;#8220;seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of&amp;nbsp;computation.&amp;#8221;&lt;/p&gt;
&lt;p&gt;A tenfold increase in model parameters and training data led to a surprising leap in capability. OpenAI and other researchers wondered: What if we pushed this to its logical conclusion, and threw more compute and more data into machine learning&amp;nbsp;training?&lt;/p&gt;
&lt;p&gt;In Jun 2020, &lt;a href="https://web.archive.org/web/20200611150951/https://openai.com/blog/openai-api/"&gt;OpenAI released &lt;span class="caps"&gt;GPT&lt;/span&gt;-3&lt;/a&gt;, available through their web &lt;span class="caps"&gt;API&lt;/span&gt; (&lt;a href="https://ngjunsiang.github.io/laymansguide/issue004.html"&gt;Issue 4&lt;/a&gt;). &lt;span class="caps"&gt;GPT&lt;/span&gt;-3 had 175 billion parameters (175B, a hundredfold increase in model size), was trained on a mix of books and websites totalling 300 billion tokens, took weeks to train on hundreds of GPUs, and cost OpenAI up to $12 mil to&amp;nbsp;train.&lt;/p&gt;
&lt;p&gt;&lt;span class="caps"&gt;GPT&lt;/span&gt;-3&amp;nbsp;could:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;take instructions given in natural&amp;nbsp;language&lt;/li&gt;
&lt;li&gt;&lt;em&gt;reliably&lt;/em&gt; tackle many tasks zero-shot (with no&amp;nbsp;examples)&lt;/li&gt;
&lt;li&gt;&lt;em&gt;reliably&lt;/em&gt; adapt examples given in the user input, and generalize from&amp;nbsp;patterns&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;It had reached a level of capability that took the focus away from &lt;em&gt;training data&lt;/em&gt; and placed it on the user input, called the &lt;strong&gt;prompt&lt;/strong&gt;: without further training, the model could give you a response, the quality of which depended on the quality of your&amp;nbsp;prompt.&lt;/p&gt;
&lt;h2&gt;Alarming&amp;nbsp;behavior&lt;/h2&gt;
&lt;p&gt;LLMs had finally reached a point where they were easy enough to use by the general public. But before it could actually launch for public use, there were some concerns to be&amp;nbsp;addressed.&lt;/p&gt;
&lt;p&gt;For one, &lt;span class="caps"&gt;GPT&lt;/span&gt;-3 was extremely prone to hallucinations—making up things that never happened, papers that were never written, academic journals that never existed. It also readily reproduced toxic outputs from its data source—the internet (especially reddit and 4chan). It was extremely steerable through the prompt—a little too steerable for OpenAI&amp;#8217;s liking, when some users got &lt;span class="caps"&gt;GPT&lt;/span&gt;-3 to leak its system prompt—the instructions that OpenAI prepended to every request guiding &lt;span class="caps"&gt;GPT&lt;/span&gt;-3&amp;#8217;s response style and&amp;nbsp;guardrails.&lt;/p&gt;
&lt;p&gt;It would be some time before ChatGPT could even launch without dragging OpenAI down with&amp;nbsp;it.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Issue summary:&lt;/strong&gt; OpenAI discovered, through models &lt;span class="caps"&gt;GPT&lt;/span&gt;-1 to &lt;span class="caps"&gt;GPT&lt;/span&gt;-3, that scaling compute and (training) data &lt;em&gt;alone&lt;/em&gt; was sufficient to sharply increase the capabilities of a &lt;span class="caps"&gt;LLM&lt;/span&gt;: the transformer architecture and unsupervised learning together resulted in a model that was alarmingly&amp;nbsp;intelligent.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;We are getting closer to the LLMs we know and love/hate today. 
This issue covered the miracle story of GPTs 1 to 3. If &lt;span class="caps"&gt;GPT&lt;/span&gt;-3 was a child genius, ChatGPT is &lt;span class="caps"&gt;GPT&lt;/span&gt;-3 dressed up for work. Let&amp;#8217;s talk about what OpenAI had to do to it for public release—next&amp;nbsp;issue.&lt;/p&gt;
&lt;h2&gt;What I’ll be covering&amp;nbsp;next&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Next issue:&lt;/strong&gt; &lt;a href="https://ngjunsiang.github.io/laymansguide/issue174.html"&gt;Issue 174: Reinforcement&amp;nbsp;Learning&lt;/a&gt;&lt;/p&gt;</content><category term="Season 14"></category></entry><entry><title>Issue 172: Tokens, the currency of LLMs</title><link href="https://ngjunsiang.github.io/laymansguide/issue172.html" rel="alternate"></link><published>2026-06-22T08:00:00+08:00</published><updated>2026-06-22T08:00:00+08:00</updated><author><name>J S Ng</name></author><id>tag:ngjunsiang.github.io,2026-06-22:/laymansguide/issue172.html</id><summary type="html">&lt;p&gt;A model does not see letters or words, only tokens. These tokens are typically generated from user input through a pre-tokenizer program. Tokens are represented in the model as embeddings, a sequence of numbers representing the token&amp;#8217;s position in the embedding matrix. The model uses each token&amp;#8217;s embedding, and its surrounding tokens, to infer its meaning in&amp;nbsp;context.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;Previously:&lt;/strong&gt; The Transformer architecture, unlike previous machine learning model architectures, could generate its next item while processing all previous items at the same time. The technique of unsupervised learning trained models on unlabelled data, letting the model pick up patterns in underlying data instead of having it learn correct answers only, and was much faster than supervised learning. OpenAI applied both these ideas at scale, producing &lt;span class="caps"&gt;GPT&lt;/span&gt;-1, a model that beat best-performing models while requiring relatively little human supervision during&amp;nbsp;training.&lt;/p&gt;
&lt;p&gt;Wait—what exactly does a large language model (&lt;span class="caps"&gt;LLM&lt;/span&gt;) work with? Individual letters? Entire words? No, they work&amp;nbsp;with—&lt;/p&gt;
&lt;h2&gt;Tokens&lt;/h2&gt;
&lt;p&gt;Tokens are clusters of letters that make up the training data. The large language model (&lt;span class="caps"&gt;LLM&lt;/span&gt;) does not &amp;#8220;see&amp;#8221; letters or words, only &lt;strong&gt;tokens&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Tokens are &amp;#8230; quite unlike phonemes, syllables, or other word-fragments you and I are familiar with. They are typically programmatically generated by a separate program (not a model), based on letter-clusters that appear most frequently in the&amp;nbsp;data.&lt;/p&gt;
&lt;p&gt;For example, using &lt;a href="https://platform.openai.com/tokenizer"&gt;OpenAI&amp;#8217;s Tokenizer tool&lt;/a&gt; to visualize the above paragraph gives us&amp;nbsp;this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="OpenAI Tokenizer tool - text view" src="https://ngjunsiang.github.io/laymansguide/tokenizer-text.png" /&gt;&lt;br /&gt;
&lt;em&gt;OpenAI Tokenizer - text view&lt;/em&gt;  &lt;br /&gt;
&lt;img alt="OpenAI Tokenizer tool - token ID view" src="https://ngjunsiang.github.io/laymansguide/tokenizer-ids.png" /&gt;&lt;br /&gt;
&lt;em&gt;OpenAI Tokenizer - token &lt;span class="caps"&gt;ID&lt;/span&gt;&amp;nbsp;view&lt;/em&gt;    &lt;/p&gt;
&lt;p&gt;There is little human-discernible pattern as to what definitively constitutes a token: it could be a single punctuation mark, a letter or two (and sometimes including their preceding space, sometimes not), or an entire&amp;nbsp;word.&lt;/p&gt;
&lt;p&gt;Whatever the case, what we see&amp;nbsp;as &lt;code&gt;" you and I"&lt;/code&gt;, a &lt;span class="caps"&gt;LLM&lt;/span&gt; sees&amp;nbsp;as &lt;code&gt;[481, 326, 357]&lt;/code&gt;. A pre-tokenizer program tokenizes all input into numerical&amp;nbsp;values.&lt;/p&gt;
&lt;p&gt;Now you understand a little better why ChatGPT struggles to count Rs in &amp;#8220;strawberry&amp;#8221;, or in any other fruit&amp;nbsp;really.&lt;/p&gt;
&lt;h2&gt;Embeddings&lt;/h2&gt;
&lt;p&gt;How does the model&amp;nbsp;tell &lt;code&gt;481&lt;/code&gt;, &lt;code&gt;326&lt;/code&gt;,&amp;nbsp;and &lt;code&gt;357&lt;/code&gt; apart? How does it store or represent them within itself? Here, I am going to need you to use your imagination. You are familiar with the concept of a scatter plot, yes? A graph that looks like&amp;nbsp;this:&lt;/p&gt;
&lt;p&gt;&lt;img alt="A scatterplot with 2 dimensions" src="https://ngjunsiang.github.io/laymansguide/scatter-plot.png" /&gt;&lt;br /&gt;
&lt;em&gt;A scatterplot with 2 dimensions&lt;/em&gt;&lt;br /&gt;
Source: &lt;a href="https://www.embeddedsource.de/use-a-scatterplot-to-interpret-data/"&gt;EmbeddedSource&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Now imagine a scatterplot with as many data points as tokens. In &lt;span class="caps"&gt;GPT&lt;/span&gt;-1&amp;#8217;s case, that&amp;#8217;s approx. 40,000 tokens—its vocabulary size. Yes, I know that&amp;#8217;s a lot of points, but you can &lt;em&gt;roughly&lt;/em&gt; visualize that, yes? Good, that&amp;#8217;s the easy&amp;nbsp;part.&lt;/p&gt;
&lt;p&gt;Now I need you to imagine the scatterplot with &amp;#8230; &lt;em&gt;*checks notes*&lt;/em&gt;—768 dimensions. No, that is not a typo, we &lt;em&gt;are&lt;/em&gt; talking about a scatterplot with 768 dimensions. Oh, that&amp;#8217;s too difficult to imagine? Yeah. Sorry, that&amp;#8217;s why I don&amp;#8217;t have an image attached. Just try your best&amp;nbsp;🙏&lt;/p&gt;
&lt;p&gt;Essentially that is what a &lt;span class="caps"&gt;LLM&lt;/span&gt; generates as a result of its training. Each token in its vocabulary becomes a data point, and each data point is represented in this 768-dimensional space using 768 decimal numbers ranging from 0 to 1.0. This positional representation using many decimal numbers is called an &lt;strong&gt;embedding&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;Other uses for&amp;nbsp;embeddings&lt;/h2&gt;
&lt;p&gt;Embeddings are also not a new idea: they precede &lt;span class="caps"&gt;GPT&lt;/span&gt; by decades, having been conceptualized as early as the&amp;nbsp;1980s.&lt;/p&gt;
&lt;p&gt;Because they&amp;#8217;re such a handy and intuitive mathematical way to represent or visualize tokens and semantics, they&amp;#8217;re also used often in semantic search engines (which try to infer what you &lt;em&gt;mean&lt;/em&gt; instead of what you &lt;em&gt;said&lt;/em&gt;), recommendation engines (suggesting similar things based on what you bought or liked), relevance scoring,&amp;nbsp;etc.&lt;/p&gt;
&lt;h2&gt;How a &lt;span class="caps"&gt;LLM&lt;/span&gt; represents&amp;nbsp;semantics&lt;/h2&gt;
&lt;p&gt;There&amp;#8217;s more to a &lt;span class="caps"&gt;LLM&lt;/span&gt; than this collection of 40,000 embeddings; it forms only a tiny fraction of the entire model. But it is critical to how the &lt;span class="caps"&gt;LLM&lt;/span&gt; &amp;#8220;learns&amp;#8221; information from the text. Based on where the tokens appear relative to each other in the text, and the higher-order patterns that the model detects through its hidden layers, the model adjusts the embedding for each token, placing semantically similar ones closer to each other and dissimilar tokens farther away from each&amp;nbsp;other.&lt;/p&gt;
&lt;p&gt;And because this is a mathematical space with direction (in 768 dimensions), the model can also pick up on analogy to some extent: if you draw a (768-dimensional) arrow&amp;nbsp;pointing &lt;code&gt;king → queen&lt;/code&gt; and another arrow&amp;nbsp;pointing &lt;code&gt;father → mother&lt;/code&gt; within this embedding matrix, they end up almost parallel. This means the model can solve &lt;span class="caps"&gt;SAT&lt;/span&gt; vocab pairs, giving you &amp;#8220;mother&amp;#8221; when you give it &amp;#8220;king:queen,&amp;nbsp;father:?&amp;#8221;&lt;/p&gt;
&lt;p&gt;If an &lt;span class="caps"&gt;LLM&lt;/span&gt; relied only on this embedding matrix, it would not be able to distinguish &amp;#8220;bat&amp;#8221; as a warm flying mammal from &amp;#8220;bat&amp;#8221; as a piece of sporting equipment. The rest of the model—using the Transformer architecture, you&amp;#8217;ll recall from &lt;a href="https://ngjunsiang.github.io/laymansguide/issue171.html"&gt;issue 171&lt;/a&gt;—uses the tokens surrounding it and their positions to infer the context that &amp;#8220;bat&amp;#8221; is being used&amp;nbsp;in.&lt;/p&gt;
&lt;h2&gt;Model pricing and&amp;nbsp;limits&lt;/h2&gt;
&lt;p&gt;Most ChatGPT/Claude users are familiar with those products as subscriptions, where they pay a certain price per month to use ChatGPT/Claude for some arbitrary amount, and if they use too much too quickly they hit a usage limit and have to wait for it to&amp;nbsp;reset.&lt;/p&gt;
&lt;p&gt;But if you are a business, and using the &lt;span class="caps"&gt;API&lt;/span&gt; instead, you&amp;#8217;ll be looking at a different page, such as the &lt;a href="https://developers.openai.com/api/docs/pricing"&gt;&lt;span class="caps"&gt;API&lt;/span&gt; pricing page for OpenAI&amp;#8217;s &lt;span class="caps"&gt;API&lt;/span&gt;&lt;/a&gt;. Notice that prices are typically quoted in units of &amp;#8220;1M tokens&amp;#8221;, standing for &amp;#8220;1 million tokens&amp;#8221;. Now you know what those tokens are referring&amp;nbsp;to.&lt;/p&gt;
&lt;p&gt;Likewise, when Anthropic explains how usage and length limits work, and tell you that &amp;#8220;Claude&amp;#8217;s context window is 200K tokens&amp;#8221;, you now know what they are referring to. More importantly, you know it doesn&amp;#8217;t mean 200 characters or 200&amp;nbsp;words.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Issue summary:&lt;/strong&gt; A model does not see letters or words, only tokens. These tokens are typically generated from user input through a pre-tokenizer program. Tokens are represented in the model as embeddings, a sequence of numbers representing the token&amp;#8217;s position in the embedding matrix. The model uses each token&amp;#8217;s embedding, and its surrounding tokens, to infer its meaning in&amp;nbsp;context.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;I would have gone on longer, but I think tokens are a pretty novel concept for most layfolks and deserve their own issue to sit with and digest before we talk about what a model&amp;nbsp;does.&lt;/p&gt;
&lt;h2&gt;What I’ll be covering&amp;nbsp;next&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Next issue:&lt;/strong&gt; &lt;a href="https://ngjunsiang.github.io/laymansguide/issue173.html"&gt;Issue 173: Training, Inference, and&amp;nbsp;Scaling&lt;/a&gt;&lt;/p&gt;</content><category term="Season 14"></category></entry><entry><title>Issue 171: The first Generative Pre-Training model, GPT-1</title><link href="https://ngjunsiang.github.io/laymansguide/issue171.html" rel="alternate"></link><published>2026-06-15T08:00:00+08:00</published><updated>2026-06-08T16:00:00+08:00</updated><author><name>J S Ng</name></author><id>tag:ngjunsiang.github.io,2026-06-15:/laymansguide/issue171.html</id><summary type="html">&lt;p&gt;The Transformer architecture, unlike previous machine learning model architectures, could generate its next item while processing all previous items at the same time. The technique of unsupervised learning trained models on unlabelled data, letting the model pick up patterns in underlying data instead of having it learn correct answers only, and was much faster than supervised learning. OpenAI applied both these ideas at scale, producing &lt;span class="caps"&gt;GPT&lt;/span&gt;-1, a model that beat best-performing models while requiring relatively little human supervision during&amp;nbsp;training.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;Previously:&lt;/strong&gt; Models simplify and represent a relationship between input values and output values. The more complex the relationship, the more parameters the model needs to learn. Models are simplifications of reality, and their performance depends on how well they capture underlying patterns in the data, as well as the quality and quantity of the&amp;nbsp;dataset.&lt;/p&gt;
&lt;p&gt;We are going to set aside image and audio models for today, and narrow down to focus on &lt;strong&gt;language models&lt;/strong&gt; in particular, because that&amp;#8217;s what sparked off the &lt;span class="caps"&gt;AI&lt;/span&gt;&amp;nbsp;craze.&lt;/p&gt;
&lt;h2&gt;The pre-2018 machine learning&amp;nbsp;paradigm&lt;/h2&gt;
&lt;p&gt;I am not a machine learning researcher and can&amp;#8217;t tell you what the prevailing &lt;em&gt;research&lt;/em&gt; paradigm at that point was. But in open-source and consumer applications, it seemed machine learning models were &lt;em&gt;bespoke&lt;/em&gt;:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;You started with something specific you needed, like image classification, optical character recognition (&lt;span class="caps"&gt;OCR&lt;/span&gt;), speech recognition, translation,&amp;nbsp;&amp;#8230;&lt;/li&gt;
&lt;li&gt;You collected a &lt;em&gt;huuuuuge&lt;/em&gt; dataset of input-output pairs for that specific task: images and their labels, scanned documents and their text, audio, etc. And by huuuuuge I mean tens of thousands to millions of&amp;nbsp;examples.&lt;/li&gt;
&lt;li&gt;After collecting the data you often have to clean it up (remove duplicates, remove outliers, etc.) and label it (e.g. label images with their correct&amp;nbsp;labels).&lt;/li&gt;
&lt;li&gt;You then trained a model on part of the dataset, tweaking parameters and trying different architectures (ways of arranging&amp;nbsp;parameters).&lt;/li&gt;
&lt;li&gt;You tested the model on the other part of the dataset, passing each input through the model and comparing the output to the expected output, and measuring how well it&amp;nbsp;performed.&lt;/li&gt;
&lt;li&gt;You repeated steps 4 and 5 until you were satisfied with the model&amp;#8217;s performance, and then you deployed it for&amp;nbsp;use.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This technique of using labelled data to train the model is called &lt;strong&gt;supervised learning&lt;/strong&gt;, because of the need to tweak the model&amp;#8217;s parameters (under human supervision) to match the expected&amp;nbsp;output.&lt;/p&gt;
&lt;p&gt;There were (and still are) many machine learning models trained this way and used. For example, &lt;a href="https://tesseractocr.org/"&gt;tesseract&lt;/a&gt; is an open-source &lt;span class="caps"&gt;OCR&lt;/span&gt; engine that was first released in 2005. It was trained on a dataset of scanned documents and their corresponding text, and has been used in various applications for &lt;span class="caps"&gt;OCR&lt;/span&gt; tasks. Another example is the ResNet architecture for image classification, which was introduced in 2015 and has been widely used for image recognition&amp;nbsp;tasks.&lt;/p&gt;
&lt;h2&gt;The Transformer&amp;nbsp;architecture&lt;/h2&gt;
&lt;p&gt;Before Google&amp;#8217;s 2017 paper on the attention mechanism, the prevailing machine learning models had two problematic&amp;nbsp;limitations:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;they &amp;#8220;looked&amp;#8221; at input data one item at a time to produce the output, resulting in slow output&amp;nbsp;generation&lt;/li&gt;
&lt;li&gt;because of the above, data that was processed earlier seldom made it through to the end of the model, resulting in a recency bias: the model tended to focus on the most recent input data and ignore earlier input&amp;nbsp;data&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;The attention mechanism introduced in Google&amp;#8217;s 2017 paper allowed models to &amp;#8220;look&amp;#8221; at all input data at once, speeding up output generation. The same mechanism also computed which parts of the input data were most relevant for producing the&amp;nbsp;output.&lt;/p&gt;
&lt;p&gt;Attention was not a new mechanism in machine learning: prior models had used them, but in separate stages, and alongside other mechanisms. Google&amp;#8217;s paper was the first to ask: &amp;#8220;what if we &lt;em&gt;only used attention everywhere&lt;/em&gt;?&amp;#8221; The resulting architecture, which they called the &amp;#8220;Transformer&amp;#8221;, was a breakthrough in speed and&amp;nbsp;simplicity.&lt;/p&gt;
&lt;h2&gt;Unsupervised&amp;nbsp;learning&lt;/h2&gt;
&lt;p&gt;Besides the Transformer architecture, another breakthrough was already making its rounds: instead of task-specific datasets, researchers wondered why they needed so many task-specific datasets. Since the data represented different subsets of reality (from different tasks), what if they just trained a single model on a really, really large dataset of text to produce a &lt;strong&gt;base model&lt;/strong&gt;? Then they could fine-tune it on smaller task-specific datasets to produce task-specific&amp;nbsp;models.&lt;/p&gt;
&lt;p&gt;This technique, called &lt;strong&gt;unsupervised learning&lt;/strong&gt;, did not require data to be labelled—the model &amp;#8220;learns&amp;#8221; patterns in the underlying data without human correction, simply trying to predict the next word in the training data given the previous&amp;nbsp;words.&lt;/p&gt;
&lt;h2&gt;Generative Pre-trained Transformer (&lt;span class="caps"&gt;GPT&lt;/span&gt;)&lt;/h2&gt;
&lt;p&gt;A few researchers at OpenAI then had the idea to try this pre-training approach on the Transformer architecture. OpenAI built the first &lt;strong&gt;Generative Pre-trained Transformer&lt;/strong&gt; (&lt;span class="caps"&gt;GPT&lt;/span&gt;) model, which they released in 2018. &lt;strong&gt;Generative&lt;/strong&gt; means the model generates output based on input, producing one output item at a time (but processing all inputs simultaneously). &lt;strong&gt;Pre-trained&lt;/strong&gt; means the model was largely trained through unsupervised learning. &lt;strong&gt;Transformer&lt;/strong&gt; refers to the underlying&amp;nbsp;architecture.&lt;/p&gt;
&lt;p&gt;They went &lt;em&gt;big&lt;/em&gt; on scale: &lt;span class="caps"&gt;GPT&lt;/span&gt;-1 trained on a dataset of 7,000 self-published books comprising 985 million words, representing this data using 117 million parameters—an unheard-of scale at the time (but now considered paltry). It attracted attention from the research community not only by improving on best-performing models on various language tasks, but by improving on &lt;em&gt;all of them&lt;/em&gt;, with &lt;em&gt;minimal task-specific training&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;Due to the unprecedented number of parameters used, &lt;span class="caps"&gt;GPT&lt;/span&gt;-1 was considered a &lt;strong&gt;large language model&lt;/strong&gt; (&lt;span class="caps"&gt;LLM&lt;/span&gt;), to distinguish it from smaller models that came before. However, this was a research idea, with code that was far from release-ready, and nobody except research-minded folks knew how to get &lt;span class="caps"&gt;GPT&lt;/span&gt;-1 running. And thus, this went unnoticed by the&amp;nbsp;public.&lt;/p&gt;
&lt;p&gt;Still, this was a breakthrough: no research lab before OpenAI had the kind of resources that enabled them to try this idea. It did require resources that most labs didn&amp;#8217;t have at the time: 8 GPUs, when most labs ran their training on a single &lt;span class="caps"&gt;GPU&lt;/span&gt;.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Issue summary:&lt;/strong&gt; The Transformer architecture, unlike previous machine learning model architectures, could generate its next item while processing all previous items at the same time. The technique of unsupervised learning trained models on unlabelled data, letting the model pick up patterns in underlying data instead of having it learn correct answers only, and was much faster than supervised learning. OpenAI applied both these ideas at scale, producing &lt;span class="caps"&gt;GPT&lt;/span&gt;-1, a model that beat best-performing models while requiring relatively little human supervision during&amp;nbsp;training.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;We&amp;#8217;re almost at the meaty part! I kinda snuck in 2 ideas today: the Transformer architecture (a minor part of this series actually) and unsupervised learning. I don&amp;#8217;t think you would have wanted to wait a week in between before hearing how OpenAI combined the two, haha &amp;#8230; so there you&amp;nbsp;go.&lt;/p&gt;
&lt;h2&gt;What I’ll be covering&amp;nbsp;next&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Next issue:&lt;/strong&gt; &lt;a href="https://ngjunsiang.github.io/laymansguide/issue172.html"&gt;Issue 172: Tokens, the currency of&amp;nbsp;LLMs&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;Wait—what exactly does a large language model (&lt;span class="caps"&gt;LLM&lt;/span&gt;) work with? Individual letters? Entire words? Find out next&amp;nbsp;issue!&lt;/p&gt;</content><category term="Season 14"></category></entry><entry><title>Issue 170: Machine learning models</title><link href="https://ngjunsiang.github.io/laymansguide/issue170.html" rel="alternate"></link><published>2026-06-08T11:30:00+08:00</published><updated>2026-06-08T11:30:00+08:00</updated><author><name>J S Ng</name></author><id>tag:ngjunsiang.github.io,2026-06-08:/laymansguide/issue170.html</id><summary type="html">&lt;p&gt;Models simplify and represent a relationship between input values and output values. The more complex the relationship, the more parameters the model needs to learn. Models are simplifications of reality, and their performance depends on how well they capture underlying patterns in the data, as well as the quality and quantity of the&amp;nbsp;dataset.&lt;/p&gt;</summary><content type="html">&lt;p&gt;&lt;strong&gt;Previously:&lt;/strong&gt; By better understanding how search bots categorise pages, a website owner can use keywords and other techniques to optimise the ranking of their page for specific search&amp;nbsp;terms.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;[Editor&amp;#8217;s Note]&lt;/strong&gt; Layman&amp;#8217;s Guide to Computing went on hiatus after its 13th season, because my promise when I began was to write only things widespread enough that I thought layfolks should have an accessible-yet-useful introduction&amp;nbsp;to.&lt;/p&gt;
&lt;p&gt;As I wrapped up Season 13 in 2022, the trend at that time was cloud computing. I tackled emulation and virtualization in Season 12, then the internet and online services in Season 13. ChatGPT launched in November 2022 that year. In 2024, I was first asked if I would continue Layman&amp;#8217;s Guide again to write about &lt;span class="caps"&gt;AI&lt;/span&gt;. I said no; less than half my colleagues were using ChatGPT or had heard of it, and I didn&amp;#8217;t think there would be enough common knowledge for me to usefully write about &lt;span class="caps"&gt;AI&lt;/span&gt;&amp;nbsp;yet.&lt;/p&gt;
&lt;p&gt;But now, in 2026, even my employers are actively promoting genAI, my students are using ChatGPT, and by the end of this year it would likely be difficult to find someone who hasn&amp;#8217;t heard of Claude Code or Gemini Pro or Codex. I suppose it&amp;#8217;s time to add one more&amp;nbsp;season.&lt;/p&gt;
&lt;p&gt;There are many explainers out there; I&amp;#8217;ve read a large number of them, many very good! But this is Layman&amp;#8217;s Guide to Computing, and something I noticed talking to laypeople is confusion: where did this &lt;span class="caps"&gt;AI&lt;/span&gt; come from? Why hadn&amp;#8217;t it been invented earlier? How does it work? What can it do? What can&amp;#8217;t it&amp;nbsp;do?&lt;/p&gt;
&lt;p&gt;So let&amp;#8217;s rewind time: I started writing Layman&amp;#8217;s Guide to Computing in 2018. A year before that, eight machine learning engineers at Google had published &amp;#8220;&lt;em&gt;Attention is All You Need&lt;/em&gt;,&amp;#8221; the paper that introduced the transformer architecture that underpins most of today&amp;#8217;s genAI. In mid-2018, before I started writing, OpenAI was still a non-profit research lab founded by Elon Musk and Sam Altman, and had just released the first version of &lt;span class="caps"&gt;GPT&lt;/span&gt;, a language model that was not yet large enough to generate coherent text. Following Google&amp;#8217;s whitepaper on the attention mechanism, they had just released a paper, &amp;#8220;&lt;em&gt;Improving Language Understanding by Generative Pre-Training&lt;/em&gt;&amp;#8221;, that described the architecture and training process for &lt;span class="caps"&gt;GPT&lt;/span&gt;, their first large language&amp;nbsp;model.&lt;/p&gt;
&lt;p&gt;It&amp;#8217;s a little hard to mentally reconstruct the tech culture and public awareness of the field of artificial intelligence and machine learning at that point in time. So let&amp;#8217;s start by understanding: what is a model? How were they used&amp;nbsp;then?&lt;/p&gt;
&lt;h2&gt;Models&lt;/h2&gt;
&lt;p&gt;You may not know it, but you were already using models in your daily life in 2017. When the iPhone launched, it had intelligent autocorrect and touch auto-adjustment features. For these features to work, Apple had to train machine learning models on large datasets of text and touch interactions. These models were then deployed on the iPhone to provide the autocorrect and touch adjustment&amp;nbsp;functionality.&lt;/p&gt;
&lt;p&gt;What are these models? You would likely have used them in a stats course, perhaps even in high school. If you were ever asked to sketch a best-fit line, a trendline, or a linear regression, you were already drawing a model. To do that,&amp;nbsp;you:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Hypothesized a linear relationship between an input&amp;nbsp;variable &lt;code&gt;x&lt;/code&gt; and an output&amp;nbsp;variable &lt;code&gt;y&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Collected data points&amp;nbsp;(&lt;code&gt;x&lt;/code&gt;, &lt;code&gt;y&lt;/code&gt;) through an&amp;nbsp;experiment.&lt;/li&gt;
&lt;li&gt;Represented the relationship&amp;nbsp;between &lt;code&gt;x&lt;/code&gt; and &lt;code&gt;y&lt;/code&gt; using a mathematical formula&amp;nbsp;(&lt;code&gt;y = mx + b&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;Determined the&amp;nbsp;parameters &lt;code&gt;m&lt;/code&gt; and &lt;code&gt;b&lt;/code&gt; that best fit the data&amp;nbsp;points.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;You &lt;em&gt;compressed&lt;/em&gt; the data—multiple sets of points (which we call a &lt;strong&gt;dataset&lt;/strong&gt;)—into two &lt;strong&gt;parameters&lt;/strong&gt;, &lt;code&gt;m&lt;/code&gt; and &lt;code&gt;b&lt;/code&gt;, a simpler representation that captures the underlying relationship. This representation is a &lt;strong&gt;model&lt;/strong&gt;. (We sometimes call it a mental model when we don&amp;#8217;t have it formally represented as a mathematical relationship, just a conceptual&amp;nbsp;description.)&lt;/p&gt;
&lt;p&gt;Apple&amp;#8217;s machine learning models do something similar. An autocorrect model takes a dataset of incorrect words/phrases and their actual words/phrases, and compresses it into a text correction model. A touch auto-adjustment model takes a dataset of touch interactions and their intended targets, and compresses it into a model that can predict the intended touch target based on the touch&amp;nbsp;input.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;tl;dr&lt;/strong&gt; A model takes in input values and produces output values based on patterns it has learned from training&amp;nbsp;data.&lt;/p&gt;
&lt;h2&gt;More complex&amp;nbsp;models&lt;/h2&gt;
&lt;p&gt;Of course, more complex models do not use a linear equation or a simple mathematical formula anymore. Machine learning researchers first represent more complex relationships using more complex formulas, such as polynomials or decision trees, which use more&amp;nbsp;parameters.&lt;/p&gt;
&lt;p&gt;But for other purposes the input may not be a single variable and the output may not be a single variable either. For example, in image recognition, the input is an image (which can be represented as a grid of pixel values), and the output is a label (e.g., &amp;#8220;cat&amp;#8221;, &amp;#8220;dog&amp;#8221;, &amp;#8220;car&amp;#8221;). An image classifier may have 64 input values (one for each pixel in an 8×8 image) and 10 output values (one for each possible label). The model would learn to map the input pixel values to the correct label based on patterns in the training data. That&amp;#8217;s 640 parameters (64 input values x 10 output values) that the model would learn to adjust during&amp;nbsp;training.&lt;/p&gt;
&lt;p&gt;This direct mapping of input to output can only take us so far. Perhaps output 1 doesn&amp;#8217;t just depend on inputs 1 to 10, but on some intermediate value calculated from them. Now we have to add intermediate &lt;strong&gt;layers&lt;/strong&gt; between input and output, which researchers call &amp;#8220;hidden layers&amp;#8221;. These layers allow the model to learn and represent more complex relationships between input and output. Each layer can have its own parameters, and the model learns to adjust these parameters during training to improve its&amp;nbsp;performance.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;tl;dr&lt;/strong&gt; More complex models use more parameters to represent the relationship between input values, intermediate values, and output values. Each parameter represents a relationship between two values. The more parameters, the more complex the relationships the model can&amp;nbsp;learn.&lt;/p&gt;
&lt;h2&gt;Limitations of&amp;nbsp;models&lt;/h2&gt;
&lt;p&gt;Models sound like mathematical dark magic, and often feel like it too. But like the mathematical models we learned in school, they have&amp;nbsp;limitations.&lt;/p&gt;
&lt;p&gt;If you&amp;#8217;ve seen how far some of your data points deviate from your best-fit line or trendline, you already know that the model cannot accurately represent all the data points—it is only a simplification. Likewise, all machine learning models are simplifications of&amp;nbsp;reality.&lt;/p&gt;
&lt;p&gt;Their performance depends on how well they capture underlying patterns in the data: pick an inappropriate representation for the feature, e.g. a linear formula instead of a polynomial, and the model will perform&amp;nbsp;poorly.&lt;/p&gt;
&lt;p&gt;It is also possible to go to the other extreme, adding a complex model with many parameters that fits the training data perfectly, but does not predict other data points well—an overfitted model. You can have a computer come up with a sine-decay formula that fits your first 6 data points perfectly, but wildly overshoot a 7th data&amp;nbsp;point.&lt;/p&gt;
&lt;p&gt;Also, their performance depends on the quality and quantity of the dataset. If your data does not represent the underlying reality well enough, missing important patterns or exceptions, or not covering a sufficient variety of cases, the model can pick out the wrong features and learn the wrong patterns. In the early days of machine learning, some researchers found that when training image classifiers on images of dogs and cats, the model began identifying any brown creature sitting on grass as a dog, because the training dataset had many images of dogs sitting on grass, but few images of cats sitting on grass. The model had learned to associate grass with dogs, which was not the intended&amp;nbsp;pattern.&lt;/p&gt;
&lt;p&gt;&lt;strong&gt;Issue summary:&lt;/strong&gt; Models simplify and represent a relationship between input values and output values. The more complex the relationship, the more parameters the model needs to learn. Models are simplifications of reality, and their performance depends on how well they capture underlying patterns in the data, as well as the quality and quantity of the&amp;nbsp;dataset.&lt;/p&gt;
&lt;hr /&gt;
&lt;p&gt;After experiencing the magic of ChatGPT and other genAI tools, it&amp;#8217;s easy to forget, or perhaps not even realise, that fundamentally they are powered by the same underlying principles that we apply in simpler&amp;nbsp;experiments.&lt;/p&gt;
&lt;p&gt;But&amp;nbsp;between &lt;code&gt;y = mx + b&lt;/code&gt; and ChatGPT, there is still &amp;#8230; such a huge gulf of complexity. We still have quite a way to&amp;nbsp;go.&lt;/p&gt;
&lt;h2&gt;What I’ll be covering&amp;nbsp;next&lt;/h2&gt;
&lt;p&gt;&lt;strong&gt;Next issue:&lt;/strong&gt; &lt;a href="https://ngjunsiang.github.io/laymansguide/issue171.html"&gt;Issue 171: The first Generative Pre-Training model, &lt;span class="caps"&gt;GPT&lt;/span&gt;-1&lt;/a&gt;&lt;/p&gt;
&lt;p&gt;What was the fundamental insight that made &lt;span class="caps"&gt;GPT&lt;/span&gt; and other LLMs possible? Find out next season&amp;nbsp;;)&lt;/p&gt;</content><category term="Season 14"></category></entry></feed>