Layman's Guide to Computing - Season 05

Issue 65: Memory Sharing in the Operating System

2020-03-28T17:12:00+08:00

Previously: Meltdown and Spectre require the programs executing them to have access to kernel memory space. Kernel address isolation attempts to prevent the program from even having access to the kernel address space in the first place. TLB flushing changes the virtual-to-physical memory mapping, disrupting Spectre’s reliance on a consistent virtual-to-physical memory mapping.

One question that makes sense to ask is: if the operating system is supposed to keep the memory used by each program separate, then how is one program able to access the memory of another program? How would a program trying to mount a Meltdown or Spectre attack be able to read the memory of any other program, let alone the operating system?

Let’s face it: it is impossible to completely separate programs from each other. Many programs need to communicate with each other; antivirus software needs to be able to scan the addresses accessed by your web browser for harmful links, Office software needs to be able to send data to each other especially for features like Mail Merge, and of course your task manager has to know how much resources every app is using. So that it can show you this:

Task Manager in Windows 10
You can reveal the shared memory column by right-clicking on the column labels and then “Select Columns”.

or this:

System Monitor in KDE (Linux).

What is this shared memory?

Private memory

The memory I talked about earlier, which every software application has, is used to store various things. It is used to store temporary information, such as unsaved data, application settings, graphics resources (every icon and image shown in the application has to come from somewhere …), but most important, libraries and other functions (Issue 17)).

Very few software developers will write every single bit of code used by their program; often, they will use software libraries written by others to provide specialised functions (e.g. encrypting your data, or accessing a database). When program code is compiled into CPU instructions (Issue 54)), these libraries of course have to be compiled and bundled up as well.

That makes the program really huge, doesn’t it? Yes, it does; it is one reason (but not the main reason) that mobile apps, especially Android apps, have become so bloated over the last half-decade or so. But I digress.

Shared memory

At some point, you start to realise that many of these apps need to use a set of identical functions: at the most basic level, requesting and managing memory, requesting file access, sending data over a network, …, and up to libraries for resizing images, and so on.

It doesn’t make sense for each app to have to bundle their own libraries for that! So the OS actually provides a set of common libraries that applications compiled for that OS can use. Each operating system bundles its own libraries for applications to use; this is one reason why applications compiled for Windows wont work on OSX or Linux, and vice-versa. That also means that these libraries have to be loaded into a part of memory that is accessible to all applications. These shared libraries thus go into shared memory space.

What else? Shared libraries can’t be taking up so much space by themselves, they’re just instructions …

Let’s try to find out what else is sitting in there.

Investigating memory details

On Windows, I’m going to need more specialised tools. I’ve only got an hour; let’s try something else.

Ah! System Monitor actually reveals more details about how an application uses memory. Let’s investigate the top few processes using the most memory.

Here’s Firefox:

Firefox detailed memory usage in KDE (Linux).

Oops, too much detail. Heres the gist:

Firefox uses about 450 MB for its own stuff in private memory, in a place called the heap.
To communicate with other processes, it uses about 10 MB privately, and 82 MB shared with other processes (it does so through /SYSV00000000, which is deleted when not in use)
It has loaded one of its core libraries, libxul.so (almost all libraries start with the prefix lib) in shared space. This core library is shared with other Mozilla applications, such as its Thunderbird email client, so it makes sense to put it mostly in shared memory.
It uses a small amount of space for caching things (startup code, its own scripts, etc)
It uses some shared memory to communicate with other processes. (The acronym IPC in this context usually refers to inter-process communication.) This can be for playing audio/video (it has to communicate with the audio/video drivers), or loading content that has to be processed through plugins (used to be Flash content in the past, now it can be other things).

Hmm, interesting. Let’s try to find something more illuminating to wrap up this season with.

How is shared memory used?

I do my newsletter writing mainly in an app called Atom, made by Github. Atom runs on a platform called Electron (atom … electron … get it?). Electron is a Github project that allows developers to write desktop/laptop apps in Javascript, traditionally the language of web scripting.

In system monitor, I can see an app named atom, and one named electron. Let’s inspect them both.

Electron detailed memory usage in KDE (Linux).

Atom detailed memory usage in KDE (Linux).

We can see that:

Both apps are sharing the electron library (it does not have a lib prefix, but it is stored in the /usr/lib directory which is where libraries go)
They both use a bunch of shared libraries: libicu* for Unicode support, libc* & libstd* for standard operating system functions (reading/writing files, etc), libgtk* for user interface management, fontconfig for fonts, etc
Some libraries are still loaded privately, and both programs still have a heap for their own data which is not meant to be accessible to other programs

You can see why the application memory usage shown in Task Manager/System Monitor doesn’t always tally with the total memory usage. Application memory usage usually shows both private+shared memory usage, so that will add up to a number greater than the total memory usage.

Issue summary: Shared memory helps to reduce the amount of memory needed by all the applications running on an operating system. It also allows applications to send data to each other, and to communicate.

Long issue, I hope the images make up for it. Computers in the early days didn’t share memory so easily, and that made things really inconvenient. They often had to communicate through one application writing data to a file, and then having the other application reading the data from that file. Slow, and often unreliable. Shared memory evolved as a way to make that process easier.

But shared memory, improperly secured and managed, is also how vulnerabilities like Meltdown and Spectre are made possible, and how malware can do what it does. It’s a double-edged sword.

What I’ll be covering next

Next issue: [LMG S6] Issue 66: Before the Cloud

Memory is one of those topics where I think laypeople and engineers have a completely different picture in their heads. I hope this issue has clarified that picture somewhat. It still won’t be completely clear until I can talk about heaps, but I wont do that until I figure out how to simplify it.

Meanwhile, the newsletter must go on! I’ve finished Season 5, having explained how computers improve performance through reordering instructions (Out-of-Order Processing) and running instructions ahead of time if it thinks they will be needed (Speculative Execution). Both of these processes use the cache, which is controlled by the CPU hardware directly, not by the operating system. And through an esoteric loophole that exploits timing differences in cache access (cache hit = fast, cache miss = slow), an attacker is able to leak data out from protected kernel memory through the cache.

After this detour, its time to rewind back to where I stopped in Season 3: with networks and the Internet. I went through data types in Season 4 to talk about what complex documents are (because the web is made up of a series of complex documents). Then I laid out a CPU exploit in Season 5, to show you how data can be leaked inadvertently.

Now I’m ready to tell you more about how the current online advertising model became what it is today, and why it is so bad for privacy. You are going to learn a lot more about how ads really work, how advertisers track your online activity, and how they ensnare many companies (especially the big publishers) into a kind of self-reinforcing scheme that lets them target their content more effectively while also letting advertisers improve their targeting.

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 64: Fixing Meltdown and Spectre

2020-03-14T08:00:00+08:00

Previously: For Meltdown and Spectre to work, they need two things: (1) Permission to carry out instructions (i.e. run programs) on the OS, and (2) knowledge of where the kernel address space is.

Last week, I explained two key limitations of Meltdown and Spectre that are needed for an attack to be successfully carried out. Hackers getting permission they shouldn’t have is not a security flaw related to Meltdown and Spectre, so that really belongs in a different season of Layman’s Guide.

So we’ll focus on problem 2—protecting access to the kernel address space, which is set aside for the OS’s use. The kernel address space contains key information, such as user privilege tables and OS state, which a hacker can compromise to gain higher-level permissions, or to find out where the memory address space for a certain program is running. Such as the customer information database.

Protecting the kernel address space

One common way of protecting knowledge of where the kernel address space (i.e. the “HQ”) is located is to keep changing its location. For example, newer versions of Linux randomise the location of the kernel address space at each computer bootup, to make it harder for an attacker to guess.

It is still possible for the attacker to slowly probe which parts of the address space it can access, and which parts it can’t, and make a guess where the kernel address space is; I will not go into detail about these various methods.

But do you see the bigger problem? Programs are actually allowed to request an address in the kernel address space. The OS checks its permission tables before it tells the program whether it is allowed to access that space. The only thing preventing program access to that space is an OS permission check.

In contrast, if a program tried to request memory address -56 or 2^65, the OS wouldn’t even need to check. Negative memory addresses are obviously invalid, as are memory addresses longer than 64-bit (which wouldn’t even be able to be sent).

Mitigating Meltdown: Kernel address isolation

One fix that has been merged into the Linux kernel since 2017 is KAISER, which aims to prevent programs from even having access to kernel address space. Similar patches have been released for Windows and macOS as well¹. Under this patch, two sets of address spaces are maintained.

The first set is the same as before: it is essentially the entire address space. But now, only the kernel (the “core” of the OS) has access to it. The second set contains the entire address space used by programs, excluding kernel address space. This way, programs running with user permissions will not even be able to get data from the kernel address space. It’s like trying to get to a room that doesn’t exist (to the program).

Having to keep switching between two sets of pages when executing instructions from both kernel programs as well as user programs is, of course, going to make things take longer than usual. Up to 20% longer for some instructions.

This primarily mitigates the impact of Meltdown, which attempts to access the kernel address space before it gets caught and an exception is raised in the program. But it does not do anything for Spectre, which speculatively executes two possible outcomes where the code meets a decision point, but later discards the outcome which is not needed.

Crash course: Translation Lookaside Buffer

One concept to cover before we get to the Spectre mitigation. In Issue 55) I talked about how the virtual address space allows programs to access data from different parts of the computer: USB devices, hard drives, network, sound card, and of course not forgetting the physical memory itself.

How does the CPU know that virtual address 2354476 is actually pointing to physical memory address 3564241? It doesn’t. This mapping is stored in the CPU, within the memory management unit. Like all mappings (remember the CPU cache, and the DNS cache from Issue 39)?), the lookup process can be greatly speeded up with a cache. The part of the CPU that caches virtual-to-physical memory mappings is called the Translation Lookaside Buffer, or TLB.

A key requirement for Spectre to work is for the Translation Lookaside Buffer to remain unchanged, so that it is getting data from the same part of (kernel address space) memory.

Mitigating Spectre: TLB flushing

Naturally, one way to mitigate Spectre is to keep flushing the TLB. As can be expected whenever you flush a cache, lookups will cause a cache miss and result in the CPU memory management unit having to figure out the mapping all over again, leading to slowdown.

Some performance/security features that are being worked on for processors include selective TLB flushing (flushing only some parts of it but not all), or learning to identify when it should be flushed.

Last words on Meltdown and Spectre

I lied in the title of this issue: there is no fix. These are only mitigations, which can reduce the impact of these attacks, but not prevent them completely.

The dismal conclusion you might not have drawn is that there is little we can do to protect ourselves against such vulnerabilities, besides keeping your OS patched and up to date, and not leaving your computer running continuously for too long (the location of kernel address space is only randomised upon bootup).

The good news is: Meltdown and Spectre are a lot of work. No cases of them being used in the real world have been reported as of yet, and hackers are unlikely to go to this much effort to attack consumers; targets of their attack will probably be database servers of bigger companies.

Still, the origin of these exploits stemmed from an earlier time when our collective focus was on faster and faster CPUs. In the early ’00s, we didn’t hear CPUs being touted as “safe” or “secure”, just “fast”. Neither did we see a need for secure CPUs.

It was only with the explosion of the mobile internet after 2007 that the market became a lucrative target. By the time the hacking tools became widespread, CPUs had already incorporated so many features to speed up processing at the cost of security.

Perhaps it is time for us to reassess the situation, make the judgement call to ask for greater hardware security, and take the bitter pill of performance tradeoff. and then wait for the CPU manufacturers to get the message, if they haven’t already.

Issue summary: Meltdown and Spectre require the programs executing them to have access to kernel memory space. Kernel address isolation attempts to prevent the program from even having access to the kernel address space in the first place. TLB flushing changes the virtual-to-physical memory mapping, disrupting Spectre’s reliance on a consistent virtual-to-physical memory mapping.

Phew, that was quite a bit to type. I am glad to be done talking about Meltdown and Spectre; these are sombre topics, and the more I write about them, the less faith I have in the devices I use.

Funnily enough, I had originally titled this season “Operating Systems and the CPU”. I am obviously far from covering things that I think people should know about their operating system, so that will probably resume in another season.

Since I’ve been talking so much about memory, I think it makes sense to bring in one more topic here to close off the season: how do all those programs share a common memory space?

What I’ll be covering next

Next issue: [LMG S5] Issue 65: Memory Sharing in the Operating System

I have only one issue left and I don’t want to end with a cliffhanger, so I’m going to keep the next issue focused on one question: what is all that memory used for?

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Interestingly, these patches went out shortly before Meltdown and Spectre were announced … I won’t speculate about the timing here, you draw your own conclusions. ↩

Issue 63: Limitations of Meltdown and Spectre

2020-03-07T08:00:00+08:00

Previously: To snoop the cache, we:

Flush the cache corresponding to the 256 memory addresses (to get a cache miss when attempting to load the data from memory)
Load the secret value using Meltdown or Spectre attacks (the secret value is only one byte, and cannot be greater than 256, so 256 addresses are sufficient)
Load the memory address from step 1 that corresponds to the value of the secret - this address is now cached, and the next request for it again will result in a cache hit
Request each address and look for the one with a lower request latency

At the heart of the exploit is a cache which, being managed by hardware, is not subject to more fine-grained OS control. This is presumed to be safe by engineers, since you can’t access data directly from hardware easily. But as we have seen in this season, with the right exploit, you can still get to that data, with or without permission.

Just how vulnerable are we to Meltdown and Spectre?

Getting permission

Every exploit relies on one or more things to work before it can do its things. Meltdown and Spectre require a way to run themselves on the CPU, in the OS. That means a black hat hacker will have to obtain illegal access to the OS, and there are a few common ways to do so:

By cracking a password
If the hash of a password (future season) is leaked, hackers can try to reverse-engineer the original password that led to that hash. This requires A LOT of CPU time, and is often not feasible for properly hashed passwords.
Getting a password from an unsuspecting user
Other people, usually admins and employees, of the OS will already have access to it. A black hat hacker can try to get the password from them through phishing means, or trying to get keystroke-logging malware onto a flash drive they use, or simply posing as a contractor who needs the password for … whatever reason.
Exploiting vulnerabilities
There are many ways to get an OS to carry out instructions it is not supposed to. An improperly secured web app could receive malicious form data from any of its pages that tells the database to return supposedly-secured information. An improperly configured web server could be exploited by sending it more data than it requested. (Just see the number of “buffer overflow” entries on this page). If it is not properly written to know what to do with this excess data, and naïvely stores it into memory or processes it, that leads to Bad Things Happening.

Once the black hat hackers find a way to get permission to run things in the OS, they are in lala-land! Not quite. There are different levels of permissions, and the most restrictive ones might not let you run any programs except from a whitelist. At the other end of the privilege spectrum, root accounts let you do pretty much everything and anything. This is why if you are ever asked to be root (or Admin) of a computer (including your router), you should really keep that password in a safe and secure place, such as a password manager.

Knowing where the loot is

During that tiny window of opportunity, the black hat hackers are trying to read data from parts of virtual memory they are not allowed to access. But which are those parts?

Within the physical memory part of the virtual memory space—

Okay, quick unpacking here. Remember that the virtual memory space is where all our devices get an address? Hard drives, USB devices, network interfaces, … and of course, physical memory (also known as “RAM”—yes, the same RAM you usually see on the specs of computers). Programs request data from and send data to these devices by using their virtual memory addresses. Each cell in physical memory also gets an address in virtual memory space.

Within the physical memory part of the virtual memory space, there are portions which are set aside for the OS only. This is the kernel address space, which is where critical information such as user privilege tables and OS state get stored. Knowing the addresses in the kernel address space is a big requirement for many exploits, so OS engineers obviously put a lot of work into make sure they are as hard to guess or discover as possible.

Issue summary: For Meltdown and Spectre to work, they need two things: (1) Permission to carry out instructions (i.e. run programs) on the OS, and (2) knowledge of where the kernel address space is.

Problem (1) has been with us since the operating system was born. Problem (2) is also not new: it’s basically figuring out where the HQ is. Spies have also been doing that since time immemorial. But we are now dealing with a space where humans cannot tread: the virtual memory space. Hackers are sending preprogrammed chunks of compiled code into the computer to sniff out data-loot and get it out, while we are programming computers to try to detect such attempts and warn us about them or stop them outright.

Meanwhile, the processor manufacturers are trying to make everything happen faster. They are, of course, trying to prevent hackers from doing their thing, but it’s hard to do that while also trying to make things go faster. Next issue, I’ll try to show you how many of these fixes (whether complete or incomplete) inevitably lead to lower CPU performance.

What I’ll be covering next

Next issue: [LMG S5] Issue 64: Fixing Meltdown and Spectre

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 62: Cache snooping

2020-03-03T17:00:00+08:00

Previously: A cache miss is slow, and a cache hit is fast. This difference in cache reading speed can be used to transmit secrets out from the cache, which cannot be read directly by programs.

Okay, okay, we managed to leak data from memory to the cache, now how do we leak it from the cache to our program?

Cache snooping: tapping on tiles

Here in Singapore, before moving in to a newly built apartment, we have a “ritual” of tapping each ceramic floor tile to check if they have been properly fastened to the ground.

Most people don’t actually know what a fastened or unfastened floor tile sounds like. What we do know is that they sound different.

So we go tok, tok, tok, tok, tok, … tik! Aha, there’s a loosened floor tile!

That’s kind of what we are going to do to the cache. We are going to load information located in memory cells addressed 1 through 256, and see how long each request takes.

Address 1: 135 ns
Address 2: 134 ns
Address 3: 136 ns
Address 4: 134 ns
…
Address 136: 130 ns
Address 137: 66 ns
Address 138: 137 ns
…
Address 256: 135 ns

Can you tell what the secret number is? It’s the one with an obviously lower request latency. In this case, the other addresses didn’t have a copy of their data already in the cache, so they result in a cache miss (Issue 57))—the CPU has to go to main memory to read the data again, and that’s slow. Address 137 already had its data loaded before, and a copy of it was already in the cache, so loading it again results in a cache hit and is fast.

Treating memory addresses as data

One key thing to remember here is that each memory address points to a memory “cell”, which only stores one byte (8 bits), with a value that can run from 0 to 255 to give us 256 (i.e. 2^8) different values.

Meltdown or Spectre have gotten the secret number (137), but in that small window of opportunity before it gets terminated, it would not have time to even store it into a text file that we can open later. How could we get that secret number without Meltdown or Spectre storing it?

We can write a snooping program to do the following:

1) Empty the cache cells for memory addresses 1 to 256, so that loading information from them would result in a cache miss.

Then instead of storing the value 137 somewhere, we would get Meltdown/Spectre to load information from memory address 137. A load operation is much faster than a store operation, and Meltdown/Spectre would be able to pull this off within the window of opportunity. This would cause a copy of the information in memory address 137 to be stored in the cache; the next time any program tries to load information from address 137 again, it will be a cache hit (fast).

The snooping program would then:

2) Make requests for information from each of these 256 memory addresses (“tapping on tiles”) and see which request has an obviously lower latency.

3) Determine that memory address 137 has obviously lower request latency, and store the “transmitted” secret: “137”

It’s a lot of work to get a single byte (256 possible values), but computers are good at doing lots of tedious work in a short amount of time. Using sample working code that exploits out-of-order execution (Issue 58)) and speculative processing (Issue 60)), coupled with a snooping program like the one we described above, the Meltdown and Spectre authors are able to leak data at a rate of about 580 KB/s, which seems slow. But there are 86,400 seconds in a day, so that’s roughly 43 GB/day at full exploit speed! (There are 4 videos of demonstration exploits near the bottom of the Meltdown page.) Malicious actors would probably do it at a slower rate to keep it covert, but in the weeks or months it would take to notice something was amiss with the memory access operations, that’s a lot of data they can siphon off … .

We’ve covered quite a bit of technical ground, so I’ll summarise.

Issue summary: To prep the cache, our program empties addresses 1 to 256, so that they are guaranteed to have a cache miss if their information is loaded.

To cache snoop (after Meltdown/Spectre have “delivered the payload”), we load information from memory addresses 1 to 256 and look for the one with an obviously lower request latency (a cache hit). The memory address itself is the value to keep.

Okay, that’s it. Secret is leaked, cat is out of the bag, and now you know how Meltdown and Spectre work, without all the technical detail (like how addresses 1 to 256 need to be in separate pages which are 4 KiB each because the CPU will speculatively load adjacent data from memory, yaddah yaddah).

So what can we do about it? Why hasn’t Intel fixed it after a year? I’m no computer engineer, but I’ll offer some thoughts in the next issue.

What I’ll be covering next

Next issue: [LMG S5] Issue 63: Limitations of Meltdown and Spectre

This isn’t even a fraction of 1% of what happens inside a CPU. It’s hard to convey just how complex CPU design is; no single person can explain in full detail how every part of the CPU works. Much of the design and validation work is already being done by software, but it still takes a human to write the code that does the checking.

Does it surprise you that a little hack like this can get past so many pairs of eyes? It really shouldn’t.

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]

Issue 61: Mapping the cache

2020-02-22T08:00:00+08:00

Previously: Speculative execution is a feature that lets the CPU speed up execution if it correctly predicts a decision point. The CPU carries out the operations along the predicted decision branch and loads the results if it predicts correctly.

Meltdown and Spectre need 2 pieces of the puzzle to leak data, and we have covered the first piece already: How to load the forbidden information into the cache, where it will not be immediately wiped by the OS when we are “found out”.

If we were trying to pull off a Meltdown or Spectre, we would try to:

Set up the request to have the info loaded into the cache
Attempt to read the cache … how?

The second piece of the puzzle, naturally, is how to get the info out of the cache before the CPU eventually evicts old data from it.

Failure from the start

At this point, we would have failed. We have gotten the secret into the cache, but we have no idea where it is in the cache, and we have no way to access the cache directly—remember that the cache is managed by the CPU and there is no instruction we can issue to the CPU to give us cache data directly.

We’ve come so far … and it doesn’t even matter.

We’ll need to modify our approach slightly. We can’t store the leaked data directly in the cache naïvely like that. We’ve got to be a little cleverer.

The cache “mirrors” a part of virtual memory

A quick refresher on how the cache works (Issue 57)):

When the CPU needs data from a memory address, it looks in the cache first.
If the data is not there (a cache miss), it will load the data from the memory address, and store a copy in the cache for faster reference in future. [SLOW]
If there is a cache hit, the data from the cache will be returned. [FAST]

Hmm … there’s something here. A cache miss is slow, and a cache hit is fast. Could we exploit this in some way, possibly? If we are creative, yes!

Many secret ways of transmitting information involves a shared cipher, a secret way of converting what is sent to what is meant. Leaking cache information will require a cipher of some sort.

It’s like a WWII spy story. Two spies arrange 3 different dropoff locations. Dropoff location 1 means their country is going to attack. Dropoff location 2 means their country is not going to attack. And dropoff location 3 means the information is compromised and they should avoid contact. Even if they are caught by the secret police, there is no way of figuring out what the two spies had communicated to each other indirectly.

All right, I’m writing a newsletter here, not a workshop. And the rest of the story will need more technical detail, so let’s call it a week. Next issue, all will be revealed ;)

Issue summary: A cache miss is slow, and a cache hit is fast. This difference in cache reading speed can be used to transmit secrets out from the cache, which cannot be read directly by programs.

I know, I know, what a cliffhanger! Before you started reading this newsletter, you never thought you’d be waiting with bated breath to hear some technical explanation of how to read data from a CPU cache, huh? Or that you never thought you might (in the next issue) find newfound appreciation of an ingenious CPU vulnerability exploit, and just how difficult it would be to fully resolve it.

We are getting close to the big reveal. Same time next week.

What I’ll be covering next

Next issue: [LMG S5] Issue 62: Snooping the cache

I was about to write both the mapping and the snooping in one issue, then I momentarily lost my train of thought and was trying to trace it again. And I realised that if I could lose the train of logic like that, I probably should split it up into two issues. One idea per issue, and I will still try to stick to it. I haven’t been able to write short issues that communicate a single idea, and it feels good to achieve it again.

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]

Issue 60: CPU Optimisation Part 2 – Speculative Execution and Spectre

2020-02-15T08:00:00+08:00

Previously: A set of instructions can trick a CPU into reordering load instructions so that the data is temporarily loaded into the cache before the instructions are retired. The cache can then be snooped to retrieve the data.

At the heart of the matter is the fact that the OS has no control over the order in which instructions are carried out. Because of this, hackers who understand how the CPU reorders instructions can write malicious code that tricks the CPU into loading precious data into memory for a fraction of a second, during which they can use cache-snooping techniques to read the data.

Before I go into the details of one cache-snooping technique, I want to outline another way that malicious code can get their targeted data into the cache. This exploits another feature, known as speculative execution.

Speculative execution: the CPU’s way of anticipating

010011011011101101000101 …

Can you predict the next number in the sequence? Kinda tough …

1111111111111111 …

How about now? Easier?

We have all been in that workplace situation where we are waiting on a colleague to make a decision. If they choose A, we have to perform one set of routines. If they choose B, we have to perform another set of routines.

If our past experience with this person tells us that there is no pattern to what they will choose in such a situation, it is very difficult to proceed until they have made their choice. However, if they regularly choose A and occasionally choose B, that’s another story. Especially if they take a long time to make their decision.

To speed up the process, we might just carry out the set of routines for A, wait for them to say “I choose A”, then give them the results—tada! And if they choose B instead, secretly dump the evidence and curse our luck.

Another model: the car valet

How would this work with a more concrete example? I could reuse the bank teller model from the Meltdown explanation, but I run the risk of muddling you up since the steps will look very similar. Instead, let’s model a pair of robot car valets instead. This pair still consists of a robot ALU (arithmetic logic unit) and LSU (load-store unit). The ALU, the brains of the pair, gets the car keys and driver’s license from the customer, and asks the customer for his ID number before asking the LSU to retrieve the vehicle. The LSU, the brawns of the pair, well, just parks or retrieves the vehicle.

Let’s exploit this CPU model to find out what kind of car our secretive neighbour drives. I don’t have my neighbour’s ID, but I do know his ID number (23983698576), and I give it to the CPU. It carries out the following instructions:

GET ID number[23983698576] from customer
Verify if I am the car owner [SLOW]
IF verified, LOAD car of 23983698576 by driving it to retrieval point and pass keys to customer
IF not verified, dump data and start over with the next customer

Sounds fair enough. The ALU finds out I am not my neighbour, and I don’t get to see his car. Awww. But let’s wait and see …

Speculative valeting

10 customers later, the CPU has been processing verified customers only. It goes into speculative execution mode (in a real CPU, of course you can’t disable speculative execution just like that; it is always on). Now the CPU works this way:

[1.] GET ID number[23983698576] from customer
[2a.] LOAD car of 23983698576 by driving it to retrieval point
[2b.] Verify if I am the car owner [SLOW]
[3.] IF verified, pass car keys to customer
[4.] IF not verified, dump data and start over with the next customer

2a and 2b are carried out simultaneously. Have you figured out where the cache is in this model? It’s where the car is temporarily held: at the retrieval point.

10 customers later, the ALU checks my ID, and at the same time the LSU in good faith starts to drive my neighbour’s car to the retrieval point. It is astutely hidden from my direct view. But if I know the mode of operation of this valet beforehand, I have a small window of opportunity to sneak a peek at the car before the ALU figures out I’m not the owner and a cache flush occurs (i.e. the LSU removes the car from the retrieval point). Perhaps I could plant a camera at the retrieval point …

Issue summary: Speculative execution is a feature that let’s the CPU speed up execution if it correctly predicts a decision point. The CPU carries out the operations along the predicted decision branch and loads the results if it predicts correctly.

And there you have it, two CPU features explained with robots. These are well-researched CPU features that have been used in CPUs for a long while … and nobody thought to thoroughly investigate ways in which this might be exploited for malicious intent.

You might blame this oversight on Intel, but I think I would blame the unpredictable nature of development. Early forts only needed walls, but not roofs, until catapults were invented. Hardware was invented to run fast, and the internet was designed to be robust, and very few people could predict accurately how they would be exploited for ill intent.

What I’ll be covering next

Next issue: [LMG S5] Issue 61: Mapping the cache

Okay, I’m done talking about the exploit part of Meltdown and Spectre. The scene freezes, goes into extreme time slowdown mode … the last 5 transactions are on the bank teller’s paper, and the neighbour’s car is at the retrieval point. The bank teller ALU is looking over my ID, checking various things, and the car valet ALU is verifying my ID … the quarry is at hand! Only a split second before they uncover the truth and the quarry is snatched away!

How are we going to snoop that precious cargo? You’ve watched enough heist movies, you know these things don’t happen without exhaustively detailed planning.

Let’s start planning our cache snoop next issue.

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]

Issue 59: Meltdown

2020-02-08T08:00:00+08:00

Previously: The CPU comprises different types of execution units. All the execution units can run at the same time, but they may execute instructions over different numbers of clock cycles. To minimise wait time, CPU instructions are carried out in an order that keeps the execution units busy as often as possible.

Last issue: optimising the old-school robot bank teller

Last issue, I modelled a simple CPU bank teller consisting of two units, an ALU (arithmetic logic unit), and a LSU (load-store unit). The ALU does the calculations, while the LSU loads from or stores data to memory. For the CPU to hum along optimally, the ALU should not be kept waiting for data, and the LSU should not be left twiddling its thumbs while the ALU is working.

By reordering the instructions that come in, we can optimise CPU usage by making sure the LSU is loading data for the next few instructions while the ALU is still working; this is known as out-of-order execution.

For the CPU to give a customer his bank account balance, the following steps need to happen:

GET ID from customer
LOAD bank account owner from memory (using ID number)
Check that customer is the bank account owner (verifying other details) [SLOW]
IF verified, LOAD bank account balance
SEND bank account balance to customer

Where I last stopped, we were optimising the robot bank teller by carrying out the two LOAD steps together. This helps to optimise CPU use, because while the ALUs are busy carrying out operations to verify that the customer is the owner of the bank account, the LSU is loading the bank account details, ready to be used once the ALU is done.

Where do the bank account details go while the LSU is waiting for the ALU? In the case of the bank teller, they’re written on a piece of paper (yes, old-school, because analogy). In a real CPU, every piece of data requested by the CPU first goes through the CPU cache. This means the cache has a copy of all data ever requested, and it evicts the oldest data to make way for new data. The bank teller’s piece of paper is an analogy for the CPU cache.

Meltdown: the exploit

Suppose I’m an ill-intentioned customer who wants to snoop on a neighbour’s bank transactions. I go up to the bank teller and ask it to retrieve the last 5 transactions of account ID 23983698576 (that’s my neighbour’s account ID, unknown to the robot tellers).

The bank tellers need to execute the following instructions. There is an implicit step after step 4 to ensure security:

GET ID[23983698576] from customer
LOAD bank account owner of [23983698576] from memory (written back to cache)
Check if I am the bank account owner [SLOW]
IF verified, LOAD bank account balance of [23983698576]
IF not verified, dump data and start over with the next customer
SEND bank account balance to me
LOAD last 5 transactions of [23983698576] from memory (written back to cache)
SEND last 5 transactions to me

However, after reordering for efficiency, the steps now look like this:

[1.] GET ID[23983698576] from customer
[2.] LOAD bank account owner of [23983698576] from memory (written back to cache)
[3.] Check if I am the bank account owner [SLOW]
[4.] IF verified, LOAD bank account balance of [23983698576]
[7.] LOAD last 5 transactions of [23983698576] from memory (written back to cache)
[5.] IF not verified, dump data and start over with the next customer
[6.] SEND bank account balance to me
[8.] SEND last 5 transactions to me

While the ALU is carrying out authenticity checks in step 3, the LSU is simultaneously carrying out steps 4 and 7, the LOAD steps, to avoid sitting idle.

This also leaves a copy of the data in the cache; the LSU teller has written down the bank balance and last 5 transactions on a piece of paper while waiting for the ALU.

When the ALU reaches step 5 and figures out I’m not the owner of that account, then they start over with the next customer and I get evicted from the queue (this is called retiring an instruction, in a real CPU). But meanwhile, the papers on the desk don’t get cleared!

Cache snooping: the oldest trick in the book

If this sounds horrifying to you, remember that the real CPU is just a bunch of transistors and it really isn’t all that smart. And remember that programs cannot access the cache directly; it is a hardware implementation detail (like the backroom of any business), and so this is considered normal practice.

But still, that leaves a small window of opportunity for me to crane my neck and try to snoop the paper. And that’s all the time I need to see my neighbour’s transactions, and even his bank balance!

Issue summary: A set of instructions can trick a CPU into reordering load instructions so that the data is temporarily loaded into the cache before the instructions are retired. The cache can then be snooped to retrieve the data.

Okay, I’ve left out the meaty details of cache snooping here, because there are a whole bunch of tricks to doing it, written up into white papers by cybersecurity researchers. Also this is a one-idea-a-week newsletter, and cache snooping is a whole ’nother idea. Also, I’ll get round to it later.

But first I want to talk about Spectre, which is another way of getting the desired data into the cache. But Spectre exploits another feature, known as speculative execution. It is also an intuitive concept, not difficult for normal folks to understand, and I’ll go straight into it next issue.

What I’ll be covering next

Next issue: [LMG S5] Issue 60: CPU Optimisation Part 2 – Speculative Execution and Spectre

Cache snooping is interesting to me, because things like this actually happen all the time IRL! What’s really going on here is that any business operation needs to have a place to put things, move things, work on things, in a way that is invisible to customers and outsiders. But making sure that these inner workings are truly invisible to other people is helluva difficult.

Consider, for instance, this article from The Atlantic. It describes how some rich investors try to make more accurate predictions of their investments’ performance by buying satellite imagery of their factory or operations sites. By seeing visual data that is not readily available to other investors, they can better predict how those companies are really performing.

Cache snooping is another instance of hardware snooping, but at a different scale and scope. Just how hidden are our hardware implementations? It is difficult to think about ways people can obtain such dearly desired info if we are not those people; human ingenuity does seem almost boundless!

When we really try to do everything in a secure manner, often it means sacrificing performance for security. For instance, a CPU without out-of-order execution would not be subject to this leak risk. But it would also run 1.26 to 2.4 times slower, according to Bruce Dawson of Google.

Ah, how to have our cake and eat it too …

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]

Issue 58: CPU Optimisation Part 1 – Out-of-Order Processing

2020-02-01T08:00:00+08:00

Previously: The CPU stores data for ready access in the CPU cache. Accessing data from the CPU cache is much faster than accessing data from main memory. When the CPU needs data from a memory address, it looks in the cache first. If the data is not there (a cache miss), it will load the data from the memory address, and store a copy in the cache for faster reference in future. The CPU cache is managed by the CPU and is invisible to the OS. Programs that need to ensure the data in the cache is “fresh” can perform a cache flush and reload.

In this issue, we look at one feature that CPUs use to speed up processing: out-of-order execution. “Out-of-order” makes it sound like something is broken in the CPU, but it really just means that the CPU instructions it is given are not executed in the same order that they were fed to the CPU.

If you have seen a busy Starbucks joint or Chinese restaurant at work, you would know that menu orders are not always carried out in the same order that they were taken (even if customers are eventually first-come-first-served). A fully staffed Starbucks joint or Chinese restaurant is not a single working unit, but a collection of specialised units.

CPU execution units

A CPU core is comprised of 3 types of execution units:

Arithmetic Logic Unit (ALU): THE ALU is responsible for carrying out integer calculations
Floating Point Unit (FPU): The FPU is responsible for carrying out decimal calculations
Load/Store Unit (LSU): The LSU is responsible for loading data from memory into the CPU, or storing data from the CPU into memory (Issue 55))

An instruction decoding unit in the CPU decodes each instruction and sends it to the appropriate execution unit. All these units can work at the same time, and for maximum performance this is what you want to happen.

Not all instructions are executed equal(ly)

The CPU has an internal clock (called the CPU clock) that regulates when things are done in the CPU. Everything in a CPU takes place in cycles. Every operation takes at least one cycle, but some operations which require more steps will require more cycles.

For instance, the ALU can carry out most operations in one or two cycles, but the FPU often needs four or more cycles to do its work (moving decimals is hard work!). The LSU clock cycle latency varies, depending on which part of the cache you are fetching from (the cache has different regions; some regions are closer to CPU cores while other regions are shared among all the cores and therefore further. I won’t go into deeper detail.)

Keeping all the execution units busy is getting more complex now, eh?

Minimising wait time in a CPU

Let’s revisit the instructions from Issue 53):

1 LOAD 1   R1
2 ADD  2   R1, R2
3 MOV  R2, MEM1011

The third instruction is to store data from the CPU register to main memory, and this is gonna take a little while. Sending subsequent instructions to the ALU immediately after the third instruction will result in some wastage of clock cycles: the ALU will just be sitting there, waiting for the data to be available in main memory before it can do its thing.

Why not schedule an instruction, even from another application, while waiting? It doesn’t matter if the other application’s instruction came later, if it can be executed now we might as well do it.

This, in a nutshell-issue, is out-of-order execution.

Analogy: old-school robot bank teller

Let’s model a CPU core as two execution units: an ALU and a LSU. The ALU is a robot bank teller that does what the customer asks, while the LSU is a robot bank teller that retrieves data from and stores data back to the bank’s database (i.e. memory). Two such robot bank tellers work at a teller counter (CPU core).

If a customer needs to check their bank balance, the following instructions need to happen (like I said, this is old-school; no iBanking or ATMs here, because analogy).

GET ID from customer
LOAD bank account owner from memory (using ID number)
Check that customer is the bank account owner (verifying other details) [SLOW]
IF verified, LOAD bank account balance
SEND bank account balance to customer

If you’re wondering why steps 2 and 4 can’t happen at the same time … congratulations! You already understand out-of-order execution at an intuitive level. If the LSU can carry out steps 2 and 4 at the same time, the ALU can simply provide the bank balance once the customer is authenticated, or discard the information otherwise.

This frees up the LSU, and if the LSU’s load is low enough we might even reduce robotpower and share one LSU between two teller counters, seeding android fears of restructuring and impending robot retrenchment … but let’s stop the analogy here for today.

Issue summary: The CPU comprises different types of execution units. All the execution units can run at the same time, but they may execute instructions over different numbers of clock cycles. To minimise wait time, CPU instructions are carried out in an order that keeps the execution units busy as often as possible.

Some very smart people might harangue me about micro-ops, or about decode buffers, etc. My only answer to all such concerns are: not necessary at this point. Maybe in a future issue, if it is the linchpin in some layman explanation.

What I’ll be covering next

Next issue: [LMG S5] Issue 59: Meltdown

This little optimisation step, of doing things in an order that keeps the CPU busy, looks innocuous enough. But once we combine it with some features of the cache, it leaves a little loophole that enables an attacker to snoop on data: this is Meltdown. Stay tuned, we’ll get to the meat of Meltdown next issue!

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]

Issue 57: Cache, the CPU’s working space

2020-01-25T08:00:00+08:00

Previously: The operating system is responsible for listing and managing the computer’s resources, making them available to programs running on the computer, and making sure they only use what they are allowed to.

Those who have been following Layman’s Guide since Season 3 will remember this term, caching. I first introduced it at the end of Season 3, in Issue 39):

Searching for anything takes time. Need to fill out a form? You need to search for a pen first. Need to call someone? Before speed dial and contacts apps existed, you used to need to look up a number in order to dial it. If you do it often enough, you would make sure you always had a pen with you, or you would write the number somewhere convenient for you to see so you don’t need to hunt for it.

Computers use the same trick, and it is called caching. Any information it needs repeatedly which is unchanging is stored in a cache.

The DNS cache, which I introduced in that issue, is a place where hostnames (such as facebook.com) and their associated IP addresses are stored, so that we don’t need to keep looking up the IP address for facebook.com.

We use caches to reduce latency and speed up processing: the full DNS querying process takes a few hundred milliseconds, but looking up a DNS entry in the cache only takes a few milliseconds — that’s a speedup by a factor of 100!

How long does it take to transfer data?

Let’s look at the transfer speeds and latencies for a few places where data can be stored:

Hard disk drive (HDD): ≈5 ms response latency, 100 MB/s transfer speed
Solid state disk (SSD): up to 0.1 ms response latency, 0.5–1+ GB/s transfer speed
Physical memory (RAM): 0.1 µs (0.0001 ms) response latency, 20GB/s transfer speed
CPU register: <1 ns (<0.000001 ms) response latency[^1]

[1]: We seldom talk about the transfer speed of CPU registers, because each register only holds one byte and the transfer is near-instantaneous.

A CPU register is a slot within the CPU (the same slots from Issue 53)) which it uses to hold the data it is processing.

Notice that the speed difference between each layer is more than 10×? If a computer did not have physical memory to store temporary data in, and had to transfer data to/from disk instead, it would be responding a thousand times more slowly!

A CPU can carry out operations very quickly on data loaded into its registers; it generally takes only a few nanoseconds for complex calculations to be done. Simple instructions (such as ADD) can even be done in less than 1 ns!

This means that the limiting factor for CPU speed is actually loading and storing data. In the time it takes to load data from physical memory or store data to physical memory, it can perform 100 simple operations. That’s damn slow in CPU time!

CPU and cache performance

So CPU designers included some cache on the CPU:

CPU cache: 0.001–0.040 µs response latency, 175 GB/s transfer speed

Great, now we have some storage space that sits between physical memory and the CPU’s registers. It is only slightly slower that a CPU register, and much faster than physical memory.

Imagine working in an office that gave you a cubicle but no desk. When you need a piece of information, you have to go down the hallway to the filing cabinet, retrieve it, and return to your cubicle (no desk!). When you were done processing it (1 second), you had to put the results back in the filing cabinet, down the hallway … a process that takes about 100 seconds. SLOWWWWW!

If you had a desk, you could put some papers on it, and retrieve them much more quickly (a few seconds). You would be 10× more efficient!

By simply including a cache on the CPU, its designers sped up its performance by a factor of more than 10.

Cache is not managed by the OS

Just as an organisation would not control what information you should have on your desk, an operating system does not control the CPU cache. This feature is managed entirely by the CPU itself. The operating system is unable to see what is in the CPU cache, and has no access to it.

Like other caches, when the CPU needs data from a memory address, it looks in the cache first. If the data is not there (a cache miss), it will load the data from the memory address, and store a copy in the cache for faster reference in future.

Like other caches, this process has its own issues. The cache can fill up, requiring the CPU to eject old data so as to make way for fresher data. The cached data on the CPU cache can also become outdated when other programs and instructions update the data in memory. Programs that absolutely need to ensure they get the freshest data from memory can issue special instructions to perform a cache flush and cache reload.

A cache flush empties out the cached data while preserving the memory address it is linked to. A cache reload, well, reloads the data from those memory addresses. These two terms, jargon for very technical operations that take place in the CPU, are being introduced because they are the linchpin of Meltdown and Spectre. We will get there in the next two issues.

Issue summary: The CPU stores data for ready access in the CPU cache. Accessing data from the CPU cache is much faster than accessing data from memory. When the CPU needs data from a memory address, it looks in the cache first. If the data is not there (a cache miss), it will load the data from the memory address, and store a copy in the cache for faster reference in future. The CPU cache is managed by the CPU and is invisible to the OS. Programs that need to ensure the data in the cache is “fresh” can perform a cache flush and reload.

If CPU development had stopped at this point, Meltdown and Spectre would not have been possible … and we would have been stuck in the ’90s, somewhat. It is in human nature to try to exploit every last bit of available optimisation, and this is what happened with the design of CPUs.

As new manufacturing processes allowed computer engineers to cram more transistors into a CPU, the question arose: what should we do with more transistors? Just add more calculation units? Build new features into the CPU?

Adding more calculation units made the design of CPUs much more complex (as anyone who has had to work alongside other team members doing the same job can attest). The most popular optimisations thus hinged on adding CPU features to ensure that it is fully utilised as much as possible.

In the next two issues, I will do a shallow dive into two of these features: out-of-order processing, and speculative execution. These will not be technical issues, because there are ready human analogues for such optimisations. You probably already do some of this at work, or even at home!

What I’ll be covering next

Next issue: CPU Optimisation Part 1 – Out-of-Order Execution

Out-of-order execution is a solution that we have all discovered at one point or other in our lives. When we have to manage multiple tasks and carry each one out as quickly as possible, we don’t always carry out the steps in a logical order, but in a manner that makes sense and lets us work as quickly as possible.

CPUs do this as well, to perform calculations much more quickly. More in the next issue of Layman’s Guide.

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]

Issue 56: Operating Systems and resource management

2020-01-18T08:00:00+08:00

Previously: The CPU just executes instruction after instruction after instruction. Each instruction may consist of loading data from a memory location, sending data to a memory location, or performing operations on the data it is holding.

If the CPU is mindless and simply carries out instructions, there must be some kind of a “higher mind” maintaining order and harmony within the CPU so that our programs don’t muck things up for each other.

In the early days of computing history, this higher mind was the programmer. In those days, a programmer had to mentally partition the limited memory space, and ensure that the programs being executed on the CPU don’t inadvertently muck up the memory in unexpected ways. This was manageable for a while: up to a few thousand, or tens of thousands of memory addresses, with a sensible set of rules. But as programs became more complex, and when multiple programs had to be run on the same computer, bugs started to creep in and become difficult to trace and fix.

Humans could no longer manage the CPU’s resources. It had to be automated. And so the operating system (OS) was born.

The operating system manages the computer’s resources

An operating system has to do a few things at minimum:

Enumerate the devices on the computer: checking all its available interfaces and listing the devices connected to each interface, to be made available to programs upon request.
Registering device ports into the virtual memory address space. This includes physical memory, hard drive ports, printer ports, keyboard and mouse and other USB device ports, and so on. This makes the devices available to programs that need to load data from those devices, or send data to those devices.
Manage running programs, giving each program its own memory space, dividing up the available CPU time among programs so that each gets some runtime, allocating more memory to programs that request it, reclaiming memory from programs that release it.
Enforce security by ensuring that programs only carry out instructions that they are allowed to. This is why Windows keeps bugging you about program permissions. This also ensures that guest users cannot access the data of other guest users, and cannot modify important system files.

This is both an art and a science, and getting it right is an ongoing study. When an OS works well, instructions from different programs can be mixed into the same queue and executed by the CPU without the data somehow getting mixed up. And programs will not be able to dabble into the private memory area of other programs.

But cybersecurity is a multi-billion dollar industry with good reason. Black hat hackers and cybersecurity researchers are constantly trying to find loopholes in the OS logic so as to access data they are not supposed to be able to access. In Meltdown and Spectre, the loophole is not a fault in the OS logic, but in a hardware feature of the CPU which I will explain in the next issue: the CPU cache.

Issue summary: The operating system is responsible for listing and managing the computer’s resources, making them available to programs running on the computer, and making sure they only use what they are allowed to.

What I’ll be covering next

Next issue: Cache, the CPU’s working space

The pieces are in place now for me to introduce the crux of the matter: the CPU cache. This is where the heart of Meltdown and Spectre takes place, and yet we cannot do away with it. Stay tuned to learn why.

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]

Issue 55: Addressing memory

2020-01-11T08:00:00+08:00

Previously: To get useful output from a CPU, we must translate the operations we want it to perform into CPU instructions, in a process known as compiling. Most compilers convert programming code into CPU instructions.

A CPU executes instructions, which loads data from memory, store data to memory, or carries out operations on loaded data. Where exactly does this data go, and how is it organised?

Memory is a collection of bytes organised by address

In Season 4, I mentioned the byte as a convenient collection of 8 bits. Part of the reason is that memory is organised by bytes. Each byte of memory has its own address. Naturally, in a CPU, this memory address will be encoded in binary.

Working out the numbers, a CPU will need 10-bit memory addresses to use 1 KiB of memory (2^10 = 1,024). 20-bit addresses will let it use 1 MiB of memory (2^20 = 1,048,576). 30-bit addresses will let it use 1 GiB of memory (2^30 = 1,073,741,824). And 32-bit addresses will let a CPU use 4 GiB of memory.

Are those numbers ringing a bell?

The 32-bit to 64-bit transition in the ’00s

A little history, for those who remember: Around the turn of the century, in the ’00s, there was some hoo-ha about 32-bit CPUs not being able to use more than 4 GiB of memory; this was a time when 2 GiB of memory on a laptop was considered beefy, Google Chrome hadn’t appeared on the scene yet, and browsers did not use up gobs of memory.

This was also a time when 64-bit CPUs started coming onto the scene, and there was much confusion in the software world about which software would work on 32-bit CPUs, which ones would work on 64-bit CPUs, and which ones would work on both.

So this is what it boils down to: a 32-bit CPU, without any hacky workarounds, can only work with about 4 billion memory addresses. and this became insufficient around the turn of the century. We needed to use CPUs that could work with more than 4 billion addresses. 64-bit CPUs were the solution that the computing industry settled on. 64-bit memory addresses would extend the addressable memory capacity to 16 TiB for the foreseeable future.

16 TiB?! Why do we need so much memory?

Hold your horses — I want to be clear here. I’m not just talking about memory here, but about memory addresses. What’s the difference? Consider for a moment how the CPU would transfer data to the hard drive. Or send data to a printer. Or even send it out onto the network. How would those virtual “locations” be represented in a CPU instruction that can only handle memory addresses?

The most straightforward answer, which you may have some difficulty accepting, is that they are simply represented as memory addresses. Yep, in the entire space of memory addresses, most of it is used to address physical memory (what is known as Random Access Memory, or RAM), while some of it is used to address hard drive devices, USB devices, network devices, and various other connected peripherals.

Of instructions and addresses

Let’s summarise the picture so far.

Issue summary: The life of the unconscious CPU is just executing instruction after instruction after instruction. Each instruction may consist of loading data from a memory location, sending data to a memory location, or performing operations on the data it is holding.

Not a very interesting life, but it forms the bedrock which supports everything we use a computer for. And things are about to get more complex once we throw programs into the picture. Each program is its own long list of CPU instructions, meant to produce different results. Excel carries out our spreadsheet processing, while Word helps us to format our documents. Yet the instructions from both programs are carried out in the same CPU! How does the CPU avoid mixing up data from different programs? How does it prevent Word from accidentally screwing up Excel’s data, and vice-versa?

What I’ll be covering next

Next issue: Operating Systems and resource management

Okay, I think I’ve laid out the basics of CPU operation in sufficient detail for now. I have yet to mention one key component—the CPU cache. And I have yet to explain how CPUs speed up processing. These two explanations will make more sense after I make a side trip about how operating systems prevent everything from becoming one gigantic mess.

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]

Issue 54: Compiling programming code into CPU instructions

2020-01-04T08:00:00+08:00

Previously: CPUs are unconscious slaves that simply execute instruction after instruction, at a very fast rate.

Last issue, I introduced the idea of the CPU has an unconscious instruction-executing machine. It cannot process programming code directly; that code must first be compiled into CPU instructions.

The compiler converts programming code to CPU instructions

Last issue, I showed you a short snippet of CPU instructions:

1 LOAD 1   R1
2 ADD  2   R1, R2
3 MOV  R2, MEM1011

But that’s not the kind of code we usually see in movies, on the screens of geeks, and in stock images. What gives?

Most code we see looks something like (example from Python):

num1 = 1
num2 = 2
sum = num1 + num2
print(f'The sum of {num1} and {num2} is {sum}')

How does that get turned into CPU instructions? That job is performed by a piece of software known as the compiler.[^1] The compiler compiles programming code into an executable file (sometimes shortened to executable), which contains the actual instructions executed by the CPU. This is why, in Windows, some files have a .exe file extension — those are executable files!

[1]: Purists will argue with me that Python technically runs through an interpreter, not a compiler. At this point, the distinction between the two terms for layfolks is not critical, and I choose clarity over accuracy at this point until I can delve into more detail in a future issue.

The compiler itself is also a piece of software that reads in programming code (a process known as parsing), and follows its own instructions to break it down into CPU instructions.[^2]

[2]: If you find yourself wondering “how was the first compiler written? Which came first: the compiler code, or the compiler executable? How would a compiler compile its own code into its executable?”, you might be a prime candidate for a Computer Science degree programme :)

Okay, I think I am done talking about CPU instructions for now. On to the next piece of the puzzle: memory.

Computer memory: addressable bytes

In the CPU instruction snippet above, there was a line that involved storing data into memory:

3 MOV  R2, MEM1011

This line means “store the value in slot R2 into the memory location 1011”. Next issue, I will delve into what these memory locations are, and build out our mental model of how a CPU works.

Issue summary: To get useful output from a CPU, we must translate the operations we want it to perform into CPU instructions, in a process known as compiling. Most compilers convert programming code into CPU instructions.

A very short issue, just as I like it :) There’s something philosophical about the process of a CPU beginning with no knowledge of what to do, and slowly bootstrapping a library of code-to-instruction conversions through a compiler. These and other puzzles about information manipulation are what computer scientists love studying! And this is one good reason to differentiate Computer Science from general Computing: if you take up a degree in Computer Science and expect to learn more about general Computing, you might end up being disappointed.

What I’ll be covering next

Next issue: Addressing memory

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
~~compiling code into an application [Issue 26]?~~
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]

Issue 53: The CPU is an instruction-obeying slave

2019-12-28T08:00:00+08:00

Previously: PDF’s markup language is more concerned with how things appear on the page than with what they were originally. Once the PDF is generated, it is almost impossible to retrieve the original data from it. Scanned documents that are converted to PDF may have a text layer generated by OCR that lets detected text be copied from it.

In Season 4, I laid out the basics of how data is represented: text, images, audio, video. I also explained how compression happens, and unpacked how these basic data types can be combined into more complex documents.

But data by itself isn’t of much value in a computer if we can’t do things to it, perform operations on them. We are not talking surgical or military operations here, but chiefly mathematical operations to manipulate information: changing a bit here, a bit there, or making a massive set of changes throughout.

How exactly does that happen in a Central Processing Unit (henceforth CPU)?

CPUs are instruction-obeying slaves

The design of a CPU is very much inspired by human experience. One essential aspect of that experience is that everything we do consists of operations.

Civilization advances by extending the number of important operations which we can perform without thinking of them. — Alfred North Whitehead

Want to make coffee? Measure out one scoop of coffee beans per cup, add them to the grinder, press the Start button on the grinder and wait for the noise to stop, empty the coffee grounds into the drip machine, add water, press Start, and wait for a beep. 6 steps to make coffee. You can break those steps down differently depending on what kind of machine you are using and what kind of coffee you are making. Whatever the outcome we want, if it can’t be broken down into simple steps like that, we would not be able to design, make, and sell household appliances; we would have to be craftsmen (and craftswomen) of that trade.

A CPU is an unconscious operation-executing machine. Every outcome we want must be translated into operations which a CPU can perform without understanding.

A common mental model of how our computers work is that a programmer writes code in a language that a CPU understands, and the CPU simply carries out those instructions. Let’s go deeper into that model. How do those instructions get translated into the 1s and 0s of binary code?

Much the same way as information gets converted to binary in Season 4. The CPU can understand and execute a limited set of instructions, and each instruction is labelled with a number. The CPUs in use today have standardised on the instructions which they can be instructed to carry out. These sets of instructions are known as instruction sets.

What are these instructions like?

CPU instructions: moving data around

These instructions perform operations on one, two, or more pieces of data. This is how an instruction like b = 1 + 2 would be broken down:

1 LOAD 1   R1
2 ADD  2   R1, R2
3 MOV  R2, MEM1011

I am using this arcane presentation format in a newsletter for layfolk because I think it helps to distinguish between human thinking and computer thinking. What the computer is doing here is:

Load the value 1 into slot R1
Add the value 2 to the value in slot R1, and store the result in slot R2
Store the value in slot R2 into the memory location 1011 (where the variable b points, so that other programs/instructions can use the result)

Everything we ask a CPU to do essentially consists of loading data from somewhere, doing some kind of processing on it, and storing the result somewhere. The CPU processes lists of these instructions, at a rate of millions to billions of instructions per second.

Let that sink in for a moment. Every Youtube video, meme, or tweet we send or see is the result of hundreds and thousands of operations, taking place in CPUs around the world. CPUs converting text, audio, and images into raw data, encapsulating it into a data package along with some metadata, sending it out to another CPU that translates the destination address and forwards it to the next gateway, and so on, until it reaches its destination, gets decoded and processed, and signals get sent to the monitor and speakers to produce what we see and hear.

Why can’t I run an exe file from Windows on my smartphone, or an Android/iOS app on my Windows laptop?

There are many reasons for that, and I will explain one of those reasons here: the x86-64 instruction set used by Intel/AMD CPUs on your Windows laptop is not compatible with the ARM instruction set used by your smartphone CPU; the MOV, ADD, and other instructions have different numerical codes in each instruction set.

The same programming code for the app must be compiled into CPU instructions separately for Intel/AMD processors, and for ARM-based processors.

Issue summary: CPUs are unconscious slaves that simply execute instruction after instruction, at a very fast rate.

What I’ll be covering next

Next issue: Compiling programming code into CPU instructions

I think this is a good place to stop today. Before we can dig into CPU exploits, we must first unpack what a CPU does. And we are starting slow, because the CPU is ultimately a strange place. Stepping into it is kind of like stepping into Willy Wonka’s Chocolate Factory, where all kinds of wonderful things are happening, and once you figure how everything fits together you can figure out where you can sneak globs of chocolate without people finding out.

See you in the next issue of Season 5: the Chocolate Processing Unit!

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
compiling code into an application [Issue 26]?
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]