Layman's Guide to Computing - Season 04

Issue 52: PDFs part 2 – Text and images

2019-12-21T08:00:00+08:00

Previously: PDF is the gold standard for universal compatibility (supported by most software and platforms) and visual fidelity (displays exactly the same way). When you need things to appear on a different device in exactly the same way you created it, without having to install additional software, use PDF.

I mentioned earlier that PDF is an incredibly complex and powerful format. You can do so much with it once you have digested the approximately 900 pages of its format specification, which are available for free. To support older versions of Acrobat and other readers, you may have to cross-reference the reference manuals of older versions. It’s not impossible, but I hope this helps you understand why many apps and services are reluctant to provide PDF support unless there are already libraries available for them to use in their own application. This is time-consuming stuff!

Nope, I’m not going to read that.

Sure, that’s why I’m writing this newsletter :) Now if you flip to page 238 of the reference manual (I’m just kidding, don’t go download the reference manual now!) and look at Example 1, you see this:

This example illustrates the most straightforward use of a font. The text ABC is placed 10 inches from the
bottom of the page and 4 inches from the left edge, using 12-point Helvetica.

BT
    /F13 12 Tf
    288 720 Td
    (ABC) Tj
ET

The five lines of this example perform these steps:

a) Begin a text object.
b) Set the font and font size to use, installing them as parameters in the text state. In this case, the font
resource identified by the name F13 specifies the font externally known as Helvetica.
c) Specify a starting position on the page, setting parameters in the text object.
d) Paint the glyphs for a string of characters at that position.
e) End the text object.

Remember when I showed you some markup languages in Issue 50? Here’s another one, but much more concise and much more specific: it lets you specify font and position for each string of characters. There are additional formatting codes for changing the colour, changing the text format to an outlined version, and making various other kinds of changes.

How PDF documents display this text

In the reference manual, there is a long and complicated way of putting a text block into a PDF document and specifying the line spacing and character spacing and how to insert line breaks and all that, so that it appears nicely. In practice, it is rather difficult for developers to convert their own format used in their app into what the PDF format fully requires (see my point at the start of this issue about having PDF libraries available).

If it is done properly, copying text from a PDF document is rather easy, and you may on rare occasion have experienced this. Apps that do not use or do not have access to high-quality PDF libraries for their app may end up generating PDFs that simply display the text word by word, or line by line. If you’ve ever copied a paragraph of text from a PDF and had it appear in multiple lines instead of a single line, or with some word spaces missing, this could be the reason why.

What about images?

Again, the one thing you need to remember is that PDF is concerned primarily with how things look, not with what things are. To display a JPG or GIF on a PDF, the app’s PDF library has to convert it from its compressed format into an array of pixels. The image’s pixel dimensions will seldom match those of the frame it must go into; often you may find yourself trying to fit an 800×600px image into a 400×300px space. The PDF encodes that stream of pixels, and you may not be able to get the original image back from that stream, especially after it has gone through some resizing and cropping.

Why can’t I copy text from a scanned document?

Ah, a common question, and one I have been dying to answer.

When you scan a document, your scanner does not produce text; it produces an image. When the scanning software lets you save your scan as a PDF, it basically puts the image into a full-page PDF and calls it a day. There is no text content in the PDF at all!

If the software is a bit smarter, or if you have Adobe Acrobat, you might have access to Optical Character Recognition software (OCR). This is a feature in some apps that recognise text in images and recreates it for you. This feature lets the app check your scan for recognisable characters and produce a text stream from it. It can them put this text into an additional layer in the PDF, below the image.

It takes some additional trickery to ensure the text appears at exactly the same position where it was detected in the image (remember from above that the text position must be specified). If the PDF library gets the font size and positioning right, this simulates the experience of selecting text on the image and having it appear to be highlighted.

However, the state of OCR technology is such that you will often still get typos or missing/extra spaces in the text, so do be sure to check any text that you copy from a PDF!

Issue summary: PDF’s markup language is more concerned with how things appear on the page than with what they were originally. Once the PDF is generated, it is almost impossible to retrieve the original data from it. Scanned documents that are converted to PDF may have a text layer generated by OCR that lets detected text be copied from it.

… and Season 4’s a wrap! Phew, I hope Season 4 increased your understanding of how text, images, audio, and video are represented and stored in a computer, of how lossy and lossless compression work and why the former leads to a decrease in quality, of what a file is and how OSes tell them apart, and lastly of documents and other complex file types, and how they are put together.

What I’ll be covering next

Next season: The CPU - where it all happens

I was going to start Season 5 continuing where I last left off in Season 3. From networking I would have gone on to talk about the internet and its history, how it became the cloud, and how we had the advertising network we have today. But I realised that (1) I still need to do more research on some areas (particularly ad exchanges), and (2) Meltdown and Spectre are apparently not fully fixed yet.

If you remember, Meltdown and Spectre are the CPU vulnerabilities that can potentially allow attackers to access protected data in your computer’s memory. Most of us don’t have much on our computers that we need to worry about, but banks and other corporations that we rely on certainly do!

One year on, that vulnerability is still not fully fixed. Some people seem to be flabbergasted by the inability of the huge CPU companies (actually just mainly Intel) to figure this out. But once you understand what Meltdown and Spectre are and how they work, even at a layperson level, I think it is easier to see that there is no straightforward fix that will make everyone happy. With media outlets everywhere citing Moore’s Law uncritically and expecting performance to increase in accordance with it, I am disappointed that such a vulnerability had not been conceptualised earlier and prevented, but I am not surprised.

Security and privacy are the hot-button topics of the day, but there are many pundits and analysts talking with little idea of how they are implemented and why they are such a difficult challenge. With Season 5 I hope to lay out the basics of operating system security and CPU operation, and attempt to explain in simple terms how Meltdown and Spectre work and why they are so difficult to fix.

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
compiling code into an application [Issue 26]?
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]

Issue 51: PDFs part 1 – Compatibility and fidelity

2019-12-14T08:00:00+08:00

Previously: An HTML file contains markup tags that tell the browser how to interpret and format the text within the tags. Other document formats usually use tags in a similar way. These tags constitute a markup language that any app can use to mark up its own text too.

If you were old enough (or perhaps lucky enough) to remember the old days of document layout, you may remember a time when such software was non-existent. You typed the text using a typewriter, being very careful to do a carriage return and line break where the pictures were supposed to go. Then you literally cut the pictures and pasted them in. Not the right size? You’re outta luck.

And then computers came along. But in the days of dot matrix printers, which printed on paper with those holey tearaway strips on both sides, it was the same process, just digital. You still printed only text, and added the pictures later.

The early days of publishing

If you were working for a professional publisher, you formatted text by inserting control codes (including the formatting commands mentioned in Issue 41)) using a special keyboard. But computers back then weren’t powerful enough to show you the effects of that formatting instantaneously. You would just see the formatting code on the display and have to imagine how it looks like in your head. Which is easy to do, after many years of experience.

WordPerfect 5.1 (1986), with formatting codes revealed
From Anthology

And then desktop publishing software arrived on the scene in the mid-1980s, when Aldus released PageMaker. You could see how the pages actually looked! This feature was called What You See Is What You Get, or WYSIWYG. PageMaker was quickly overshadowed by QuarkXpress, which had extensions (whoa!), and Aldus languished and got bought over by Adobe in late 1994. Yup, that Adobe. And then Adobe released InDesign in 1999.

Publishing vs word processing

Why didn’t I mention Microsoft Word, even though it was first released much earlier, back in 1989? That’s because Word is a word processor, not a page layout application. A word processor is focused on helping you to produce reports with nice formatting, but still primarily text-based. You wouldn’t design a professional magazine in Microsoft Word; it doesn’t give you enough fine-grained control over positioning of the various elements. For that you need a proper page layout application, like InDesign.

I just mentioned fine-grained control. That’s something you are going to hear a lot in the world of graphic design. Designers and publishers want control, lots of it. They not only want to control where things go on the page (down to sub-millimetre precision), they even want to control exactly how the colour looks.

Going into more detail here would betray my principle of writing for the layperson, but I think it is important to present this perspective because it explains the need for a format many of us love and hate: the PDF format.

Ensuring print fidelity: the Postscript language

When you design something on the screen, how do you know that it will look exactly the same when printed? Short answer: you won’t, unless you have a markup language that is understood the same way by both the desktop software and your printer. That language is called Postscript, and it can handle text, images, shapes, and additional info (or metadata, i.e. data about data) that comes with them.

But people soon wanted to include even more things in their documents: forms, videos, 3D artwork, … many of which Postscript did not support natively. And that’s where PDF shines.

PDF: the standard for compatible fidelity

Today, it is easy to take for granted that when I create a DOCX document in Word on my iPad and upload it to Google Drive, it should open on my laptop and look the same. To an accurate enough degree, anyway.

But two decades ago, such compatibility was still a dream. You could not take for granted that a complex document format produced on one software would open correctly (if it even opens) on another piece of software, or even the same software written for a different machine (think of Word for Windows, Mac, and other OSes).

Needless to say, this was incredibly frustrating for industry. If you were running an ad campaign and your ad agency is trying to send poster designs to you but you each use different software in your workflow … well, how is that going to happen? Or what if two different government departments are trying to collaborate on a form that citizens will use to file taxes?

A lot of engineering and coordination went into ensuring that PDF would work everywhere (universal compatibility), and display exactly the same way on every device (visual fidelity), and that is why it is a gold standard for the printing and publishing industry. If you want to ensure your T-shirt design will appear exactly¹ the way you want, send it as a PDF file, not as an image file. Have your magazine cover all set up with the fonts, sizes, colours, and everything else absolutely correct?² Send it to the printers as a PDF file.

There is just one issue with PDF: because of the way it was designed to display correctly, editing it is a big pain compared to text-based formats like DOCX or even HTML. I’ll explain why in the next issue.

Issue summary: PDF is the gold standard for universal compatibility (supported by most software and platforms) and visual fidelity (displays exactly the same way). When you need things to appear on a different device in exactly the same way you created it, without having to install additional software, use PDF.

I am sooo glad I don’t have to go into technical detail here. PDF is an incredibly, amazingly, mind-blowingly complex specification. All the words written about it would fill tomes. I am not surprised that Adobe charged so much for the initial versions of Adobe Reader and Acrobat; the immense amount of work that went into it would have made that price feel justified. (But luckily for all of us, more enterprising minds prevailed.)

I hope this issue sheds some light on the uses of PDF. We don’t get taught these things by our parents, in school, or anywhere really; the only folks who know this are usually publishing industry professionals. But with more and more programs being able to handle and produce PDF files, if we hope to continue enjoying its benefits and avoiding the consequences of using it inappropriately, then it is time that such knowledge became more commonplace.

For more: Pretty Darn Fascinating: The story of the PDF, the portable document format that’s become one of the internet’s defining information formats.

What I’ll be covering next

Next issue: PDFs Part 2

Next issue, I’ll try to explain why PDF files are the idiosyncratic beasts you hate to edit. While still avoiding technical jargon as much as I can.

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
compiling code into an application [Issue 26]?
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]

This is harder than it appears; for one, you have to ensure you get the image size and resolution correct, or the printer will have to modify it for you. ↩
Again, this is harder than it appears; the way to specify the exact colour you want is not something a layperson would know. ↩

Issue 50: Complex file formats and the Document

2019-12-07T08:00:00+08:00

Previously: A file consists of data, preceded by a file header which describes the data. Software (including operating systems) detect the kind of data contained in a file by 1) glancing at the file extension, 2) looking at its declared MIME type (if any), and 3) checking the file header.

I took a small detour in Issue 49 to talk about how files are stored and how the operating system identifies them. This issue, let’s pick up where we left off in Issue 48 about complex data types, and encapsulated data (data in a shell of metadata in a shell of metadata …).

Video files can contain multiple data streams: video, audio, and text. That makes them a pretty complex type of file in which we can embed other types of data. But they are not the only complex file type. We deal with them every time we create a new Microsoft Office document, be it in Word, Powerpoint, or Excel. You can embed images, videos, fonts, and even stranger objects in Microsoft Word. How does a simple DOCX or PPTX document keep it all together?

We are going to dig into a webpage document and a Word document and see what it looks like in there.

Webpage: An HTML document

It may be 2019 now, where URLs can end with all kinds of extensions like .aspx and .php and even no extension, but a decade or two ago they almost always ended in .html. That’s because I mentioned back in Issue 3) that the basic format of any web document is HTML. I apologise for leaving that acronym untranslated up till now.

HTML stands for Hypertext Markup Language. We’ve seen this word “Hypertext” before, when I explained the Hypertext Transfer Protocol (HTTP, Issue 7)), the set of rules that our web browsers use to request Hypertext Markup Language documents. See a link now?

HTML is not a programming language. You can’t write code and tell a computer to make different decisions just by writing HTML. You can create a button using HTML, but you cant use HTML to tell the computer to send your credit card details to another server on the Internet when you click that button. And that is why we refer to it by another term: a markup language.

HTML Markup tags

This is (a snippet of) the previous issue, as an HTML file:

Issue 49 as an HTML file

Thank goodness we have syntax highlighting, which should make it easier to notice all the little tags that start with an open angled bracket < and closed angled bracket >. These are called HTML tags, and they signify the start and end of segments in the document.

<html> starts the document, </html> ends it.
<head></head> contains information about the page: the page title (which will appear in the title bar of your web browser), the styles to apply to the document are within.
<style type="text/css">…</style>, which I have hidden here and will show later.
<body class="app">…</body> contains the main part of the document, which is what we will see.
<h1> and <h3> signify different levels of headers, which can all be formatted separately.
<div> (for “division”) is a generic container, within which you can embed images or other text.
<p>…</p> (for paragraph) indicates to a web browser that the context is to be treated like a text paragraph.
<strong>…</strong> indicates that it is to be formatted in strong fashion (which is usually treated as bold text … but you can change that in the styles section in the <head>).

What are those class="…" attributes in the tags? The web browser creates a content element for each tag, and styles it according to the predefined style class in the document, defined inside <style>…</style>. This is what that section looks like when expanded:

Element styles for Issue 49

I don’t need to explain the specifications for you to notice that <h1>, <h2>, <h3> etc all have a slightly different style defined for them. .app is a little different; it starts with a period (.) and is applied to everything that has the class="app" attribute (psst … that’s the <body> element from the earlier image!).

Yet at the same time, there are also other styles defined for <body>…</body>. The browser has rules for how it chooses which styles override which. Those rules are like the bible for web programmers and web designers, which thankfully we are not (*waves to any web folks in this mailing list*).

Okay, just two more tags to illustrate embedding other content:

The <a> tag (for “anchor”; don’t ask) is used to define links (those clickable things in a webpage) and the place it links to is defined as a href="…" attribute.

The <img> tag (for “image”) is used to insert images. Rollover text, which appears when you put the mouse cursor over the image without clicking, is defined in the alt="…" attribute, while the URL of the image is defined in the src="…" attribute.

(Embedding an image in a webpage is also possible, but I don’t want to go into depth here because I would have to explain many more concepts before that.)

Word document: An XML document

I probably didn’t need to explain so much in an issue that’s not Introduction to HTML, but I think it will help make the next part easier to grasp.

Last issue I said this:

That also means you can spoof a lot of software into thinking you have a zip file when you in fact have an .epub ebook file. This is a pretty common way to unpack files that use the zip archive format to pack their files!

Suppose we do that with a DOCX file … heck, lets convert Issue 49 into a DOCX, rename it to a .zip file and see what happens.

Don’t run! Most of it is unimportantly technical, we’ll just jump right into the interesting part which is document.xml, so take a deep breath …

document.xml

Okay, ouch. That’s a different tag language, called eXtensible Markup Language (XML). Interestingly enough, each of those tags starts with w:, followed by some familiar phrases: body, p, and others that are not so familiar.

But look, there’s also “Heading1” and “Heading3”! Other than the fact that the tags look completely different, it still uses tags in similar fashion.

Documents are just another kind of complex file

So, that’s a Word document demystified. When you save a Word document, it just converts whatever you were working on into tags, like this, and zips it all up into a zip file. And any other program that knows how to read these XML files and edit them correctly can then open and edit a DOCX file too.

Issue summary: An HTML file contains markup tags that tell the browser how to interpret and format the text within the tags. Other document formats usually use tags in a similar way. These tags constitute a markup language that any app can use to mark up its own text too.

Okay, I hope I’ve demystified webpages, text documents, and just about any place where you see formatted text just a little bit. Just about any place where you see formatting being done to text, there’s some kind of markup language working in the background. Of course, it’s often going to be much more complicated and messy than a little newsletter, but that is why we get computers to handle it.

What I’ll be covering next

Next issue: PDFs Part 1

I’ll round up Season 4 with two issues on everyone’s favourite hated format: PDFs. I think a lot of the reasons people love PDF are spot on, and were how PDFs were sort of intended to be used. And a lot of the reasons people hate PDF occur in cases that PDF was never meant to be used for. They still ended up being used because no better format came along to serve that purpose. More on this in Issue 51.

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
compiling code into an application [Issue 26]?
firmware? [Issue 34]
~~HTML? [Issue 38]~~
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]

Issue 49: What is a File?

2019-11-30T08:00:00+08:00

Previously: A video container can hold one or more audio, video, or text data streams. To encode or decode a data stream, you need to have the necessary codec installed. Most video runs at 25 or 30 fps, with high-quality video going up to 60 fps. You can use a program like MediaInfo to help you decipher the streams inside a video container file.

Images, audio, video, and more … we are so used to thinking of them as different kinds of files. But within the computer’s binary world, how does it tell that one file is a different type from another? In the human world, if you ran across a bunch of unlabelled boxes of various types and sizes, you would have no way of telling what is in each box. And you know this is a terrible way to move house—you would have to at least label the boxes by colour, by room, or by type of contents.

You would also have encountered this if you bought anything online. Your packages arrive with a shipping label, which is a quick and convenient way for the shipping companies to identify the package, type of contents, origin, and destination.

The box labels, and the shipping labels, tell us about the contents, but not the contents itself. We refer to such data as metadata. Metadata is data about data.

For a computer to be able to handle so many files without inspecting them individually, it must also have metadata about each of these files.

The file header

Files generally have a file header. The GIF file format begins with a header (“GIF87a” or “GIF89a”), so anytime a piece of software (e.g. an image editor) starts to read a file header and detects that label, it knows it’s dealing with a GIF file and not a JPG file.

When the software opens a GIF file, and before it has read anything beyond this header signature (that’s what the GIF file specification) calls the above label), it doesn’t know anything about this GIF file. Before doing anything else, it will at least need to know the width and height of this image, and in the case of GIF, some information about its colour palette (which can vary from GIF to GIF). All this information is stored within the file header, and the software will have to know how to read it from the header.

If for any reason you wish to start writing software that can edit GIF files, you can find out its detailed specifications online. This is because when Compuserve came up with the format in the early days of the internet, they meant it to be widely used. Companies who design a file format to be used internally and not for public use will come up with proprietary file formats, which are inscrutable to most people. Anyone coming across such a file would have no idea how to open it.

If you want to figure out such a file format, you would have to reverse-engineer it, like this guy on StackExchange. Since typical engineering means starting with a blueprint and coming up with a product, reverse-engineering means starting with the product and trying to figure out its blueprint. Here’s Julia Evans having a go at reverse-engineering the Notability file format.

Another example: The MP3 file format is simpler (although not easier to decode). Audio data is organised into frames, each frame having its own header followed by data. What about the artist name, record label, genre, date of release, and other information that comes with the file? All that is stored within the ID3 portion of the file metadata.

The MP3 file structure
Image from Wikipedia

File extension

That seems like an awfully complicated way for operating systems to detect what type of files they have. They would have to open each file individually, even if just to read the header, and then figure out which complicated set of patterns the header matches. When you open a folder, save a file, or download something from the internet, the computer seems to do that detection much faster.

That’s because when speed is a concern, software will often attempt to detect the filetype simply by detecting the file extension. File extensions are the ending characters in the filename, after the period (“.”). A file named sound.mp3 has a .mp3 file extension, and one named image.gif has a .gif file extension. That’s a much faster way to detect a whole bunch of filetypes, it’s quick-and-dirty, and it mostly works.

That also means you can spoof a lot of software into thinking you have a zip file when you in fact have an .epub ebook file. This is a pretty common way to unpack files that use the zip archive format to pack their files! So if you write software that absolutely needs to be sure it has the right filetype, you should double-check the file header instead of jumping to assumptions from the file extension alone.

What about the internet?

Guessing from file extensions might work in a computer, but on the internet it’s the Wild West. Images sent as data packets over the internet do not come with filenames; notice how some apps (e.g. WhatsApp) rename your image with a different name when you upload or download them? And on some web platforms, especially those that handle huge volumes of images, the filenames are just semi-random characters.

That is why we rely on what are known as MIME types. MIME stands for Multipurpose Internet Mail Extension, and yes there is an RFC for it, RFC6838. This is a much more standardised way of declaring what type of file you have. The exhaustive list of MIME types, maintained by IANA (whom we first met in Issue 27)), has MIME types for application files, audio, font, image, messages, model, multipart formats, text, and video.

If you plan to come up with a file format that is intended to be used widely, you can apply for it to be included in the list.

MIME types and HTTP headers

Remember this HTTP header from Issue 8?

See that label in the third row, with the Content-Type label, “application/json”? That’s the MIME type for the JSON data format). When the server returns data, my browser (the client) has no idea what format it is. It might be nicely formatted HTML meant for human consumption, but it might also be plain text, JSON data (like in this case), XML, or any of the various data formats that people use. Declaring the MIME type properly makes life easier for the browser to know what to do with the data.

Issue summary: A file consists of data, preceded by a file header which describes the data. Software (including operating systems) detect the kind of data contained in a file by 1) glancing at the file extension, 2) looking at its declared MIME type (if any), and 3) checking the file header, in order of difficulty and accuracy.

I almost started writing a long post about filesystems, but stopped myself in time. I hoped with this issue to continue emphasising the theme of data encapsulation: data locked in shells upon shells upon shells of metadata. I’ll be back to describing other types of data again for the rest of the issue, but I thought file headers would be good to introduce at this point.

After this season I won’t be digging into complex data types, but when I move on to operating systems I’ll cycle back to filesystems and what you need to know about them. Before I get to that season, though, here’s something for you to ponder: if all data is ultimately binary, how would an app know where one file ends and where another starts? Does the file header for mydocument.doc start at this 0, or another 0, or actually at this 1?

What I’ll be covering next

Next issue: Complex file formats and the Document

Two issues ago, I just talked about video formats, which include multiple types of data: video, audio, and even text (subtitles).

Next issue, we’ll pick up where we left off to look at another format that includes multiple data types: the document.

See you again next week, next issue.

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
compiling code into an application [Issue 26]?
firmware? [Issue 34]
HTML? [Issue 38]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]

Issue 48: Of containers and codecs

2019-11-23T08:00:00+08:00

Previously: Data cannot be compressed beyond its predictability limit in a lossless fashion. Lossless compression does not discard any information. It spots patterns in the data and represents them with fewer bits, through a combination of predictive coding, run-length encoding, and entropy coding.

In past issues this season, I went into some detail about how images and sound are represented as data in computers. I also went into a little detail about lossy compression, in which imperceptible information is discarded, and lossless compression, in which the original information can be reconstructed.

That progression finally brings me to this issue, where I introduce the first complex data representation: the video file.

A video file, as we like to think about it, actually is not a simple form of data. It can have one or more of the following:

video data
audio data
subtitles
annotations (e.g. on Youtube videos)
chapters (which let you jump to certain points in the video, like a bookmark)
miscellaneous files (e.g. embedded copyright information)

These various types of information, if they are time-sensitive (video, audio, and subtitles), have to be presented in synchrony. It’s not like you can just throw them into a simple zip file or folder and the computer knows what to do with them! How does a computer know how to put them together into an engaging movie?

The video container

What we usually understand as a video file is actually a video container format. The common ones we encounter online today are MP4 (.mp4) and Quicktime (.mov). In a more recent past, you would have commonly encountered AVI (.avi), 3GPP (.3gp), and Flash Video (.flv). And if you’re a video techie who dives into DVDs and Blu-ray discs, you would also have seen Video Objects (.vob) and MPEG Transport Streams (.ts) while digging through their contents on a computer.

The audio, image, and text data in the video container are referred to as streams. At the binary level, it’s all 1s and 0s; how does the computer know which part of the file contains audio, image, or text data? This information is in the video container metadata, along with more details on how to load the correct part of the video, audio, or text at the right time.

If you have come across poorly formed video where the image and audio data is not in sync, or the subtitles come too early/late, you know how critical it is to get this right: the human eye and ear can be pretty sensitive to even slight discrepancies in timing.

From still image to video

I’ve talked about how pixels are perceived in still image data, now I’ll introduce one more aspect of psychovisuals: how the human eye perceives motion.

The eye interacts with the brain in strange ways. Over millions of years of evolution, the brain has evolved a ‘high-power’ and a ‘low-power’ way to receive information from the eye. Under everyday conditions, the brain is able to connect separate frames of image data into a coherent picture and interpretation without being confused by the differences between each frame.

Decades of experimentation have set the gold standard for motion pictures at 60 frames per second (fps) for a seamless experience. That’s a lot of images per second, and a lot of corresponding video data!

For everyday purposes, such as online streaming, it is more common to encounter 30 fps, or even 25 fps for older videos. In certain types of video entertainment, such as hand-drawn animation, the human eye can make do with 15 fps and the brain can still piece together an enjoyable performance!

Data streams

How about the data streams? How are they stored?

To start with the obvious, they are not stored uncompressed; we saw that a single image of 1920×1080 pixels (that’s 1080p video standard, with 1080 pixels vertically) already requires 6 MiB (Issue 43)), while one second of audio requires 86 KiB (Issue 45)).

In addition to the lossy compression techniques I covered in Issue 46), software that creates these streams can also compare video frames at different points in time and throw away identical parts (if there’s no scene change, or if the camera is panning slowly, for instance).

Various video stream formats exist to carry out this lossy compression of video data.

h264 (a.k.a. AVC, for Advanced Video Coding) is still the most common video stream format in use today.
h265 (a.k.a. HEVC, for High Efficiency Video Coding) is slated to replace it and is set to become more and more popular.
Google’s VP9 is attempting to compete with it (with companies such as Netflix already on board).
FLV (as a video stream format, not a container; I know it’s confusing) is becoming less and less common.

What about audio? We used to encounter MP3 pretty often, but today most audio stream data is stored as AAC (for Advanced Audio Coding, the standard that’s meant to replace MP3), Dolby (often on DVDs and Blu-rays), and sometimes Vorbis (.ogg).

Confused yet? Just remember that the video file you have (carrying the .mp4, .mov, etc file extension) is only the container, and it contains one or more streams of actual data.

Encoding and decoding

To use these streams, you need a piece of software on your computer. This piece of software encodes or decodes the data stream, so it is called a codec. If you don’t have the required codecs, you will get an error when you attempt to open a video container file that has one or more streams in that format.

The operating system you use comes bundled with support for the most common formats, although for free-and-open-source OSes (like some flavours of Linux) this may be hampered by copyright restrictions.

About a decade ago, when video formats proliferated like a tropical ecosystem, codec packs containing just about every codec you need were a common sight online. Today, with most video moved to online streaming platforms, you no longer need them.

MediaInfo: a program to decipher containers and streams

You can use a program like MediaInfo to help you read the metadata and figure out the container and stream formats. Here’s an example of the information it shows about the only video file on my laptop at the moment:

Mediainfo screenshot showing metadata for an MP4 file containing an h264 (a.k.a. AVC) video stream and an AAC audio stream.

Issue summary: A video container can hold one or more audio, video, or text data streams. To encode or decode a data stream, you need to have the necessary codec installed¹. Most video runs at 25 or 30 fps, with high-quality video going up to 60 fps. You can use a program like MediaInfo to help you decipher the streams inside a video container file.

The key part of this issue I really wanted to get to was about codecs. “Why can’t I open this video file?” was a much more popular question in the recent past, but it has gradually faded as more and more video gets moved to Youtube. Today, I suppose the only people who still run into this problem are teachers who come across archives of old videos while hunting for teaching resources.

But still, I anticipate that I need a gentle introduction to data encapsulation. That’s a complex way of talking about data being nested in a series of shells, like a Matryoshka doll. We’ve seen some examples from the previous season on networking: data stored in an HTTP request, which is encapsulated in a TCP packet, which is encapsulated in an IP packet before it is sent over the Internet.

Today, I can have video stream information stored in an MP4 container, placed in a folder in a losslessly-compressed ZIP file (for whatever strange reason), and sent over the Internet to somebody else. Data surrounded by shells and more shells. It’s like opening a delivery box: your tiny item inside, surrounded by cardboard packaging, surrounded by bubble wrap, surrounded by a cardboard box, which was probably placed on a pallet and shipped in a shipping container.

The next few issues will continue to be about encapsulated data, but I’ll start with something simple first: what is a file?

What I’ll be covering next

Next issue: What is a file?

Sometimes, the hardest questions are deceptively simple. We all have an intuitive idea of what a file is. But what actually goes on under the hood?

See you again next week, next issue.

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
compiling code into an application [Issue 26]?
firmware? [Issue 34]
HTML? [Issue 38]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]

Come to think of it, that’s a good topic for a future issue: what goes on when a piece of software is installed on your computer? ↩

Issue 47: Lossless compression

2019-11-16T08:00:00+08:00

Previously: Computers compress image and audio data through a process similar to summarising: it analyses the data using algorithms that use brightness and colour instead of RGB values for images, and different frequencies of sound rather than samples at different points in time for audio. These algorithms then discard parts of the information that human senses do not perceive easily, and reduce the resolution of other parts that human senses are not as sensitive to.

I went into quite a bit of technical detail in the last issue, and am loathe to do so again this issue. Let’s see how much math I can avoid explaining this issue.

Lossless compression is necessary in cases where the information must be stored verbatim. For example, if you are sending a 26MB Powerpoint file to a friend, but GMail’s attachment limit is only 25MB, one thing you might try to do is put it into a compressed ZIP file to see if you can bring the size below 25MB. However, you would not want any information to be lost when it reaches your friend; they must be able to decode the ZIP file to retrieve the original Powerpoint file.

While lossy compression depends very much on how our senses (particularly sight and hearing) work and on their deficiencies, lossless compression only depends on the characteristics of the information. Accordingly, a wide variety of lossless compression techniques have been developed, each suited for a particular domain. I will attempt to give a very brief overview of some common techniques before I explain some common things people try to do in compression.

Lossless audio compression

Your brain works in interesting ways. If it sees two images that are near-identical (like a game of Spot The Difference), it won’t remember it as two separate images, but as one image, and the difference between the two images. So when people try to recall the two images you hear things like “this photo had a cat and a dog staring each other down and it also had [blahblah], the other photo is exactly the same except the cat’s ears were furled back and the dog was drolling”. Certainly a lot faster than describing the second image exactly the same way, with the additional detail!

Lossless audio compressors work in a similar way. They sample the audio in short segments, and try to see how lazy they can get in describing the next sample. This is known as predictive coding, because it is a little similar to the process of trying to “predict” the next sample. For example, based on the past 10 samples, a predictive algorithm might say “the next sample will have 0.09% of sample 1, 1.02% of sample 2, 5.63% of sample 3, …”. Storing those percentages will use a lot less space than storing the entire sample; when decompressing, the algorithm can then multiply the percentages with the respective samples to reconstruct the original sample.

In lossless compression, the predictive algorithm already knows what the next sample is, so most of the work is in calculating exactly what those percentages are. It does so by making an initial guess, then refining that guess in successive stages of calculation, each stage bringing it closer to the original waveform. This requires a lot of computation time. If such a setting is available, the algorithm can shorten the process, leading to a poorer guess. It then calculates the difference between the best guess and the original sample, and stores the difference between the two. This part is what makes it lossless rather than lossy.

Lossless image compression

The most common image formats that use compression are GIF (yes, really) and PNG. Some kinds of images, such as screenshots, have patterns that are repeated. The algorithm used in GIF and PNG, LZ77, attempts to spot these patterns, and reduce them to 1) the repeating portion, and 2) the number of repetitions. This is known as run-length encoding. The nature of images makes the process easier, as each pixel only has 256 possible values rather than 65536.

Those patterns are stored in a table, and references to them are used instead. So instead of saying “Pattern 0101011101110110”, the algorithm will store a list of these patterns, and refer to them as Pattern 0, Pattern 1, Pattern 10, Pattern 11, … (these are 1, 2, 3, and 4 respectively, in binary representation (Issue 40))).

This is known as entropy coding. By linking the longest pattern with the smallest reference number (i.e. Pattern 0), the next-longest pattern with the next-smallest reference number (Pattern 1, 10, 11, 100, 101, 110, …) you can reduce quite significantly the number of bits needed to represent the image.

Text compression

Text lends itself very well to compression, since there are so many repeated words and phrases. In general, text compression algorithms will use a combination of entropy coding and run-length encoding to reduce a document of text into repeating patterns, and using shorter references to those patterns rather than the full pattern itself.

What is the maximum possible compression?

Excellent question. Shannon’s source coding theorem¹ defines a compression limit for each block of information, called Shannon entropy (unbolded, don’t worry!). The source coding theorem says it is impossible to compress data beyond its Shannon entropy.

So what is the Shannon entropy of the data? That depends on its predictability. A block of text that only consists of the letter ‘e’ would be highly predictable, and therefore have a low Shannon entropy (I will stop using this term and use predictability limit instead). A block of text that is just completely random characters would be unpredictable and would therefore have a high Shannon entropy.

tl;dr higher predictability = higher (lossless) compression, lower predictability = lower (lossless) compression

And now it is myth-busting time! Well, not really, since most observant folks would have noticed this by now.

When I put a zip file in another zip file, why is the second zip file no smaller in size that the first?

When the first zip file compressed its contents, the predictability of the resulting data decreased (ever tried compressing shorthand?). You won’t get very far trying to compress unpredictable data.

If you want greater compression, use a higher compression setting on the original file instead.

7zip archive settings for zip files.
Image from Wikimedia Commons

A higher compression level generally causes the algorithm to try more combinations and iterations of compression, a larger dictionary size enables the algorithm to use more pattern references. Play with these two settings to find the best tradeoff between compression time and compression ratio (the ratio of final filesize to original filesize).

Why do Powerpoint files sometimes compress very well and sometimes not at all?

Powerpoint is already a compressed file format, so the only filesize gains you will get are from compressing embedded media, such as videos or images. If you used any uncompressed images, you might be able to achieve some filesize gains. But it is better to have Powerpoint handle the compression instead; it offers a Compress Pictures option.

You talk about your highfalutin Shannon entropy, but I can find so many tiny video and image files online! How do they achieve that?

Shannon’s source coding theorem does not claim that you cannot compress data beyond its predictability limit. It only claims that you cannot do so losslessly. Which means you can compress data beyond its predictability limit, lossily.

You are getting video and image files from those sources with lots of information thrown away. If you can’t tell the difference, good for you.

Issue summary: Data cannot be compressed beyond its predictability limit (Shannon entropy) in a lossless fashion. Lossless compression does not discard any information. It generally tries to spot patterns in the data, and represent those patterns with fewer bits, through a combination of predictive coding, run-length encoding, and entropy coding.

Predictive coding: express samples as a combination of past samples
Run-length encoding: spot repetitions of patterns in the data
Entropy coding: Store the list of patterns, using a shorter symbol as reference to the pattern

If the lossy compression articles are hard to read, the lossless compression articles are even worse, because so much of it is math theory. I got the gist of it as best as I can.

I don’t like the way most layman explanations in the media completely skip over the details; before I understood lossless compression, these explanations were often no help to me. I think at least knowing what kind of patterns can be found in the data would help with imagining the process, hence the crash-course introductions to predictive coding, run-length encoding, and entropy coding.

What I’ll be covering next

Next issue: Of containers and codecs

Why have we been talking so much about images and audio and compression? Because I want to get to the meat, which is: video formats! This is probably the single biggest source of confusion for most people who come to look for me regarding file types: “What kind of video file is this? How do I open it? Why can’t it open?” Next issue: a simple way to understand video formats and what they need.

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
compiling code into an application [Issue 26]?
firmware? [Issue 34]
HTML? [Issue 38]
OpenType? And what are fonts anyway? [Issue 42]

Yes, that’s the same Shannon from Nyquist-Shannon sampling theorem. Claude Shannon is lauded as “the father of information theory” with good reason. ↩

Issue 46: Lossy compression

2019-11-09T08:00:00+08:00

Previously: Humans can distinguish 120 dB of loudness, which means the loudest perceivable sound is a million times louder than the softest perceivable sound. CD audio provides 16 bits of information per sample, sufficient to provide 96 dB. Humans have a hearing range from 20 Hz to 20 kHz. CD audio is sampled at 44.1 kHz. Uncompressed audio thus requires 705,600 bits per second, or 86 kB/s.

Lots of numbers in the last issue, and you don’t need to memorise any of them, but those numbers were necessary to demonstrate some fundamental facts about data and information: We need a heck lot of data to produce images and audio that doesn’t sound distorted! And this is closely related to the limits of our eyes and ears.

Why are the images and audio files on the internet so much smaller?

Because they are compressed, that’s why.

We all have that one friend (or maybe more) who can just drone on and on about their day, or about something that happened, giving a detailed account with every little thing that happened, and all the things that it reminds them of, and finally in their entire speech there’s that piece of information you are looking for!

Or maybe you’ve been in an hour-long meeting and your colleague missed it and asked you what they missed. Would it take you an hour to recount the key points? Probably not. You’d give a summary, highlighting only the key bits that would make a difference.

Computers do something similar using compression algorithms that analyse the data and figure out which parts can be safely discarded without affecting the gist of what’s being transferred. Because information is being discarded, this is known as lossy compression—you can never get back all of the original information once it has been lossily compressed.

If you’re thinking “this part is going to be incredibly math-ey”, you are right, but I have only an hour for this issue so I’ll see how I can further summarise the theory for you readers :)

Lossy image compression: luma and chroma

In Issue 44), I mentioned that the human eye has 3 types of cones that sense red, green, and blue light. What I didn’t mention then is that partly due to the way these cones are distributed, the human eye is more sensitive to differences in brightness (or “luma”) than differences in colour (“chroma”).

A black-and-white image has only luma information (brightness), while a colour image has both luma and chroma information—you can mathematically separate the data of a colour image into the brightness component (which looks just like a black-and-white photo), and a colour component, which looks like nothing you have ever seen. The closest thing to chroma information would be analog colour photo negatives, if you were born early enough to get to see those.

So that’s another way of representing image information: you can either represent it as RGB (red-green-blue) colour values, or YUV (1 luma value, Y, and 2 chroma values, U & V). In RGB, all 3 colour components are equally important and you can’t treat them differently, but in YUV you can process them differently to achieve lossy compression.

Lossy image compression: chroma

Since the human eye is less sensitive to chroma (colour) information, in the JPEG image format, the chroma components are compressed by averaging each 2×2 group of pixels into 1 value for U and V each. (This process is known as subsampling.) Theoretically that halves the amount of data required for the same image! (4/4 Y + 1/4 U + 1/4 V = 6/12 of the original information)

Compare the image without chroma compression (4:4:4) to the image with chroma compression (4:2:0).
Without scrutiny, the human eye is not very sensitive to lower resolution in chroma.
Image from Wikimedia Commons

Lossy image compression: luma

Furthermore, even within the luma channel (i.e. looking at luma information only), the human eye is more sensitive to sharp changes in brightness across adjacent pixels than gradual changes in brightness across adjacent pixels. Through a Discrete Cosine Transform (DCT) algorithm, a computer can separate the luma information and differentiate parts with sharper changes, and parts with gradual changes.

As the compression level increases (this is the quality setting you often play with in Photoshop and other image-editing software), the computer increasingly discards more and more information, starting from the gradual-change information. For photograph images, you will generally hit diminishing returns below 85%: each 1% decrease in quality brings you less and less savings on filesize.

And that, in a nutshell, is how most lossy image compression works, and how the JPEG format works (well, okay, I’ve explained the main 30% of it maybe).

Lossy audio compression: discarding what we can’t hear

What about audio?

If you are all about the bass, or like tweaking with sound settings, or have worked with audio systems before e.g. for a performance or for your school’s events, you would have used an equaliser at some point. An equaliser is a device (or software application) that lets you adjust how much bass (low pitch), medium (middle pitch), and treble (high pitch) you want from the sound. How is the system able to do that?

Through transforms! DCT, mentioned earlier, is one such transform; audio formats often use another one, known as the Fast Fourier Transform (FFT). (Aren’t you glad this is a newsletter about computing and not about math?) Anyway, a transform lets us transform information organised by position (e.g. in images) or by time (e.g. in audio) into information organised by other properties, such as frequency.

The FFT algorithm organises audio information (for a certain time length) by frequency. Depending on your equaliser settings, it increases or decreases the weightage of different frequencies to produce the sound you want, be it bass-heavy rock or medium-light jazz.

But the FFT algorithm can do much more! It is known that most sounds we hear are typically in the 40 Hz to 19 kHz range, so it is usually a safe bet to discard frequency information below 40 Hz and above 19 kHz. If we lower the frequency ceiling for discarding, down to 16 kHz, we can reduce the amount of audio information even more.

Lossy audio compression: masking

It is also known that the human ear, when it hears a very loud sound around one frequency, will not process much softer sounds in other frequencies. This is known as masking. With the help of the FFT algorithm, it’s easy to identify which frequencies will be masked for each range of time samples, and therefore can be discarded.

Furthermore, because of the way the cochlea works, right after hearing a very loud sound, the ear will not be able to hear softer sounds for a fraction of a second (maybe the fluid in the cochlea of the ear needs some time to settle? I don’t know). So softer sounds occurring right after a loud sound are masked. We can discard that audio information too.

Lastly, long periods of silence (a couple seconds for example) are not worth all that information they take up as well, and can be further compressed.

Lossy audio compression: lowering dynamic range

We don’t always need to record audio with the full dynamic range of human hearing. For an orchestra concert, maybe that is important, but if you are just recording an interview, you don’t need to hear every tiny detail of how that person speaks (unless maybe you’re a doctor who can pick up telltale signs of cancer from the way a person speaks? That would be amazing.).

Human voice frequency typically ranges from 85 to 255 Hz, and only covers a range of up to 65 dB. That’s a full 30 dB lower than the 96 dB of CD audio, which means we don’t need 16-bit audio to store that; about 11 or 12 bits would be sufficient. And you won’t need a 44.1 kHz sampling rate for that; 11.025 kHz is sufficient.

That, in a nutshell, is how we get such small images and audio files on the internet. If you’re particularly sensitive you can often make out the difference caused by this lost information. But most of the time, we’re not listening or looking closely, and it’s easy to overlook such minor differences.

Issue summary: Computers compress image and audio data through a process similar to summarising: it analyses the data using algorithms that use brightness and colour instead of RGB values for images, and different frequencies of sound rather than samples at different points in time for audio. These algorithms then discard parts of the information that human senses do not perceive easily, and reduce the resolution of other parts that human senses are not as sensitive to.

It took me a long while to understand the lossy compression algorithms well enough to explain them simply, and even longer to summarise them still further without using terms like RLE, high- and low-frequency components, and subsampling. If you found the previous two issues overly technical, I hope this issue makes up for that by helping you understand compression in less time than detailed technical articles elsewhere, yet in more depth than your mainstream internet sources.

What I’ll be covering next

Next issue: Lossless compression: like repacking but for data

If you’ve bought anything online before, you know how much of the space is taken up by packing peanuts or styrofoam or recycled cardboard or crumpled brown paper or those airbag things. You might also know about how some third-party shipping services help you cut down on shipping costs by repacking your items together before shipping so as to reduce the volumetric weight that you have to pay for. In all cases, you’re still getting the same thing, just in a smaller package that is smaller in size.

Computers can also do something similar: give you the exact same information but in a smaller filesize. This is lossless compression, in contrast with what you learnt this issue on lossy compression (which keeps the gist of things but does not give the exact same information). How do computers do this? And when will you want to use lossless vs lossy compression?

I didn’t manage to get into what happens when you save, edit, and re-save a JPEG image repeatedly in this issue, so I’ll see how I can work it into the next issue :)

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
compiling code into an application [Issue 26]?
firmware? [Issue 34]
What is HTML? [Issue 38]
What is OpenType? And what are fonts anyway? [Issue 42]
~~What is compression? [Issue 43]~~
~~Why are music files so large when a voice call over internet uses so little data? [Issue 45]~~

Issue 45: Audio, a sampling of values

2019-11-02T08:00:00+08:00

Previously: An image’s resolution describes its dimensions. Its pixel resolution gives an indication of its physical size (if printed or displayed on a screen), and thus its sharpness. A display with imperceptibly small pixels is often referred to as a Retina display (Apple’s branding) or as a high-PPI display; this requires at least 220 PPI (pixels per inch) nominally. For an image to be printed sharply, it needs at least 300 DPI (dots per inch) on paper.

I uncovered some of the complexity of image display last issue, and I hope it has helped you to see that computers and humans influence each other very closely. The design of computers and the way they store information is inextricably linked to the way humans store and display information too. Colours are stored as RGB (red-green-blue) values because that is how humans perceive colour as well. And the monitors we buy have a pixel density that is just high enough for us to not perceive individual pixels easily.

This issue, we will explore the limits of human sensory perception again, and how it influences the way computers store information. This issue, we talk about audio. And let’s start with a question that I’ve pondered a few years back:

Why are music files so large (a few MB) when a voice call over internet uses so little data?

This is not a question I can answer within one issue; you will have to wait until next issue for a complete answer :) But let’s start here.

We’ll answer the first part, why audio files are so large, by looking at just how much information we need to provide an undistracting audio experience.

The human ear

Humans detect sound through vibrations of the eardrum which are transmitted through the cochlea of the inner ear. These vibrations are caused by variations of air pressure in the ear, which in turn are caused by vibrations in the air.

These vibrations can be produced by computers through speakers. The cones of a speaker, which are the movable rubber parts, are connected to electromagnets which control the movement of the cone. The computer sends a signal that causes the cones to move in a particular pattern that produces … sound, or sometimes music!

Converting sound to data

If we plot the vibrations of air on a graph known as a waveform, they look something like this:

An audio waveform
Image by Gordon Johnson from Pixabay.

This waveform is converted into numeric values through a process called Pulse Code Modulation (PCM). If you see the acronym PCM or LPCM in any audio-related file, this is likely what it is referring to.

Pulse code modulation to convert a waveform into numeric values
Image from Wikimedia Commons.

These numeric values can then be stored digitally as bits (Issue 40)).

How high do these values go? Of decibels and dB

Well … how many different values would we need? That depends on how loud we need the sound to be … or does it?

The maximum loudness actually depends on your speaker, not on the signal. The number of levels we can represent in the sound should depend on the range between the loudest and softest sound, shouldn’t it? If we have sixteen levels, we can represent a range of sound where the softest sound is no softer than 16 times below the loudest sound. Any sounds softer than that can’t be represented on the waveform.

So just how many levels can the ear make out? Welcome to the field of psychoacoustics, the study of how sound is processed in the ear and perceived in the brain.

Loudness is measured in decibels (dB). The softest sound the human ear can hear corresponds to 20 microPa (microPascals) of pressure; this is taken to be 0 dB, a reference point. A sound 10 times louder (200 microPa) is 20 dB, so every increase of 20 dB represents a tenfold increase in loudness. A jet liner taking off (120 dB) is 10^6 times louder, or a million times louder! That is generally the limit of human hearing: from 0 to 120 dB, or a range of 120 dB.

CD-Audio quality audio uses 16 bits to store a single sample of sound; that provides 65,536 (2^16) different levels, which corresponds to a 96 dB range of loudness. I doubt we will find speakers that can produce close to jet engine levels of sound, and if we do, they probably won’t be using CD Audio as a sound format, so this is pretty much sufficient for most quality audio you’ll find on the internet.

Today, 16-bit audio is pretty much standard on all computers. Audiophiles will tout the benefits of 24-bit audio, but we won’t go into detail on that in a layman’s guide to computing.

So each point produced from pulse code modulation (PCM, above) of sound contains 16 bits (2 bytes) of information. How many samples do we need?

Sampling and frequency

You probably can’t make out the individual waves in the waveform much earlier in this issue; that’s because the waveform is visually squeezed horizontally. But if we expanded it, you would be able to make out individual waves.

A sound with higher pitch has higher frequency; it has more waves per second. A sound with lower pitch has lower frequency; it has fewer waves per second. It is the upper limit we need to worry about: we must have enough samples per second to be able to represent so many waves. To be able to see a complete wave, we need at least two points: one for the peak, and one for the valley.

This agrees with what signal engineers learn from the Nyquist-Shannon sampling theorem: to store a 1 Hz sound (1 wave per second), you need at least 2 samples per second (to distinguish the peak and valley of the wave).

The human range of hearing ranges from 20 Hz to 20 kHz (that’s 20,000 Hz). To store a 20 kHz sound, you need at least 40,000 samples per second. CD-quality audio is sampled at 44,100 samples per second (enough for up to 22.05 kHz), which is sufficient to cover the human hearing range of frequencies.

So for 1 second of audio, uncompressed, we will need 16 bits × 44,100 samples = 705,600 bits, or 86 KiB. 1 minute of uncompressed audio would be 5.05 MiB!

Issue summary: Humans can distinguish 120 dB of loudness, which means the loudest perceivable sound is a million times louder than the softest perceivable sound. CD audio provides 16 bits of information per sample, sufficient to provide 96 dB. Humans have a hearing range from 20 Hz to 20 kHz. CD audio is sampled at 44.1 kHz. Uncompressed audio thus requires 705,600 bits per second, or 86 KiB/s.

This issue and the past 2 issues set the stage for the next issue, which is the first milestone for this season when I can finally introduce compression! And then we will finally get to answering the question: Why are music files so large (a few MB) when a voice call over internet uses so little data?

What I’ll be covering next

Next issue: Lossy compression: a computer’s attempt to summarise

We’ve all done this before. “What were you talking about with X?” “Oh, we were just talking about Y. I said blah and X said blah and that was about all that was important.” It’s called summarising, and if we didn’t do it, 75% of our lives would just be talking.

How do computers summarise and attempt to convey only the important parts of all the information we store and transmit? This and more in the next issue on lossy compression!

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
compiling code into an application [Issue 26]?
firmware? [Issue 34]
HTML? [Issue 38]
OpenType? And what are fonts anyway? [Issue 42]
What is compression? [Issue 43]
Why are music files so large when a voice call over internet uses so little data? [Issue 45]

Issue 44: Image resolution

2019-10-26T08:00:00+08:00

Previously: Colour is stored as a combination of red, green, and blue. In a computer system, each colour is stored as one byte (8 bits), allowing for 256 different levels. An image is made up of many such pixels of colour.

An image is two-dimensional and certainly much larger than a single pixel. How do we talk about its size?

Image resolution

It is common to hear people refer to an image’s size as its pixel size. When we say that an image has a resolution of 1000×3000 pixels, that means it is 1000 pixels wide by 3000 pixels high. In other words, the image is made up of 3 million pixels of colour, arranged in a grid 1000 pixels wide by 3000 pixels tall.

But how large is this image physically? Well, that’s a harder question to answer …

Resizing

You see, on a computer, you can resize an image as you like. I’m sure you have done it many times, preparing for a presentation or just creating a document. So you can make that image 1cm×3cm, or 10cm×30cm. But how large is an image originally meant to be?

Image resolution: a ratio between dots and inches

In more finicky circles, the term “resolution” is used in another way: to refer to the ratio of pixels to a physical dimension, usually in inches (this is a legacy thing, I can’t explain why it’s imperial and not metric).

For example, if that 1000×3000 image was meant to be displayed as a 10cm×30cm image on screen (approx. 4 inches by 12 inches), it would have a resolution of 250 pixels per inch (PPI)— 1000 pixels ÷ 4 inches. If you could see pixels, and you took out a ruler to count the number of dots in a 1-inch line across or down the image, there would be 250 pixels.

If it was displayed as a 100cm×300cm image instead, that printed image would have a resolution of 25 pixels per inch (1000 pixels ÷ 40 inches). And it would look 10 times blurrier; each image pixel would be about 1mm wide!

So image resolution, as pixels per inch, also gives a measure of sharpness of the image.

For printed images, the same idea applies: a 1000×3000 image printed as a 10cm×30cm image has a resolution of 250 dots per inch (DPI) — 1000 ÷ 4 inches. It’s dots instead of pixels because a printer lays down dots of colour rather than displaying pixels (I’ll go into more detail in a future season on computer accessories and peripherals).

Monitor resolution

When you buy or browse computer monitors, you would have heard the monitor’s pixel dimensions (number of pixels across and down) referred to as its resolution. Its measure of sharpness is usually listed under a label like DPI or PPI, if not pixel density. If not, you can calculate the PPI of a monitor yourself: Just take the horizontal pixel dimension (number of pixels in the screen horizontally) and divide it by the display width, or take the vertical pixel dimension and divide by the height.

Your OS might have a setting for fixing blurry apps, or making small text appear larger. These are typical problems faced on a high-PPI screen. But how high does the PPI need to be for us to get a reasonably sharp image?

Retina: a brand name for high pixel density displays

In 2010, the late Steve Jobs first used the term Retina referring to the iPhone 4. I suppose he meant to describe a class of devices with a display so sharp that the pixels were practically imperceptible; it wasn’t that long ago that if you squinted a little, you could make out the pixels on your monitor or laptop. High pixel density displays are a lot more common today, so you would probably have to visit the budget section of the computer monitor department in a store to see the low-pixel-density effect again.

So what’s the minimum PPI required to have a Retina display? Apple doesn’t specifically designate a number, but it appears that the minimum PPI of their Retina devices is 218. Devices that will be further to your eye can get away with about 220 PPI, while those that will be closer to your eyes will need a higher PPI (up to 400 on the iPhone 6).

But all of that is useless if you scale up an image and still view it at a low image PPI!

Why do my printed images come out blurry?

Here’s a problem I think some of you might have encountered: You are editing a picture on your laptop or computer monitor, and it looks just fine. You send it to the printer and it comes out really blurry. What happened?

What happened is that the image was presented in two different ways. On a screen, it appears as a grid of pixels. a 14” laptop with a 1920×1080 screen resolution actually only has a screen PPI of 157. An image at 100% zoom (1 image pixel displayed as 1 screen pixel) on such a screen would appear fine, because it would be displayed alongside other screen elements (such as the application window) that appear sharp.

But once it is printed, it appears as a collection of ink dots on paper. These dots are a lot finer than the pixels on a screen, so any blurriness is immediately apparent. Your computer or laptop screen is a poor device for assessing print sharpness! To get a better sense of print sharpness, you will want to view the image on a high-PPI display (such as an iPad) and adjust the zoom such that the image on screen has the same size when printed.

For printing images, you will want to make sure your image has a resolution of at least 300 DPI; at least 600 DPI is ideal. You can also calculate this by taking the horizontal pixel dimension of the image, and dividing by the horizontal size you intend to print it at.

Issue summary: An image’s resolution describes its dimensions. Its pixel resolution gives an indication of its physical size (if printed or displayed on a screen), and thus its sharpness. A display with imperceptibly small pixels is often referred to as a Retina display (Apple’s branding) or as a high-PPI display; this requires at least 220 PPI nominally. For an image to be printed sharply, it needs at least 300 DPI.

It took a lot of discipline this time to not burrow down rabbit holes (like image-to-screen pixel grid alignment); that would have taken a lot longer than an hour to write.

Pixels and dots are an abstraction that anyone working with computers have to think in terms of, and the relationship between them to physical size can be really tricky to articulate clearly. I hope in this issue I have at least introduced you, my dear readers, to PPI and DPI. And if you work with printers, I think knowing what is going on is a big relief, and takes away the stress from guesswork. Many times I have saved myself the stress of trying to get a sharp banner printed by doing the DPI calculations and realising that there is no way that is possible; I would need too large an image!

Okay, I think we’re done with basic colour and pixel theory! Next up, basic sound theory, and then we can move on the compression :)

What I’ll be covering next

Next issue: Audio, a sampling of values

Sound is so easily taken for granted, but how exactly is it represented in the computer, and how much information is required to store sound? Stay tuned.

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
compiling code into an application [Issue 26]?
firmware? [Issue 34]
What is HTML? [Issue 38]
What is OpenType? And what are fonts anyway? [Issue 42]
What is compression? [Issue 43]

Issue 43: Images, a mosaic of 3 colours

2019-10-19T08:00:00+08:00

Previously: Unicode is an encoding format which is meant to support every language, ever. Most websites, apps, and interfaces support it today.

In the last two issues, I explained how text is stored as numbers through the use of lookup tables, whether ASCII or Unicode. The more total characters we want to store in the lookup table, the more bits we need for each character.

This is going to be a recurring theme: If we want to be able to differentiate more shades of colour, or more degrees of loudness in sound, we will need more and more bits for each sample, and that means our file—whether text, image, or sound—is going to have a larger filesize.

How many bits is good enough? In the case of text, that is determined largely by the upper limit on the number of symbols we might possibly need to communicate. But how do we decide that for colour? The number of different shades of colours is possibly infinite, and yet we can’t possibly differentiate between really fine shades, nor can our screens possibly produce all of them …

In this issue, I’ll be summarising and oversimplifying decades of colour theory and colour vision research. Buckle up!

The human eye

Any effective colour system must take into account how the human eye is structured, and how vision occurs. Today, we understand that humans are trichromatic: there are 3 types of cone cells in the eye (and also 1 type of rod cell, which I won’t be explaining here), and each one recognises a different shade of colour: red, green, blue. Each type of cone cell can differentiate roughly 100 different shades, which theoretically enables us to distinguish 1 million shades of colour (100^3).

So it makes good sense that our colour systems in computers evolved similarly, to store single dots of colour as a combination of red, green, and blue. To be able to store 100 different shades, we will need at least 7 bits (2^7 = 128), but computer systems like things in 8s). For this and other historical reasons, 1 byte (8 bits) are used for each shade, giving us 256 shades of red, green, and blue each. That’s over 16 million (256^3) shades of colour!

Colour encoding

Since one byte stores one colour value, three bytes are needed for a single spot of colour combining red, green, and blue—a combination commonly called RGB. In a computer, each byte represents the level of that colour; 0 means minimum level (i.e. black) while 255 means maximum level (complete saturation of that colour). So any of those 16 million colours can be stored as a number triplet, representing the red, green, and blue values respectively.

(0,0,0) is black (255,255,255) is white

So now you know what to do with colour pickers in applications: just find the combination of red, green, and blue that is closest to the colour you want!

A colour picker, common in graphics applications. This one is from Microsoft Paint.

You can play with a simple colour wheel on colorspire.com, or if you’re feeling more adventurous, try the more technical one on rapidtables.com.

Colour production

On a screen, colours are produced by millions of liquid crystals (in LCDs) or light-emitting diodes (in LED displays). These are arranged in a rectangular grid pattern, and each one is known as a pixel (shortened from picture element). Each pixel is capable of producing 256 shades of red, green, or blue.

It is extremely difficult to manufacture pixels that can produce any colour; this would require that the crystal or diode can emit light of different frequencies. Instead, the display industry has settled on combining 3 sub-pixels into a pixel. Each sub-pixel produces—you guessed it—either red, green, or blue light.

Extreme close-up shots of pixels.
Taken from lcdtech.info.

When colour information is sent from the computer to the display (through the video cable) and decoded in the display, it also uses RGB values. In this sense there is remarkable consistency in computer systems in how colour is stored, sent, and displayed. That minimises the amount of time spent by computers converting from one format to another.

Colour storage

In a computer, combinations of image pixels are stored as image files, but you already know that. I’m on the verge of exceeding my one-idea-per-week promise, so I’ll end this issue with a short comparison of common image formats. Each image format is labelled below by its file extension, the part of the filename that comes at the end.

BMP
BMP is short for “bitmap”. The bitmap format commonly encountered in computer systems, stores pixels uncompressed. This means that each pixel requires 3 bytes of space, so a full-screen image on a typical modern laptop (1920 pixels horizontally, 1080 pixels vertically) would require about 6 MB (1920×1080×3)!

GIF
GIF (Graphics Interchange Format) is one of the earliest image formats, and is rather more restricted in its capabilities as a result. Each GIF pixel is only 8 bits, so a GIF image is limited to using only use 256 colours. One of those colours can be “transparent”, allowing GIF to produce images with transparent parts.

JPEG
JPEG stands for Joint Photographers Expert Group, so it wouldn’t surprise you to learn that it was designed to display photographs with as small a filesize as possible. Today, it is in use for a variety of image types. JPEG can display pixels in 24 bits (i.e. 8 bits for RGB each), but does not store them uncompressed like BMP. Instead, it applies compression to reduce the filesize by “discarding information” from the image in a way that does not affect the final image visibly.

PNG
PNG (Portable Network Graphics) was designed as a replacement for GIF. It supports 24-bit image pixels, with an additional 8 bits per pixel for transparency information. That means PNG pixels have 256 different levels of transparency, allowing for blending effects where one image overlaps another. PNG files support image compression, allowing them to be stored with smaller filesizes than BMP.

Issue summary: Colour is stored as a combination of red, green, and blue. In a computer system, each colour is stored as one byte (8 bits), allowing for 256 different levels. An image is made up of many such pixels of colour.

I get carried away easily explaining colour, and it took incredible discipline to rein that exploratory instinct in and stick to the most essential parts. There’s so much to go into, even for laypeople! But, I know, one idea a week, and I’ve sort of worked out where the other ideas should go, so we’ll have a nice and gradual introduction to colour over the course of several seasons.

What I’ll be covering next

Next issue: Image resolution

After examining a single pixel, I’ll look at a whole image: what does it take to trick our brains into seeing an image instead of a collection of pixels?

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
compiling code into an application [Issue 26]?
firmware? [Issue 34]
What is HTML? [Issue 38]
What is OpenType? And what are fonts anyway? [Issue 42]
What is compression? [Issue 43]

Issue 42: Unicode, computers go international

2019-10-15T22:22:00+08:00

Title: Issue 42: Unicode, computers go international Date: 2019-10-12 08:00 Tags: Category: Season 4 Slug: lmg-s4-issue-42-unicode-computers-go-international Author: J S Ng Summary:

Previously: In ASCII encoding, text is stored as a 7-bit sequence. Text consists of letters, numbers, symbols, and control codes. Control codes instruct the computer how to format the text so that it looks the way we intended.

Last issue, I explained what ASCII is and what it does: it allows us to encode letters, numbers, symbols, and control codes into bits (0s and 1s) to be sent to another computer digitally, where it can be decoded by another computer.

That still does not explain accented characters (such as á), umlauts (like ö), and emojis. Where are those represented in ASCII? And what about glyphs (symbols) used in Greek, Cyrillic, Chinese, Japanese, and other languages?

Again, some history

In short, other countries and cultures were not happy with ASCII. It did not allow them to communicate effectively in their own languages.

The first thing that happened was that the European Computer Manufacturers Association (ECMA) extended US-ASCII into ISO 8859-1. In ISO 8859-1, each character is represented by 8 bits. Let’s look at some numbers:

Characters needed minimally (lower- + upper-case, and numerals): 26+26+10 = 62
Common symbols: 30
7 bits can encode 2^7 = 128 different characters
8 bits can encode 2^8 = 256 different characters

8 bits was enough to provide for a number of additional glyphs seen below. But very quickly it ran into limitations as well. 256 characters just aren’t enough!

The ISO 8859-1 characters

Encoding hell

Computer systems in these other countries soon came up with their own ways of representing the huge number of glyphs they needed. There were other ISO 8859-* encoding systems which I do not want to list. The Chinese had GB encoding on the mainland, Big5 in Taiwan, and numerous extensions on that. The Japanese used Shift-JIS.

It was encoding hell.

If you remember the internet circa the ’90s and early ’00s, the internet often had pages of what looked like gibberish. Because webpages then did not include information about their encoding, most of the time web browsers simply had to guess. If your page wasn’t encoded in ASCII (or ISO 8859-1), it was anybody’s guess what encoding you were using. You just tried each encoding until you got a page that makes sense!

That simply would not do.

Origins of Unicode

In 1988, a bunch of engineers from Xerox and Apple started thinking about a universal encoding that can encompass all languages. The first volume of this encoding was published in 1991, with extensions added subsequently.

At that point, a Unicode character was represented using 16 bits (for a possible 65,536 characters!). In 1996, a method of extending the Unicode scheme was added, so that Unicode could easily represented over a million different characters!

Unicode today

Today, the global significance of the internet has resulted in Unicode being the standard encoding on any interface a user interacts with.

If something you try to submit in a form (such as your name) or view on a page does not display properly, chances are the service you are interacting with has not updated itself with proper Unicode support yet. Write a support request to them and ask for it to be done!

One big reason for the increased support in Unicode is the space that was set aside for emoji … more evidence that war may drive the development of technology, but it is social factors that lead to its widespread adoption :)

Cool things about Unicode

Aside from the fact that it could include encodings for just about any character in any language, here are some things about Unicode which may not be entirely relevant for the layperson, but I think are good to know. Feel free to skip this section.

Unicode is able to “craft” characters by combining multiple glyphs.
For instance, a̅ is not represented with a single character, but can be printed through combining the ‘a’ glyph with the ◌̅ (Combining Overline) glyph.
Unicode has an area set aside for alternate character representations.
For instance, “fl” is sometimes stylistically combined into an “ﬂ” ligature; there is room for this ligature in Unicode. (Try to select the ‘fl’s above if you can’t see the difference.)
Some high-quality fonts provide such alternate glyph representations, and with the right software (such as Adobe InDesign) you can make use of them.
Some languages (e.g. Arabic) actually require ligatures for combining adjacent glyphs, so this is a pretty big deal.
With the right font type (i.e. OpenType), you can actually include programmatic features though Unicode.
FF Chartwell is a font for creating mini-charts just by typing!
The font uses ligatures to turn numbers into a mini chart.
Unicode has a “Private Use Area” that you can use for your own private purposes. You can insert symbols from this area for use in a webpage.
I have seen websites use this to create custom icons that can scale in size and change colour easily, just like text.

Issue summary: Unicode is an encoding format which is meant to support every language, ever. Most websites, apps, and interfaces support it today.

That was really short, thank goodness. I’ve of course skipped over Unicode complexity, because the average layperson does not need to know that. But people need to know that it is possible, and actually easy, to represent different languages on the same page, and there is no excuse not to do so.

What’s really interesting is that it took 20 years or more for a format like Unicode to be conceptualised, born, and finally reach the mainstream. Many ideas in computing are like that. When you see something really novel hit the market, it has probably been brewing in somebody’s head for over a decade!

I think we’re as done with text as we need to be. I’ll start going into other types of data in the next issue, starting with colours and images.

What I’ll be covering next

Next issue: Images, a tri-colour mosaic

Coming up: a highly compressed crash course in psychovisual theory, colour theory, and how an LCD screen works! All condensed into layperson language, of course.

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
~~Unicode? And what does it have to do with emoji? [Issue 8]~~
a good reason developers write code and give it away for free online? [Issue 21]
compiling code into an application [Issue 26]?
firmware? [Issue 34]
What is HTML? [Issue 38]
What is OpenType? And what are fonts anyway? [Issue 42]

Issue 41: ASCII, the typewriter digitised

2019-10-05T08:00:00+08:00

Previously: 8 bits comprise 1 byte. Humans count bytes in multiples of thousands, while computers count bytes in multiples of 1,024.

It’s still difficult to wrap our minds around how computers do everything with exactly two symbols: 0 and 1. Let’s start simple: How do computers represent text?

The simple answer is that text can be represented as numbers. In the simplest scheme we know of, A=1, B=2, C=3, and so on. A computer does something more complicated, it keeps a table of characters and the numbers that represent them, in an encoding table. One encoding that is commonly used for plain text is known as the American Standard Code for Information Interchange, or ASCII table.

Some ASCII background and history

To put things in some context, keep in mind that ASCII actually predates the internet! (We had computers way longer than we had the Internet, after all.) This was the 1960s, Morse code was the standard in telegraph transmission until the 1900s, when the Murray code was used instead (itself derived from the earlier Baudot code). The Murray code employed a keyboard much like a typewriter’s. This was an improvement over Morse code, because instead of tapping a single control key (like you see in classic movies), you can now use all five fingers of the hand to type.

In the 1920s, the Murray code was developed into the International Telegraph Alphabet No. 2 code (ITA2 code). Behold:

Image from Wikimedia Commons

But the Murray code actually used more bits to transmit the same information! In Morse code, every letter is represented with between 1 to 5 symbols. Each symbol is either a dash or a dot:

Image from Wikimedia Commons

What do we gain from using more symbols to transmit each number or letter? If you compare the two, you see that ITA2 has some things that are missing in Morse code:

Spaces
Carriage return
Line feed
Symbols

Symbols and spaces are easy enough to understand, and very welcome; if you’ve ever tried reading early telegrams (or using Morse code) you’ll appreciate their addition. But what is carriage return and line feed?

The typewriter

With the advent of the typewriter, people had access to nicely formatted text. You could type text on multiple rows instead of one long row! But you had to remember to do the actions when using a typewriter for it to be formatted properly.

The Underwood Five typewriter
Image from Wikimedia Commons

There were two separate actions involved as you pulled the leftmost lever to the right: (1) The carriage, which holds the paper and moves a bit to the right after each letter is typed, now resets its position so you can start typing from the left again, and (2) the paper is moved up so you can begin typing on the next line.

(1) is called a carriage return, (2) is called a line feed.

ITA2 could not only send letters and symbols, it could send formatting commands!

ASCII proper

The ASCII code chart expands the capabilities of ITA2, while requiring 7 bits for each character. Each character is situated in a specific row and column, out of 8 columns and 16 rows which are numbered starting from 0. (Note that 8 is 2^3 and requires 3 bits, 16 is 2^4 and requires 4 bits.)

(An early version of) The US ASCII code chart. Each row number is represented by 4 bits, while each column number is represented by 3 bits.
Image from Wikimedia Commons

Technical details aside, look at what ASCII has:

Symbols and numbers (mainly columns 2 and 3, but also scattered elsewhere)
Upper and lowercase letters (columns 4 to 7)
Lots and lots of control codes! (columns 0 and 1)

What do these control codes mean?

NUL stands for null, a placeholder code for when the machine wasn’t transmitting.
SOH: start of header, to indicate the portion of the transmission that contained information about the message.
STX and ETX: start of text and end of text, to indicate the message portion.
EOT: end of transmission.
DEL: to delete the previous character (hello, backspace).
CR and LF: we just met them, carriage return and line feed.

I won’t explain the rest in this supposedly-short newsletter, but if you’re interested the full list is on Wikipedia.

ASCII today

In a basic text file, text is still stored using ASCII (although it has seen some modifications since). Some of the control codes are obsolete, while some are still in use today. Remember this image from Issue 12?

An HTTP request captured in Wireshark.

The \r and \n you see there are control codes. They stand for ‘return’ and ‘newline’, the modern equivalent of ‘carriage return’ and ‘line feed’.

Formatting codes are well and alive today, and they are more prosperous than ever! Without formatting codes, all our files would be stored only in the same boring format, represented only as letters and numbers and punctuation marks.

And there you have it.

Issue summary: In computers that can encode and decode ASCII, text is stored as a 7-bit sequence. Text consists of letters, numbers, symbols, and control codes.

A rather long issue, but that’s what it takes to explain carriage return and line feed, which don’t make sense to folks who have never seen or used a typewriter before (why can’t you just have a single control code that moves to the start of the line and moves to the next line? Well, sit down and let me tell you a story …)

Much of the idiosyncracies of computers and technology are this way: accumulated from decades of historical developments, forming legacy baggage in some cases, and interesting bits of history in others.

What I’ll be covering next

Next issue: Unicode, computers go international

These days, most of the text you encounter is not encoded in ASCII. It is rather limited, after all, and we need a lot more than just letters, numbers, and symbols today. Next issue, we’ll go into modern-day text encoding, using Unicode.

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
Unicode? And what does it have to do with emoji? [Issue 8]
~~those ‘\r\n’s in the HTTP request packet [Issue 12,17]?~~
a good reason developers write code and give it away for free online? [Issue 21]
~~ASCII? [Issue 23]~~
compiling code into an application [Issue 26]?
firmware? [Issue 34]
What is HTML [Issue 38]

Issue 40: Bits and bytes

2019-09-28T08:00:00+08:00

Previously: Networks enable data packets to get from one computer in the network to another through gateways that forward the data packets according to fixed rules. These rules are encoded in the various protocols followed by network systems, and all computers on the network agree to follow the same protocol.

But what kind of data gets transmitted over the network? And why do strange file-related things happen on my computer? I’ll unpack some of these gradually over the course of this 13-issue season.

Let’s start Season 4 slow, with a simple question: when I buy a 1TB hard drive, why does my computer say it has only 930GiB available?

A bit: the littlest bit of data

You know the game Animal, Plant or Mineral, where you ask yes/no questions to guess what the other person has chosen (from the Animal, Plant, or Mineral category)? Each yes/no question narrows down the range of options until you are finally reasonably certain you know what they have in mind.

It seems everybody knows that all the way down, computers work with 0s and 1s. They work kind of like Yes and No, too, with each digit acting like the answer to a Yes/No question, to narrow down the available information. Quick example:

Animal

Does it have more than two legs? → Yes (1)
Does it have four legs? → No (0)
Does it crawl on the ground? → No (0)
Can it jump? → Yes (1)

With this question sequence, a grasshopper would be represented as YNNY, or 1001. A millipede would be represented as 1010. A dog would be represented as 1101, but so would a cat. 4 digits can help us categorise different animals, but not all. The more questions we can ask, the better we can categorise them.

The answer to each question has 2 possible outcomes, and gives us a little bit more information. Claude Shannon, the father of modern Information Theory, thus named it the bit. What is a bit? It’s a unit of measure for information. Just as we measure weight in units of kilograms, height in units of centimetres, or time in units of seconds, we measure information in bits.

1 bit of information is enough information to reduce the uncertainty by 50%. Each question you ask in Animal, Plant, or Mineral should reduce the possibilities by half, until the remaining possibilities are small enough to guess.

So in a computer, a single digit—0 or 1—is a bit.

A byte: a convenient cluster of 8 bits

In the 1970s, 8-bit microprocessors were all the rage. These were processors that processed everything in clusters of 8 bits. It became convenient to refer to 8 bits as a byte, and the term has stuck since. The term didn’t die off because so many things still use clusters of 8 bits to represent information.

8 bits can store 256 (2^8) unique values, and that turns out to be enough for many purposes. I won’t list examples here, since those examples will come in subsequent issues. If you need greater precision, you can always use 2 bytes.

It’s all Greek (prefixes): kilo, mega, giga, tera

The metric system gave us nice prefixes to count in thousands (kilo-), millions (mega-), billions (giga-), or trillions (tera-), neatly represented by the letters k, M, G, and T respectively (case-sensitive).

So a kilobyte is 1,000 bytes, a megabyte is 1,000,000 bytes, a gigabyte is 1,000,000,000 bytes, and a terabyte is 1,000,000,000,000 bytes.

Uh oh …

Here we run into a little bit of a problem. Computers like to count in powers of two, because increasing the number of bits by one gives us double the number of possible values.

8 bits gives us a byte. 9 bits gives us two bytes, since the additional bit can be 0 or 1. 10 bits gives us four bytes, since the additional 2 bits can be 00, 01, 10, or 11.

11 bits: 8 bytes (000, 001, 010, 011, 100, 101, 110, 111)
12 bits: 16 bytes (I won’t list them from this point onwards; I think you can see the pattern)
13 bits: 32 bytes
14 bits: 64 bytes
15 bits: 128 bytes
16 bits: 256 bytes
17 bits: 512 bytes
18 bits: 1024 bytes

1024 bytes is the closest we can come to 1000 bytes.

Can’t be unseen

If you’re on a Windows computer, go to My Computer. If you’re on another OS, go to whichever app shows you available disk space. Look carefully at the units for free space.

Disk space is not reported in MB, GB, or TB. It’s reported in MiB (mebibytes), GiB (gibibytes), or TiB (tebibytes)! Those units are not the decimal notations we are used to. We count bits differently from computers.

Humans: Since 10^3 is 1000, a kilobyte is 1000 bytes, a megabyte is 1000 kilobytes, a gigabyte is 1000 megabytes, and a terabyte is 1000 gigabytes.
Computers: Since 2^10 is 1024, a kibibyte (kiB, or kilo binary byte) is 1024 bytes, a mebibyte (MiB, or mega binary byte) is 1024 kibibytes, a gibibyte (GiB, or giga binary byte) is 1024 mebibytes, and a tebibyte (TiB, or tera binary byte) is 1024 gibibytes.

When you buy a 1TB hard drive, you are buying a 1,000,000,000,000-byte drive.

1,000,000,000,000 bytes ÷ 1,024 = 976,562,500 kibibytes (kiB)
976,562,500 kibibytes ÷ 1,024 = 953,674 mebibytes (MiB)
953,674 mebibytes ÷ 1,024 = 931 gibibytes (GiB)

So your computer isn’t lying, it’s just using different units of counting.

Issue summary: A bit is a unit of measurement for information. 1 bit of information is enough to reduce the uncertainty by 50%. 8 bits comprise 1 byte. Humans count bytes in multiples of thousands, while computers count bytes in multiples of 1,024.

There are much shorter versions of this explanation on the Internet, but I found none of them satisfying, because they try to paper over the mathematical detail. While this newsletter is intended for layfellas, the math is something that can be worked out with a calculator, and I found that showing the detail makes it easier to understand.

There may be a social-construct argument to be made here for units of measurement, but I won’t go into that here. I wanted Issue 40 to start with an example of how things work differently between a human mind and a computer’s “computational mind”, and I hope I’ve achieved that.

What I’ll be covering next

Next issue: ASCII, the typewriter digitised

We started with a (figurative) bit of bean-counting, let’s get right into how computers work with text in Issue 41 so that I can finally answer one of the sometime-in-the-future questions below: What is ASCII? And I’ll answer another one in Issue 42: What is Unicode?

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
Unicode? And what does it have to do with emoji? [Issue 8]
those ‘\r\n’s in the HTTP request packet [Issue 12,17]?
a good reason developers write code and give it away for free online? [Issue 21]
ASCII? [Issue 23]
compiling code into an application [Issue 26]?
firmware? [Issue 34]
What is HTML [Issue 38]