Layman's Guide to Computing - Season 06

Issue 78: uMatrix: voyuering the voyeurs

2020-07-04T08:00:00+08:00

Previously: The default settings of most browsers expose a lot of information to scripts that request it. To prevent such scripts from running, we need services that can filter the source of these scripts. These services generally work by matching browser requests against a blacklist, and blocking the request if it comes from a domain known to host malicious scripts.

Many existing solutions to blocking scripts—let’s call them script-blockers—rely on manually managing a blacklist. That is to be expected, but few of them make it easy to see which domains the scripts are coming from.

uMatrix

As part of my research for this season, I installed uMatrix, a browser extension for Firefox, Chrome, and Opera.

Once installed, it adds a button beside the address bar. When clicked, this button pops up a matrix showing the number of resources loaded from each domain:

uMatrix in Firefox showing default settings.
Items highlighted in green are permitted to load, items in red are blocked.

Understanding the Matrix

Along the top row, the column headers tell us what kind of resources are being requested by the page. A quick refresher:

cookies are little bits of information that scripts attach to a domain in the browser (Issue 69))
CSS (Cascading Style Sheet) files describe the styling to be applied to the page
image files need no explanation I hope
media covers any rich/animated media e.g. videos
scripts are javascript files containing code to be executed when the page has loaded them
XHR (XmlHTTPRequests) are requests for other resources—to verify a Captcha, get a winning ad bid (Issue 73)) … or something as innocent as getting the weather forecast
frame refers to iframes (inline frames), which are a way of embedding a webpage inside another. You see this often on sites which display PDF files within their pages. But this can also be used to embed Captcha puzzles within a login box, for instance.
other: I won’t go into the other esoteric means of loading data onto a webpage; we won’t need that for this issue

At a glance, I can see that just to load the login, the Dropbox webpage is pulling resources not only from dropbox.com, but also from:

dropboxcaptcha.com
dropboxstatic.com
google.com
fonts.googleapis.com
gstatic.com
googletagmanager.com

These are represented as row labels.

The numbers in each cell represent how many resources of each type are being loaded from each domain.

CSS and images are considered important and quite harmless, and are thus allowed by default. First-party resources (Issue 76)) too, since the website itself has full control over them, are considered “secure”, assuming you trust that website enough to be there in the first place.

Blacklisting or whitelisting domains

By default, some domains known to host scripts for tracking are already blacklisted. googletagmanager.com (highlighted in bold red) is the domain for Google’s Tag Manager platform for measuring and analysing browsing data. It is how their ads can get personalised data on you, so it is on uMatrix’s blacklist once you install it.

Other third-party domains are blacklisted by default (highlighted in light red) for your safety, but I can choose to whitelist them by clicking on them until they are highlighted in light green.

Dissecting page functionality

That’s interesting … blocking all third-party resources does not stop the page from loading at all! So what are those resources doing (especially the 63 scripts from cfl.dropboxstatic.com)? Let’s continue using the webpage to find out.

Error (405) means Method Not Allowed, implying that something is missing from the webpage resulting in it not understanding what to do. Oops.

Error 405. Looks like I broke something. This is the tedious part: I whitelist one domain at a time, reloading the page each time to see if anything changes.

It turns out the Dropbox webpage is doing a surprising number of things behind the scenes! By the time I managed to get a login, uMatrix looked like this:

uMatrix in Firefox showing settings that got Dropbox working.
I had to allow embedded frames from dropboxcaptcha.com and google.com as well.

Spotting the patterns

If you are thinking of trying this, be warned: this will frustrate your browsing experience for the first week or so (after you take a couple of days to figure out how the uMatrix interface works) while you build up a custom whitelist of domains on your usual online haunts. There is an “off” button for times when you really don’t have the brainspace to be figuring this out (e.g. when you are just tying to get some ibanking done quickly), but it shouldn’t be the default setting.

I did this because I wanted to know what my web browser is doing. And here are some things I’ve figured out through this exercise:

Big websites often load their unchanging (static) resources, such as images, CSS files, script files, etc, from a separate domain.
Presumably they do this so that this other domain can be set up for caching (Issue 39)). Having static files cached on the browser makes the browsing experience much smoother, as static parts such as the icons and stylesheets can be rendered (put on screen) first while waiting for dynamic data to load.
Dropbox loads their static resources from dropboxstatic.com.
Big websites may load their dynamic data from a CDN (Issue 73)).
Once traffic gets large enough that a single server might not be able to handle peak load, many online services switch to delivering their content through a CDN (such as Squarespace). These resources will appear to be loaded from a third-party. So anything with a “cdn” in the domain is probably safe.
ReCaptchas don’t always need a pop-up.
Some of them run in the background, checking to see if you have already been verified human somewhere else, or verifying you by other means.
Dropbox loads its captchas from dropboxcaptcha.com and google.com (for Google’s reCaptcha service). Two layers of captchas!
There are many websites out there that rely on google.com being whitelisted. This is what happens when you have a single company providing so many critical services that their domain has to be whitelisted. If blocked, the webpage will no longer work.
Some websites rely on “daisy-chaining”, where script A loads script B which loads script C, and so on.
You know this because when using uMatrix, you whitelist a domain and reload the page, and another domain appers. You whitelist that domain, and another one appears …

Issue summary: Modern webpages rely on many third-party resources for their functionality. Blocking access to some domains may cause these webpages to break and stop working.

This was fun, in a masochistic sort of way. Most of what I learnt here is not really newsletter-worthy: how prevalent Google is, what a clean webpage looks like in the backend (very few domains), what a massive webpage looks like (lots of domains! E.g. Trello), what the most popular CDNs are, and some dead giveaways of a webpage quickly spiralling out of control (large numbers on a single domain, slow loading with no static domain or CDN) … maybe I’ll figure out the layman-worthy parts of it someday and put it in another season.

What I’ll be covering next

Next issue: [LMG S7] Issue 79: A Base for Data

Next season, we go back to data again. Specifically, we look at how data is stored and managed for most of the internet: in a database.

What is a database and why do we need one?

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 77: Wearing clothes on the Internet

2020-06-20T08:00:00+08:00

Previously: Cookies with the same domain as the site are first-party cookies, while cookies with domains different from the site are third-party cookies. Cookies are used for all kinds of purposes, from remembering browsing sessions, to logging users in, to tracking their identity across websites. Blocking all third-party cookies indiscriminately can result in most if not all of these functions breaking. And yet, not blocking them at all means that you are being tracked across all your browsing sessions, likely without your explicit permission.

I apologise for the titillating title, though I believe it is apt. After all, your choice of clothing is not about ensuring not a single square centimetre of skin is seen, nor is it about covering the absolute bare minimum. It is not about everybody having to follow the exact same dress code. It is about giving you choices about how far along the spectrum you want to be, from totally uncovered at one end to totally covered at the other end. It is about giving you options in deciding where to cover and where not to cover.

But I’m getting ahead of myself. Cover yourself from what? From scripts that seek to see things they shouldn’t. And what are they trying to see? Your information.

What a script sees

There are websites online (such as privacy.net) which can tell you what information is exposed by your browser (and other settings). They do so by, of course, actually extracting this information by any means possible. Go on, give it a try if you’re not paranoid.

If you are, I did it for you so you don’t have to. Heres what it can see, in decreasing order of control (I skip privacy hacks/cheats here because the list would be almost endless):

IP address
From your IP address, it is an IP lookup away from finding out your ISP, your approximate location (using geoIP services)
Browser (and probably OS), with version information
This lets scripts know if you are using a (possibly outdated) browser version. Since most browser vulnerabilities are published online (to help security researchers patch them), you should keep your browser updated to benefit from these security patches.
OS information can provide some demographic information (e.g. if you are an Apple user or Linux user), and also whether you are on a mobile browser or laptop browser. With many data points, a data aggregator can learn if you are on the move often (mostly on mobile browser) or generally static (about 50/50 between mobile and laptop).
Screen resolution (Issue 44))
This can provide enough info to put you in an income bracket; cheaper devices generally have lower resolution. A mid-range or high-end phone usually has a resolution of 1080×1920 or higher.
Autofill information
Any information you save in your browser, to be autofilled in forms, can be extracted by a script. It creates a hidden input field that the browser detects and autofills. The script can then send this information as an HTTP request back to the originating server.
Accounts you are logged in to.
A script can sniff other cookies on your browser session and match them against known cookies to do this. These cookies may also containing other info, such as your username, last accessed timestamp, last search term, etc.
Information that you have given permission to access
If you run browser plugins and third-party services on your accounts (e.g. Google Drive Addons), you may have granted additional permissions that give these services permission to access your contact list, location, microphone, camera, etc. Needless to say, they now have access to that information.

All this, before a script even lays a single cookie on you! Then there’s all the information it can get through the tracking pixels and cookie IDs on the webpage, when it looks up those IDs in its own database. And if it takes a step further and attempts to exploit some common vulnerabilities, it may also know:

Your browsing history
A script can know if URLs on a page have been visited before (this is why links you have visited before can appear in a different style; if you’re a millennial i.e. Gen-Xer, remember the blue links and purple links?). Scripts are also able to check if a link is visited or not. By applying this check on every URL it comes across, it is able to build up a browsing history of your device, albeit in a limited way.

Okay, I’ll stop scaring you here, although I am by no means done with all the things a script can do once it has been loaded by a webpage. But I hope I’ve made my point: you need to limit what scripts can see about you. In other words, you need to wear clothes on the Internet.

Let’s talk about some broadly useful strategies (note: this is a newsletter, not a howto guide. I won’t walk you through the steps here, just outline the strategies available):

DNS blocking

A quick refresher on DNS (Issue 28)): each time the browser is given a URL to load, it first figures out the IP address associated with the domain name of the URL (e.g. facebook.com is the domain name of a URL like https://www.facebook.com/<username>/posts/17-digit-number). It does this through a DNS lookup request to a DNS server.

Your default DNS server is usually your ISP. This allows your ISP to do some content filtering for you (e.g. if you signed up for a parental control service by them), by simply blocking all requests to a particular IP address or domain. e.g. if you have ISP parental controls enabled, and the ISP detects a DNS lookup request to resolve a blacklisted domain like www.xxxchicksxxx.com to its IP address, it will simply block the request by not returning any result—stopped at the source! (Note: that URL is probably fictional, I have not tested it!)

What if you don’t want to pay for that service? You could use other alternatives, such as OpenDNS. You will need to:

Register an account. You need an account for OpenDNS to remember your settings.
Change your DNS server IP address to OpenDNS’s servers: 208.67.222.222 and 208.67.220.220
If you do this on your wireless router, anyone using that wifi connection will use the same DNS server—benefits for all!
Decide the level of filtering you want. You can customise the blocked domain names, or whitelist some that you need (the higher levels can be pretty aggressive and cause some services to stop working)
Register your IP address with your account, so OpenDNS can apply your setting to requests from your IP address. Since your ISP may change your IP address periodically, you may need to enable a DDNS service (Issue 31)), again best done on your router. Some modern routers may have this built-in for you to configure.

Script filtering with a browser addon

Some browser addons can help you detect script sources, and block the script from loading if the originating domain is blacklisted. The blacklist is full of tracking companies and data aggregators, and being updated by volunteers on a regular basis.

This currently works on laptop browsers only, as most mobile browsers do not support addons.

Web-filtering mobile apps

Although mobile browsers do not support addons, some mobile apps are able to help you do this blocking. They do so by setting up an app-controlled VPN on your phone, routing all internet traffic through that VPN, and filtering blacklisted DNS lookup requests.

Conclusion

The options are not many, and they often don’t leave you with much configuration options. Adding a domain to a blacklist/whitelist is tedious, and most users end up not enabling it at all.

Issue summary: The default settings of most browsers expose a lot of information to scripts that request it. To prevent such scripts from running, we need services that can filter the source of these scripts. These services generally work by matching browser requests against a blacklist, and blocking the request if it comes from a domain known to host malicious scripts.

What I’ll be covering next

Next issue: [LMG S6] Issue 78: uMatrix: voyuering the voyeurs

In my virtual travels, I have found an addon that actually makes it easy for you to see what domains the scripts on a page are coming from. It even makes it easy for you to decide if you want to block them in future. It is by no means easy to use, as it requires some background knowledge of what the different kinds of requests are and what they do, but it makes it really easy to experiment and learn about privacy at the same time!

I’ll reserve the last issue fo this season to show you some screenshots from it :)

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 76: Third-parties and cross-site resources

2020-06-13T08:00:00+08:00

Previously: By not enforcing strict cookie policies on their own sites, publishers allowed advertisers to sneakily set cookies on their site audience. This allowed advertisers to reach the same audience via their advertising slots on other websites, which could be bought more cheaply. The publishers were cut out of the value chain and were not longer “gatekeepers” to their own site readers. They could not sell their advertising slots at a premium.

First-party cookies

Almost every site that needs to “remember” who you are will set cookies on your browser. The reasons for doing so can range from simply remembering that you are not new to the site and don’t need to be reminded to subscribe to their promotional newsletter, to giving you a login cookie so that the site knows you are logged in. (This cookie gets removed when you log out, which is why clearing cookies automatically logs you out of most sites.)

The site publisher sets these cookies via scripts that are often hosted on the same URL. Since cookies are tagged by URL domain, these cookies will have the same domain as the site. These are first-party cookies. Blocking these will result in internet-wide breakage, particularly the large majority of login mechanisms.

Third-party cookies

On the other hand, if the site uses a script from another domain, and this other-domain script sets a cookie, that cookie will have a domain tag that is not the same as the site URL domain (e.g. huffpost.com using a script from an advertiser that sets an advertising cookie, which will not have huffpost.com as its domain). These cookies are known as third-party cookies.

These cookies enable advertisers and data-mining companies to track you across websites. Any website you visit which is running their script can retrieve these cookies and send the cookie information back to their servers. This is known as cross-site tracking.

A simple way to block pretty much all cross-site tracking is to block third-party cookies. But this also causes other problems, as I will explain below.

Software-as-a-Service needs third-party scripts

There was once a time when sites took it upon themselves to run all the services they needed. Login, authentication, database management … everything was handled on the server, by scripts that originated on the same server. first-party everything.

As sites grew more complex, Software-as-a-Service (SaaS) companies grew to provide more specialised services involved in the running of such sites. Companies cropped up offering off-site databases, login servers, and all manner of services. That means that when users visit your site, the browser downloads the SaaS company’s script, which carries out the task for you.

For example, Google’s reCAPTCHA service lets you add a CAPTCHA to your site. A CAPTCHA is a test that humans are supposed to pass and bots (automated scripts) are supposed to fail: usually some image recognition-based task such as “identify all buses” or “identify all traffic lights” or “type the letters you see”. The code involved in carrying this out is not simple, and most sites are not capable of running the full backend required to make it work. So they embed a reCAPTCHA script from Google on their site, let the script verify that the user is a human, and then carry on as usual.

However, the Google reCAPTCHA script sets and retrieves cookies. (I am guessing it probably sends your Google cookie to its servers to look up your online history and determine if you are malicious or not.) Since the script originates from Google and not from the website itself, the browser considers it a third-party cookie. Disabling third-party cookies will also cause reCAPTCHA to fail, resulting in a non-functional login for the site.

Cookie categories

For this reason, cookie policies often differentiate between cookie categories:

Session cookies
Cookies that are set for that browsing session only. These cookies are removed when the browser window is closed. These cookies may be used to remember your progress in a multi-step transaction, e.g. doing a multi-page survey.
Persistent cookies
These cookies last beyond the current browsing session, and normally terminate after a pre-defined period of time (I often see “1 year” as a default value of sorts, although it can even be set to 30 years!) Such cookies are used to remember the state you left a service in (e.g. what you have in your shopping cart, even if you didn’t log in or create an account).
Strictly necessary cookies
(Subjectively) necessary cookies for legal compliance or other reasons, for example implementing parental controls, or internal analytics (tracking most-visited pages, or visit frequency).
Functional cookies
Cookies set for the intent of enabling site functionality, e.g. remembering preferences and settings.
Performance cookies
Cookies that enhance the website’s performance, not always what you think that means. For example, if the website is trying out a new feature, they may do A/B testing, giving one cohort of users the “A” interface and another cohort the “B” interface. Which cohort you are in is decided at random, and remembered with a performance cookie.
Advertising cookies
Just explained in Issues 73 and 74.

Caveats

I think it is only responsible for me to point out here that the above categorisation is not exactly enforced by law, and nothing stops a company from miscategorising their cookies so as to mislead a user into enabling them. For instance, some sites may categorise a cookie for tracking identity as a functional cookie, justifying it by claiming it as part of their security measures, and thereby require the user to enable third-party functional cookies before they are able to use the site.

Objections to internet-wide disabling of third-party cookies

It would come as no surprise that ad companies object to such measures, claiming it will “hurt the user experience”, “sabotage the economic model for the Internet”, and “disrupt the valuable digital advertising ecosystem that funds much of today’s digital content and services”. (The quoted parts come from an open letter from the Digital Advertising Community to Apple Inc.)

Other websites have chimed in with the above concerns about disrupted provision of third-party services (X-as-a-Service providers e.g. Software-as-a-Service especially). Right now the shakeout is happening, with the browsers working out an alternative to third-party cookies, software service providers working out alternatives to cookies for providing services, and ad companies finding more subtle ways to track users. It remains to be seen what the Internet will be using in the next 5 years.

Issue summary: Cookies with the same domain as the site are first-party cookies, while cookies with domains different from the site are third-party cookies. Cookies are used for all kinds of purposes, from remembering browsing sessions, to logging users in, to tracking their identity across websites. Blocking all third-party cookies indiscriminately can result in most of not all of these functions breaking.

What I’ll be covering next

Next issue: [LMG S6] Issue 77: Wearing clothes on the Internet

Today we wear clothes for all kinds of reasons: to look cool, to cover ourselves up, to feel comfy … but I suppose in more prehistoric times, the primary purpose of clothing were more basic: to protect one from the elements, and to hide information.

What kind of information? Wounds, vulnerabilities, illnesses, recognisable features (e.g. tattoos), sometimes even sex … all of these are information that people sought to hide from each other in popular fiction, and presumably in real life as well.

Whether you believe in sharing about yourself openly or only sharing what is necessary, nobody today goes around naked (with the exception of nudist communities). Yet, as recently as ten years ago, we were doing the equivalent on the Internet: any information that websites requested about us was given freely, with few restrictions if any.

Today, with advertisers and other data-mining companies tracking you everywhere you go, with malicious hackers, phishers, and scammers waiting to snare unsuspecting users, and with more at stake being tied to your personal digital identity, we have to do better.

We have to wear clothes on the Internet.

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 75: The Costs of Data Leakage

2020-06-06T08:00:00+08:00

Previously: Data companies use the data they have gathered to determine what ads to serve you when you visit sites that load their cookie-setting scripts. This data is sent from your browser via a document request, or via a tracking pixel request.

Content adjacency

Ads used to be much more discriminate: You would publish only certain kinds of ads in Playboy magazine, another kind of ad in The New Yorker, and yet another kind in The New York Times. This, of course, is seldom dictated by the publishers themselves; mostly, the advertisers self-selected where they would like their ads to appear. IBM wouldn’t publish their ads in Playboy; they won’t reach their target group this way, and their ad spending would be wasted.

This idea was known as content adjacency: to reach your target group, you want to place your ads next to content that they would read. Content adjacency gave publishers a lot of power, since they were the gatekeepers to published ads.

But today, that power has mostly leaked away, to ad exchanges. The ads on HuffPost, NYT, and just about any newspaper look largely similar. These advertising slots are sold to ad exchanges, which decide (through the automated bidding) which ads to display to the viewer; no two viewers see the same set of ads. Content adjacency is irrelevant here. The power of ad filtering lies not with the publishers, but with the ad exchanges now.

The danger of advertising: cookie leakage

In Issue 71), I mentioned that part of the value QuantCast brought to the table is that in exchange for letting them put a cookie on your site, they would also tell you more about your audience—far more than you could ever know collecting information on your own.

But here’s the thing: it is very hard for a website’s publisher to know when an advertiser is setting a cookie. When an advertiser is allowed to put advertisements on a website, you are tacitly allowing them to put in a script that is supposed to request an ad from the ad server (after getting a winning bid from the ad exchange). This script could easily, at the same time, set a cookie and return cookie data along with that request.

The only way to catch this is to load the page yourself, compare the site data before and after, and see if any cookies are being set. You could automate this, but you’ll need resources to run that regularly on every webpage you publish—resources that publishers were loathe to spend to protect their data and their readers.

The danger of cookie leakage: audience leakage

Why would advertisers want to sneak cookies like this? Let me put it this way: nobody ever uses the Internet just for reading The NYT. NYT readers might head to Facebook to see how their friends are doing (and view Facebook ads), they might send out some angry tweets on Twitter (and see Twitter ads), they might head to Amazon or Barnes & Noble or any number of sites to do the necessaries.

And these readers can be reached on these other sites if the advertisers buy advertising slots with them. They no longer needed to rely on The NYT to reach a particular class of consumers. If The NYT thought they could price their advertising slots more expensively for the exclusive reach to upper-class readers, they now no longer have that advantage. Those readers are tied to a cookie ID now, not to a website URL.

The publishers were being cut out of the value chain.

Issue summary: By not enforcing strict cookie policies on their own sites, publishers allowed advertisers to sneakily set cookies on their site audience. This allowed advertisers to reach the same audience via their advertising slots on other websites, which could be bought more cheaply. The publishers were cut out of the value chain and were no longer “gatekeepers” to their own site readers. They could not sell their advertising slots at a premium.

What I’ll be covering next

Next issue: [LMG S6] Issue 76: Third-parties and cross-site resources

One way that web browsers and privacy advocates are trying to protect users is by pushing for stricter third-party cookie restrictions. Firefox started blocking third-party cookies by default since Sep 2019, Safari started doing so in Apr 2020, and Chrome intends to do so from 2022.

Many sites are against this, arguing that it will break some “basic internet functionality”. What is this furore about? I’ll explain what third-party cookies and resources are in the next issue, and summarise some of the objections that sites are raising.

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 74: The Walls Have Pixels

2020-05-30T08:00:00+08:00

Previously: When a page loads advertisements through header bidding, it sends your cookie along with other information to an ad exchange. The ad exchange conducts automated bidding among the ad-buyers, determines the winner(s), and sends the winning code(s) back to your browser. Your browser then sends these codes to the CDN, which sends back the winning ads for your page to render in your browser.

So how does Facebook know what you just bought on Amazon? I hope the previous post sheds some light on that. But not everything is a web browser, and not everything uses cookies (especially apps). This post is about another way that your data gets shuttled along to whoever has a data-sharing agreement with the site you are on.

Tracking pixels: another way of sending information

Even if you disable third-party tracking cookies and javascript that didn’t originate from the same page, information about where you went can still be sent to these servers. Can you guess how?

Obviously when you loaded the page, some information already went to the server to tell it what your browser wants. But beyond that, have you ever wondered about the images that get loaded?

Let’s revisit HuffPost again, this time filtering only for image loads:

Chrome DevTools showing filtered image requests.
A request for a tracking pixel is highlighted in blue.

Hmm … why does an image request need to be so long? Anytime you see a long URL like that, with a ? after the URL proper, and peppered with &s and =s, alarm bells should be going off in your head: data is being sent to the server (Issue 70))!

Let’s see what this image looks like:

This is a tracking pixel.
You can’t see it. The image info sidebar shows that its dimensions are 1×1 pixels.

Wha—?!

What is your browser doing, loading a useless 1×1 image? If it appears to be doing something useless, you’re not looking in the right place. The image itself is clearly useless; its just a way to get your browser to send information to a server.

Tracking pixels work hand in hand with cookies

This request for the tracking pixel was sent from a script. My cookie information was embedded in the request URL when it was sent. So a tracking pixel is another mechanism for sending cookies, besides sending a generic document request via the script like we saw in Issue 70).

If you have a popular website, ad exchanges will ask to pay you to put their ads on your website. These ads are served after the user’s browser sends the user’s cookie to the ad exchange, which triggers an automated bidding process. The winning bid gets sent to the CDN (content delivery network), which serves the ads (Issue 73)).

On the other hand, data companies don’t serve ads. They usually ask to put a tracking pixel on your website, which means they ask you to put in their script. This script will scrape whatever data it can about the page the user is on and related user activity, and embed it in the pixel request along with the user’s cookie.

When you visit Facebook, it looks up your cookie and sees if you have been visiting any websites recently, or left any shopping carts un-checked-out. Then it knows what ads to serve you :)

Issue summary: There are two ways your browser can send cookies back to the server:

By sending an HTTP document request (known as an XHR, short for XmlHTTPRequest) which usually returns a chunk of text data,
By sending an HTTP image request which usually returns a 1×1 pixel, known as a tracking pixel.

Data companies use the data they have gathered to determine what ads to serve you when you visit sites that load their cookie-setting scripts.

What I’ll be covering next

Next issue: [LMG S6] Issue 75: The Costs of Data Leakage

Notice at this point that ad and data companies are still more concerned with what you are doing, not who you are. That’s right; they don’t gather names, credit card numbers, and the like; that is useless for serving ads!

I kind of did some time travel in the past few issues. One moment, it was 2006 and ad companies were still just serving static images with some request tags and QuantCast had just discovered the power of the cookie. The next moment, there are a gazillion ad companies and a billion ad exchanges all bidding to serve ads before your eyeballs. How did this happen so abruptly?

It didn’t. Not abruptly anyway, but quite rapidly. The costs of data leakage have already been paid, not by us but by the websites. They have been greatly diminished in the value chain, replaced by ad exchanges which have sopped up most of the profit of advertising like a sponge.

Next issue, I’ll describe how this happened.

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 73: The Heart of Darkness (Header Bidding)

2020-05-23T08:00:00+08:00

Previously: QuantCast gathers a large amount of data on internet users directly through its cookie (which other publishers serve through their websites), and also by cross-checking it against data which it purchases from other data brokers who gather their information through other means, such as internet activity and credit card transactions.

What exactly does QuantCast do with all this data?

I’ll take a classic ad-infested website as an example: HuffPost. HuffPost may not look ad-infested, but peek under the hood and it will look different to those who know what to look for. If you just take a quick skim through the website’s source code, you will see that almost a third of the website is just javascript loading!

Do a search for <head> and </head>, then for <body> and </body>.

Webpage loading: header, followed by body

The section flanked by <head></head> is the page header. This is the most important section of the page for everyone else besides the reader. When a page is requested by the browser, the HTML code for the entire page is retrieved. But it is not rendered all at once.

The browser starts processing the page header first. It looks at all the file requests: CSS files (for styling the page), fonts (for formatting text), javascript code (for running code to make the page responsive and for loading cookies (Issue 70)) etc). It sends off another round of requests for each of these resources. The rest of the page (flanked by the <body></body> tags) does not start rendering until critical files have been retrieved.

Often, the javascript code is considered critical, because some of them actually change the page body or affect what is loaded. They are therefore placed in the page header and loaded first before the body is rendered.

Normally, on a non-advertising page, the page header is very short: just the page title, some metadata (to tell Google’s bot what the page is about for ranking in searches), some fonts, some CSS, and a bit of javascript to spice up page interactions. That’s it. A fancy photo carousel or other features will involve a bit more code, but still not a whole lot.

How ad bidding works

When advertising comes into the mix, the information flow gets much more complicated. The header loads an ad script that passes the cookie (embedded in the page), along with any other relevant information (type of website, device info, etc¹) to the advertising exchange.

What does an exchange do? It matches this cookie to its huge database of cookies, and then it conducts an auction. “Here’s a user browsing New York Times! *Looks up user in database* Probably a woke young twenty-something, good credit history, into yoga, and health-fad-ish.” So it’s pretty much like a marketplace, but one that you cannot participate directly in. It’s actually automated bidding.

The ad-buyers bid. These bids are not placed on-the-spot, but pre-bidded (through the advertisers’ dashboards, or through an API)). Higher bids win over lower bids, but more relevant bids win over less relevant bids.

The advertiser’s server sends the winning bid code back to your browser. Then another piece of the advertiser’s javascript code kicks in, sending this code to the advertiser’s content delivery network (CDN).

Yup, online ads need specialised servers to do different things. The ad exchange carries out the bidding and determines the winner (much like a stock exchange); this requires intensive CPU calculations and low latency connections. The CDN, on the other hand, is a global network of servers that keep the content ready to deliver. Servers in the US can get content to US web browsers most quickly, while servers in South-east Asia are better placed to serve Southeast Asian browsers.

These servers continually talk to each other or to a coordinating server, which determines what content should be on each server depending on the demand from each region. Each regional server caches the most frequently requested ads and cat images in the server memory (which is quick to access), leaving the rest in hard disk or solid state storage (which is slower to access).

These servers are configured for high bandwidth (to serve as many images as quickly as possible) and with large memory + storage space.

This is what that invisible one-third of the page is doing.

Issue summary: When a page loads advertisements through header bidding, it sends your cookie along with other information to an ad exchange. The ad exchange conducts automated bidding among the ad-buyers, determines the winner(s), and sends the winning code(s) back to your browser. Your browser then sends these codes to the CDN, which sends back the winning ads for your page to render in your browser.

Phew, that’s as short as I can describe ad exchanges and CDNs. (one more long-running question answered, yay!) You may or may not be surprised at what is going on at the backend, but often people don’t expect that so much of the internet backend is actually dedicated to just serving ads. But it’s true. The services you have come to rely on—this is the price we pay for them to be “free”.

What I’ll be covering next

Next issue: [LMG S6] Issue 74: The Walls Have Pixels

It gets worse … after ad exchanges came about in the mid-2010s, second-order effects were responsible for much of the data leakage and privacy concerns that hog the headlines of some publications today. I’ll explore a couple of them in the next two issues.

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
~~a CDN? [Issue 8]~~
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

It’s hard to know what exactly is going on because the javascript is often obfuscated with all kinds of codes and renaming. Only folks in the industry will be able to tell you what exactly is going on in their backend, and even then they might not be able to tell you what exactly a competitor is doing. ↩

Issue 72: The Data Brokers

2020-05-16T08:00:00+08:00

Previously: In 2006, Quantcast offered complete audience analytics for any site that puts their cookie on the site. In this way, they managed to gather information on a wider audience than they, or any single website, could reach on their own.

I’m almost about to begin talking about Quantcast’s proposition to advertisers, and how that led to the ad exchange, and what an ad exchange is, but to avoid confusing you, I had better talk about data brokers and what they are first.

The demand for data, and its players

There is a huge market out there for data. In a way reminiscent of the slave trade of the 16th to 19th centuries, in which people were being auctioned and sold and shipped to countries far from their homeland, data today is being sold in data markets, copied to places far from their point of origin, and used to put together profiles of consumers. Who are these data brokers?

Some are sources of information: subscription lists of email addresses to free journals and magazines, (anonymised) credit card activity (how much money spent where by what income bracket), your social media clicks and likes and other activity, your browsing web history, even your mobile device telemetry data (coming from a data-mining app disguised as a mobile game which you unwittingly downloaded). They sell this data to other third-parties, or to advertisers directly (rare).

Some are middlemen: third-party brokers who offer a consultancy-like service: they buy information, recompile it into profiles that are more legible to advertisers, and then resell this information.

Some are end-buyers: insurance and other risk-management companies, investigation firms, fraud detection services, … just about any company that may need information on a person or category of consumer.

FastCo has a (non-exhaustive) A–W list of some of these companies, if you’d like a more detailed sampling.

QuantCast, in effect, was acting like a data broker (though it didn’t buy or sell this information, it gathered them directly through its cookie).

The data QuantCast gathers

The end result looks like what a soulless Santa Claus would have managed to gather on its own. A Privacy International journalist sent a Data Subject Access Request to QuantCast for the data it has gathered on her[^1]. By her own analysis, QuantCast has “amassed […] more than 46 columns worth of data including URLs, time stamps, IP addresses, cookies IDs, browser information and much more.” Furthermore, the data she received “suggest that [it was obtained through] data brokers like Acxiom and Oracle, but also MasterCard and credit referencing agencies like Experian.”

Interestingly enough, first name, last name, Social Security identification, and other personally identifying information is hardly collected. Such information is of little interest to advertisers; it is too specific and tells them nothing about whether an ad can be served at you, to extract another click.

[1]: QuantCast is legally obligated to fulfill such requests under the terms of the GDPR legislation which was implemented in 2018.

Issue summary: QuantCast gathers a large amount of data on internet users directly through its cookie (which other publishers serve through their websites), and also by cross-checking it against data which it purchases from other data brokers who gather their information through other means, such as internet activity and credit card transactions.

That’s … a lot of information, but how does it help advertisers? How does the engine of ad customisation work? All this and more in the next few issues.

What I’ll be covering next

Next issue: [LMG S6] Issue 73: The Heart of Darkness (Header Bidding)

What we have come to know as “targeted advertisements” are known in the advertising industry by other terms, depending on the mode of operation: sponsored search auction, real-time bidding, etc. Generally, they are known as header bidding, because the code tag that triggers it is embedded in the header section of a webpage.

An entire cascade of bidding operations, reminiscent of eBay bidding but entirely automated, is completed in mere milliseconds; the smorgasbord of ads vying for your attention are shaken out, winners emerge, and are served into your browser view while you wait for the page to load.

Stay tuned next issue as we super-slo-mo this process to a speed you can grasp.

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 71: The Rise of Audience Analytics

2020-05-09T08:00:00+08:00

Previously: A tracking script retrieves the existing cookie on a web domain if there is one, or sets a cookie on a webpage if there isn’t an existing one. The tracking script sends the cookie information back to the originating server, along with many other fragments of information.

A quick refresher from Issue 68): it is 2006. The market had just recovered, shaken itself out from the dot-com bust which started at the turn of the century and lasted about two years.

Post-bust, the remaining companies quickly realised that throwing money blindly was not the way. They needed to target audiences more specifically. Google led the charge with their IPO in 2004, demonstrating that targeted search actually brought more users. (They introduced similar ideas to their advertising arm, Adwords.) The race began: Facebook, Youtube, Twitter, and many more. Even the news was going online. And then the iPhone launched in 2007, sparking off the mobile Internet wave, and the rise of mobile apps.

These companies all had the same problem: they only knew what users did on their site, but not what these users did around the Internet. Each company set its own cookie and tracked its own cookie, and managed its own analytics (or an analytics company did it for them).

Then Quantcast got thinking: what if we could get these cookies synchronised? Better yet, what if we could get all these companies to load our tracking script on their sites (Issue 70)) and thereby put our cookie on their sites? We would be able to gather cross-site data and build a more complete profile of the audience!

Now, why would these companies agree to that? There has to be some upside for them. The only thing Quantcast had to offer them was the very information it had gathered: in return for putting our cookie on your site, Quantcast would offer you demographic analytics on your site audience, more complete than you could ever hope to build by yourself.

Demographic analytics can help these companies know if their website design and other features and helping them reach their desired target audience. But this alone would not have catapaulted Quantcast into the limelight.

The true value of Quantcast’s cookie came when it was coupled to targeted online advertising.

Issue summary: In 2006, Quantcast offered complete audience analytics for any site that puts their cookie on the site. Websites would know more about their audience than they could otherwise gather through their site alone. But Quantcast would make most of their money through their offering to advertisers.

Another very short issue (phew!), that I hope explains how the unification of user data began. It is important to note that nobody was forced into this arrangement, at least not by the usual anti-competitive practices. Quantcast offered a product, companies that hopped on the bandwagon became highly successful at targeting specific audiences, and soon any company not doing that found themselves being unable to compete in the same space.

What I’ll be covering next

Next issue: [LMG S6] Issue 72: The Data Brokers

QuantCast does not do all its data gathering; it also gets information from other data providers, known as data brokers. Lets visit them next issue.

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 70: The Cookie Factory

2020-05-02T08:00:00+08:00

Previously: Cookies are little fragments of information with a name and a value, and associated with a domain address. They are most commonly used to identify new or returning users. This cookie is issued by a website upon the first visit, stored in the browser, and returned to the issuing server whenever the server requests it.

This issue is a short one, just to put one more piece in place. Last issue, I said that analytics.js loaded a _gid cookie with a value of GA1.2.1807773255.1584140066. At that point, the cookie only existed in my web browser. How did it get sent back to Google Analytics for counting?

Let’s watch what is happening with Google DevTools:

Chrome DevTools showing the (filtered) sequence of requests made by the webpage I loaded.
The request made by analytics.js (third-last line) is highlighted in gray. The Initiator column tells us this requested was initiated by analytics.js on line 25 of the script.

The full URL of the highlighted request is http://www.google-analytics.com/collect?v=1&_v=j81&a=227860763&t=pageview&_s=1&dl=http%3A%2F%2Fwww.adopsinsider.com%2Fad-serving%2Fhow-does-ad-serving-work%2F&ul=en-us&de=UTF-8&dt=How%20Ad%20Serving%20Works&sd=24-bit&sr=3840x2160&vp=1319x1284&je=0&_u=QACAAAAB~&jid=&gjid=&cid=184706471.1584140066&tid=UA-13115681-1&_gid=1807773255.1584140066&gtm=2wg340NLT927&z=1600454420.

That’s unreadable for humans!

In layman terms, analytics.js sends a request to http://www.google-analytics.com (yup, unsecured transmission since it does not use HTTPS) with the following information:

v: 1
_v: j81
a: 227860763
t: pageview
_s: 1
dl: http://www.adopsinsider.com/ad-serving/how-does-ad-serving-work/
ul: en-us
de: UTF-8
dt: How Ad Serving Works
sd: 24-bit
sr: 3840x2160
vp: 1319x1284
je: 0
_u: QACAAAAB~
jid:
gjid:
cid: 184706471.1584140066
tid: UA-13115681-1
_gid: 1807773255.1584140066
gtm: 2wg340NLT927
z: 1600454420

See anything interesting there? Here, let me highlight it for you:

_gid: 1807773255.1584140066

Yup, analytics.js sets a cookie if there isn’t one, or retrieves the existing cookie if there is one. It sends the cookie back to google-analytics.com with your cookie ID, so Google Analytics knows who is visiting the page and can count visitor stats for the webpage.

It makes sense for a webpage to embed analytics.js so that Google Analytics can help it count page visits. But why would a webpage allow Facebook and other ad services to put their cookies on a reader’s browser and then send it back to their own servers? Doesn’t that worsen the site experience? What is the benefit to them?

That is the key insight that Quantcast arrived at.

Issue summary: When browsing a webpage, a tracking script retrieves the browser’s existing cookie, if there is one, or sets a cookie for the browser if there isn’t one. The tracking script sends the cookie information back to the originating server, along with many other fragments of information.

Short issue just to close the loop on cookie setting and returning. Enjoy the mental break! :)

What I’ll be covering next

Next issue: [LMG S6] Issue 71: The Rise of Audience Analytics

When it comes to ad networks, there is the How aspect, and the Why aspect. The How aspect is almost hopelessly complicated, an ever-evolving race of advertisers vs ad-blockers, each trying to outdo the other. I will focus less on this aspect, and more on the Why aspect. I think it is more critical to understanding what information advertisers actually extract, and why it does not make any sense for them to want to know your personal details.

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 69: The Cookie Monster

2020-04-25T08:00:00+08:00

Previously: The old CPM model (cost per thousand impressions) in the early Internet was replaced by the CPC model (cost per click) after the dot-com bust. But CPC only works well if publishers and advertisers could get users to click; they need to target advertisements accurately to users. QuantCast figured out a way to do so in 2006.

How to do that? The key, it turns out, centres around cookies.

Wait, what’s a cookie?

When you visit any website in Chrome or Firefox, if you click on the icon to the left of the address bar:

Clicking the icon to the left of the address bar shows basic site information

It shows you some basic information, including the cookies loaded by the website.

You can view the content of cookies through that window in Chrome or Vivaldi. This information is also available in other web browsers through a different menu option.

The cookies themselves are only just little fragments of information. They are identified with a name, they have a bunch of content (usually gibberish to humans), and they are associated with a website. Above, you can see that this website has a cookie named _gid with a value of GA1.2.1807773255.1584140066.

The script code used by Google Analytics is named analytics.js.

Little snippets of javascript create and delete cookies. These snippets of Javascript are usually loaded as a script, with a .js file extension. The script code used by Google Analytics is named analytics.js.

What do cookies do?

This cookie was loaded by analytics.js after the web browser runs the script. It is how Google Analytics identifies users on the website. The value stored in the _gid cookie is the client ID assigned by Google Analytics to identify a unique user.

Many bloggers and website owners rely on Google Analytics to tell them how much internet traffic their website is getting every month, which countries they are from, what time of day they are most active, which search results are bringing these visitors to the site, and so on.

But each visit represents one browser loading the page; how do we know that’s not the same user repeatedly refreshing the page waiting for something to happen? (It happens on auction sites, or game sites, and many other places).

Whenever the webpage is loaded, the cookie information gets sent to the Google Analytics server. That is how Google Analytics know it’s the same fella on the same browser doing it. The cookie associates each client Issue 7) with a _gid id. But if the user is using two different web browsers, or using a smartphone browser and doing it on their laptop, that actually gets classified under two different identifiers, even though it’s the same person!

Plain cookies are not enough

Before 2006, this wasn’t a big issue. Users mostly browsed the internet on their desktops and browsers, and they seldom used more than one as their regular device. The famous Intel Core series processors had not even arrived yet—they would come a year later, in July 2007—and the first iPhone would arrive a month before Intel Core.

That meant the average user was using a Pentium-based computer to browse the internet, and that was probably their only internet-enabled device. At most, they had a desktop at home and a laptop at work. If you got a website visit with a user’s cookie, you know it’s not coming from a smartphone or their Amazon Alexa or any other smart device—those did not exist yet. One or two cookie identifiers was enough.

In a year, this would change.

Issue summary: Cookies are little fragments of information with a name and a value, and associated with a domain address. They are most commonly used to identify new or returning users. This cookie is issued by a website upon the first visit, stored in the browser, and returned to the issuing server whenever the server requests it.

Time to dispel some myths: cookies don’t actually contain any information about you. At least, in the context of advertising, what gives you away is not the cookie information. Think of cookies as queue numbers or collection slips that you get when you go shopping. They are impersonal identifiers simply used to ensure that a product gets delivered to the person who actually paid for it.

So what’s actually leaking your information, and helping Facebook know what you bought on Amazon? We’ll get there, patience please. The pieces are not yet in place.

Earlier in this issue, I said

Whenever the webpage is loaded, the cookie information gets sent to the Google Analytics server.

How does this actually happen? In Issue 38), I showed you a graphic from Chrome’s Developer Tools that represented the loading sequence a webpage goes through. With that same feature, we can find out when and how the Google Analytics cookie gets returned to the server.

What I’ll be covering next

Next issue: [LMG S6] Issue 70: The Cookie Factory

We’ve seen how cookies are served, next issue we’ll get a bit closer. We’ll see how information from the cookie is returned. And then in the subsequent issue, you’ll understand Quantcast’s genius insight, and how it led to the ad landscape we have today.

Sometime in the future: What is:

booting up? [Issue 15]
~~a cookie? [Issue 8]~~
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 68: The Age of Bloat

2020-04-18T08:00:00+08:00

Previously: Each click on a link, or even an ad, sends data to the server. This information can include an ID for the link you clicked, or the category of ad you clicked. But without Javascript, the webpage can’t know very much about you.

The dot-com bust

Once Javascript was made available … surprisingly little happened on the ad front. Javascript could animate your pages and make buttons that changed image or even changed colour when you clicked them. But it was doing little else for now.

3 years after Javascript was announced, the online advertising industry had achieved revenue of $4.6 billion. It’s hard to imagine that this was largely achieved through banner ads alone … many new companies were being founded, there was lots of capital in the market, and it looked like the Internet was the new growing industry, with stock prices continually soaring beyond what people could imagine.

On March 10, 2000, the NASDAQ Composite stock market reached its peak, and then it all went downhill from there. It was the dot-com bust which welcomed the 21st century.

The old model of advertising: cost-per-mile (CPM)

Past online publishers (who displayed ads on their sites) primarily used a CPM model of pricing (“cost per mile”, which was interpreted as “cost per thousand ads served”). You paid for a certain number of ads to be served on a certain number of pageloads, and that was it. You could pay more to have your ad served in a more prominent slot, or to have more ads served, and that was it.

You would often have little idea who saw it or who clicked it, and you just sat and waited for the clicks to come through. Sometimes they did, and often they didn’t. It was cheaper than highway banner ads and huge posters on buildings, but it was still expensive.

Recovery and restrategising

As freeflow cash quickly shrunk during the dot-com bust, many companies began to rethink their advertising campaigns. They could no longer just spend freely on banner ads that online users were getting accustomed to. The pop-up ad, invented in 1997, was being blocked by non-Microsoft-owned major browsers (Netscape, Firefox, and Opera) around the time the economy started to recover from the bust. New services were needed, new value needed to be created.

The dot-com low lasted until early 2002, when stock prices finally started to pick up again. Google led the rise with its revamped Adwords.

Google Adwords, revamped after its premature introduction 2 years earlier, offered a CPC model: cost-per-click. You only had to pay if somebody clicked through the ad to your site, not if they ignored the ad.

This was not a new innovation: Yahoo already offered a similar model back in 1998. That was a flop, because Yahoo didn’t know enough about its users to optimise the click-through rate.

Google innovated over the old model in one unique way though.

The new model: cost-per-click (CPC)

Early CPC models literally just counted clicks on a link and invoiced you accordingly. As the number of advertisers buying ads rocketed, the publishers switched to an auction model: highest bidder wins. This model disadvantaged smaller companies, who had much smaller advertising budgets, and could not out-compete the big ad-buyers on price.

Google (back then still a tiny company) saw this and, inspired by its search engine algorithm, introduced one change to it: if an ad with a lower bid got more clicks than ads with higher bids, it could climb the ranking ladder.

Now the race is on to grab every user click, with new services and web media. Facebook launched in 2004, YouTube in 2005, Twitter in 2006.

The search for unified user data

There was just one problem: these companies still didn’t know very much about the market. Every company had a piece of the puzzle: Online publishers knew a bit about its users: what time they visited most often and their approximate locations. But they didn’t know what kind of ads their users wanted, and they have to balance the annoyance their users experienced with the revenue that could be brought in by online advertising.

Ad buyers, on the other hand, mostly knew who their target market was, but had little idea how to reach them. They had to make a guess, or talk to online publishers to see if there was a fit somewhere.

Analytics companies such as comScore and Nielsen quickly saw this need, and started researching demographic behaviours online. But this didn’t work for niche markets, or when data was lacking.

Ad servers (such as Doubleclick, whom you already met in Issue 66)) helped to aggregate advertising slots from online publishers. But they were not in a place to gather data on the users; users were not visiting their site. Nor were they in a place to gather the disparate information from online publishers and ad buyers to build coherent profiles of users.

That piece of the puzzle would come later. Konrad Feldman and Paul Sutter, who noticed the surge of interest in search advertising after Google’s IPO in 2004, and were working on an interesting puzzle: “How would we get direct data on users of sites that we don’t own?”

They figured it out two years later, and founded a company called QuantCast.

Issue summary: Advertising was sold on a CPM model (cost per thousand impressions) in the early Internet, until the dot-com bust forced companies to reconsider their ad-buying strategy. The CPC model (cost per click) became more popular, but was still not very user-targeted. It would take QuantCast, founded in 2006, to figure out a way to gather data on users and build a coherent profile of each demographic.

What I’ll be covering next

Next issue: [LMG S6] Issue 69: The Cookie Monster

We will take a short detour next week so that I can explain what cookies are, how they came about, and what they do. It’s the linchpin for understanding how modern online advertising works today.

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 67: The Innocent Times

2020-04-11T08:00:00+08:00

Previously: DoubleClick, the first commercially successfully ad server, launched in 1996. It ran a system that tracked the performance of banner ads across 30 sites, working to optimise their return on investment. This was made possible by standardisation of the web (thanks to the HTTP specification), and the birth of Javascript, a scripting language integrated into the webpage rather than being a separate module from it. All of this happened in 1995–1996.

The Internet Archive is a 501(c) non-profit that aims to achieve nothing less than a digital library of the Internet and its artifacts. The Wayback Machine is your Google portal to the past. This is where you can type in any URL and see how it looked in the past (as long as The Wayback Machine has a saved copy of it from that time).

Advertising in 1996

Back in Oct 22, 1996, Yahoo! already had advertising front and centre, right above its search bar. (Google had not even been founded yet.)

Yahoo! in 1996 already had advertising right above the search bar

The URL of that page was http://www10.yahoo.com:80/, and we can see a few things from that:

HTTP 1.0 had not been fully effected yet. When it was, port 80 would be standardised as the port for the Internet. Before that happened, though, you sometimes had to specify the port (Issue 33)) for your web browser to send the request through.
The Internet was small, but it was big enough for Yahoo! to need more than 1 server to serve its homepage. Yahoo had one domain name, yahoo.com, to route all internet traffic through, but it had to somehow direct this traffic to multiple servers. 1 such server was www.yahoo.com, the others were named www2.yahoo.com, www3.yahoo.com, … you get the idea.
HTTPS was not yet a thing. Privacy was the last thing on peoples’ minds. Who cares what you were searching for? There wasn’t much on the Internet to implicate people with yet. You couldn’t book hotels or buy stuff online or send a tweet. The Internet was an interesting place, far removed from real life.

What’s in an ad link?

The URL that the ad points to is http://www.yahoo.com/homet/SpaceID=0/AdID=2754/?http://la.yahoo.com. Why does yahoo.com appear twice? What’s going on?

That link is doing quite a number of things: it is sending an HTTP request to Yahoo’s servers with some information attached:

SpaceID = 0
Website owners categorise ad slots into different “spaces”. The primary, busiest parts of the webpage might have ads categorised as SpaceID 0. Pages with less traffic might have ads categorised as SpaceID 1, and so on. This allows for some limited form of ad targeting, and different pricing tiers: SpaceID 0 would be more expensive, SpaceID 1 less so, and so on.
AdID = 2754
In the table of customers, AdID 2754 would belong to the Yahoo! Los Angeles page.
http://la.yahoo.com
This is the page that users should be redirected to.

Back in 1996, websites like Yahoo! could already track how many times an ad was clicked before redirecting users to the actual page. But it had no way of knowing anything about the user who clicked it. The only information it would have was the user’s IP address (Issue 27)).

You might find it surprising that none of this requires Javascript; in fact, that page doesn’t have a single scrap of Javascript in it!

So what does Javascript do for ads?

Issue summary: Each click on a link, or even an ad, sends data to the server. This information can include an ID for the link you clicked, or the category of ad you clicked. But without Javascript, the webpage can’t know very much about you.

What I’ll be covering next

Next issue: [LMG S6] Issue 68: The Age of Bloat

Still starting slow … because the picture of online advertising is not complete yet

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 66: Before the Cloud

2020-04-04T08:00:00+08:00

Previously: Shared memory helps to reduce the amount of memory needed by all the applications running on an operating system. It also allows applications to send data to each other, and to communicate.

Season 5 focused on the vulnerabilities that arise from optimising CPUs for speed. Speed means sharing; the more easily data is made available to the CPU without all kinds of permission checks, the more quickly the processing that take place.

This season, Season 6, I will finally go back to the topic that got me started writing Layman’s Guide to Computing in the first place: online data privacy. But this is a huge, complex topic, and I’ve spent two weeks so far trying to build a timeline of key events, identifying key moments, and chasing interesting connections down deep rabbit-holes. Where do I even start?

Part of the difficulty of getting started is trying to definitively find out when it all started. Today, when you dig into a website’s code, it is mostly a gobbledygook of interacting code, advertising tags, accessibility declarations, and more. A mere 25 years ago, early in 1995, websites were still only static content! How did it turn out like this?

Did it start, maybe, in 1993? When Tim O’Reilly, who had already founded what would later be O’Reilly Media, started the first online information project, Global Network Navigator. (Yahoo! would follow suit with Yahoo! Directory a year later, trying to create the world’s biggest index of websites. By hand.) The site had to be funded somehow, since online commerce had not been born yet—the web was still static content, remember? The enterprising O’Reilly, taking a page from the huge highway banner ads, sold the first clickable ad to a Silicon Valley law firm. After all, 5 months later, Hotwired, a commercial web magazine (which would later be renamed to just WIRED), started doing just that in large quantities.

That seems to be a reasonable starting point … except that was not the same as the online advertising we know today. People emailing each other image files and signing off advertising contracts on paper is not the same as online ad space being sold to the highest bidder within microseconds while your page loads.

The birth of Javascript

No. I think it started in mid-1995, when Netscape hired Brendan Eich to create a scripting language for the web. They already had Java, a language which the web didn’t understand; it had to be compiled (Issue 54)) to a Java application (which you might know as a java applet) and put into its own little box so it wouldn’t hurt the rest of the webpage. But they wanted a scripting language, which could be run directly in the browser without compilation, in real time, as part of the page. In Eich’s words, “The idea was to make something that Web designers, people who may or may not have much programming training, could use to add a little bit of animation or a little bit of smarts to their Web forms and their Web pages.”

Mr Eich created a prototype for the language, Mocha, in 10 days, just in time to be included in Netscape Navigator 2.0 beta 3 when it was released in November that year. Its name had been changed to LiveScript. But in December, when his prototype language was announced to the world by Netscape Communications and Sun Microsystems, it would be known as Javascript.

The same year, Internet Explorer 2.0 was also released to the world. Work on it had also started early that year. Both Netscape Navigator and Internet Explorer were based on very similar codebases: both originated from NCSA’s Mosaic browser, which began development by Eric Bina and Marc Andreessen three years ago, at the end of 1992. (Andreessen would later be best known as co-founder of Andreessen-Horowitz Capital Management.)

By Spring 1996, things were heating up. Before this point, web browsers were only working with HTTP v0.9, a protocol so simple I probably wouldn’t need to laymanise it for you. But a new standard was needed to support all the new things that Web 2.0 was supposed to be able to do. That new standard, HTTP v1.0, was published in 1996. (See Issue 7) if you’re still wondering what HTTP is.)

What else happened in that magical year of 1996?

As if to signal a shift in the zeitgeist, Global Network Navigator was bought by AOL that year; by year-end they were shuttered, their subscribers moved to AOL. Static banner ads would go the way of the dinosaur.
The Internet Advertising Bureau was founded to streamline industry standards and provide legal support—instead of stunting growth through regulation like today, this was meant to help growth by standardising things when most things were non-standard, such as the pixel dimensions of online ads.
Adobe introduces Flash. It would have a good run for 15 years until Apple decided not to support it in their iOS devices, and it would see browser support removed entirely in 2020, just 25 years after its beginnings.
While Google was the first to successfully monetise putting ads in your search, Yahoo! was the first to put their search engine in an ad. They launched their IPO in April of 1996.

And one more thing. Slow as the internet seemed to be growing, people quickly ran into the limits of static banner ads. You couldn’t do very much on static websites. You couldn’t track clicks, for instance, and you couldn’t quickly deploy different ads to different websites to see which ones did better. To do something like that, you had to work with different websites—talking to them over phone or email(!)—and work out performance metrics and tracking arrangements with them. In an era when it was hard to know precisely how much a TV ad, poster, or radio ad contributed to your campaign’s success, many companies were hoping to change things with an online presence through ads. It was unsurprisingly turning out to be harder than expected.

But right then in 1995, one company figured out how to do just that. Instead of serving their own ads, they decided to run their banner ad system, deployed across 30 sites, and sell ad space to other companies. By early 1996, they decided to launch their business. DoubleClick, an ad server, was born.

They would be acquired almost 10 years later by Google for US$3.1 billion.

Issue summary: DoubleClick, the first commercially successful ad server, launched in 1996. It ran a system that tracked the performance of banner ads across 30 sites, working to optimise their return on investment. This was made possible by standardisation of the web (thanks to the HTTP specification), and the birth of Javascript, a scripting language integrated into the webpage rather than being a separate module from it. All of this happened in 1995–1996.

What I’ll be covering next

Next issue: [LMG S6] Issue 67: The Innocent Times

The more astute by this point could have imagined the portentous future that Javascript would herald. But this was still an age of innocence, still enchanted by the immense untapped potential of the desktop and still-new laptop.

Online advertising already existed even then. Visually, it would look familiar. But at the backend, ads today work very differently from how they did in 1996.

In the next issue, I will try to trace how online ads developed, as the industry changed and grew and shifted, to show you how they became what they were today.

Sometime in the future: What is:

booting up? [Issue 15]
a cookie? [Issue 8]
XSS? [Issue 8]
a CDN? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]