Layman's Guide to Computing - Season 07

Issue 91: Commercial database alternatives

2020-10-11T11:00:00+08:00

Previously: A URI (Uniform Resource Identifier) is required to connect to a database. This URI can be provided by a hosting service provider that runs your own database for you, or by a cloud service provider that runs your database on their platform.

So you’re running up against the limits of a spreadsheet and want to do more with the data inside it. Databases sound cool and kind of like what you want right now. But writing a whole app and setting up the database yourself, or even getting someone else to do it and checking their work … it all sounds like so much.

What to do?

Airtable

Airtable is a database-like platform that lets you set up Bases (similar to databases), which can contain different tables for your data. You can specify a specific data type for each table, limit entries to a list of options, and even create lookups (match the value here with a column in another table, and return data from another column in the same row).

Just as databases don’t have a single canonical view, and everything depends on queries, Airtable also lets you create different views of your data. You can set it up as a list, a gallery, a job status board, and filter it as you like.

Interestingly, Airtable also dynamically generates an API for each of your bases, so that apps you create have a way to retrieve, modify, or delete data from the database. That saves you a lot of trouble having to set up your own database, for simple needs.

Smartsheet

Smartsheet is another platform that lets you create sheets with different views. Unlike Airtable, is leans more heavily towards workplace workflows, with built-in task management features and integration with many services. If you are already using one or more of these services, Smartsheet could be a way to store information for collaboration.

Knack

Knack is yet another database-as-a-platform, which also allows you to craft queries to extract the data you need. It has an interesting feature that lets you specify how tables relate to each other (e.g. Contact connects with one Company, Company connects with many contacts) to improve queries.

Knack also lets you create simple apps with limited access to the data, for employee or customer use. If you mainly need internal apps for disseminating or allowing field access to data, this is probably a simpler option than hiring an app programmer/company.

Zoho Creator

Zoho Creator is a database platform that is more focused on app-building (or so it appears). The database just comes bundled as part of the deal. Another option for corporate operations-focused apps.

Issue summary: Depending on what you need a database for, there may be online database platforms that can manage and automate much of the work for you. Airtable, Smartsheet, Knack, and Zoho Creator are just 4 of many options that offer an easier way to set up and input your data, then access them through apps or other means.

The best thing about these cloud services is that you probably don’t need to learn SQL or other advanced query languages to use them. A passing familiarity with spreadsheets, and time to sit down and watch tutorial videos, is probably sufficient to get started.

What I’ll be covering next

Next issue: [LMG S8] Issue 92: All about apps

I’ve spent a whole season talking about data (Season 4, Issue 40) to Issue 52)), then detoured to talking about computers, and the internet, and now back to databases. I think that’s plenty of foundation to finally move on to something more familiar: apps.

What exactly are apps and what do they do? What are they like under the surface? What makes them tick?

This and more in Season 8 … which will start after a two-week hiatus. It has been really fun putting finger to keyboard and watching everything come together, but I noticed the quality of recent issues has been sliding more than I’d like. I’m going to take a little break to reconsolidate, recuperate, and think about the next couple of seasons.

See you next issue!

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 90: Using a database

2020-10-03T08:00:00+08:00

Previously: Graph databases treat the details of things as secondary, and optimise for managing the network of relationships. A graph database can quickly look up how things are related to each other, and return the results.

At some point in the past, getting a database meant talking to a consultant or contractor, who would then sit with you to understand your requirements, then set everything up for you without letting you touch any part of it. And that is probably for the benefit of you both. But today, for SMEs with some relevant expertise, it is actually possible to get your own database up and running very quickly.

Setting up a database on a server

If you have admin rights to the workplace server (which can be both a blessing and a curse), you’ll have to find the setup instructions that came with the server software (or Google it online). I’m sorry, it is painful for layfolks (and even for many experienced database admins) and there just isn’t an easier way yet.

Registering a database in the cloud

If you do not have admin rights to the workplace server, you usually ask your friendly server administrator to help you install the database and set up a web admin panel for you. They will give you a URL and login credentials for that web admin panel, and you configure the database through the database section of the admin panel.

If your company has decided to do away with organic IT support, your next bet is to outsource that help from cloud services. Each of the major cloud providers provide multiple database types for your perusal. Some app hosting services will also host a database for you (usually intended for app use, but who’s asking?).

Relational databases - Amazon Relational Database Service - Google Cloud SQL - Microsoft Azure SQL Database

Document databases (You will see many of them referred to as NoSQL databases) - Amazon DynamoDB - Google Cloud Firestore (part of Firebase) - Microsoft Azure Cosmos Database

Graph databases - Amazon Neptune - Microsoft Azure Cosmos also has an API (Issue 4)) for graph databases

Getting the database identifier

After you have successfully registered a database (of any type), you will be given a connection URI (Uniform Resource Identifier), which is a fancy way of saying “URL to identify your database uniquely”. It can be a simple line of text, like:

mongodb://mongodb0.example.com:27017

which identifies your database as a mongodb (document) database running on the server at mongodb0.example.com on port 27017. (I covered server hostnames in Issue 29) and port numbers in Issue 33)).

or it can look like:

postgres://myusername:myverylongwindedpasswordwhichisobviouslygeneratedbyacomputerandnotahuman@ec2-52-207-124-89.compute-1.amazonaws.com:5432/d77ila0heea1lk

which identifies your database as a postgres (relational) database running on the server at ec2-56-486-386-34.compute-5.amazonaws.com on port 5432, and your particular database is named d77ila0heea1lk (you can run multiple databases on a single server).

Connecting to a database

This is where it gets a bit trickier.

If you are using another online service that integrates with your database, that service needs to know your URI and its associated information. The service will either ask your for your login/authentication credentials, hostname, and port separately, or ask for it in a single URI, or some mix of the two options.

If you are hiring your own developer (including possibly yourself), you will have to figure out which module you need to connect to the database.

For example, MongoDB in Python: MongoClient('mongodb://mongodb0.example.com:27017')

And for PostGreSQL in Python: psycopg2.connect('postgres://myusername:myverylongwindedpasswordwhichisobviouslygeneratedbyacomputerandnotahuman@ec2-52-207-124-89.compute-1.amazonaws.com:5432/d77ila0heea1lk')

Note: It is considered insecure to simply leave your login credentials in code like that. Please read up on best practices for importing sensitive information from more secure sources in your programming language of choice.

Issue summary: A URI (Uniform Resource Identifier) is required to connect to a database. This URI can be provided by a hosting service provider that runs your own database for you, or by a cloud service provider that runs your database on their platform.

Once you go through the painful process the first time, it gets easier. A lot of engineering work has been done to make this possible: connect to a database with one identifier. URIs are their own fascinating bit of information engineering, definitely not within the scope of Layman’s Guide. It is something to think about whenever you need to identify everything in your office or warehouse with a unique name (think barcode system or inventory/asset management).

What I’ll be covering next

Next issue: [LMG S7] Issue 91: Commercial database alternatives

What if we don’t want to do all of that? Next issue, to wrap up this season, I’ll give you some alternatives that sit somewhere between a full database solution, and a simple Excel/Google Sheets spreadsheet.

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 89: Graph Databases

2020-09-26T08:00:00+08:00

Previously: Document databases organise data into documents, each containing a number of field-value pairs. each value can itself be a document, and multiple values/documents can be grouped under a field. Document databases do not enforce data consistency across documents, so those rules need to be managed by the application which is using the database. This allows document databases to continue operating even when partitioned, at the cost of some consistency.

In the past two issues, I laid out how relational databases primarily focus on the relations between tables, while document databases primarily focus on organising data into documents. I’ll look at one more application today.

If I’m trying to start a new social media platform today, I would have to store posts and user account data into a database. Which type of database should I use?

I could use a relational database, but joining multiple tables to get a chain of posts, Twitter-style, could get ugly and involve lots of lookups … that is going to be one laggy service at scale!

I could use a document database, but it would involve retrieving each post one at a time, searching to find posts which are linked to it, and then checking which posts are linked to those posts … that is too many searches!

Maybe I’m approaching this wrong. I don’t need to relate many different types of tables or retrieve self-contained documents here. I am actually trying to store a humongous, densely linked network of data—a graph!

What?

Okay, stay with me here, I know you are thinking of a horizontal and a vertical axis, and axis labels and bars and lines and—that’s not the kind of graph I am talking about.

“In mathematics, graph theory is the study of graphs, which are mathematical structures used to model pairwise relations between objects.”
— Graph theory (Wikipedia)

That’s what I’m talking about. And it looks like this:

This network graph shows the co-editing patterns on Wikipedia. The size of the arrows indicate the number of Wikipedia editors for one language edition of Wikipedia, who also edited another language edition.
Source: Wikimedia Commons

Okay, phew.

Graph databases: a network of relationships

So if I’m going to make a social media platform that can retrieve chains of posts, how would a graph database make it easier?

A graph database will still need to have some data for the users and posts:

personA:User {name:"Alice"}
personB:User {name:"Bob"}
...
post001:Post {tags:"...", contents:"..."}
post002:Post {tags:"...", contents:"..."}
...

But the heart of the graph database is the data that stores the relationships between those users and posts:

(personA)-[:SAYS_TO {message:"..."}]->(personB)
(personB)-[:SAYS_TO {message:"..."}]->(personA)
...

If I want to lookup a conversation between Alice and Bob, I can search for SAYS_TO relationships with Alice and Bob at either end of the relationship arrow (-->), and sort the results in chronological order.

Graph databases put relationships first

What about posts and comments? For social media, we can treat them as the same type of data (Post), but link them with relationships:

(personA)-[:WROTE {datetime:"..."}]->(post001)
(personB)-[:WROTE {datetime:"..."}]->(post003)
(personC)-[:WROTE {datetime:"..."}]->(post005)
(personD)-[:WROTE {datetime:"..."}]->(post007)
(personA)-[:WROTE {datetime:"..."}]->(post011)
(personB)-[:WROTE {datetime:"..."}]->(post013)
(personA)-[:WROTE {datetime:"..."}]->(post017)
...
(post003)-[:REPLY_TO {datetime:"..."}]->(post001)
(post005)-[:REPLY_TO {datetime:"..."}]->(post003)
(post007)-[:REPLY_TO {datetime:"..."}]->(post003)
(post011)-[:REPLY_TO {datetime:"..."}]->(post005)
(post013)-[:REPLY_TO {datetime:"..."}]->(post011)
(post017)-[:REPLY_TO {datetime:"..."}]->(post013)
...

Because the relationships contain only the bare minimum data for figuring out the network, they are quick to search through. I don’t have to load the names, post tags, post contents, and other irrelevant detail.

Although I would still have to retrieve post001, check for replies, check those replies for replies, and so on, this is much faster with relationships between labels. A graph database optimises for this type of lookup.

Once I have figured out which users and posts are involved in this chain, I can then retrieve their information in a subsequent query. I won’t even need to load all the information at a go, since the app user is not going to see the contents of later posts until they scroll.

Issue summary: Graph databases treat the details of things as secondary, and optimise for managing the network of relationships. A graph database can quickly look up how things are related to each other, and return the results.

So there you go, three types of databases in three weeks. I picked these three because they’re the least technical to give an overview of (in my opinion), and are three different ways of thinking about data that I think you are likely to encounter.

There are, of course, other types of databases: key-value stores (used heavily in web browsers), wide column databases, search databases (very similar to document-based), … but beyond this point the differences are primarily technical, and not really suitable for this newsletter.

What I’ll be covering next

Next issue: [LMG S7] Issue 90: Using a database

I’ve been cracking my head trying to come up with 2 more topics to round up this season on databases. I suppose most layfolks would (hopefully) never ever have to start or run their own database. But it could be helpful to know what is needed to get a database up and running, and the most common ways of getting access to one. Expect a short issue next week.

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 88: Document Databases

2020-09-19T08:00:00+08:00

Previously: Relational databases are designed to maintain a well-structured set of data tables through constraint rules. This makes them very useful for preventing accidental inconsistencies in data, but make any changes to the data schema difficult to implement. Changing from one schema to another involves downtime and a migration.

One problem I keep running into with Excel is when I think the data has a consistent structure, but halfway through I realise that it actually doesn’t: sometimes I might have two students with different categories of accomplishments, and that requires a big change in the way I design the columns.

Document databases bypass this problem by not enforcing a strict schema on the data. That is not to say you can’t; it is optional and up to you to enforce.

Document databases: a collection of fields and values

When we think of documents, we usually think of Office documents, or PDFs, or things that are … more associated with the way a workplace works.

These documents are not the ones I have in mind when talking about document databases. In these databases, documents are simply bits of data grouped together. Each bit of data is described by a field. For example, I might start out defining a student document this way:

{
  name: "Harry Potter",
  school: "Hogwarts School of Witchcraft and Wizardry",
  characteristics: "lightning-shaped scar on forehead"
}

I can add more fields later, if I wish:

{
  name: "Harry Potter",
  school: "Hogwarts School of Witchcraft and Wizardry",
  characteristics: "lightning-shaped scar on forehead"
  mother: Lily Potter,
  father: James Potter,
  ...
}

But what makes document databases truly document-oriented is the way they can be nested. Suppose I want to expand a bit more on this student’s education, to include the years of study. I could expand each entry in the school field to include that:

{
  name: "Harry Potter",
  school: {
    name: "Hogwarts School of Witchcraft and Wizardry",
    start: "1991",
    end: "1997"
  }
  characteristics: "lightning-shaped scar on forehead"
}

Yup, now I’ve just expanded the value of the school field into … another document! This document has a name field, a start field, and an end field. I can embed documents just about any place I want.

I can also group multiple values under a field:

{
  ...
  characteristics: ["wears glasses", "lightning-shaped scar on forehead"]
}

I can also group multiple documents under a field. It’s documents all the way down!

Collections: the only way to organise documents

While relational databases have tables for organising rows, document databases have collections for organising documents.

Each collection can contain multiple documents. There is no constraint on what kind of documents each collection can contain.

I could have a collection for teachers containing only teacher documents, a collection for students containing only student documents, a collection for subjects containing only subject documents, … or I could just have a collection for the department containing a mix of all three types of documents.

What can I do with a document database?

Just about … anything? If you can think of a way to organise the data as documents, you can put it into a document database.

A document database lets you find documents based on its fields. I can look up all documents which have a name field, or check that the word “Harry” is in the name field. I could look for students who enrolled in the year "1991" or later, or more specifically students who enrolled in "Hogwarts School of Witchcraft and Wizardry" in "1991" or later.

Drawbacks

Since this is not a relational database, you don’t have the protection of foreign keys and other features that stop you from making the data inconsistent—there’s no concept of enforced consistency here! You’ll have to write those rules into your app when it accesses the document database; the database won’t enforce them for you.

Advantages

Data organised as documents tends to be more self-contained. Since the database does not enforce consistency, it has less to worry about when edits or changes are made to the database. In a distributed document database, we thus sacrifice some consistency—unless we make pains to ensure it in our application code.

This does provide an advantage: when the distributed document database suffers a network outage, causing it to partition into multiple clusters (Issue 86)), the database can continue to operate. However, each cluster only has access to its own data, and not data on the other clusters. Over time, each cluster will become less and less consistent, since changes in each cluster are not synchronised to other clusters.

Once the network issue is resolved and the clusters are synchronised again, these changes can subsequently be merged following rules for resolving conflicts. The database remains operational throughout the ordeal, just with some desynchronisation.

Issue summary: Document databases organise data into documents, each containing a number of field-value pairs. Each value can itself be a document, and multiple values/documents can be grouped under a field. Document databases do not enforce data consistency across documents, so those rules need to be managed by the application which is using the database. This allows document databases to continue operating even when partitioned, at the cost of some consistency.

What I’ll be covering next

Next issue: [LMG S7] Issue 89: Graph Databases

Okay, relational and document databases were easy enough. They are more easily mapped to spreadsheets and file/folder hierarchies, respectively.

But now we go up the abstraction ladder, and get to more abstract ideas of data. In a social network, the user profile is usually the least significant part of the account; what often matters most is how this account is linked to other accounts (followers and following). The study of such interlinked objects is known in mathematics as graph theory (nope, not the kind of graphs we are so used to in reports). This is where terms like “social graph”, the representation of your social network on Facebook or Twitter, comes from.

What is the most intuitive way to represent, store, and modify this kind of graph data? Using a graph database, of course.

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 87: Relational Databases

2020-09-12T08:00:00+08:00

Previously: To increase the performance of a distributed database, we can scale up/scale vertically by increasing the computers’ performance, or scale out/scale horizontally by adding more computers. Distributed databases can only prioritise two of the following three factors: consistency, availability, partitioning (CAP theorem).

I’ve already discussed one big strength of relational databases in Issue 84) when I illustrated how the JOIN keyword, one of many SQL commands (Issue 83)), can join our data from multiple tables into a single view. This is where we look under the surface to see what makes that possible.

Linking tables through foreign keys

From Issue 84):

To join the Customer and Sales data so that we get the sales data along with custName, we would write a SQL query like this:

SELECT salesID, orderDate, custID FROM Sales JOIN Customer ON Sales.custID = Customer.custID

Here, Sales.custID refers to the custID of the Sales table, while Customer.custID refers to the custID of the Customer table. This query effectively says “select the salesID, orderDate, and custID columns from Sales table, and add data from the Customer table where the custID column matches”. This will return:

Did you catch the fact that there were actually two custID columns? One in the Sales table, and one in the Customer table … by linking two tables like that, we actually introduce a point of potential breakage.

Suppose one day, a customer goes out of business, or changes name, and the corresponding Customer entry gets deleted. Now if we accidentally attempt to retrieve Sales to that customer, the SQL command will fail because it is unable to find the entry.

We can protect ourselves from this kind of error by declaring Sales.custID as a foreign key in Customer, thus informing the database that Sales.custID is actually a column from Customer. If we attempt to delete that customer again, the database will help to check if that entry is referenced by other tables as a foreign key. Entries can only be deleted if they are not referenced by other entries.

These and other constraints allow us to protect ourselves from inadvertent harm, but over time, they accumulate and make a relational database very hard to modify. Database administrators will tell you to think about your database tables in advance, as even attempting to add a column or change a column type is going to be a pain in future!

The tradeoff: downtime for database maintenance and migrations

To modify a relational database, we have to shut it down¹, and migrate the database from the old schema to the new schema. In essence, we are exporting our data and re-importing it again. Attempting to migrate while the database is active—known as a live migration—is strongly discouraged, as changing a database while a migration is in progress can introduce data inconsistency; a real headache with constraints!

Relational databases can also develop problems that require them to be shut down and rectified. It’s the tradeoff for having a consistent and structured way to store our data, and automated rules to enforce this structure.

Relational databases: excellent for predictable data needs

If you don’t expect to be changing your database schema often, or if you are able to design the schema to minimise such migrations, relational databases can be quite excellent for your needs. Please consult a professional database engineer if you are planning to use a database for your business needs.

Issue summary: Relational databases are designed to maintain a well-structured set of data tables through constraint rules. This makes them very useful for preventing accidental inconsistencies in data, but make any changes to the data schema difficult to implement. Changing from one schema to another involves downtime and a migration.

What I’ll be covering next

Next issue: [LMG S7] Issue 88: Document Databases

Relational databases work well for data that we can imagine as an Excel table. But often, we have data that might not share the same set of properties, or might not have a predictable structure (such as online collaboration data). Such data is more intuitively imagined as a set of documents than as a set of tables. What do databases that encourage a document-based model of data look like?

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

There are ways to avoid this, but I’ll let a real database administrator tell you about how to make it happen. ↩

Issue 86: Distributed databases

2020-09-05T08:00:00+08:00

Previously: Forms that naïvely inject user-submitted data into a SQL query template may end up sending valid SQL commands to the database, with disastrous consequences.

So far, we have been assuming that the database runs from a single computer, and all its data is stored on one as well. What happens when it outgrows this single computer?

We could add more disk space, more memory, more cores on the processor; this is called vertical scaling/scaling up (because we are increasing the performance of the computer, which usually feels like pushing up the performance bar on the vertical axis of a graph).

Or we could spread that database over two or more computers. And keep them constantly synchronised. This is called horizontal scaling/scaling out (because we are adding more computers, which is usually depicted as adding more units on a horizontal axis).

We can only take vertical scaling so far; at some point we will have the most powerful server possible and it still won’t be enough. So if we are expecting massive growth, that means we will need a distributed database.

Wait, who actually expects a database to not have to store a lot of information?

There are tiny databases out there!

These are often used in places where the task is not expected to grow beyond a single PC. For example, the database that stores your WhatsApp messages on your mobile phone, or a tiny database that stores records from a remote standalone sensor. These databases are designed to be extremely efficient at handling small amounts of data, to use very little memory, and/or to ensure that data is always written securely.

Okay, fine. Back to distributed databases

Buying more computers to run a server is similar to hiring more employees to do the company’s work. The good: you now have more help. The bad: you now have to talk to them! Regularly!

In distributed databases, there are three factors that are impossible to achieve together in full:

Consistency — reading the same data multiple times should not give us different results
Availability — we should get a response from the database quickly
Partition tolerance — If network disruptions or software/hardware failures break communication, our cluster of servers break up into smaller clusters—they get partitioned. Computers in each subcluster can communicate with each other, but not with computers outside the subcluster. Under such conditions, the system should still continue to operate.

This is known as the CAP theorem: you can only really prioritise two out of the three factors.

Consistency and Availability

The database we have been examining so far in Season 7 are known as relational databases, which handle data in the form of tables. When implemented as a distributed database, they often prioritise consistency and availability.

How does that work? When our distributed database is being hit with 100,000s of requests per second, more than one computer can handle, we need multiple computers to serve these requests. These computers had better be synchronised (to achieve consistency) so that the request will always return the same response from any of those computers.

One way to achieve this is to have a Single Source of Truth: perhaps we design it so that only one “leader” computer handles edits/changes to the database, which then get sent to all the other “follower” computers. (This assumption that reading data occurs much more frequently than writing/changing data holds up for most use cases.) What happens if the “leader” computer goes down, and our distributed database goes from a leader-follower system to a partitioned bunch of followers? No writes can happen, the system is no longer operational.

(There are multiple theorems on how to design this system to automatically/manually select a new leader, but I won’t go into that here. The fundamental problem of ensuring consistency and availability in such cases remains.)

When a partition happens

So it comes down to this: when communication failure happens in a scenario like the above, we have to choose.

If we need a workaround to ensure that updates on one computer still reaches all the computers so that the data is consistent, that is going to be slow — we lose availability.

If we want to achieve availability, we could have each computer just return or update the data it has, then worry about synchronisation later — we lose consistency.

If you find yourself in the position of having to choose a distributed database, it would be immensely helpful to know upfront which 2 factors you want to prioritise!

Examples

Consistency and Availability: Bank databases fall in this category. Financial transactions must be accurate, and people need to quickly know whether they were successful. So we have to live with these databases requiring regular maintenance (usually late at night) to minimise the risk of partitioning failure.

Partitioning and Consistency: Authentication systems are relied upon to ensure that data is only accessed by people who are authorised to do so, and cannot afford to go down for long periods of time. This requires that permissions be properly synchronised across all computers, so consistency is key. These two factors are more important than ensuring a speedy response.

Partitioning and Availability: Essential services, such as Google Maps, have to remain operational even with (recoverable) failures, and still have to respond in a reasonable amount of time (otherwise real-time navigation would fail). Roads do not change often, so it is okay if the info we are getting is slightly out of date; we might occasionally get a slower route or find ourselves at a business whose operating hours are not updated in Google Maps, but these are not critical failures.

The CAP theorem does not say we can never have the third factor! It means we have to pick 2 factors to prioritise, and live with the lowered performance of the third.

Issue summary: To increase the performance of a distributed database, we can scale up/scale vertically by increasing the computers’ performance, or scale out/scale horizontally by adding more computers. Distributed databases can only prioritise two of the following three factors: consistency, availability, partitioning (CAP theorem).

This actually ran longer than I expected; the examples were an unplanned addition that I think helps to clarify use cases for each combination.

What I’ll be covering next

Next issue: [LMG S7] Issue 87: Relational Databases

I’ll spend the next 3 issues talking about 3 major types of databases in use today. This isn’t strictly layman content, but I suspect in some non-technical conversations these terms may pop up. More importantly, I think the 3 major types cover 3 different concepts of data, and I hope that elaborating on these in a little bit more detail will help to develop a more nuanced way of thinking about data.

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 85: SQL Injections

2020-08-29T08:00:00+08:00

Previously: SQL queries let you join multiple tables based on specified conditions using the JOIN keyword. This enables crafting complex queries to return only the specific data that is required.

SQL databases are really powerful; this is usually a good thing since it allows developers to do amazing things with the data inside. But it can also lead to disastrous consequences in the unsupervised hands of inexperienced developers. And matters can be even worse if these powers are not carefully granted. A malicious actor could “borrow” these powers to wreak havoc on the database!

Relevant xkcd comic

Adding data to an SQL database

Adding data to an SQL database is easy. If our Customer table looks like this (from Issue 84)):

The relevant SQL query to add another customer is:

INSERT INTO Customer
VALUES (Ernest, ernest@lmn.com, 57564986)

What could go wrong?

Interacting with an SQL database

The most direct way of managing and interacting with a database is through its commandline tool. Needless to say, this is not how you would want your users using it. It’s just a terrible user experience, and gives them waaaay too much power.

So we usually design a frontend—an app, webpage, or database form—that formats and lays out the data nicely for them, and limits the things they can do to the data. This frontend will usually only allow users to edit or delete existing data, and add new data. Then it constructs an SQL query to be sent to the database. The code to do this might look like the following:

custName = request.form['custName']
custEmail = request.form['custEmail']
custContact = request.form['custContact']
sql.execute(f'INSERT INTO Customer VALUES ({custName}, {custEmail}, {custContact})')

This code naïvely inserts data from the submitted form into the database without any checks. That’s not smart; the contact number might have the wrong number of digits, the email might not even have an ‘@’, and people often type the wrong things in the wrong fields.

What else could go wrong?

SQL Injections: sending SQL commands through an unsecured form

A malicious/clever user might attempt to submit the following form data:

Customer Name: Ernest Customer Email: ernest@lmn.com Customer Contact: 10); DROP TABLE Customers—

Why would they do that? When inserted into the template above, the full SQL query becomes:

INSERT INTO Customer VALUES (Ernest, ernest@lmn.com, 10); DROP TABLE Customers--)

Two things to explain: - the semicolon (;) indicates the end of an SQL query. It is used to write two or more queries in one line. - The database ignores everything after the --. It is a useful way to add comments to SQL queries (for human consumption) … or to make the database ignore invalid syntax (such as the standalone )), which is what happens in this case.

So the database ends up executing this:

INSERT INTO Customer VALUES (Ernest, ernest@lmn.com, 10);
DROP TABLE Customers

Goodbye, Customer table …

Data leakage through SQL injections

This app is probably going to have some kind of search or filtering feature, where we enter a name to search for and get results that match. If we were searching for a user named George, an inexperienced developer might send this as the SQL query:

SELECT * FROM Customer WHERE custName = George

If I submit the following in the search box:

Customer Name: George OR 1=1

It might get naïvely substituted to form the following query:

SELECT * FROM Customer WHERE custName = George OR 1=1

The database will attempt to parse this, and come across custName = George OR 1=1. It gets interpreted as “return all results from Customer table where the custName column matches the result of George OR 1=1”.

It will then attempt to evaluate George OR 1=1. By the unintuitive reasoning of computer logic, this always evaluates to True, and results in the database returning … all the rows in Customer.

Conclusion

If you’re going to use a database with a frontend, get an experienced developer to do it. If all you have are inexperienced developers, send them for the appropriate training. If you don’t have developers, use an established product over an untested one. If in doubt, find someone with the relevant credentials to ask for advice.

Issue summary: Forms that naïvely inject user-submitted data into a SQL query template may end up sending valid (but otherwise unathorised) SQL commands to the database, with disastrous consequences.

This would have been 3–5 times as long if I had started going into some basic ways to prevent this kind of mistake. Fortunately, this is just a layman’s guide, and I can foist that responsibility off to the rest of the internet.

On a serious note, database security is a whole field of study. If you are using a database for enterprise purposes, please give database security the resources it needs; there are just so many ways that things can go wrong!

What I’ll be covering next

Next issue: [LMG S7] Issue 86: Distributed databases

So far, we have been assuming that the database runs from a single computer, and all its data is stored on one as well. What happens when it outgrows this single computer? Why, it then gets transmitted and infects another computer … just kidding, we then have to spread that database over two or more computers. And keep them constantly synchronised. If that sounds like a pain, you are exactly right! More on this next issue.

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 84: JOIN – supercharged VLOOKUP

2020-08-22T08:00:00+08:00

Previously: Structured Query Language (SQL) is a computer language for managing data in databases. It has keywords and keyphrases that let you filter rows and columns, group and order data, perform basic arithmetic on data, and more. It is complex and powerful, but astute and efficient use requires specialised training.

VLOOKUP: The bread-and-butter of spreadsheets

If I have a Customer data table that looks like this:

And a Sales data table that looks like this:

I could add a custName column to the sales table that looks up the custID, and inserts the custName info from the same row. This feature of spreadsheets is known as VLOOKUP (vertical lookup)¹. This is what the formula for each cell in custName would look like:

Let’s break down each part of that formula:

=VLOOKUP(C2,Customer!A:D,2)

This means “in columns A:D of the Customer table, look for the value from cell C2 (which is 1) in the first column of the Customer table, and return the value from the same-row cell in the 2nd column of the Customer table.”

What if you needed to insert more than one column? What if you need to “join” two or more tables? Your spreadsheet would soon be filled with VLOOKUP cells, and this really slows down the performance of the spreadsheet. This method is not suitable for data involving millions of rows, for sure.

SQL JOIN: VLOOKUP on steroids

In a database, there is no “standard view” of the data. All data you want to see has to be retrieved with a query. So it makes no sense to require cells filled with VLOOKUPS; we just need to figure out how to do the equivalent in a query. The keyword for that is called JOIN.

To join the Customer and Sales data so that we get the sales data along with custName, we would write a SQL query like this:

SELECT salesID, orderDate, custID FROM Sales
JOIN Customer ON Sales.custID = Customer.custID

Here, Sales.custID refers to the custID of the Sales table, while Customer.custID refers to the custID of the Customer table. This query effectively says “select the salesID, orderDate, and custID columns from Sales table, and add data from the Customer table where the custID column matches”. This will return:

That is much easier—once you’ve been trained in SQL syntax—than writing separate VLOOKUP formulas for each column you want, and having to maintain a whole table of formulas!

You can even join more than two tables together with a query like:

SELECT salesID, orderDate, custID, invoiceID, Customer.custName, Customer.custContact, invoiceDate, invoiceAmt FROM Sales
JOIN Customer ON Sales.custID = Customer.custID
JOIN Invoice ON Sales.invoiceID = Invoice.invoiceID

This is barely scratching the surface of what SQL can do; it has at least 4 types of JOINs, and many more ways of crafting queries to return specifically the data you want.

SQL queries are a whole different way of talking to your computer, and they can be really frustrating to write for people who are new to it. But they are behind many of the interfaces you see, which seem to seamlessly pull data from multiple sources together into a coherent view.

Issue summary: SQL queries let you join multiple tables based on specified conditions using the JOIN keyword. This enables crafting complex queries to return only the specific data that is required.

What I’ll be covering next

Next issue: [LMG S7] Issue 85: SQL injections

Databases are immensely powerful software systems when it comes to searching for information. One recurring challenge that all admins face is ensuring that only authorised use is permitted; how do we prevent malicious activity from being able to access the database?

Next week, I will introduce a common vulnerability that web developers always have to guard against: SQL injection.

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

There is an equivalent feature for columns known as HLOOKUP (horizontal lookup) that looks up info in a row and inserts data from the same column, but it is not as popular. So the VLOOKUP name is more commonly used for this kind of operation. ↩

Issue 83: Structured Query Language

2020-08-08T08:00:00+08:00

Previously: A database system follows rules that enable multiple users to send commands to the database at the same time. The system attempts to execute each action one at a time, locking data that is in use by other users, and ensuring that each user does not carry out actions that they are not permitted to. Such systems are better able to prevent data corruption compared to a text-based system.

Have you experienced the pain of having really huge tables in your spreadsheet, sometimes spanning more than a hundred columns? Then you might know how painful it can be trying to filter data from it, e.g. if your boss just wants a few columns of info from certain rows. Like if he asks for the performance numbers of employees who are up for promotion.

In a spreadsheet, you would have to apply filters for nextPromoYear to only show the appropriate rows, then you’ll have to hide all the other irrelevant columns. Or you’d just copy all more-than-a-hundred columns for those rows into another new spreadsheet, and manually delete the unnecessary columns.

Database designers don’t want to to do that. You should be able to ask the database to do this querying and filtering for you, and return you only the data you want. But how would that be designed?

Structured Query Language: the universal database language

Structured Query Language (SQL) is another computer language designed to manage data in databases. It reads almost like English, but more logical and less poetic. It has its own syntax and grammar, which are not the same as in English. And sending a proper SQL query to any database that supports it will get you what you want.

Here’s what an SQL query for the above info might look like:

SELECT employeeName, teamName, salesCount, salesTotal FROM SalesData
WHERE nextPromoYear = 2020
GROUP BY teamName
ORDER BY salesTotal;

The SELECT keyword lets you filter only the columns you want FROM a table
The WHERE keyword lets you filter only the rows you want, based on one or more criteria
The GROUP BY keyphrase lets you group the returned data based on values in a column
The ORDER BY keyphrase lets you sort the returned results according to values in a column

A database has no “main view”

One difficulty many people have in “upgrading” from a spreadsheet mindset to a database mindset is that they expect to have a “main spreadsheet” where (almost) all the data lives, and where sub-spreadsheets pull data from. In a database, all data lives in separate tables, and are joined only when a query is executed. The only way to get data from a database is to use queries!

Most websites or software you are using which retrieves data for you usually end up executing one or more queries such as the above to get that data. And the job of the database software is to interpret such commands, pull the data from the various tables together, collate it correctly, and send it to you.

A database can give you almost exactly what you want

By using these and many other keywords and keyphrases, it is possible to put together a query that gives you only the data you want. SQL has arithmetic functions such as count, average, sum, and it can even return only unique values.

The tradeoff is that you have to learn another language, and use it regularly enough to understand the ins and outs. This is why every big corporation has a data team that can do this!

Issue summary: Structured Query Language (SQL) is a computer language for managing data in databases. It has keywords and keyphrases that let you filter rows and columns, group and order data, perform basic arithmetic on data, and more. It is complex and powerful, but using it in an astute and efficient manner requires specialised training.

What I’ll be covering next

Next issue: [LMG S7] Issue 84: JOIN – supercharged VLOOKUP

I haven’t even touched on SQL’s really powerful features yet. Filtering data from a table is fine, but if my data is spread across many tables, how do I pull that data together? Excel folks have a command they rely on heavily to do this, and it is called VLOOKUP. I’ll show you the SQL version next issue.

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 82: Multiplayer databases

2020-08-01T08:00:00+08:00

Previously: Putting all data into one table results in unnecessary duplication of data. Making data atomic by splitting it up into multiple tables makes the data easier to work with, but requires multiple lookups and joins to get the required data. A standard database language, SQL, makes it possible to write queries that are supported by multiple databases.

This issue is going to be a short one, because it is simple enough to explain :)

“The action can’t be completed because the file is open. Close the file and try again.”

This happens in File Explorer because the operating system treats a text file as a single block of data. When a user opens this file, they do not expect data inside to change. To prevent other users inadvertently modifying it, the operating system “locks” the file, preventing any changes—including deletion!

How do we resolve this with a database? In the previous issue (Issue 81)), I described the process of making data atomic—breaking it up into the smallest level of detail. By splitting up one huge spreadsheet worth of data into multiple tables representing different things, we allow the database to do the heavy work of data processing for us, while we avoid the tedium of repeating the same data row after row (such as author name for different blog posts).

Locking specific data

Now that the data is atomic, the database is better able to figure out which data needs to be locked. If a user is requesting data from a particular table and from certain rows only, we do not need to lock the entire database and prevent other users from accessing it. Such systems are called row-locking systems, and some databases (but not all) support this feature.

Action deconflicting

When multiple users access a database and attempt to write data to it at the same time, the database takes these requests and puts them in a queue, processing them one by one so that no two conflicting actions end up causing the data to be corrupted.

But sometimes, conflicting actions can end up getting queued. For instance, User 1 might send a command to delete a table while User 2 send a command to retrieve data from that table (because it had not been deleted at the point when User 2 accessed it). User 1’s command gets through first and deletes the table, and when the database reaches User 2’s command, it is no longer able to execute it. What happens then?

Well, that’s when the database throws an error. A database system is able to detect actions whose logic conflict with other actions. With our previous text-based system, even with the table gone, the program could still continue to search for results, and finding none, return empty data instead of alerting the user.

Issue summary: A database system follows rules that enable multiple users to send commands to the database at the same time. The system attempts to execute each action one at a time, locking data that is in use by other users, and ensuring that each user does not carry out actions that they are not permitted to. Such systems are better able to prevent data corruption compared to a text-based system.

What I’ll be covering next

Next issue: [LMG S7] Issue 83: Structured Query Language

If you have an Excel maven in your workplace who writes VLOOKUPs, INDEX-MATCHs and other chained functions with ease, you will know how spreadsheets can do downright amazing things. But wait till you see Structured Query Language (SQL); it will blow your mind! It almost looks like Excel code, except with fewer nested parentheses, and reads a little (deceptively) more like English. I’ll show you next issue.

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 81: Data Normalisation

2020-07-25T08:00:00+08:00

Previously: An index is a separate table containing key terms in the database (usually names, IDs, or some other key identifier), alongside the row numbers where they are found. An index greatly speeds up row lookups, but slows down the writing of new rows.

In this post, I will use CSV format to describe data, although if you have followed this season from the start you would be aware that in a database, this data would not be in text form. Nonetheless, at this point it would be represented similarly.

If we were constructing a database of blog posts from multiple authors of a blog, we might organise the post data like this:

id,author,title,content
1,knowitall,Why the world is falling apart,blahblahblah…
2,knowitall,Make the world great again,blahblahblah…
3,whatsgoingon,Why have things come to this,blahblahblah…

Later, when the authors start to add avatars and other information to their profile, the table might grow:

id,author,avatarURL,about,title,content
1,knowitall,http://avatars.me/knowitall.jpg,I know everything!,Why the world is falling apart,blahblahblah…
2,knowitall,http://avatars.me/knowitall.jpg,I know everything!,Make the world great again,blahblahblah…
3,whatsgoingon,http://avatars.me/whatsgoingon.jpg,Curious about the world,Why have things come to this,blahblahblah…

And when we start to add post tags:

id,author,avatarURL,about,title,content,tags
1,knowitall,http://avatars.me/knowitall.jpg,I know everything!,Why the world is falling apart,blahblahblah…,daily+apocalypse
2,knowitall,http://avatars.me/knowitall.jpg,I know everything!,Make the world great again,blahblahblah…,daily+ambition
3,whatsgoingon,http://avatars.me/whatsgoingon.jpg,Curious about the world,Why have things come to this,blahblahblah…,essay+apocalypse

What problems are we going to run into with a table like this?

Suboptimal table forms

Even with constant-width tables and pre-determined data types, plus speeding up lookups with indexes, we will run into some issues as the number of posts grows:

Duplication of data
The avatar URL and About description (for the author) are repeated in each post. In a real blog, where these are often longer and you might have more contact information about each author (such as contact info), the amount of duplicated data is simply wasteful.
Difficult data extraction
Notice that the tags are all jammed up into one column. How would we search for all posts with the “apocalypse” tag?
We would have to retrieve each row one by one, split up the tag strings, and check if “apocalypse” is in there … that’s really slow!

Data normalisation: making data atomic

When data is really complex, it makes sense to split it up and make it atomic. When data is atomic, it means that it has been broken down to the lowest level of detail; typically this would mean individual records that avoid duplication.

For instance, we might have an Author table:

ID,name,avatarURL,about
1,knowitall,http://avatars.me/knowitall.jpg,I know everything!
2,whatsgoingon,http://avatars.me/whatsgoingon.jpg,Curious about the world
…

And a Posts table:

ID,authorID,title,content
1,1,Why the world is falling apart,blahblahblah…
2,1,Make the world great again,blahblahblah…
3,2,Why have things come to this,blahblahblah…
…

What to do with the Tags? Often a database designer will create a Tags table like this:

ID,tag
1,daily
2,apocalypse
3,ambition
4,essay
…

and a PostTags table like this:

postID,tagID
1,1
1,2
2,1
2,3
3,2
3,4
…

This process of splitting up a complex data set into atomic, related data fields is known as data normalisation. A data set that is not normalised will make it difficult to do lookups efficiently as new needs arise later.

Advantages of data normalisation

The first advantage you can see above is that retrieving author-only data, post-only data, etc is now much faster. We don’t have to pull up a whole lot of other unrelated information, incurring unnecessary data transfer overhead.

The second advantage you can see is that entities—authors, posts, tags—are now referred to by an ID. An ID is usually a number, which is represented more compactly in a computer in binary form as compared to a name or title in text form (Issue 79). This allows our program to carry out any processing on relationships between these entities much more quickly (e.g. “how many posts does this author have?” “How many posts have this tag?”), with lower data transfer overhead.

Disadvantages: greater complexity

The disadvantage is that pulling data together to render a blog post on a webpage now involves looking up three different tables and joining the data together. Each query is going to involve multiple lookups and joins, and is going to require many lines of code … if each programming language is going to come up with its own way of writing these lookups and joins, and each new database format also comes up with its own commands, very soon we would have a huge unmaintainable mess of syntax and commands to learn!

So programmers and database designers came together and came up with a new language to do lookups and joins: Structured Query Language, or SQL. This is the reason why today you can write SQL queries that will work on a Microsoft SQL (MSSQL), PostGreSQL, MySQL, or MariaDB database; they all support SQL!

Issue summary: Putting all data into one table results in unnecessary duplication of data. Making data atomic by splitting it up into multiple tables makes the data easier to work with, but requires multiple lookups and joins to get the required data. A standard database language, SQL, makes it possible to write queries that are supported by multiple databases.

I am jumping ahead of myself a little here; I’ll only talk about SQL a couple of issues later. Before I go into what SQL does, there are two features our program does not yet support: allowing multiple users to read and write data, and setting access permissions on data.

What I’ll be covering next

Next issue: [LMG S7] Issue 82: Multiplayer databases

“The action can’t be completed because the file is open. Close the file and try again.”

How often have you run into this error on Windows?

This makes it difficult for multiple users to work on a file at the same time. How do databases work around this? Find out in the next issue!

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 80: Indexing

2020-07-18T08:00:00+08:00

Previously: Comma-separated value (CSV) files store all data in text form. Within each row, a separator divides each chunk of data, and rows are separated by a line delimiter. To keep the data compact and read it more quickly, we have to decide beforehand what data type each chunk should be, and how much space it is allowed to take up. Such a data form can no longer be opened in a simple text editor program like Notepad.

Last issue, we were still looking at how to speed up a text-based data storage solution. When we finished, we had a program that could skip the process of reading every single line and counting line breaks, but it could no longer be opened in Notepad. (That’s not a big loss really; Notepad can’t really handle text files larger than 0.5–1 GB anyway …)

No matter, so long as our program does not need to read in everything at a go!

The search problem

Right now, our data is still stored in a huge, continuous text block. Retrieving information from this block is easy if you already know the row number you want; our data program can quickly calculate the required row and jump to its starting byte.

Most if not all of the time, you would have no idea which row to retrieve, although you might know something to tell you what data to look for—a name, a date of birth, etc. You would need to search for this row. And blocks are just not really optimised for such operations.

Nonetheless, this is not a new challenge. Paper books are often dense and long, especially textbooks. If you wanted to find a passage in there to quote, you would not be flipping through more than 800 pages and scanning paragraph by paragraph just to find it again! You would just flip to the index, look up the term you were hoping to find, and simply check those page references.

Why not do that here?

Indexes

To create an index, we would need to create another block of data. This data block would contain select pieces of data from our table for indexing—names, dates, or other select pieces of data from our table—along with the corresponding row number(s) where they are found.

Yes, that would take up more space, but it would speed up the search immensely, and that is often a worthy tradeoff. This index would be stored together with the table in our database. When the database is opened, this index would be read into memory, because accessing memory is much faster than accessing physical storage (Issue 57)). Our database would use it to look up the row number of the record containing the name we want, and retrieve it with the row number much more quickly than a row-by-row lookup could.

Tradeoffs

You can see how an index would greatly speed up searches, which do not modify the database. But what if we need to store data?

Each row we add to the database would necessitate updating the index. Instead of updating one table with our original database format, we now have two tables to update; that is definitely slower. You would not want to include an index for tables that are often written to.

Now that creates a conundrum for us: if I have customer records, should I add an index, or not? I would often have to search these records for a customer’s information, but I would also be adding to this information often. So it looks like indexes would greatly speed up the lookup, but slow down the adding of records.

I’ll examine this issue next week with data normalisation.

Issue summary: An index is a separate table containing key terms in the database (usually names, IDs, or some other key identifier), alongside the row numbers where they are found. An index greatly speeds up row lookups, but slows down the writing of new rows.

What I’ll be covering next

Next issue: [LMG S7] Issue 81: Data Normalisation

In a spreadsheet, we sometimes love to split a page into multiple tables, with lovely table labels and such. With our database now optimised for fast access with constant-width rows and specific data types, we can no longer do that.

How should we organise our data then? More on this in the next issue.

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]

Issue 79: A Base for Data

2020-07-11T08:00:00+08:00

Previously: Modern webpages rely on many third-party resources for their functionality. Blocking access to some domains may cause these webpages to break and stop working.

We start a new season this issue, and now I circle back to the theme of data again. In Season 4, I laid out the broad categories of data, and showed how these basic data types get put together into more complex containers, such as video and documents. Let’s take it one step further.

base

noun
the bottom of something considered as its support: [foundation]

Whatever you do, I doubt your entire life can fit into a single document. Heck, even your household data, your work data … whatever it is, it is probably too complex and varied to fit into a single container. So many numbers and paragraphs, related in some ways, all interconnected … how to make sense of it?

We need a foundation for all this data, a base on which we can build our lives and our worlds. We need databases.

Let’s start from square 1: plain text.

Text files and CSV

Starting simple, let’s try to put our data into a text file. Inside the computer, a text file is just a long string of text:

This is the first line\nThis is the second line\nThis is the third line\n…

That \n? That is the newline character (Issue 41)), an unprintable code that tells the computer “the subsequent parts go on a new line”. Without it, Everything would just be one long, continuous string. The newline character determines the limits of each line; it delimits the line. \n, the newline character, is therefore a line delimiter.

Not all our data is just a single line like that. In spreadsheets, for example, we want multiple data types in the same row. How do we get the computer to understand that these data are not one big bundle, but separate pieces? We need a separator. Commonly, commas are used to separate data, like this:

5,bubbleSort,1.734122735215351e-06
5,insertionSort,1.0771698807366193e-06
5,mergeSort,5.6086346949450675e-06
5,quickSort,4.135697910096496e-06

That’s some data I was compiling to compare different sorting methods. The first piece of data, a number, represents how many numbers I was sorting. The second piece of data is the sorting method, and the third piece of data is the time taken. and they are separated by commas.

This format is known as comma-separated values, and referred to by the acronym CSV.

Searching through data in CSV

In Issue 41), I mentioned that each character takes up a standard number of bytes (1 byte, in the case of the characters on your keyboard; anything outside of that, it’s complicated). That makes it easy for the computer to retrieve characters. First character, first byte. 100th character, 100th byte.

What about the 5th row? Which byte is that?

Now the computer has to start searching from byte 1 all the way, count the number of newlines (\n), and after the 4th newline it knows “this is the fifth line”. That works for a small amount of data, perhaps even for a household, but for businesses with thousands of customers and millions, even billions of lines of data, this is unworkable. What can we do about this?

If you recognise the themes that have been recurring so far, you probably know it subconsciously: we need standardisation.

If we could decide beforehand—a big IF, but possible—how many data pieces each row should have, and the largest number of bytes each data piece will take up, things will be much easier. If each row only has 3 pieces of data, and each piece of data takes up no more than 8 bytes (64 bits), then each row takes up 28 bytes. The 5th row starts from byte 113.

This process is much faster for a computer. It does not need to read every single byte and count newlines anymore; it can just jump to the position of byte 113 and start reading from there.

Reducing data size

One more problem to resolve: 112 bytes for 4 rows is a lot of data! A chunk of data in text form, such as “$1,234.56” is 9 characters, which means 9 bytes. If we could somehow standardise this data type (say, let’s just call it currency), and reduce it to just the number form 1234.56, we could store it in just 2 bytes! That’s much fewer bytes to retrieve, store, and transfer.

The tradeoff is that now we can no longer just open that file in Notepad to peek at the data. We would need a program that:

remembers how many bytes each row and piece of data should take up,
remembers what type each piece of data is.

This program will figure out where to start reading the file from, retrieve the data we want, and return it. Compared to managing data in CSV, the data will be more compact, and the program will be faster. And we would have taken one step away from plain text files, towards a full database.

Issue summary: Comma-separated value (CSV) files store all data in text form. Within each row, a separator divides each chunk of data, and rows are separated by a line delimiter. To keep the data compact and read it more quickly, we have to decide beforehand what data type each chunk should be, and how much space it is allowed to take up. Such a data form can no longer be opened in a simple text editor program like Notepad.

For tech junkies and programmers, it is easy to get into the blind pursuit of performance. I wanted this issue to start right, by demonstrating the tradeoffs involved in increasing performance. We started from a data format so simple it can be opened in Notepad and read by a human, to a format that needs a program to read.

At least this program is simple to write; I could do it in less than fifty lines of Python code. Let’s look at more tradeoffs in the next issue.

What I’ll be covering next

Next issue: [LMG S7] Issue 80: From Blocks to Trees

We are so used to seeing data in a single blob—as a dense spreadsheet table, as densely packed lines of text, etc—that it is difficult to see it as a loosely organised tree structure.

But in our daily lives, that is much more commonly the way data is organised. Data in organisations is never all put in a single document or place; it is loosely spread across departments, each of which manage a portion of it, and these departments send information updates to each other to update their separate sections.

In the next issue, I’ll apply this idea to the way computers manage information.

Sometime in the future: What is:

booting up? [Issue 15]
XSS? [Issue 8]
a good reason developers write code and give it away for free online? [Issue 21]
firmware? [Issue 34]
OpenType? And what are fonts anyway? [Issue 42]
What is involved in installing a piece of software? [Issue 48]
How do apps know where a file starts and ends? [Issue 49]
What is a password hash? [Issue 63]