Personal Systems: Joint Cognition

Today, human-agent interaction or collaboration is largely limited to the human user sending a request, and the agent sending a response. The response is usually text; sometimes it contains tool calls, thinking blocks, or other specially formatted regions. The shape however is unmistakeable: this is still the request-response protocol that the world of networking is familiar with.

Richer forms of interaction

If you’ve ever tried to draft a document with an agent, you would have felt the friction inherent in trying to comment on the agent’s proposal or reply. This is an old, old problem, and humans simply worked with its limitations. With the advent of HTML and internet forums, we developed some conventions for quoting chunks of a reply and adding our own response. And with the advent of collaborative document editing, such as through Google Docs, we developed other conventions, such as highlighting lines and adding comments.

Attempting these newer interaction protocols with existing agents, however, is a shot in the dark. Most agent harnesses don’t support this, and you would have to find an online platform that does (and hope it survives long enough), or find an MCP server or tool that does.

The shape of human-agent interaction

And suppose you do find this tool. In all likelihood, it gives your agent tools for:

  • reading the document, or chunks of it
  • editing the document, by adding content at the end, or with find-and-replace semantics
  • annotating the document

The agent makes a change, the agent reads the document and verifies the change. Sometimes the agent attempts to edit—but oops, someone made a change in the meantime, so their find-and-replace didn’t work. Sometimes a new document was created; the agent won’t know unless someone tells it, or it happens to make a tool call for listing documents and discovers it.

This is the default state of most workplaces, so it is no surprise that it also ends up looking like the default starting state of human-agent collaboration.

But we can definitely do better.

Joint cognition

While the term joint cognition has picked up different meanings depending on who you read, here I’m looking at a few key features:

State transparency

All users working on a file/document/project see its latest state at all times. Changes do not happen silently when other users are viewing.

As much information is captured in the system as possible, instead of remaining in private agent or human memory.

Simultaneous thinking

The default request-response loop forces cognition to alternate between human and agent. First the human thinks and the agent waits, then the agent thinks and the human waits. To enable richer forms of interaction and assistance, like we are already seeing with code assistants and autocomplete systems, we need to enable humans and agents to think together.

In its simplest form, this looks like interleaved thinking: the human’s actions are streamed to the agent as text/audio/video chunks, enabling it to recognize appropriate moments to respond, while the agent’s response is likewise streamed, allowing the human to interrupt mid-thought or provide additional information/context or add another request without waiting for the agent to finish its response.

We can go yet further: after the agent writes its proposal document, can a human read and annotate as an agent spectates, providing specific clarifications or misunderstandings while aware that more comments are forthcoming? Can the agent highlight relevant parts of the document as the human is reading and typing?

Shared environment

A human working in their own terminal, and the agent working in theirs, diverge very quickly: the human has a variable in their environment that’s not in the agent’s, and vice-versa. The tools of software engineering help to make environments more repeatable and representable, but do not solve the problem.

If you are a vibecoder or use agent-assisted coding in your own projects, think about the last time you actually knew what was going on in the agent’s shell. Could you tell why certain commands that worked on your terminal failed in theirs? And suppose you did know why: were you able to “reach in” to their terminal and resolve the problem for the agent before it continued? Or did you have to “tell it” (by sending a request) and hope it understood how to resolve the issue itself?

Joint control

A shared environment enables both human and agent to inspect the same entity, to literally be on the same page. This is technically difficult, because most computer and software systems are not engineered for simultaneous broadcast: for two entities to see the same state of a third entity at all times, all changes by one must be broadcast to the others. This is a well known messaging problem that has its own name: the publish-subscribe pattern. Designing this for humans and agents within a system requires some engineering work that most harnesses do not yet provide.

Agent limitations

We have some of these tools for collaboration between humans. But these don’t quite work for agents, because the shape of agent cognition is different.

Most agents generate a continuous token stream, terminating only when they encounter a stop token. Tool calls and their results accumulate, previous thinking tokens remain, and eventually the agent hits a context length limit.

Many agent harnesses are gradually developing context compression mechanisms, ranging from a crude /compact command, to recency/relevance mechanisms that discard old tool call results or file reads, to progressive compression. Even with these improvements, it is obvious that agents don’t “see” or “act” like us. We edit a document by placing a cursor and typing text; agents currently use find-and-replace or rewrite semantics. Adding a single word is easy for us, less easy for an agent. Seeing the latest version of a document, or knowing when it changed, is easy for us (if the text editor supports real-time updates), near-impossible for an agent. If you modified a file after an agent touched it, it would have no knowledge until its next file edit attempt fails and it guesses at what happened.

Open research questions

  1. What kind of design would enable an agent to see the same context as humans?
  2. How do we design toolsets that are low-friction for an agent, yet easy for humans to understand how they are used?
  3. What kind of control and handover protocols will we need to enable control of context to pass between agents and humans as required?