The Two-Board Problem: Training Environment for Research Agents

Novel theories or conceptual extensions of mathematics are core to human progress, but are uniquely challenging to construct. The environment, be it physical reality or known formal system, provides the agent with limited observations, which it uses in an attempt to assemble a predictive model or policy.

Asymmetry manifests differently: from the limited sampling to the structural properties of the problem itself. The latter cases involve: Complex numbers discovery while observing only Real values, description of continuity to explain the behaviour of countable objects and other historical examples. Those are both the motivation and the main focus of the Two-Board Problem.

I designed the framework that captures the main structural properties of such a problem class and allows expressing them in an MDP suitable for benchmarking the ML approaches or as a source of data for deep learning methods. This article focuses on the explanation of the problem and its design choices, while the implementation example is available on Github.

The Two-Board Problem, as it is probably evident by its name, consists of two boards available for the agent. The first board, called the Real board, provides the agent with feedback about its solutions, and the operations defined by the formal grammar. The canonical case involves real numbers operations—it is selected due to the fact that physical measurements are expressed in those.

The second board, called Imaginary, is of more interest to us. It is not governed by any grammar, and allows writing arbitrary strings, e.g. from UTF-8. It acts as an optional scratchpad for the agent.

The agent can execute any operations defined by the grammar of the Real board to transform its contents and get the reward. As an example of the task we can take finding the roots of polynomials, or solving any other equation—find an argument, substitute it, achieve equality $\to$ get rewarded.

On the Imaginary board, the agent can write any arbitrary strings, which may help it to solve the equation. Strings from the Imaginary board can also be substituted into the Real board, provided they satisfy the Real board’s formal grammar.

From here on, I will refer to the Real board as $R$ and the Imaginary board as $I$ , mirroring $R$ and $I$ respectively.

If you’re interested in formal definition, you can find it below, but it is not necessary for understanding the core of the problem:

MDP $(S, A, T, R)$ where:

State $s \in S$ : the contents of both boards - $R$ holds expressions satisfying formal grammar $G$ , $I$ holds a list of arbitrary strings, plus metadata (step count, found solutions).
Actions $a \in A$ : operations defined by $G$ applied to $R$ ; free writes to $I$ ; and cross-board substitution (replace a symbol on $R$ with an expression extractable from $I$ ’s strings, provided it satisfies $G$ ). Terminal declarations end the episode.
Transitions are deterministic—defined by the semantics of $G$ .
Reward: the agent receives positive reward upon verified solutions on $R$ , and at episode termination based on correctness and completeness. The canonical instantiation uses the field operations over $Q$ as $G$ , with polynomial root-finding as the task.

The Imaginary board may look like a nice addition for writing an agent’s thoughts—and for simple tasks it largely is. The core of the problem becomes apparent in the scenarios where using Imaginary board is the only way to achieve the reward. A canonical example is finding roots of polynomials with rational coefficients—where the agent must construct expressions for the argument substitution—or declare that no such expression exists. Here is an explanation why for some of the polynomials using $I$ is necessary:

For the linear and quadratic equations the problem is fairly trivial, and the operations on $R$ are sufficient. The agent can operate using the actions provided by grammar $G$ , and make optional notes. However, after the degree 3 - cubics—the situation changes dramatically. In the general case there provably exists no valid sequence of actions on the Real board which will find a root. There are some trivial cubics, like $x^{3} - 8 = 0$ for which the solution can be found. But for an example like $x^{3} - 3 x - 1 = 0$ - our data-centers can work till the end of eternity, and it won’t get us one bit closer to the solution. Yet, for humanity it took much less—just 1800 years from the inception of the problem statement to the general solution computable with quill and paper. And this is where our board $I$ comes into play. The only way for an agent to finish this task is to use it, and do it creatively.

The issue with the equation $x^{3} - 3 x - 1 = 0$ is not only that we need a lot of clever substitutions, but also that the result of those would yield this: $\sqrt{- 3 / 4}$ as one of the intermediate steps. Our grammar over board $R$ can’t express such term and operations with it. Those just… don’t exist. Undefined.

The agent—in the historical case Cardano and a few other mathematicians, can resolve this problem creatively. How? I will skip the full explanation available on Wiki, but one of the intermediate transformation steps will give us the form:

$u^{3} + v^{3} - q = 0$

This means, that if $u^{3} = (a + b)$ and $v^{3} = (a - b)$ , the $b$ part, whatever monster it is—will cancel out on the real board. Our $G$ does not allow an arbitrary substitution with unknown properties to interact with the Real numbers. But what the agent can do, is to notice that the $b$ parts will erase into nonexistence due to expansion $a + b + a - b = 2 a$ - whatever they are. This allows us to express $\sqrt{- 3 / 4} = \sqrt{3 / 4} i$ , where $i$ is just “some strange function real board doesn’t express but it passes the grammar for substitution”, and observe how it cancels out.

All the substitutions, especially ones which implicitly contained non-allowed terms, were Imaginary board moves. While creative and purposeful, those were absolutely not guaranteed to work until verified on the Real board. Moreover, nothing from direct observations on $R$ can guide an agent to find those operations. They are undefined for the Real board—we have zero information about them. Intermediate steps would not yield any feedback. This makes it insanely hard for an agent to navigate the search space—and that’s what we observe. It took centuries for humanity to invent $\sqrt{- 1}$ - probably the most fundamental concept in modern science and physics.

And if you think that was a hard task, well… It doesn’t end here. As soon as we reach the degree 5, the Imaginary board would require constructing Galois theory and the related group for polynomial roots. This is the only way to understand if there is any solution findable in $I$ we could use for substitution given $R$ constraints. For some polynomials—there will be a solution, and for some—it would be impossible, and we’re back to data-centers working till the end of eternity.

One may ask something like: “Ok, we just need to explore Imaginary board—at some point the right string combination appears, we substitute or decide unsolvable—for sure we can just throw more compute into that?”. The issue with this approach is that each write operation to the scratchpad will result in multiple combinations available for substitution. In other words—using Imaginary board leads to combinatorial explosion of both state space and effective action space. Brute-force creates more problems for the brute-force.

Another approach is to use geometric or numeric approximation methods for solution on $I$ in an attempt to find the solution. The problem is, however, that it doesn’t yield a term that the Real board grammar would accept.

This means that our favorite tools are not applicable for the Polynomial Two-Board Problem. If applied directly, those lead to a dead end. And yet, we know it’s solvable.

We know that Cardano, Ferrari, Abel, Galois somehow trained on this Polynomial environment and were able to come up with solutions for their respective pieces. Unsurprisingly, this environment is very nasty to train on. Some of its “nice” properties:

Enormous state space and action space, both exploding—as already mentioned
Sparse reward—valid roots are the only thing rewarded, and full reward is only available for finding all of them
Non-smooth gradient for $R$ - you can try to sample $(x, y (x))$ reals and train with back-prop—but some solutions will be transcendental and our particular grammar of $R$ doesn’t allow them—it’s 150 years before calculus and ~300 before the terms are coined
Zero explicit gradient on $I$ - it’s connected to $R$ via substitution terms and those explode. Other heuristics are non-trivial—we’ll discuss it later
No obvious direct transferability of previously learned strategy from degree $n$ to $n + 1$
After degree 5 valid solution space includes “declare unsolvable”—and agent doesn’t know that, and doesn’t know that it becomes useful only for quintics and further

To summarize it, the agent operates in the environment, where almost every action exponentially increases the entropy of exploration policy without any positive reward signal or other explicit feedback which would help in navigation. Under these conditions, it must develop the heuristics or strategy. Those must be efficient enough to determine a sequence of several valid steps in this chaos, and taking them must produce the unique set of strings valid for $R$ . On substitution those yield the reward and make the problem collapse to the solution state of “0=0”. ^[1]

Since the environment is adversarial towards our training objective, we can converge on the following explanation: the agent should have some internal heuristics or bias to counter the growth of the state space. Those must be sufficient for finding an algorithm to guide the agent, without representing the solution itself—encoding complex numbers and Galois groups is cheating and doesn’t represent the epistemic position of Cardano and Galois respectively.

This property rules out any LLM which is trained on the data including complex numbers or its descendants. This poses a challenge to test those architectures on the specific Polynomial case—any current LLM saw the complex numbers. Since the data will contain the solution—we won’t be able to distinguish memoized interpolation versus genuine discovery using the methods available in the training corpus. ^[2]

This might sound like an issue of the framework, but that inconvenience is the main feature of the Two-Board Problem. It was intentionally designed to test novel creative reasoning, by making the environment and the data available to the agent insufficient for finding a path to the solution given only explicit information about $R$ . It encodes a historical precedent where “novel reasoning” has a precise formal meaning: constructing objects that provably do not exist in the given system, and yet can be verified.

If we rule out unsatisfactory explanations of hypercomputation, quantum coherence at room temperature, alien communication and divine blessing, we should conclude that the computable heuristics for such an agent should exist. What does it look like? Honestly, no idea—I can only speculate towards symmetries and entropy reduction. Those are respectively what we try to discover and what we try to fight. But even if we look towards symmetries—the issue persists—what heuristics would allow us to discover them, without encoding the solution in the style of GDL? ^[3]

The core feature of the Two-Board Problem is its hostility towards any learning approach, and simultaneously obvious exemplification in the real historical precedents. It would be strange to assume that Cardano had access to imaginary numbers, Galois to symmetry groups or that Newton could directly observe and comprehend infinity. The explanation that those are somehow encoded on the unconscious level and manifest in the brain directly is somewhere on the level between Harry Potter Fanfic and tabloids. “Evolutionary priors encoding the information necessary for inventing Galois groups due to their necessity for hunting” sounds more sophisticated, but also answers neither why nor how. Yet we can observe this phenomenon and use the technology based on those discoveries on a daily basis.

I don’t have access to the datacenter for performing the ablation study on transformers or stress testing AlphaProof or AlphaGeometry on this specific problem. But to the best of my knowledge, none of those systems have produced a novel axiom expanding current mathematics—they intelligently navigate known formal systems. Such discovery—passing Two-Board test which necessarily involves new axiomatics—would be the headline of every corner of the internet given the trend of recent years.

In essence, the Two-Board Problem environment has a counter to every method I could list:

Program synthesis on $I$ would blow up the state/action space
Evolutionary / genetic programming—same thing
Curiosity-driven exploration—same thing, we don’t know when to cut branches—no feedback
MCTS—would need to evaluate intermediate states—but reward is not available and computable heuristics are currently unknown, and we’re back to branching
Gradient descent based methods—holes in gradient over $R$ , explicit obvious form doesn’t exist over $I$
RL/Bayes—sparse rewards would make you iterate policy/hypothesis too slowly to be practical, and I wouldn’t be surprised if those never converge
GDL/Transformers—encoding/memoization of properties external to the problem environment is prohibited, and ability to discover them inside is unproven until they pass benchmark under ablation criteria

The example environment implemented in python takes 500 lines with sympy package being the only dependency. And yet, I can’t figure out the solution. If one would like to try—the code is open and any LLM would happily wrap an MDP environment to gym.

The philosophical question of what creativity is and what novelty means is probably one of the hottest battlegrounds of internet discourse of recent years. I find this question meaningless, and propose to instead answer one having objective evaluation criteria: “can an agent pass the Two-Board Problem test?” The setup is fairly straightforward—given the observations over the Real board and no prior information about solution inexpressible in $R$ ’s grammar the machine should demonstrate the ability to construct one using only the scratchpad $I$ .

Some questions given the problem complexity could arise: Is this AGI? Does it mean that “If Anyone Builds It, Everyone Dies”? Is this a peak of an agent’s intelligence, limited only by Gödel above it? I don’t know, but this condition looks necessary for truly general intelligence, given we observe such examples. Its sufficiency is yet to be decided.

As a practical person, I restate my perspective that philosophical debate over those questions is meaningless, and to get the answers one should approach the problem theoretically or practically.

From a theoretical research perspective, the Two-Board state is a trivial MDP expressible in meta-board $R_{1}$ . This means that one might find some interesting properties and extensions if they wander around in the meta-space of $I_{1}$ long enough, guided by whatever the inexpressible heuristic for Meta-Two-Board is.

From a practical and experimental standpoint, one might try to research the Two-Board by attempting to domesticate the branching chaos of $I$ by any known or yet unknown means.

Regardless of which path you prefer, by this paragraph you have the formal definitions, gym-compatible environment, and open questions—all in repository. And as @Eliezer Yudkowsky wrote in my favorite essay—if you know exactly how the system works, you can build one from buckets and pebbles.

Have fun building.

PS:

“Don’t panic”—Douglas Adams, The Hitchhiker’s Guide to the Galaxy

↩︎
After writing this paragraph, I must admit that while I don’t agree with Penrose on theoretical explanation, he had a point about computational complexity. Won’t be that as sure about non-computability—that’s much stronger claim. And it still doesn’t change my skeptical stance regarding quantum effects in the specific proposed form.
↩︎
For ablation study on the Polynomial case we could use the writings before 1540 and generate additional synthetic data using algebraic, geometric and physics apparatus available at that time. The other approach is to express in the Two-Board framework one of the more recent cases of discoveries leading to novel concepts—which would be much easier to ablate. Non-invertible/categorical symmetries might be a good candidate.
↩︎
Geometric Deep Learning—an excellent framework by Michael Bronstein et al. that leverages known symmetries and invariances for deep learning architectures. The Two-Board Problem asks the complementary question: what if the relevant symmetries are not yet known?