Tl;dr: My current best plan for technical alignment is to work on: (1) Improving our gears-level understanding of the internals & their dynamics of advanced AI systems, and (2) work on pathways decorrelated with progress in the first direction—likely in the form of pivotal processes in davidad’s Neorealist perspective.

Thing We Want

The Thing We Want is an awesome future for humanity. If we’re living in a world where alignment is hard and timelines are short, we’re probably dead whatever we do. So I would rather focus my attention on addressing worlds where alignment is hard and timelines are long.^[1]

In my view, we can’t trust empirical facts about prosaic models to generalize, because there are likely rapid phase transitions and capability regimes during which we don’t get much chance to iterate on. For example, I expect that as the system starts to self-reflect, optimization pressures will start being applied to various invariants of past systems, thus breaking their properties (and the theoretical guarantees that relied on such invariants).

How to Get the Thing We Want

One perspective for categorizing alignment proposals is to take a view from the highest level, at which there are three broad targets: Corrigibility, Value alignment, and Pivotal act/process (+ other stuff I’m ignoring for now, like DWIM).

Building systems that can be trusted to follow its natural capability attractor

A commonality between Corrigibility and Value alignment is that their aim is to ensure that the AI, following its natural gradient towards increasing coherence and consequentialist cognition, ends up at an equilibrium that we prefer—like controlling where an arrow lands by controlling the bow, and trusting the arrow to do what we expect it to do after we let it go.

Thus, in order to trust these systems to do the sort of consequentialist cognition we want, these two targets seem to require that we have strong theoretical guarantees about those AI systems in a form that’s robust to extrapolation, at least until the point where the system is reflective enough to ensure goal-preservation^[2].

Since behavior alone can’t distinguish different systems, this would involve having a better theory of internal representations and their evolution over time, and how to control them.

Building such a theory is the (1) part of my above plan. Cluster of work relevant in this direction includes but is not limited to:

Natural Abstractions, Theory of Complex Systems, Self Organization, and Emergence, Theory of Interpretability (involving computational mechanics, SLT, linear logic), Self-reflection and modification, Decision Theory
Understand how the mess of proto-agents evolve to become a coherent, reflectively consistent agent by taking into account all of the above theories
… and control it to make it end at an equilibrium that gets us the Thing We Want.

Building systems that can’t be trusted to follow its natural capability attractor

If we don’t have such theories, value alignment and corrigibility is probably fucked. We’ll need a plan that’s not contingent on having a robust understanding of internals.

This would involve explicitly prohibiting the AI from rolling down that natural path of increasing coherence and consequentialist cognition in the first place, so that it’s not at the level of danger that require us to have strong guarantees about its internals—all while the AI being able to do useful work in letting us get the Thing We Want.

Clearly, this runs against the alignment-is-hard model as in sufficiently difficult targets convergently requiring consequentialist cognition (and thus involves following the attractor). So in this direction, we can’t ask for much. The ask needs to be as specialized and not-involving-dangerous-cognition as possible, all while meaningfully pursuing The Thing We Want, i.e. Pivotal acts.

Davidad’s Neorealist perspective is such an example with the ask being “end the acute risk period,” subject to constraints such as the act being ethical and cooperative with the rest of humanity (closer to the spirit of Pivotal Processes).

His motivation for adopting such a perspective matches our motivation in this section (and I was in fact inspired by it):

unlike a typical “prosaic” threat model, in the neorealist threat model one does not rely on empirical facts about the inductive biases of the kind of network architectures that are practically successful.

So, my rephrasing of this branch of work would be:

Subject to the constraints of alignment-is-hard risk model, we want a strategy for building an AI system that helps us ethically end the acute risk period, all without the strategy being too dependent on theoretical guarantees or us having an understanding of system internals. The latter condition might be satisfied by the use of highly specialized and constrained systems specifically for running a pivotal act/process.

Creating such a plan is the (2) part of my above plan. Work relevant in this direction includes but is not limited to:

davidad’s Open Agency Architecture
- I consider this to be an existence proof against Point 7 of List of Lethalities, that there are no weak pivotal acts. Open Agency Architecture is a pivotal process that, to me, does not seem clearly doomed.
- Perhaps not realistically feasible in its current form, yes, but his proposal suggests that there actually might exist such a process if we’re clever enough, and we just have to keep searching for it.

Caveats

The two distinctions above aren’t clear-cut. Advancements in one can easily feed into another (eg interpretability is universally useful, such as checking the internals of a specialized model used in a pivotal act, just to be safe).

Also, this way of carving up the alignment problem space totally has its flaws. For example, by focusing on alignment targets—while I’m personally quite optimistic about Cyborgism (sidestepping the consequentialism-is-convergent problem by focusing on the narrow task of addressing the bottlenecks found in human researchers) it fits into neither of these categories.

Conclusion

Anyways, working in both of these branches seem especially useful since they’re pretty much anticorrelated (i.e. if we fail to produce theories needed for (1), we have (2) as a backup), thus justifying my current best plan which I copy-paste below from the Tl;dr:

(1) Improving our gears-level understanding of the internals & their dynamics^[3] of advanced AI systems, and (2) work on pathways decorrelated with progress in the first direction—likely in the form of pivotal processes in davidad’s Neorealist perspective.

^
The detailed reason for my choice of working on the hard/long world is a bit more nuanced. For example, I actually do think alignment is hard (eg consequentialism is convergent, capabilities generalize further than alignment, etc). I also think work in this direction is a better personal fit. But for the rest of this post I won’t justify this particular choice further, and also take my threat model as a background assumption.
^
Feedback from Tsvi benson-Tilsen: there’s probably a better term here than “Goal,” which is a pre-theoretic and probably confused term. I agree. A better, more expanded out phrasing would be “reflective enough to ensure preservation of that-which-we-want-to-preserve-to-get-the-thing-we-want.”
^
Tsvi also notes that language like “dynamics” or “evolution over time” is a temporal concept, and it would be better to talk about a more general sense of determination.

Gearing Up for Long Timelines in a Hard World