Cognitive Emulation: A Naive AI Safety Proposal
This is part of the work done at Conjecture.
This post has been reviewed before publication as per our infohazard policy. We thank our external reviewers for their comments and feedback.
This post serves as a signpost for Conjecture’s new primary safety proposal and research direction, which we call Cognitive Emulation (or “CoEm”). The goal of the CoEm agenda is to build predictably boundable systems, not directly aligned AGIs. We believe the former to be a far simpler and useful step towards a full alignment solution.
Unfortunately, given that most other actors are racing for as powerful and general AIs as possible, we won’t share much in terms of technical details for now. In the meantime, we still want to share some of our intuitions about this approach.
We take no credit for inventing any of these ideas, and see our contributions largely in taking existing ideas seriously and putting them together into a larger whole.
The core intuition is that instead of building powerful, Magical end-to-end systems (as the current general paradigm in AI is doing), we instead focus our attention on trying to build emulations of human-like things. We want to build systems that are “good at chess for the same reasons humans are good at chess.”
CoEms are a restriction on the design space of AIs to emulations of human-like stuff. No crazy superhuman blackbox Magic, not even multimodal RL GPT5. We consider the current paradigm of developing AIs that are as general and as powerful as possible, as quickly as possible, to be intrinsically dangerous, and we focus on designing bounded AIs as a safer alternative to it.
Logical, Not Physical Emulation
We are not interested in direct physical emulation of human brains or simulations of neurons, but of “logical” emulation of thought processes. We don’t care about whether underlying functions are implemented in the same way as they are in the system we are trying to emulate, just that the abstraction over their function holds, and is not leaky. 
In the current paradigm, we generally achieve new capabilities through an increase in Magic. We throw more compute at black boxes that develop internal algorithms we have no insight into. Instead of continually increasing the amount of Magic present in our systems, we want to actively decrease this amount, to more cleanly implement and understand how new capabilities are achieved. Some amount of Magic will realistically be needed to implement many useful functions, but we want to minimize the amount of times we have to use such uninterpretable methods, and clearly keep track of where we are using them, and why.
CoEms are much “cleaner” than Ems, which are still ultimately big black boxes of weird computation, while in the CoEm paradigm, we keep careful track of where the Magic is and try to keep its presence to a minimum.
Predict, Track and Bound Capabilities
In the current dominant machine learning paradigm, there are absolutely no guarantees nor understanding of what is being created. Power laws don’t tell us anything about what capabilities will emerge or what other properties our systems will actually have.
One of the core hopes of shifting to a CoEm paradigm is that far more deeply understanding what we are building should allow us to predictively bound our system’s capabilities to a human-like regime. This eliminates the problem of being unable to know when an ostensibly harmless system passes from an understandable, harmless capabilities regime into an unprecedented, dangerous regime.
Exploit the Human Regime
We want systems that are as safe as humans, for the same reasons that humans have (or don’t have) those safety properties. Any scheme that involves building systems that involves humans should allow you to swap those humans for CoEms without breaking or drastically altering their behavior.
We have a lot of experience and knowledge of building systems that are broadly beneficial and safe, while operating in the human capabilities regime. The problem of why e.g. normal laws and regulations will not work for AGI is that we have no way of ensuring that the AGI that gets built will obey the capabilities constraints that are implicitly assumed in our social and legal mechanism design. By the definition of the paradigm, CoEms have the property of being understandable enough that we can ensure they implement human level capabilities and allow the user to ensure this regime isn’t accidentally exited.
We observe that humans are sorta, kinda, sometimes, corrigible-ish. They can focus their attention on many different things and optimize decently hard. Speedrunners can put massive amounts of effort into optimizing for relatively arbitrary numbers going down, scientists can design experiments and generate new knowledge, workers can generally be paid to perform many kinds of tasks.
We expect that this retargetability is something that can rather naturally be achieved in the process of implementing human-like cognition, and that many other factors, such as emotions, values, desires, etc are mostly contingent and can be factored out to a large extent.
By combining these five factors, we should be left with a system that:
Is built on understandable, discoverable and implementable ML and computational building blocks.
Does not have so much Magic inside of it that we cannot even put bounds on its possible consequences and capabilities.
Can be sufficiently understood and bounded to ensure it does not suddenly dramatically shift its behaviors, properties and capabilities.
Is well situated in the human(ish) capabilities regime and, when in doubt, will default to human-like failure modes rather than completely unpredictable behaviors.
Is retargetable enough to be deployed to solve many useful problems and not deviate into dangerous behavior, along as it is used by a careful user.
Instead of building black box, end-to-end Magical systems, we suggest composing simpler systems and reintegrate human’s knowledge into the development process. While this is a slower path to get to AGI, we believe it to be much safer.
There is a massive amount of alignment insights that can be gained purely from mining current level systems, and we should focus on exhausting those insights before pushing the capabilities frontier further.
CoEms, if successful, would not be strongly aligned CEV agents that can be left unsupervised to pursue humanity’s best interests. Instead, CoEms would be a strongly constrained subspace of AI designs that limit systems from entering into regimes of intelligence and generality that would violate the assumptions that our human-level systems and epistemology can handle.
Once we have powerful systems that are bounded to the human regime, and can corrigibly be made to do tasks, we can leverage these systems to solve many of the hard problems necessary to exit the acute vulnerable period, such as by vastly accelerating the progress on epistemology and more formal alignment solutions that would be applicable to ASIs.
We think this is a promising approach to ending the acute risk period before the first AGI is deployed.
Similar ideas have been proposed by people and organizations such as Chris Olah, Ought and, to a certain degree, John Wentworth, Paul Christiano, MIRI, and others.
When we use the word “Magic” (capitalized), we are pointing at something like “blackbox” or “not-understood computation”. A very Magical system is a system that works very well, but we don’t know why or how it accomplishes what it does. This includes most of modern ML, but a lot of human intuition is also (currently) not understood and would fall under Magic.
While Robin Hanson has a historical claim to the word “em” to refer to simulations of physical human brains, we actually believe we are using the word “emulation” more in line with what it usually means in computer science.
In other words, if we implement some kind of human reasoning, we don’t care whether under the hood it is implemented with neural networks, or traditional programming, or whatever. What we care about is that a) its outputs and effects emulate what the human mind would logically do and b) it does not “leak”. By “leak” we mean something like “no unaccounted for weirdness happens in the background by default, and if it does, it’s explicit.” For example, in the Rust programming language, by default you don’t have to worry about unsafe memory accesses, but you have a special “unsafe” keyword you can use to mark a section of code as no longer having these safety guarantees, this way you can always know where the Magic is happening, if it is happening. We want similar explicit tracking of Magic.
The Safety Juice™ that makes e.g. Eliezer like Ems/Uploads as a “safe” approach to AGI comes from a fundamentally different source than in CoEms. Ems are “safe” because we trust the generating process (we trust that uploading results in an artifact that faithfully acts like the uploaded human would), but the generated artifact is a black box. In CoEms, we aim to make an artifact that is in itself understandable/”safe”.
Roughly defined as something like “big pretrained models + finetuning + RL + other junk.”
Note that we are not saying humans are “inherently aligned”, or robust to being put through 100000 years of RSI, or whatever. We don’t expect human cognition to be unusually robust to out of distribution weirdness in the limit. The benefit comes from us as a species being far more familiar with what regimes human cognition does operate ok(ish) in…or at least in which the downsides are bounded to acceptable limits.
This is a good litmus test for whether you are actually building CoEms, or just slightly fancier unaligned AGI.
And other constraints, e.g. emotional, cultural, self-preservational etc.
A malicious or negligent users could still absolutely fuck this all up, of course. CoEms aren’t a solution to misuse, but instead a proposal for getting us from “everything blows up always” to “it is possible to not blow things up”.
For example, with GPT3 many, many capabilities were only discovered long after it was deployed, and new use cases (and unexplainable failure modes) for these kinds of models still are being discovered all the time.
As we have already observed with e.g. unprecedented GPT3 capabilities and RL misbehavior.
In the same way that AlphaZero is a more powerful, and in some sense “simpler” chess system than Deep Blue, which required a lot of bespoke human work, and was far weaker.
Or rather “allow the user to limit”.