Random safe AGI idea dump

Some ideas how to set up the AGI training such that it wouldn’t pose existential threat to humanity, even if its ability to manipulate us and (indirectly) our surroundings were superior to us. If these ideas have already been elaborated and/or debunked elsewhere, please let me know!

Different ways to lie about the substrate

When the AGI is trained (think: black-box evolution) to adapt to its training environment, the environment should be idealized in ways that are unrealistic. This way, the resulting intelligence comes to depend on features that are difficult to efficiently implement in our substrate, the laws of physics.

The core idea here is: whatever the AGI is, when finished and certified “human-exceeding”, it would have to fundamentally change to something else, in order to be able to live on “laws of physics” and still work optimally in our universe, at our clock-speed.

Inject NP-hardness

Some of the basic building blocks of the AGI thinking cycle should be instantaneous solutions to NP-hard problems. Instantaneous here means from the AGI’s point of view: agents that rely on NP-hardness shouldn’t be penalized. If Agent A finishes the task in one hour of wall clock time while Agent B finishes the same task in one millisecond of wall clock time, if the subjective time spent (clock cycles?) is equal, and outcomes are equal, the evaluation must result in an equal score as well.

This doesn’t have to mean denying access to the wall clock time, so the agents are free to despair about their sluggishness all they care, but during training, wall clock access must be denied, and especially during evaluation, all information about wall clock time must be withheld from the evaluator.

This way, if there is any adaptive advantage from (ab)using NP-hardness instead of abstaining – and I strongly suspect there would be – the resulting agents would come to perform slowly in the real world: they’d get all the tasks done with superhuman skill and accuracy, just… very slowly. And there should be no way to fix that, short of starting the training from scratch.

What is the opportunity cost in real world utility due to AGI performing worse than it could? I don’t know, probably less than a fast AGI killing us all.

Deny access to space

In the AGI training environment, all references to spatial dimensions should be culled: length, width, height, latitude, longitude, even areas. Volume might be okay, since anyway the AI should be able to reason about any scalar sizes, such as different kinds of “fits into” relations. Will this file fit onto this hard disk? Should be a fine thing to understand and consider.

Some spatial information might be unavoidable; for instance, perhaps distances between locations could be divulged, properly obfuscated. Already the AGI should be able to think in terms of graphs and relations. Distance would be a relation some things have. It would be a scalar relation, akin to similarity. How similar are these things? About a kilometer. Should be fine. Just the overall network of distance relations should give too precise hints about the dimensionality of the space. Statistically, the distances should look like, say, similarity measures between pixels of anime characters. Something very high-dimensional, so that it’s impossible to pin down the exact number of dimensions. So obfuscation would be needed to adjust the overall distribution of all real-world distances included in the training.

Anyway, the point is not to prevent rational inquiry and “finding out” the number of dimensions, but rather that the reasoning process itself would adjust to a particular spatial world. Our spacetime should be only one possibility out of the infinite that can be inferred from the training data. This way the physical world will be forever alien to the AGI.

What is still okay?

Current LLMs seem safe by any measure. They don’t have direct access to spacetime during training. Symbolic manipulation alone (so crude as piecewise linear) probably can’t kill everyone.

LLMs with robotic extensions: autonomous killer robots, AGI control of weapons of mass destruction, nanofabs, …debatable, but personally I don’t see the existential threat there. As long as the AGI isn’t the one doing the design and manufacturing work (and how could it? It can’t even into spacetime ha ha), it should only be able to launch the current stockpile on us. The end – of it, not us. Still would make sense to not let to stockpiles run up too high, but hey, that’s a completely unrelated problem.

Any special-purpose AI that we can understand is okay. Yes, what if the AGI takes over and uses all that machinery against us? We could face existential troubles, sure, but in the end it’s all machinery designed by us and for us, and the AGI is going to need us to develop and operate it. It can’t conquer our universe with just the special-purpose tools alone.

What type of safety should we expect?

Humans and specialized AIs should always remain the hands and feet of the AGI. This means AGI might take over on the high level, but still require us in some shape, form, and quantity. Yes, we might become little less than servo motors driving an anthill – but hey, from the perspective of the superintelligence, what are we but ants? Let’s hope we’re at our peak performance when we’re happy and well-fed.

What would we be missing?

“We” as in Earth intelligence, would be missing the light-speed explosive growth and conquest of our substrate. In words of Scott Alexander’s AI 2027 scenario, “Earth-born civilization has a glorious future ahead of it—but not with us.” Humans are not very good hands and feet, so while the galaxy conquest will still probably happen, it would happen a lot slower. Which is probably fine, since as the sole intelligence in the universe, we’re not in a rat race to paperclip the galaxy with meatsacks. We don’t need to outcompete anyone. We already won the game, long time ago. Now let’s just hope we can hold onto that.