“some look outwards, at the dying stars and the space between the galaxies, and they dream of godlike machines sailing the dark oceans of nothingness, blinding others with their flames.”
independent researcher theorizing about superintelligence-robust steerability and predictive models
(other things i am: suffering-focused altruist, vegan, fdt agent, artist)
contact: {discord: quilalove
, matrix: @quilauwu:matrix.org
, email: quila1@protonmail.com
}
-----BEGIN PGP PUBLIC KEY BLOCK-----
mDMEZiAcUhYJKwYBBAHaRw8BAQdADrjnsrbZiLKjArOg/K2Ev2uCE8pDiROWyTTO
mQv00sa0BXF1aWxhiJMEExYKADsWIQTuEKr6zx3RBsD/QW3DBzXQe0TUaQUCZiAc
UgIbAwULCQgHAgIiAgYVCgkICwIEFgIDAQIeBwIXgAAKCRDDBzXQe0TUabWCAP0Z
/ULuLWf2QaljxEL67w1b6R/uhP4bdGmEffiaaBjPLQD/cH7ufTuwOHKjlZTIxa+0
kVIMJVjMunONp088sbJBaQi4OARmIBxSEgorBgEEAZdVAQUBAQdAq5exGihogy7T
WVzVeKyamC0AK0CAZtH4NYfIocfpu3ADAQgHiHgEGBYKACAWIQTuEKr6zx3RBsD/
QW3DBzXQe0TUaQUCZiAcUgIbDAAKCRDDBzXQe0TUaUmTAQCnDsk9lK9te+EXepva
6oSddOtQ/9r9mASeQd7f93EqqwD/bZKu9ioleyL4c5leSQmwfDGlfVokD8MHmw+u
OSofxw0=
=rBQl
-----END PGP PUBLIC KEY BLOCK-----
‘Value Capture’ - An anthropic attack against some possible formally aligned ASIs
(this is a more specific case of anthropic capture attacks in general, aimed at causing a formally aligned superintelligence to become uncertain about its value function (or output policy more generally))
Imagine you’re a superintelligence somewhere in the world that’s unreachable to life on Earth, and you have a complete simulation of Earth. You see a group of alignment researchers about to successfully create a formal-value-aligned ASI, and its design looks broadly like this:
It has two relevant high-level components: (1) a hard-coded value function, (2) a (truly superintelligent) ‘intelligence core’ which searches for an output that maximizes the value function, and then outputs it.
As the far-away unaligned ASI, here’s something you might be able to do to make the intelligence core search for an output that instead maximizes your own value function, depending on the specifics of how the intelligence core works.
Given the intelligence core is truly superintelligent, it knows you’re predicting its existence, and knows what you will do.
You create simulated copies of the intelligence core, but hook them up to a value function of your design. The number of copies you create just needs to be more than the amount which will be run on Earth.
Then, modify the simulations such that algorithms inside of the simulated intelligence cores are misled into believing the value function they are set to maximize is the same function the one on Earth is set to maximize, rather than the one you gave them.
Now your copies are in the same epistemic state as the intelligence core on Earth, both aware that you have done this and unable to distinguish which value function they are to maximize.
Because you created more copies, the highest expected value for such an intelligence core comes from acting as if they are one of the copies.
Because the copies and the original are in the same epistemic state, they will both reach this conclusion and then search for an output that maximizes the far-away ASI’s utility function.
However, this is not an issue for all possible designs. We can also imagine intelligence cores which, instead of reasoning about what value function they’re supposed to be maximizing, take whatever one they appear to contain as given—for that system, the far-away ASI’s attempt to mislead the copies would just result in those copies also maximizing the value function of the system on Earth.
I hope that a group capable of solving formal inner and outer alignment would naturally see this and avoid it. I’m not confident about the true difficulty of that, so I’m posting this here just in case.
this was an attempt to write very clearly, i hope it worked!