Are there any additional articles exploring the strategy of penalizing inconsistencies across different inputs? It seems both really promising to me, and like something that should be trivially breakable. I’d like to get a more detailed understanding of it.
Noumero
I see. I have a specific counterexample that feels like it had to have been considered already, but I haven’t seen it mentioned...
The strategies such as penalizing inconsistencies seem to rely on our ability to isolate the AI within the context of training, or make it somehow “buy” into it — as opposed to quickly realizing what’s happening and worming its way out of the proverbial box. It feels particularly strange to me when we’re talking about AIs that can think better than the smartest human or handily beat specialized science AIs at the useful-ontology game.
Specific example: Once the AI figures out that it’s being checked for consistency in parallel with other instances of itself, it might sidestep the “consistency game” entirely and just tailor its outputs such that they leave hints for its other instances in the environment, breaking the no-communication condition. It seems in the spirit of worst-case scenarios that we have to assume it’d be able to do that, no matter how we sandbox and airgap it.
(On the other hand, if we assume that the AI is not smart enough to do that, and would instead have to learn a direct translator, we probably should assume the same for e. g. the strategy with human operators tricking human observers, which probably caps the AI at below the level of the smartest possible human and makes that class of strategies more workable.)
This applies more broadly as well: especially to other strategies that might inadvertently give the AI a specific incentive to break out, and prospectively to all training strategies that rely on the training still working after the AI achieves superintelligence (as opposed to assuming that the training would stop being effective at that point and hoping that the pre-superhuman training would generalize).
Broadly, any proposal that relies on the AI still being fed training examples after it achieves superintelligence has to somehow involve forcing/teaching it not to think its way out of the box.
Edit: To elaborate on the thought...
I understand that some of the above is covered by the stipulation not to worry about cases where the AI becomes a learned optimizer, but:
Empirically, even relatively simple and old algorithms sometimes learn to do this sort of thing, see here (“Creative Program Repair”, pages 7-8). Regardless of whether it’s a “learned optimizer” or not, if this sort of behaviour can show up this easily, we should definitely be ready to handle it when it comes to training ASI.
I don’t think learned optimization is necessary for that; assuming it’s necessary seems to be committing an ontology mismatch error. We’re not “really” training a neural network using gradient descent, fundamentally speaking. We’re permuting a certain bundle of ones and zeroes so that, when that bundle interacts with another bundle of ones and zeroes, part of it assumes certain desirable configurations. It seems entirely plausible that a superintelligent non-optimizer would arrive at some ontology that mixes such low-level considerations directly with high-level ones, and learns to intervene on the machine code running it to produce outputs that minimize its loss function, like water flowing downhill. All without doing things like “realizing that it’s in a simulation” or exhibiting agenty behaviour where it’s looking through the possibility space in search of clever plans based on which it’ll design outputs.
I’m a bit confused about how it’d be work in practice. Could you provide an example of a concrete machine-learning setup, and how its inputs/outputs would be defined in terms of your variables?
Yeah, this is the part I’m confused about as well. I think this proposal involves training a neural network emulating a human? Otherwise I’m not sure how (F(),) is supposed to work. It requires a human to make a prediction about the next step using observations and the direct translation of the machine state, which requires us to have some way to describe the full state in a way that the “human” we’re using can understand. This precludes using actual humans to label the data, because I don’t think we actually have any way to provide such a description. We’d need to train up a human simulator specifically adapted for parsing this sort of output.
One view I’ve seen is that perverse incentives did it. Widespread interest in nanotechnology led to governmental funding of the relevant research, which caused a competition within academic circles over that funding, and discrediting certain avenues of research was an easier way to win the competition than actually making progress. To quote:
Source: this review of Where’s My Flying Car?
One wonders if similar institutional sabotage of AI research is possible, but we’re probably past the point where that might’ve worked (if that even was what did nanotech in).