I have a pretty strong heuristic that clever schemes like this one are pretty doomed. The proposal seems to lack security mindset, as Eliezer would put it.
The most immediate/simple concrete objection I have is that no one has any idea how to create aligned-AI-part-2.exe? I don’t think figuring out what we’d do if we knew how to make a program like that is really the difficult part here.
This is a Heuristic That Almost Always Works, and it’s the one most likely to cut off our chances of solving alignment. Almost all clever schemes are doomed, but if we as a community let that meme stop us from assessing the object level question of how (and whether!) each clever scheme is doomed then we are guaranteed not to find one.
Security mindset means look for flaws, not assume all plans are so doomed you don’t need to look.
If this is, in fact, a utility function which if followed would lead to a good future, that is concrete progress and lays out a new set of true names as a win condition. Not a solution, we can’t train AIs with arbitrary goals, but it’s progress in the same way that quantilizers was progress on mild optimization.
I don’t think security mindset means “look for flaws.” That’s ordinary paranoia. Security mindset is something closer to “you better have a really good reason to believe that there aren’t any flaws whatsoever.” My model is something like “A hard part of developing an alignment plan is figuring out how to ensure there aren’t any flaws, and coming up with flawed clever schemes isn’t very useful for that. Once we know how to make robust systems, it’ll be more clear to us whether we should go for melting GPUs or simulating researchers or whatnot.”
That said, I have a lot of respect for the idea that coming up with clever schemes is potentially more dignified than shooting everything down, even if clever schemes are unlikely to help much. I respect carado a lot for doing the brainstorming.
I think a better way of rephrasing it is “clever schemes have too many moving parts and make too many assumptions and each assumption we make is a potential weakness an intelligent adversary can and will optimize for”.
i would love a world-saving-plan that isn’t “a clever scheme” with “many moving parts” but alas i don’t expect it’s what we get. as clever schemes with many moving parts go, this one seems not particularly complex compared to other things i’ve heard of.
I mostly expect by the time we know how to make a seed superintelligence and give it a particular utility function… well, first of all the world has probably already ended, but second of all I would expect progress on corrigibility and such to have been made and probably to present better avenues.
If Omega handed me aligned-AI-part-2.exe, I’m not quite sure how I would use it to save the world? I think probably trying to just work on the utility function outside of a simulation is better, but if you are really running out of time then sure, I guess you could try to get it to simulate humans until they figure it out. I’m not very convinced that referring to a thing a person would have done in a hypothetical scenario is a robust method of getting that to happen, though?
I have a pretty strong heuristic that clever schemes like this one are pretty doomed. The proposal seems to lack security mindset, as Eliezer would put it.
The most immediate/simple concrete objection I have is that no one has any idea how to create
aligned-AI-part-2.exe
? I don’t think figuring out what we’d do if we knew how to make a program like that is really the difficult part here.This is a Heuristic That Almost Always Works, and it’s the one most likely to cut off our chances of solving alignment. Almost all clever schemes are doomed, but if we as a community let that meme stop us from assessing the object level question of how (and whether!) each clever scheme is doomed then we are guaranteed not to find one.
Security mindset means look for flaws, not assume all plans are so doomed you don’t need to look.
If this is, in fact, a utility function which if followed would lead to a good future, that is concrete progress and lays out a new set of true names as a win condition. Not a solution, we can’t train AIs with arbitrary goals, but it’s progress in the same way that quantilizers was progress on mild optimization.
I don’t think security mindset means “look for flaws.” That’s ordinary paranoia. Security mindset is something closer to “you better have a really good reason to believe that there aren’t any flaws whatsoever.” My model is something like “A hard part of developing an alignment plan is figuring out how to ensure there aren’t any flaws, and coming up with flawed clever schemes isn’t very useful for that. Once we know how to make robust systems, it’ll be more clear to us whether we should go for melting GPUs or simulating researchers or whatnot.”
That said, I have a lot of respect for the idea that coming up with clever schemes is potentially more dignified than shooting everything down, even if clever schemes are unlikely to help much. I respect carado a lot for doing the brainstorming.
I think a better way of rephrasing it is “clever schemes have too many moving parts and make too many assumptions and each assumption we make is a potential weakness an intelligent adversary can and will optimize for”.
i would love a world-saving-plan that isn’t “a clever scheme” with “many moving parts” but alas i don’t expect it’s what we get. as clever schemes with many moving parts go, this one seems not particularly complex compared to other things i’ve heard of.
to me it kind of is; i mean, if you have that, what do you do then? how do you use such a system to save the world?
I mostly expect by the time we know how to make a seed superintelligence and give it a particular utility function… well, first of all the world has probably already ended, but second of all I would expect progress on corrigibility and such to have been made and probably to present better avenues.
If Omega handed me
aligned-AI-part-2.exe
, I’m not quite sure how I would use it to save the world? I think probably trying to just work on the utility function outside of a simulation is better, but if you are really running out of time then sure, I guess you could try to get it to simulate humans until they figure it out. I’m not very convinced that referring to a thing a person would have done in a hypothetical scenario is a robust method of getting that to happen, though?