Many users would immediately tell that predictor “predict what an intelligent agent would do to pursue this goal!” and all of the standard worries would re-occur.
I don’t think it works this way. You have to create a context in which the true training data continuation is what a superintelligent agent would do. Which you can’t, because there are none in the training data, so the answer to your prompt would look like e.g. Understand by Ted Chiang. (I agree that you wrote ‘intelligent agent’, like all the humans that wrote the training data; so that would work, but wouldn’t be dangerous.)
If we can clarify why alignment is hard and how we’re likely to fail, seeing those futures can prevent them from happening—if we see them early enough and clearly enough to convince the relevant decision-makers to make better choices
Okay, that’s true.
After many years of study, I have concluded that if we fail it won’t be in the ‘standard way’ (of course, always open to changing my mind back). Thus we need to come up with and solve new failure modes, which I think largely don’t fall under classic alignment-to-developers.
I meant if the predictor were superhumanly intelligent.
You have spent years studying alignment? If so, I think your posts would do better by including more ITT/steelmanning for that world view.
I agree with your arguments that alignment isn’t necessarily hard. I think there are a complementary set of arguments against alignment being easy. Both must be addressed and figured in to produce a good estimate for alignment difficulty.
I’ve also been studying alignment for years, and my take is that everyone has a poor understanding of the whole problem and so we collectively have no good guess on alignment difficulty.
It’s just really hard to accurately imagine agi. If it’s just a smarter version of llms that acts as a tool, then sure it will probably be aligned enough just like current systems.
But it almost certainly won’t be.
I think that’s the biggest crux between your views and mine. Agency and memory/learning are too valuable and too easy to stay out of the picture for long.
I’m not sure the reasons Claude is adequately aligned won’t generalize to AGI that’s different in those ways, but I don’t think we have much reason to assume it will.
I’ve expressed this probably best yet on LLM AGI may reason about its goals, the post I linked to previously.
I don’t think it works this way. You have to create a context in which the true training data continuation is what a superintelligent agent would do. Which you can’t, because there are none in the training data, so the answer to your prompt would look like e.g. Understand by Ted Chiang. (I agree that you wrote ‘intelligent agent’, like all the humans that wrote the training data; so that would work, but wouldn’t be dangerous.)
Okay, that’s true.
After many years of study, I have concluded that if we fail it won’t be in the ‘standard way’ (of course, always open to changing my mind back). Thus we need to come up with and solve new failure modes, which I think largely don’t fall under classic alignment-to-developers.
I meant if the predictor were superhumanly intelligent.
You have spent years studying alignment? If so, I think your posts would do better by including more ITT/steelmanning for that world view.
I agree with your arguments that alignment isn’t necessarily hard. I think there are a complementary set of arguments against alignment being easy. Both must be addressed and figured in to produce a good estimate for alignment difficulty.
I’ve also been studying alignment for years, and my take is that everyone has a poor understanding of the whole problem and so we collectively have no good guess on alignment difficulty.
It’s just really hard to accurately imagine agi. If it’s just a smarter version of llms that acts as a tool, then sure it will probably be aligned enough just like current systems.
But it almost certainly won’t be.
I think that’s the biggest crux between your views and mine. Agency and memory/learning are too valuable and too easy to stay out of the picture for long.
I’m not sure the reasons Claude is adequately aligned won’t generalize to AGI that’s different in those ways, but I don’t think we have much reason to assume it will.
I’ve expressed this probably best yet on LLM AGI may reason about its goals, the post I linked to previously.