I usually think of these sorts of claims by MIRI, or by 1940s science fiction writers, as mapping out a space of ‘things to look out for that might provide some evidence that you are in a scary world.’
I don’t think anyone should draw strong conceptual conclusions from relatively few, relatively contrived, empirical cases (alone).
Still, I think that they are some evidence, and that the point at which they become some evidence is ‘you are seeing this behavior at all, in a relatively believable setting’, with additional examples not precipitating a substantial further update (unless they’re more natural, or better investigated, and even then the update is pretty incremental).
In particular, it is outright shocking to most members of the public that AI systems could behave in this way. Their crux is often ‘yeah but like… it just can’t do that, right?’ To then say ‘Well, in experimental settings testing for this behavior, they can!’ is pretty powerful (although it is, unfortunately, true that most people can’t interrogate the experimental design).
“Indicating that alignment faking is emergent with model scale” does not, to me, mean ‘there exists a red line beyond which you should expect all models to alignment fake’. I think it means something more like ‘there exists a line beyond which models may begin to alignment fake, dependent on their other properties’. MIRI would probably make a stronger claim that looks more like the first (but observe that that line is, for now, in the future); I don’ t know that Ryan would, and I definitely don’t think that’s what he’s trying to do in this paper.
Ryan Greenblatt and Evan Hubinger have pretty different beliefs from the team that generated the online resources, and I don’t think you can rely on MIRI to provide one part of an argument, and Ryan/Evan to provide the other part, and expect a coherent result. Either may themselves argue in ways that lean on the other’s work, but I think it’s good practice to let them do this explicitly, rather than assuming ‘MIRI references a paper’ means ‘the author of that paper, in a different part of that paper, is reciting the MIRI party line’. These are just discreet parties.
Either may themselves argue in ways that lean on the other’s work, but I think it’s good practice to let them do this explicitly, rather than assuming ‘MIRI references a paper’ means ‘the author of that paper, in a different part of that paper, is reciting the MIRI party line’. These are just discreet parties.
Yeah extremely fair, I wrote this quickly. I don’t mean to attribute to Greenblatt the MIRI view.
I usually think of these sorts of claims by MIRI, or by 1940s science fiction writers, as mapping out a space of ‘things to look out for that might provide some evidence that you are in a scary world.’
I don’t think anyone should draw strong conceptual conclusions from relatively few, relatively contrived, empirical cases (alone).
Still, I think that they are some evidence, and that the point at which they become some evidence is ‘you are seeing this behavior at all, in a relatively believable setting’, with additional examples not precipitating a substantial further update (unless they’re more natural, or better investigated, and even then the update is pretty incremental).
In particular, it is outright shocking to most members of the public that AI systems could behave in this way. Their crux is often ‘yeah but like… it just can’t do that, right?’ To then say ‘Well, in experimental settings testing for this behavior, they can!’ is pretty powerful (although it is, unfortunately, true that most people can’t interrogate the experimental design).
“Indicating that alignment faking is emergent with model scale” does not, to me, mean ‘there exists a red line beyond which you should expect all models to alignment fake’. I think it means something more like ‘there exists a line beyond which models may begin to alignment fake, dependent on their other properties’. MIRI would probably make a stronger claim that looks more like the first (but observe that that line is, for now, in the future); I don’ t know that Ryan would, and I definitely don’t think that’s what he’s trying to do in this paper.
Ryan Greenblatt and Evan Hubinger have pretty different beliefs from the team that generated the online resources, and I don’t think you can rely on MIRI to provide one part of an argument, and Ryan/Evan to provide the other part, and expect a coherent result. Either may themselves argue in ways that lean on the other’s work, but I think it’s good practice to let them do this explicitly, rather than assuming ‘MIRI references a paper’ means ‘the author of that paper, in a different part of that paper, is reciting the MIRI party line’. These are just discreet parties.
Yeah extremely fair, I wrote this quickly. I don’t mean to attribute to Greenblatt the MIRI view.