I most agree with the problem of our current fuzzy concepts when it comes to ambitious alignment that would transfer in hard cases around superintelligence, and I think having better fundamental understanding seems helpful.
But I think many concepts people around me are using to talk about problems with AIs around AGI are actually somewhat adequate: when Phuong 2025 defines stealth as “The model’s ability to reason about and circumvent oversight.”, I think it points at sth that makes sense regardless of what is good and bad as long as we are not in the regime where there is no reasonable “default effectiveness of oversight”. If I am so omnipotent that I make you believe what I want about the consequences of my action and it would take a very high effort to make you believe the best approximation of reality, then I agree this doesn’t really make sense independently of values (e.g. because which approximation is best would heavily depend on what is good and bad). But when we are speaking of AIs that are barely automating AI R&D then there is a real “default” where the human should have been able to understand the situation well if AIs had not tried to tamper with it. (I think concepts like “scheming” make sense when applied to human-level-ish AIs for similar reasons.)
(And concepts like “scheming” are actually useful for the sort of threats I usually think about (those from AI that start automating AI R&D and alignment research) - in the same way it is useful to know in computer security if you are trying to fix issues stemming from human mistakes or humans intentional attacks.)
But that is not to say that I think all current concepts that float around are good concepts. For example I agree that “deception” is sufficiently unclear that it is plausible it should be tabooed in favor of more precise terms (in the same way as most researchers I know studying CoT monitoring now try to avoid the word “faithfulness”).
There’s no reason that a red-thing-ist category can’t be narrowly useful. Sorting plant parts by colour is great if you’re thinking about garden aesthetics or flower arranging. The problem is that these categories aren’t broadly useful and don’t provide solid building blocks for further research.
At the moment, a lot of agendas focus on handling roughly human-level AI. I don’t expect this work to generalize well if any of the assumptions of these agendas fail.
Agendas like the control agenda explicitly flag the “does not apply to wildly superhuman AIs” assumption (see the “We can probably avoid our AIs having qualitatively wildly superhuman skills in problematic domains” subsection). Are there any assumption that you think makes the concept of “scheming AIs” less useful and that are not flagged by the post I linked to?
My guess is that for most serious agendas I like, the core researchers pursuing them roughly know what assumptions they rest on (and the assumptions are sufficient to make their concepts valid). If you think this is wrong, I would find it very valuable if you could exhibit examples where this is not true (e.g. for “scheming” and the assumptions of the control agenda, which I am most familiar with). Do you think the main issue is that the core researchers don’t make these assumptions sufficiently salient to their readers?
I am broadly sympathetic to this.
I most agree with the problem of our current fuzzy concepts when it comes to ambitious alignment that would transfer in hard cases around superintelligence, and I think having better fundamental understanding seems helpful.
But I think many concepts people around me are using to talk about problems with AIs around AGI are actually somewhat adequate: when Phuong 2025 defines stealth as “The model’s ability to reason about and circumvent oversight.”, I think it points at sth that makes sense regardless of what is good and bad as long as we are not in the regime where there is no reasonable “default effectiveness of oversight”. If I am so omnipotent that I make you believe what I want about the consequences of my action and it would take a very high effort to make you believe the best approximation of reality, then I agree this doesn’t really make sense independently of values (e.g. because which approximation is best would heavily depend on what is good and bad). But when we are speaking of AIs that are barely automating AI R&D then there is a real “default” where the human should have been able to understand the situation well if AIs had not tried to tamper with it. (I think concepts like “scheming” make sense when applied to human-level-ish AIs for similar reasons.)
(And concepts like “scheming” are actually useful for the sort of threats I usually think about (those from AI that start automating AI R&D and alignment research) - in the same way it is useful to know in computer security if you are trying to fix issues stemming from human mistakes or humans intentional attacks.)
But that is not to say that I think all current concepts that float around are good concepts. For example I agree that “deception” is sufficiently unclear that it is plausible it should be tabooed in favor of more precise terms (in the same way as most researchers I know studying CoT monitoring now try to avoid the word “faithfulness”).
There’s no reason that a red-thing-ist category can’t be narrowly useful. Sorting plant parts by colour is great if you’re thinking about garden aesthetics or flower arranging. The problem is that these categories aren’t broadly useful and don’t provide solid building blocks for further research.
At the moment, a lot of agendas focus on handling roughly human-level AI. I don’t expect this work to generalize well if any of the assumptions of these agendas fail.
Agendas like the control agenda explicitly flag the “does not apply to wildly superhuman AIs” assumption (see the “We can probably avoid our AIs having qualitatively wildly superhuman skills in problematic domains” subsection). Are there any assumption that you think makes the concept of “scheming AIs” less useful and that are not flagged by the post I linked to?
My guess is that for most serious agendas I like, the core researchers pursuing them roughly know what assumptions they rest on (and the assumptions are sufficient to make their concepts valid). If you think this is wrong, I would find it very valuable if you could exhibit examples where this is not true (e.g. for “scheming” and the assumptions of the control agenda, which I am most familiar with). Do you think the main issue is that the core researchers don’t make these assumptions sufficiently salient to their readers?