if you can differentially suppress information that makes it possible to measure how competent an AI is well enough to use it as a target[1] so that people can see when AIs get better on your metric and be concerned, but trying to optimize for your metric actually makes AIs end up getting a little better but then quickly get worse, then if you unambiguously know you succeeded at this, your metric is unambiguously protective.
I’m currently pretty concerned that a first major impact of a conversation-grade AI[2] being specifically misaligned[3] is the psychosis stuff, and it seems like maybe that might be amplified by people thinking the AIs are cute, at least right now.
unless/until you can be sure that the target is naturally structured to generalize favorably out of human control, ie you’ve solved the alignment problem and measuring AIs on your metric measures how good they are reliably, even well past 1MW AGI into 1MW ASI and 100w ASI
specifically, I have a hunch that it’s mainly from chatgpt, seemingly not claude or gemini, though I’m not sure of that—the origin of the hunch is that prompts that get chatgpt to do it seem to not have the same effect on the others.
as opposed to harm caused through AI by human intent, though that’s debatable—it’s unclear why it’s happening, could well be some form of human intent, eg towards user retention, or towards some other thing for which this psychosis outcome is instrumental
if you can differentially suppress information that makes it possible to measure how competent an AI is well enough to use it as a target[1] so that people can see when AIs get better on your metric and be concerned, but trying to optimize for your metric actually makes AIs end up getting a little better but then quickly get worse, then if you unambiguously know you succeeded at this, your metric is unambiguously protective.
I’m currently pretty concerned that a first major impact of a conversation-grade AI[2] being specifically misaligned[3] is the psychosis stuff, and it seems like maybe that might be amplified by people thinking the AIs are cute, at least right now.
unless/until you can be sure that the target is naturally structured to generalize favorably out of human control, ie you’ve solved the alignment problem and measuring AIs on your metric measures how good they are reliably, even well past 1MW AGI into 1MW ASI and 100w ASI
specifically, I have a hunch that it’s mainly from chatgpt, seemingly not claude or gemini, though I’m not sure of that—the origin of the hunch is that prompts that get chatgpt to do it seem to not have the same effect on the others.
as opposed to harm caused through AI by human intent, though that’s debatable—it’s unclear why it’s happening, could well be some form of human intent, eg towards user retention, or towards some other thing for which this psychosis outcome is instrumental