So, I see the upside, I appreciate you noting the downside, but I am pretty wary.
I think capabilities-evals are probably net negative by default unless you’re actively putting work in to make sure they can’t be used as a benchmark to grind on. (i.e. making sure AI labs never have direct access to the eval, at a minimum).
AI Village seems like it’s basically building out the infra for scalable open-world evals. I also think if it takes off, it likely inspires marginal copycats that won’t be at all safety-minded.
To elaborate a bit more: I’m kinda taking a leap of faith in humanity here. I am thinking that if the public in general comes to be more aware of the situation w.r.t. AI and AGI and so forth, that’ll be on-net good, even though that means that some particular members of the public such as the capabilities researchers at the big corporations will be able to do bad stuff with that info e.g. accelerate their blind race towards doom. I agree that my faith might be misplaced, but… consider applying the reversal test. Should I be going around trying to suppress information about what AIs are capable of, how fast they are improving, etc.? Should I be recommending to the AI companies that they go silent about that stuff?
I think my main remaining source of hope for the future is that humanity-outside-the-corporations ‘wakes up’ so to speak, before it is too late, and draws fairly obvious conclusions like “Gosh we don’t really understand how these AIs work that well, maybe when they get smarter than use we’ll lose control, that would be bad” and “also, right now ‘we’ means ‘the c-suite of some tech company and/or POTUS/XiJinping’ which is not exactly comforting...” and “Both of these problems could be significantly ameliorated if we ended this crazy arms race to superintelligence, so that we could proceed more cautiously and in a more power-distributed/transparent/accountable way.”
Insofar as I’m right that that’s the main source of hope, it seems like the benefits of broader public & scientific understanding outweigh the costs.
Should I be recommending to the AI companies that they go silent about that stuff?
I do think this isn’t actually equivalent. In a vacuum, if we have information on how competent things are, seems good to share that information. But the question is “does the mechanism of gaining that information provide an easy feedbackloop for companies to use to climb? (and how bad is that?”, which isn’t a simple reversal.
Fair point that if companies already have the info, that’s different. Proper reversal test would be cases where the info isn’t widely known within companies already.
There’s also like a matter of degree – the thing that seems bad to me is the easy-repeatability. (I don’t know how much the AI village would exactly have this problem, but the scaled up versions in your vignettes seem like they likely veer into easy-repeatability.)
if you can differentially suppress information that makes it possible to measure how competent an AI is well enough to use it as a target[1] so that people can see when AIs get better on your metric and be concerned, but trying to optimize for your metric actually makes AIs end up getting a little better but then quickly get worse, then if you unambiguously know you succeeded at this, your metric is unambiguously protective.
I’m currently pretty concerned that a first major impact of a conversation-grade AI[2] being specifically misaligned[3] is the psychosis stuff, and it seems like maybe that might be amplified by people thinking the AIs are cute, at least right now.
unless/until you can be sure that the target is naturally structured to generalize favorably out of human control, ie you’ve solved the alignment problem and measuring AIs on your metric measures how good they are reliably, even well past 1MW AGI into 1MW ASI and 100w ASI
specifically, I have a hunch that it’s mainly from chatgpt, seemingly not claude or gemini, though I’m not sure of that—the origin of the hunch is that prompts that get chatgpt to do it seem to not have the same effect on the others.
as opposed to harm caused through AI by human intent, though that’s debatable—it’s unclear why it’s happening, could well be some form of human intent, eg towards user retention, or towards some other thing for which this psychosis outcome is instrumental
So, I see the upside, I appreciate you noting the downside, but I am pretty wary.
I think capabilities-evals are probably net negative by default unless you’re actively putting work in to make sure they can’t be used as a benchmark to grind on. (i.e. making sure AI labs never have direct access to the eval, at a minimum).
AI Village seems like it’s basically building out the infra for scalable open-world evals. I also think if it takes off, it likely inspires marginal copycats that won’t be at all safety-minded.
This is indeed my main concern.
To elaborate a bit more: I’m kinda taking a leap of faith in humanity here. I am thinking that if the public in general comes to be more aware of the situation w.r.t. AI and AGI and so forth, that’ll be on-net good, even though that means that some particular members of the public such as the capabilities researchers at the big corporations will be able to do bad stuff with that info e.g. accelerate their blind race towards doom. I agree that my faith might be misplaced, but… consider applying the reversal test. Should I be going around trying to suppress information about what AIs are capable of, how fast they are improving, etc.? Should I be recommending to the AI companies that they go silent about that stuff?
I think my main remaining source of hope for the future is that humanity-outside-the-corporations ‘wakes up’ so to speak, before it is too late, and draws fairly obvious conclusions like “Gosh we don’t really understand how these AIs work that well, maybe when they get smarter than use we’ll lose control, that would be bad” and “also, right now ‘we’ means ‘the c-suite of some tech company and/or POTUS/XiJinping’ which is not exactly comforting...” and “Both of these problems could be significantly ameliorated if we ended this crazy arms race to superintelligence, so that we could proceed more cautiously and in a more power-distributed/transparent/accountable way.”
Insofar as I’m right that that’s the main source of hope, it seems like the benefits of broader public & scientific understanding outweigh the costs.
I do think this isn’t actually equivalent. In a vacuum, if we have information on how competent things are, seems good to share that information. But the question is “does the mechanism of gaining that information provide an easy feedbackloop for companies to use to climb? (and how bad is that?”, which isn’t a simple reversal.
Fair point that if companies already have the info, that’s different. Proper reversal test would be cases where the info isn’t widely known within companies already.
There’s also like a matter of degree – the thing that seems bad to me is the easy-repeatability. (I don’t know how much the AI village would exactly have this problem, but the scaled up versions in your vignettes seem like they likely veer into easy-repeatability.)
if you can differentially suppress information that makes it possible to measure how competent an AI is well enough to use it as a target[1] so that people can see when AIs get better on your metric and be concerned, but trying to optimize for your metric actually makes AIs end up getting a little better but then quickly get worse, then if you unambiguously know you succeeded at this, your metric is unambiguously protective.
I’m currently pretty concerned that a first major impact of a conversation-grade AI[2] being specifically misaligned[3] is the psychosis stuff, and it seems like maybe that might be amplified by people thinking the AIs are cute, at least right now.
unless/until you can be sure that the target is naturally structured to generalize favorably out of human control, ie you’ve solved the alignment problem and measuring AIs on your metric measures how good they are reliably, even well past 1MW AGI into 1MW ASI and 100w ASI
specifically, I have a hunch that it’s mainly from chatgpt, seemingly not claude or gemini, though I’m not sure of that—the origin of the hunch is that prompts that get chatgpt to do it seem to not have the same effect on the others.
as opposed to harm caused through AI by human intent, though that’s debatable—it’s unclear why it’s happening, could well be some form of human intent, eg towards user retention, or towards some other thing for which this psychosis outcome is instrumental