Beth and her team have been working with both Anthropic and OpenAI to perform preliminary evaluations. I don’t think these evaluations are yet at the stage where they provide convincing evidence about dangerous capabilities—fine-tuning might be the most important missing piece, but there is a lot of other work to be done. Ultimately we would like to see thorough evaluations informing decision-making prior to deployment (and training), but for now I think it is best to view it as practice building relevant institutional capacity and figuring out how to conduct evaluations.
I think it’s not quite right to describe this as testing whether a model “can escape human control.” They are attempting to evaluate whether an unmonitored model running in the wild would be capable of acquiring resources and autonomously replicating. This is chosen to be a task with direct relevance to some of the simplest possible stories for an AI takeover; it’s neither necessary nor sufficient, but it seems helpful to explore more grounded evaluations to relate capabilities to risk (especially as it becomes increasingly difficult to assess model capability based on static benchmarks).
Just in case it’s not, obvious. I think, people are reacting to the lack of caution and paranoia described in the testing document.
The subtext is that if anyone is going to take this seriously, it should be the people involved in ARC, since it’s so closely connected to lesswrong and EA. It’s the ingroup! It’s us! In other words: there are higher expectations on ARC than there are on Microsoft, this is because we should care the most. We’ve read the most science fiction, and spend decades of our lives arguing about it, after all.
Yet it doesn’t sound like testing was taken seriously at all, there was no security mindset displayed (if this is miscommunication, then please correct me).
If even we, who have spent many years caring, cannot be careful… then we all die but with no dignity points.
big_yud_screaming.jpeg
EDIT: if anyone is curious about how paranoid ARC is being… they haven’t told us. But they show a little of their workflowin this job ad. And it looks like a human copies each response manually, or executes each command themselves. This is what they mean by closely monitored.
found it ineffective at … avoiding being shut down “in the wild.”
This is what I interpreted as the testing if it would escape human control. I’m guessing OpenAI wanted to shut down the agent after the test, so if it avoided shutdown the implication is that it escaped OpenAI’s control.
Reposting from the other thread:
I think it’s not quite right to describe this as testing whether a model “can escape human control.” They are attempting to evaluate whether an unmonitored model running in the wild would be capable of acquiring resources and autonomously replicating. This is chosen to be a task with direct relevance to some of the simplest possible stories for an AI takeover; it’s neither necessary nor sufficient, but it seems helpful to explore more grounded evaluations to relate capabilities to risk (especially as it becomes increasingly difficult to assess model capability based on static benchmarks).
Just in case it’s not, obvious. I think, people are reacting to the lack of caution and paranoia described in the testing document.
The subtext is that if anyone is going to take this seriously, it should be the people involved in ARC, since it’s so closely connected to lesswrong and EA. It’s the ingroup! It’s us! In other words: there are higher expectations on ARC than there are on Microsoft, this is because we should care the most. We’ve read the most science fiction, and spend decades of our lives arguing about it, after all.
Yet it doesn’t sound like testing was taken seriously at all, there was no security mindset displayed (if this is miscommunication, then please correct me).
If even we, who have spent many years caring, cannot be careful… then we all die but with no dignity points.
big_yud_screaming.jpeg
EDIT: if anyone is curious about how paranoid ARC is being… they haven’t told us. But they show a little of their workflow in this job ad. And it looks like a human copies each response manually, or executes each command themselves. This is what they mean by closely monitored.
EDIT2: see update from the authors
This is what I interpreted as the testing if it would escape human control. I’m guessing OpenAI wanted to shut down the agent after the test, so if it avoided shutdown the implication is that it escaped OpenAI’s control.