AI strategy & governance. ailabwatch.org.
Zach Stein-Perlman(Zachary Stein-Perlman)
Interesting. Thanks. (If there’s a citation for this, I’d probably include it in my discussions of evals best practices.)
Hopefully evals almost never trigger “spontaneous sandbagging”? Hacking and bio capabilities and so forth are generally more like carjacking than torture.
(Just the justification, of course; fixed.)
I agree noticing whether the model is refusing and, if so, bypassing refusals in some way is necessary for good evals (unless the refusal is super robust—such that you can depend on it for safety during deployment—and you’re not worried about rogue deployments). But that doesn’t require fine-tuning — possible alternatives include jailbreaking or few-shot prompting. Right?
(Fine-tuning is nice for eliciting stronger capabilities, but that’s different.)
I think I was thinking:
The war room transcripts will leak publicly
Generals can secretly DM each other, while keeping up appearances in the shared channels
If a general believes that all of their communication with their team will leak, we’re be back to a unilateralist’s curse situation: if a general thinks they should nuke, obviously they shouldn’t say that to their team, so maybe they nuke unilaterally
(Not obvious whether this is an infohazard)
[Probably some true arguments about the payoff matrix and game theory increase P(mutual destruction). Also some false arguments about game theory — but maybe an infohazard warning makes those less likely to be posted too.]
(Also after I became a general I observed that I didn’t know what my “launch code” was; I was hoping the LW team forgot to give everyone launch codes and this decreased P(nukes); saying this would would cause everyone to know their launch codes and maybe scare the other team.)
I don’t think this is very relevant to real-world infohazards, because this is a game with explicit rules and because in the real world the low-hanging infohazards have been shared, but it seems relevant to mechanism design.
Update with two new responses:
I think this is 10 generals, 1 petrov, and one other person (either the other petrov or a citizen, not sure, wasn’t super rigorous)
The post says generals’ names will be published tomorrow.
No. I noticed ~2 more subtle infohazards and I was wishing for nobody to post them and I realized I can decrease that probability by making an infohazard warning.
I ask that you refrain from being the reason that security-by-obscurity fails, if you notice subtle infohazards.
Some true observations are infohazards, making destruction more likely. Please think carefully before posting observations. Even if you feel clever. You can post hashes here instead to later reveal how clever you were, if you need.
LOOSE LIPS SINK SHIPS
I think it’s better to be angry at the team that launched the nukes?
No. But I’m skeptical: seems hard to imagine provable safety, much less competitive with the default path to powerful AI, much less how post-hoc evals are relevant.
I ~always want to see the outline when I first open a post and when I’m reading/skimming through it. I wish the outline appeared when-not-hover-over-ing for me.
Figuring out whether an RSP is good is hard.[1] You need to consider what high-level capabilities/risks it applies to, and for each of them determine whether the evals successfully measure what they’re supposed to and whether high-level risk thresholds trigger appropriate responses and whether high-level risk thresholds are successfully operationalized in terms of low-level evals. Getting one of these things wrong can make your assessment totally wrong. Except that’s just for fully-fleshed-out RSPs — in reality the labs haven’t operationalized their high-level thresholds and sometimes don’t really have responses planned. And different labs’ RSPs have different structures/ontologies.
Quantitatively comparing RSPs in a somewhat-legible manner is even harder.
I am not enthusiastic about a recent paper outlining a rubric for evaluating RSPs. Mostly I worry that it crams all of the crucial things—is the response to reaching high capability-levels adequate? are the capability-levels low enough that we’ll reach them before it’s too late?—into a single criterion, “Credibility.” Most of my concern about labs’ RSPs comes from those things just being inadequate; again, if your response is too weak or your thresholds are too high or your evals are bad, it just doesn’t matter how good the rest of your RSP is. (Also, minor: the “Difficulty” indicator punishes a lab for making ambitious commitments; this seems kinda backwards.)
(I gave up on making an RSP rubric myself because it seemed impossible to do decently well unless it’s quite complex and has some way to evaluate hard-to-evaluate things like eval-quality and planned responses.)
(And maybe it’s reasonable for labs to not do so much specific-committing-in-advance.)
- ^
Well, sometimes figuring out that in RSP is bad is easy. So: determining that an RSP is good is hard. (Being good requires lots of factors to all be good; being bad requires just one to be bad.)
- ^
Stronger scaffolding could skew evaluation results.
Stronger scaffolding makes evals better.
I think labs should at least demonstrate that their scaffolding is at least as good as some baseline. If there’s no established baseline scaffolding for the eval, they can say we did XYZ and we got n%, our secret scaffolding does better; when there is an established baseline scaffolding, they can compare to that (e.g. the best scaffold for SWE-bench Verified is called Agentless; in the o1 system card, OpenAI reported results from running its models in this scaffold.)
Footnote to table cells D18 and D19:
My reading of Anthropic’s ARA threshold, which nobody has yet contradicted:
The RSP defines/operationalizes 50% of ARA tasks (10% of the time) as a sufficient condition for ASL-3.
(Sidenote: I think the literal reading is that this is an ASL-3 definition, but I assume it’s supposed to be an operationalization of the safety buffer, 6x below ASL-3.[1])The May RSP evals report suggests 50% of ARA tasks (10% of the time) is merely a yellow line. (Pages 2 and 6, plus page 4 says “ARA Yellow Lines are clearly defined in the RSP” but the RSP’s ARA threshold was not just a yellow line.)
This is not kosher; Anthropic needs to formally amend the RSP to [raise the threshold / make the old threshold no longer binding].
(It’s totally plausible that the old threshold was too low, but the solution is to raise it officially, not pretend that it’s just a yellow line.)
(The forthcoming RSP update will make old thresholds moot; I’m just concerned that Anthropic ignored the RSP in the past.)
(Anthropic didn’t cross the threshold and so didn’t violate the RSP — it just ignored the RSP by implying that if it crossed the threshold it wouldn’t necessarily respond as required by the RSP.)
- ^
Update: another part of the RSP says this threshold implements the safety buffer.
Model evals for dangerous capabilities
It’s “o1” or “OpenAI o1,” not “GPT-o1.”
They try to notice and work around spurious failures. Apparently 10 days was not long enough to resolve o1′s spurious failures.
METR could not confidently upper-bound the capabilities of the models during the period they had model access, given the qualitatively strong reasoning and planning capabilities, substantial performance increases from a small amount of iteration on the agent scaffold, and the high rate of potentially fixable failures even after iteration.
(Plus maybe they try to do good elicitation in other ways that require iteration — I don’t know.)
Benchmark scores are here.
I am excited about donations to all of the following, in no particular order:
AI governance
GovAI (mostly research) [actually I haven’t checked whether they’re funding-constrained]
IAPS (mostly research)
Horizon (field-building)
CLTR (policy engagement)
LTFF/ARM
Lightcone