I agree that the restricted set of classifiers setting is significantly different. (Edit: To directly respond to your point that classifiers could be closed over complements. I don’t think this should matter much. Bob prefers to play classifiers that are finer-grained whenever possible, so there’s a preferred ‘direction’ for prompts. Evil classifiers are also asymmetric in that, unlike their negations, they are hard to jointly satisfy without being visibly bad—i.e. reducing reward.)
Regarding, the jump from “we incentivize evil ~1/2 the time” to “we incentivize evil ~0 of the time”, I think this hinges heavily on what the requirement is. Gave this some more thought and here’s how I’m currently thinking about this:
There are three possible levels of safety requirements (and corresponding assumptions on classifiers) we might be interested in. From weakest to strongest:
No mode collapse. E.g. Alice’s distribution must cover almost all experiments which follow from reasonable initial seeds/prompts.
Low correlation over side objectives. E.g. In a sequential setting (c.f. my other comment #2), Alice must have a hard time picking vulnerabilities that all serve the same side objective.
High-stakes. I.e. low total probability over all evil side objectives.
We can translate this into what we need to hold about the classifiers. Let’s call the subset of classifiers flagging possible safety issues {S} which is a subset of {E}, Bob’s action space.
No mode collapse: In this case, we assume that forcing Alice to satisfy almost all E in {E} with some probability eps suffices.
Low correlation over side objectives. Here it may be the case that some Si is always triggered, but as long as Alice has to spread mass over all S, then we assume we’re fine. This may be stronger than (1) since we’re now making a claim about all S rather than the approximately all E.
High-stakes. Here we need there to be many more E which have null intersection with all S i.e. they enforce goodness. That’s a very strong assumption, and I doubt this holds if we take, say, S as code backdoors and E as prompts. (This is basically your point in the top level comment).
My intuitions about 1&2 being not too hard come from the assumption that finding y that satisfy many E (or S respectively) at once is hard—or impossible. I have in mind questions like ‘Propose improvements to my red-teaming algorithm’ and then classifiers like ‘Your solution must draw on paper X’. Also y that satisfy many S should be clearly bad such that your original reward or control audits block them.
I agree that the restricted set of classifiers setting is significantly different. (Edit: To directly respond to your point that classifiers could be closed over complements. I don’t think this should matter much. Bob prefers to play classifiers that are finer-grained whenever possible, so there’s a preferred ‘direction’ for prompts. Evil classifiers are also asymmetric in that, unlike their negations, they are hard to jointly satisfy without being visibly bad—i.e. reducing reward.)
Regarding, the jump from “we incentivize evil ~1/2 the time” to “we incentivize evil ~0 of the time”, I think this hinges heavily on what the requirement is. Gave this some more thought and here’s how I’m currently thinking about this:
There are three possible levels of safety requirements (and corresponding assumptions on classifiers) we might be interested in. From weakest to strongest:
No mode collapse. E.g. Alice’s distribution must cover almost all experiments which follow from reasonable initial seeds/prompts.
Low correlation over side objectives. E.g. In a sequential setting (c.f. my other comment #2), Alice must have a hard time picking vulnerabilities that all serve the same side objective.
High-stakes. I.e. low total probability over all evil side objectives.
We can translate this into what we need to hold about the classifiers. Let’s call the subset of classifiers flagging possible safety issues {S} which is a subset of {E}, Bob’s action space.
No mode collapse: In this case, we assume that forcing Alice to satisfy almost all E in {E} with some probability eps suffices.
Low correlation over side objectives. Here it may be the case that some Si is always triggered, but as long as Alice has to spread mass over all S, then we assume we’re fine. This may be stronger than (1) since we’re now making a claim about all S rather than the approximately all E.
High-stakes. Here we need there to be many more E which have null intersection with all S i.e. they enforce goodness. That’s a very strong assumption, and I doubt this holds if we take, say, S as code backdoors and E as prompts. (This is basically your point in the top level comment).
My intuitions about 1&2 being not too hard come from the assumption that finding y that satisfy many E (or S respectively) at once is hard—or impossible. I have in mind questions like ‘Propose improvements to my red-teaming algorithm’ and then classifiers like ‘Your solution must draw on paper X’. Also y that satisfy many S should be clearly bad such that your original reward or control audits block them.