ryan_greenblatt comments on Unexploitable search: blocking malicious use of free parameters

ryan_greenblatt 21 May 2025 19:00 UTC
LW: 3 AF: 3
0
AF
My intuition is that this proposal isn’t going in the right direction. For one thing, if you successfully focus classifiers on whether something is evil (seemingly what you would have wanted by making the space of classifiers small), Bob would be incentivized to flip all of these classifiers (so that negative classifications would correspond to evil) to avoid Alice playing the strategy of “always be nice”. So, if classifiers are detecting aspects of evil, you’d have an equilibrium where Alice is Evil half the time even if Alice otherwise would have been nice! Presumably, you can’t particularly focus the space of classifiers on “evil classifiers” without also capturing “negated evil classifiers” with this sort of proposal. And, if you do, then you’re incentivizing Bob to use these classifiers (stocastically) and for Alice to be evil a bunch of the time!

I don’t see how we could generalize this sort of proposal from “we incentivize evil ~1/2 the time” to “we incentivize evil ~0 of the time”, so the proposal feels sunk to me.

You could try to actually have opinions on metrics which are correlated with non-malign outputs (and which force a narrower range of outputs). But, in this case, I’d guess you’d want to just directly incentivize these without any need for an adversarial game as I discuss in my other comment.
- Jacob Pfau 21 May 2025 23:29 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Intuitively I expect the ideal, intractable solution of reward-proportional sampling to be 1-eps good (similarly the BoN policy seems safe to assume is good). In such settings, “always be nice” is close to max entropy, and so “always be nice” is near equilibrium whereas “always be evil” is not at all—and Bob’s incentives are proportional.
  
  I suppose highlighting the symmetric solutions assumption obfuscated this intuition.
  - ryan_greenblatt 22 May 2025 0:03 UTC
    LW: 2 AF: 2
    0
    AF Parent
    Sure, but this is assuming the type of max-ent optimality we wanted to get away from by using a restricted set of classifiers if I understand correctly?
    - Jacob Pfau 23 May 2025 20:26 UTC
      LW: 1 AF: 1
      0
      AF Parent
      Here’s another way of thinking about this that should be complementary to my other comment:
      
      Let’s assume our safety property we want to ensure is
      
      Low correlation over side objectives. E.g. In a sequential setting (c.f. my other comment #2), Alice must have a hard time picking vulnerabilities that all serve the same side objective.
      
      The overlooked point which I want to make here is that {S}<<{E}. In this case, there are two possibilities:
      
      Alice is benign by default. In this case, Bob will pick E which are low probability for Alice, most such prompts are not safety relevant. They are just structural things which prevent mode collapse in various ways.
      Alice is sneaky and optimizes some S by default. In this case, Bob will play classifiers whose acceptance set have low intersection with Alice’s S. If we grant all Alice’s possible sneaky goals are hard to jointly satisfy without tanking reward, then heuristically Bob forces Alice to place at most v*--the game’s value—weight on any S. Now Alice has a very hard time coordinating on optimizing for the same S across problem instances.
    - Jacob Pfau 23 May 2025 13:51 UTC
      LW: 1 AF: 1
      0
      AF Parent
      I agree that the restricted set of classifiers setting is significantly different. (Edit: To directly respond to your point that classifiers could be closed over complements. I don’t think this should matter much. Bob prefers to play classifiers that are finer-grained whenever possible, so there’s a preferred ‘direction’ for prompts. Evil classifiers are also asymmetric in that, unlike their negations, they are hard to jointly satisfy without being visibly bad—i.e. reducing reward.)
      
      Regarding, the jump from “we incentivize evil ~1/2 the time” to “we incentivize evil ~0 of the time”, I think this hinges heavily on what the requirement is. Gave this some more thought and here’s how I’m currently thinking about this:
      
      There are three possible levels of safety requirements (and corresponding assumptions on classifiers) we might be interested in. From weakest to strongest:
      
      No mode collapse. E.g. Alice’s distribution must cover almost all experiments which follow from reasonable initial seeds/prompts.
      Low correlation over side objectives. E.g. In a sequential setting (c.f. my other comment #2), Alice must have a hard time picking vulnerabilities that all serve the same side objective.
      High-stakes. I.e. low total probability over all evil side objectives.
      
      We can translate this into what we need to hold about the classifiers. Let’s call the subset of classifiers flagging possible safety issues {S} which is a subset of {E}, Bob’s action space.
      
      No mode collapse: In this case, we assume that forcing Alice to satisfy almost all E in {E} with some probability eps suffices.
      Low correlation over side objectives. Here it may be the case that some Si is always triggered, but as long as Alice has to spread mass over all S, then we assume we’re fine. This may be stronger than (1) since we’re now making a claim about all S rather than the approximately all E.
      High-stakes. Here we need there to be many more E which have null intersection with all S i.e. they enforce goodness. That’s a very strong assumption, and I doubt this holds if we take, say, S as code backdoors and E as prompts. (This is basically your point in the top level comment).
      
      My intuitions about 1&2 being not too hard come from the assumption that finding y that satisfy many E (or S respectively) at once is hard—or impossible. I have in mind questions like ‘Propose improvements to my red-teaming algorithm’ and then classifiers like ‘Your solution must draw on paper X’. Also y that satisfy many S should be clearly bad such that your original reward or control audits block them.
- ryan_greenblatt 21 May 2025 19:05 UTC
  LW: 2 AF: 2
  0
  AF Parent
  Separately, my guess is that all the action lives in how much wiggle room there is in suboptimality. So, you’d want to argue that you relatively quickly converge to the right thing. This likely requires some structure on the problem via us having a view on what types of outputs might be dangerous—this is in turn naturally suggests something like regularizing relative to human imitation or some other prior we think is benign as this does more directly push away from bad outputs.