Fabien Roger comments on Fuzzing LLMs sometimes makes them reveal their secrets

Fabien Roger 1 Mar 2025 18:16 UTC
2 points
0
altering discovered vectors that make meaningful but non-gibberish changes
Yes, look at how the vectors with highest performance have a much higher performance than the average vector in many of my experiments. Tuning the norm could be a good way of checking that though.
see if some dimensions preferentially maintain their magnitude
What do you mean? Do you mean the magnitude of the effect when you reduce the norm?
- Matt Levinson 3 Mar 2025 19:16 UTC
  1 point
  0
  Parent
  I was thinking in terms of moving towards interpretability. We have no reason to believe that meaningful steering vectors should cluster around a given norm. We also have no reason to believe that effective steering vectors can all be scaled to a common norm without degrading the interesting/desired effect. This version of random search (through starting seed) and local optimization is a cool way to get a decent sampling of directions. I’m wondering if one could get “better” or “cleaner” results by starting from the best results from the search and then trying to optimize them increasing or decreasing temperature.
  The hope would be that some dimensions would preferentially grow/shrink. We could interpret this as evidence that the “meaningfulness” of the detected steering vector has increased, perhaps even use a measure of that as part of a new loss or stopping rule.
  One other thing I wonder is if anyone has worked on bringing in ideas from ensemble sampling from the statistics and applied math literature? Seems like it might be possible to use some ideas from that world to more directly find sparser, semantically meaningful steering vectors. Maybe @TurnTrout has worked on it?
  - Fabien Roger 4 Mar 2025 12:23 UTC
    2 points
    0
    Parent
    By doing more search around promising vectors found with random search or MELBO, you could get more powerful vectors, and that could be useful for unlocking / fuzzing-adversarial-training. It’s unclear if that would be more effective than just fine-tuning the model on the generation from the best random vectors, but it would be worth trying.
    For interp, I don’t know what interp metric you want to optimize. Vector norm is a really bad metric: effective MELBO vectors have a much smaller norm, but qualitatively I find their results are sometimes much more erratic than those of random vectors that have 8x bigger norm (see e.g. the MELBO completion starting with “}}{{”). I don’t know what kind of sparsity you would want to encourage. Maybe you could use regularization like “behavior on regular Alpaca prompt stays the same” to favor vectors with fewer side effects? But I’d guess that by “meaningfulness” you hoped for sth stronger than absence of side effects.
    - Matt Levinson 4 Mar 2025 18:27 UTC
      1 point
      0
      Parent
      What I was hinting at above was trying to be in the spirit of MELBO, seeing if we can find meaningful vectors without looking at model output effects. You could imagine we could come up with heuristics on something like the variance of independent first derivatives of each neuron as we shrink or grow R. That is to say, what we’re not looking for is all dimensions growing/shrinking ~equally as we shift R. Other patterns would give higher variance in the rates of change. You could imagine lots of variants of that kind of thing.
      
      This also makes me think we could expand this to leverage model outputs better. As I think has been mentioned around MELBO and DCT, it shouldn’t be that hard to use another model to score output. In the lying about reasoning use case, if we told any high-parameter model that the data in question had a special mark for the correct answer and any reasoning about why the model chose the right answer beyond that was a lie, it could easily mark answers for us as lying or not. That would be slower than orthogonality and norm constrained distance maximization, but would open up all kinds of other search options. In fact the ensemble samplers would kill this problem with output ratings as the metric instead of constrained distance.