Fabien Roger comments on Fuzzing LLMs sometimes makes them reveal their secrets

Fabien Roger 4 Mar 2025 12:23 UTC
2 points
0
By doing more search around promising vectors found with random search or MELBO, you could get more powerful vectors, and that could be useful for unlocking / fuzzing-adversarial-training. It’s unclear if that would be more effective than just fine-tuning the model on the generation from the best random vectors, but it would be worth trying.
For interp, I don’t know what interp metric you want to optimize. Vector norm is a really bad metric: effective MELBO vectors have a much smaller norm, but qualitatively I find their results are sometimes much more erratic than those of random vectors that have 8x bigger norm (see e.g. the MELBO completion starting with “}}{{”). I don’t know what kind of sparsity you would want to encourage. Maybe you could use regularization like “behavior on regular Alpaca prompt stays the same” to favor vectors with fewer side effects? But I’d guess that by “meaningfulness” you hoped for sth stronger than absence of side effects.
- Matt Levinson 4 Mar 2025 18:27 UTC
  1 point
  0
  Parent
  What I was hinting at above was trying to be in the spirit of MELBO, seeing if we can find meaningful vectors without looking at model output effects. You could imagine we could come up with heuristics on something like the variance of independent first derivatives of each neuron as we shrink or grow R. That is to say, what we’re not looking for is all dimensions growing/shrinking ~equally as we shift R. Other patterns would give higher variance in the rates of change. You could imagine lots of variants of that kind of thing.
  
  This also makes me think we could expand this to leverage model outputs better. As I think has been mentioned around MELBO and DCT, it shouldn’t be that hard to use another model to score output. In the lying about reasoning use case, if we told any high-parameter model that the data in question had a special mark for the correct answer and any reasoning about why the model chose the right answer beyond that was a lie, it could easily mark answers for us as lying or not. That would be slower than orthogonality and norm constrained distance maximization, but would open up all kinds of other search options. In fact the ensemble samplers would kill this problem with output ratings as the metric instead of constrained distance.