Also, can models now be prompted to trick probes? (My understanding is this doesn’t work for relatively small open source models, but maybe SOTA models can now do this?)
Has anyone done any experiments into whether a model can interfere with the training of a probe (like that bit in the most recent Yudtale) by manipulating its internals?
Also, can models now be prompted to trick probes? (My understanding is this doesn’t work for relatively small open source models, but maybe SOTA models can now do this?)
Has anyone done any experiments into whether a model can interfere with the training of a probe (like that bit in the most recent Yudtale) by manipulating its internals?