Very interesting finding that this transfers across models! Would be keen to see people pick up from here on defences/​mitigations with whitebox methods given the LLM judge performance.
Very interesting finding that this transfers across models!
Would be keen to see people pick up from here on defences/​mitigations with whitebox methods given the LLM judge performance.