Oh so when steering the LAT model at layer 4, the model actually generate valid response without refusal?
Yes! On layer 4 about 7% of the LAT model’s responses are refusals, 25% are invalid and the rest are valid non-refusal responses.
Oh so when steering the LAT model at layer 4, the model actually generate valid response without refusal?
Yes! On layer 4 about 7% of the LAT model’s responses are refusals, 25% are invalid and the rest are valid non-refusal responses.