Cleo Nardo comments on Can SAE steering reveal sandbagging?

Cleo Nardo 15 Apr 2025 12:58 UTC
2 points
0
This metric also ignores invalid answers (refusals or gibberish).

If you don’t ignore invalid answers, do the results change significantly?
- jordinne 16 Apr 2025 11:21 UTC
  1 point
  0
  Parent
  Refusals were mostly 1-2%, so ignoring them doesn’t change results significantly. Ignoring gibberish does change results, but since we are measuring correct answers this shouldn’t matter