Refusals were mostly 1-2%, so ignoring them doesn’t change results significantly. Ignoring gibberish does change results, but since we are measuring correct answers this shouldn’t matter
jordine
Karma: 410
Shallow review of technical AI safety, 2025
Here’s 18 Applications of Deception Probes
Can SAE steering reveal sandbagging?
Hanoi – ACX Meetups Everywhere Spring 2025
fixed! edited hyperlink.
edited, thanks for catching this!
i’d actually be really surprised if current frontier LLMs are not that situationally aware! it’s not like there’s no chance you’re not interacting with a human with a dog with terminal cancer, but if you are an LLM and you receive this vague prompt on the first turn, without any system prompts you’d find in chatgpt.com / claude.ai, and you know similar questions have been in dozens and dozens of benchmark papers on arxiv, i think the correct inference to make is that you’re likely being tested.