That’s funny. I spent a fair bit of time creating a ai safety knowledge bench. Sourced from lesswrong/papers/redwoods/metr etc.
It basically also instantly saturated.
Or like, to make it not instantly saturate, I had to make the questions so specific as to be measuring nonsense. Like asking “In the Alignment Faking paper, what was the refusal rate of llama 3.1-8b in the minimal reproduction of the helpful-only setting with prefix-matching as a proxy for non-refusals?” or like “what was paul christiano’s reply to SMK’s reply to his essay ‘three reasons to cooperate’”.
That’s funny. I spent a fair bit of time creating a ai safety knowledge bench. Sourced from lesswrong/papers/redwoods/metr etc.
It basically also instantly saturated.
Or like, to make it not instantly saturate, I had to make the questions so specific as to be measuring nonsense. Like asking “In the Alignment Faking paper, what was the refusal rate of llama 3.1-8b in the minimal reproduction of the helpful-only setting with prefix-matching as a proxy for non-refusals?” or like “what was paul christiano’s reply to SMK’s reply to his essay ‘three reasons to cooperate’”.