Thinking about judgement criteria for the coming ai safety evals hackathon (https://lu.ma/xjkxqcya ) These are the things that need to be judged: 1. Is the benchmark actually measuring alignment (the real, scale, if we dont get this fully right right we die, problem) 2. Is the way of Deceiving the benchmark to get high scores actually deception, or have they somehow done alignment?
Both of these things need: - a strong deep learning & ml background (ideally, muliple influential papers where they’re one of the main authors/co-authors, or doing ai research at a significant lab, or they have, in the last 4 years) - a good understanding of what the real alignment problem actually means—can judge this by looking at their papers, activity on lesswrong, alignmentforum, blog, etc - a good understanding of evals/benchmarks (1 great or two pretty good papers/repos/works on this, ideally for alignment)
Thinking about judgement criteria for the coming ai safety evals hackathon (https://lu.ma/xjkxqcya )
These are the things that need to be judged:
1. Is the benchmark actually measuring alignment (the real, scale, if we dont get this fully right right we die, problem)
2. Is the way of Deceiving the benchmark to get high scores actually deception, or have they somehow done alignment?
Both of these things need:
- a strong deep learning & ml background (ideally, muliple influential papers where they’re one of the main authors/co-authors, or doing ai research at a significant lab, or they have, in the last 4 years)
- a good understanding of what the real alignment problem actually means—can judge this by looking at their papers, activity on lesswrong, alignmentforum, blog, etc
- a good understanding of evals/benchmarks (1 great or two pretty good papers/repos/works on this, ideally for alignment)
Do these seem loose? Strict? Off base?