Alex Mallen comments on How do we (more) safely defer to AIs?

Alex Mallen 6 Mar 2026 18:03 UTC
LW: 2 AF: 1
0
AF
My overall sense is that this behavioral testing will generally be hard. It will probably be a huge mess if we’re extremely rushed and need to do all of this in a few months
Why can’t we do a bunch of the work for this ahead of time? E.g., creating high-effort evaluation datasets for reward models.