joshc comments on How might we safely pass the buck to AI?

joshc 20 Feb 2025 22:33 UTC
LW: 2 AF: 1
0
AF
Yeah that’s fair. Currently I merge “behavioral tests” into the alignment argument, but that’s a bit clunky and I prob should have just made the carving:
1. looks good in behavioral tests
2. is still going to generalize to the deferred task

But my guess is we agree on the object level here and there’s a terminology mismatch. obv the models have to actually behave in a manner that is at least as safe as human experts in addition to also displaying comparable capabilities on all safety-related dimensions.