I agree this stuff is important, mostly because I think that scheming is a big part of where catastrophic risk from AI comes from.
Yes there is a general question I want to talk about which is the gap between training, evaluation, and deployment, and the reasons why models might be:
1. Able to tell in which of these environments they are in
2. Act differently based on that
I agree this stuff is important, mostly because I think that scheming is a big part of where catastrophic risk from AI comes from.
Yes there is a general question I want to talk about which is the gap between training, evaluation, and deployment, and the reasons why models might be:
1. Able to tell in which of these environments they are in
2. Act differently based on that