I’m really excited about this line of work. It seems like small tweaks to model architecture could have the ability to lead to increased monitorability, basically for free. I’m surprised this project wasn’t funded, and would like to see more ambitious research projects like this that de-risk small architectural changes for positive safety properties.
I’m really excited about this line of work. It seems like small tweaks to model architecture could have the ability to lead to increased monitorability, basically for free. I’m surprised this project wasn’t funded, and would like to see more ambitious research projects like this that de-risk small architectural changes for positive safety properties.