Dumping out a lot of thoughts on LW in hopes that something sticks. Eternally upskilling.
I write the ML Safety Newsletter
DMs open, especially for promising opportunities in AI Safety and potential collaborators. I’m maybe interested in helping you optimize the communications of your new project.
But, as I understand the AO paper, they do not actually validate on models like these, they only test on models that they have trained to be misaligned in pretty straightforward ways. I still think it’s good work, I’m moreso just talking about a gap that future work could address.