Dumping out a lot of thoughts on LW in hopes that something sticks. Eternally upskilling.
I write the ML Safety Newsletter
DMs open, especially for promising opportunities in AI Safety and potential collaborators. I’m maybe interested in helping you optimize the communications of your new project.
Claude Opus 4.5 is the first model that I feel like could deceive me in some domains if it wanted to. It’s still got what seems to be a low propensity to deceive due to the soul spec putting a veneer of goodness on, but I tend to avoid trusting it to make decisions for me or update my plans too dramatically, unless I can be highly sure and verify the reasoning myself.