Dumping out a lot of thoughts on LW in hopes that something sticks. Eternally upskilling.
I write the ML Safety Newsletter
DMs open, especially for promising opportunities in AI Safety and potential collaborators. I’m maybe interested in helping you optimize the communications of your new project.
As I understand it, you mostly can’t do this type of attack on chat UIs because of special delineation characters. Anthropic also no longer lets you do assistant prefill for Opus 4.6, citing exactly this reason.