aysja comments on nostalgebraist’s Shortform

aysja 3 Jun 2025 0:34 UTC
26 points
23
I was going to write a similar response, albeit including the fact that Anthropic’s current aim, afacit, is to build recursively self-improving models; ones which Dario seems to believe might be far smarter than any person alive as early as next year. If the current state of alignment testing is “there’s a substantial chance this paradigm completely fails to catch alignment problems,” as I took nostalgebraist to be arguing, it raises the question of how this might transition into “there’s essentially zero chance this paradigm fails” on the timescale of what might amount to only a few months. I am currently failing to see that connection. If Anthropic’s response to a criticism about their alignment safety tests is that the tests weren’t actually intended to demonstrate safety, then it seems incumbent on Anthropic to explain how they might soon change that.