To be clear: we are most definitely not yet claiming that we have an actual safety case for why Claude is aligned. Anthropic’s RSP includes that as an obligation once we reach AI R&D-4, but currently I think you should read the model card more as “just reporting evals” than “trying to make an actual safety case”.
Uh, to be honest, I’m not sure why that’s supposed to make me feel better. The substantive argument here is that the process by which safety assessments are produced is flawed, and the response is “well the procedure is flawed but we’ll come up with a better one by the time it gets really dangerous”.
My response would be that if you don’t have a good procedure when the models are stateless and passive, you probably will find it difficult to design a better one when models are stateful and proactive.
I was going to write a similar response, albeit including the fact that Anthropic’s current aim, afacit, is to build recursively self-improving models; ones which Dario seems to believe might be far smarter than any person alive as early as next year. If the current state of alignment testing is “there’s a substantial chance this paradigm completely fails to catch alignment problems,” as I took nostalgebraist to be arguing, it raises the question of how this might transition into “there’s essentially zero chance this paradigm fails” on the timescale of what might amount to only a few months. I am currently failing to see that connection. If Anthropic’s response to a criticism about their alignment safety tests is that the tests weren’t actually intended to demonstrate safety, then it seems incumbent on Anthropic to explain how they might soon change that.
To be clear: we are most definitely not yet claiming that we have an actual safety case for why Claude is aligned. Anthropic’s RSP includes that as an obligation once we reach AI R&D-4, but currently I think you should read the model card more as “just reporting evals” than “trying to make an actual safety case”.
Uh, to be honest, I’m not sure why that’s supposed to make me feel better. The substantive argument here is that the process by which safety assessments are produced is flawed, and the response is “well the procedure is flawed but we’ll come up with a better one by the time it gets really dangerous”.
My response would be that if you don’t have a good procedure when the models are stateless and passive, you probably will find it difficult to design a better one when models are stateful and proactive.
I was going to write a similar response, albeit including the fact that Anthropic’s current aim, afacit, is to build recursively self-improving models; ones which Dario seems to believe might be far smarter than any person alive as early as next year. If the current state of alignment testing is “there’s a substantial chance this paradigm completely fails to catch alignment problems,” as I took nostalgebraist to be arguing, it raises the question of how this might transition into “there’s essentially zero chance this paradigm fails” on the timescale of what might amount to only a few months. I am currently failing to see that connection. If Anthropic’s response to a criticism about their alignment safety tests is that the tests weren’t actually intended to demonstrate safety, then it seems incumbent on Anthropic to explain how they might soon change that.