But I also think it’s potentially quite dangerous, and corrosive to the epistemic commons, to expect such concreteness before we’re ready to give it.
As I mention in the post, we do have the ability to do concrete capabilities evals right now. What we can’t do are concrete safety evals, which I’m very clear about not expecting us to have right now.
And I’m not expecting that we eventually solve the problem of building good safety evals either—but I am describing a way in which things go well that involves a solution to that problem. If we never solve the problem of understanding-based evals, then my particular sketch doesn’t work as a way to make things go well: but that’s how any story of success has to work right now given that we don’t currently know how to make things go well. And actually telling success stories is an important thing to do!
If you have an alternative success story that doesn’t involve solving safety evals, tell it! But without any alternative to my success story, critiquing it just for assuming a solution to a problem we don’t yet have a solution to—which every success story has to do—seems like an extremely unfair criticism.
It also seems to me like something called a “responsible scaling plan” should at the very least have a convincing story to tell about how we might get from our current state, with the primitive understanding we have, to the end-goal of possessing the sort of understanding that is capable of steering a godly power the likes of which we have never seen.
This post is not a responsible scaling plan. I feel like your whole comment seems to be weirdly conflating stuff that I’m saying with stuff in the Anthropic RSP. This post is about my thoughts on RSPs in general—which do not necessarily represent Anthropic’s thoughts on anything—and the post isn’t really about Anthropic’s RSP at all.
Regardless, I’m happy to give my take. I don’t think that anybody currently has a convincing story to tell about how to get a good understanding of AI systems, but you can read my thoughts on how we might get to one here.
I agree with you that having concrete asks would be great, but I think they’re only great if we actually have the right asks. In the absence of robust measures and evaluations—those which give us high confidence about the safety of AI systems—in the absence of a realistic plan to get those, I think demanding them may end up being actively harmful. Harmful because people will walk away feeling like AI Safety “knows” more than it does and will hence, I think, feel more secure than is warranted.
It sounds like you’re disagreeing with me, but everything you’re saying here is consistent with everything I said. The whole point of my proposal is to understand what evals we can trust and when we can trust them, set up eval-gated scaling in the cases where we can do concrete evals, and be very explicit about the cases where we can’t.
But without any alternative to my success story, critiquing it just for assuming a solution to a problem we don’t yet have a solution to—which every success story has to do—seems like an extremely unfair criticism.
When assumptions are clear, it’s not valuable to criticise the activity of daring to consider what follows from them. When assumptions are an implicit part of the frame, they become part of the claims rather than part of the problem statement, and their criticism becomes useful for all involved, in particular making them visible. Putting burdens on criticism such as needing concrete alternatives makes relevant criticism more difficult to find.
As I mention in the post, we do have the ability to do concrete capabilities evals right now. What we can’t do are concrete safety evals, which I’m very clear about not expecting us to have right now.
And I’m not expecting that we eventually solve the problem of building good safety evals either—but I am describing a way in which things go well that involves a solution to that problem. If we never solve the problem of understanding-based evals, then my particular sketch doesn’t work as a way to make things go well: but that’s how any story of success has to work right now given that we don’t currently know how to make things go well. And actually telling success stories is an important thing to do!
If you have an alternative success story that doesn’t involve solving safety evals, tell it! But without any alternative to my success story, critiquing it just for assuming a solution to a problem we don’t yet have a solution to—which every success story has to do—seems like an extremely unfair criticism.
This post is not a responsible scaling plan. I feel like your whole comment seems to be weirdly conflating stuff that I’m saying with stuff in the Anthropic RSP. This post is about my thoughts on RSPs in general—which do not necessarily represent Anthropic’s thoughts on anything—and the post isn’t really about Anthropic’s RSP at all.
Regardless, I’m happy to give my take. I don’t think that anybody currently has a convincing story to tell about how to get a good understanding of AI systems, but you can read my thoughts on how we might get to one here.
It sounds like you’re disagreeing with me, but everything you’re saying here is consistent with everything I said. The whole point of my proposal is to understand what evals we can trust and when we can trust them, set up eval-gated scaling in the cases where we can do concrete evals, and be very explicit about the cases where we can’t.
When assumptions are clear, it’s not valuable to criticise the activity of daring to consider what follows from them. When assumptions are an implicit part of the frame, they become part of the claims rather than part of the problem statement, and their criticism becomes useful for all involved, in particular making them visible. Putting burdens on criticism such as needing concrete alternatives makes relevant criticism more difficult to find.
I found this quite hard to parse fyi