This is a capabilities game. It is neither alignment or safety. To the degree it’s forecasting, it helps cause the thing it forecasts. This has been the standard pattern in capabilities research for a long time: someone makes a benchmark (say, imagenet 1.3m 1000class), and this produces a leaderboard that allows people to show how good their learning algorithm is at novel datasets. In some cases this even produced models directly that were generally useful, but it traditionally was used to show how well an algorithm would work in a new context from scratch. Building benchmarks like this gives teams a new way to brag—they may have a better source of training data (eg, google always had a better source of training data than imagenet), but it allows them to brag that they scored well on the benchmark, which among other things helps them get funding.
Perhaps it also helps convince people to be concerned. That might trade off against this. Perhaps it sucks in some way as a bragging rights challenge. That would trade off against this
Hopefully it sucks as a bragging rights challenge.
The trouble is that (unless I’m misreading you?) that’s a fully general argument against measuring what models can and can’t do. If we’re going to continue to build stronger AI (and I’m not advocating that we should), it’s very hard for me to see a world where we manage to keep it safe without a solid understanding of its capabilities.
if it’s a fully general argument, that’s a problem I don’t know how to solve at the moment. I suspect it’s not, but that the space of unblocked ways to test models is small. I’m bouncing ideas about this around out loud with some folks the past day, possibly someone will show up with an idea for how to constrain on what benchmarks are worth making soonish. but the direction I see as maybe promising is, what makes a benchmark reliably suck as a bragging rights challenge?
I see your view, I think, but I just disagree. I think that if our future goes well, it will be because we found ways to align AI well enough, and/or because we coordinated politically to slow or stop AI advancement long enough to accomplish the alignment part, not because researchers avoided measured AI’s capabilities.
I think that if our future goes well, it will be because we found ways to align AI well enough, and/or because we coordinated politically to slow or stop AI advancement long enough to accomplish the alignment part
Agree
not because researchers avoided measured AI’s capabilities.
But differential technological development matters, as does making it clear that when you make a capability game like this, you are probably just contributing to capabilities, not doing alignment. I won’t say you should never do that, but I’ll say that’s what’s being done. I personally am all in on “we just need to solve alignment as fast as possible”. But I’ve been a capabilities nerd for a while before I was an alignment nerd, and when I see someone doing something that I feel like is accidentally a potentially significant little capabilities contribution, it seems worth pointing out that that’s what it is.
as does making it clear that when you make a capability game like this, you are probably just contributing to capabilities
I would distinguish between measuring capabilities and improving capabilities. I agree that the former can motivate the latter, but they still seem importantly different. I continue to think that the alternative of not measuring capabilities (or only measuring some small subset that couldn’t be used as training benchmarks) just means we’re left in the dark about what these models can do, which seems pretty straightforwardly bad from a safety perspective.
not doing alignment
I agree that it’s definitely not doing alignment, and that working on alignment is the most important goal; I intend to shift toward directly working on alignment as I feel clearer about what work is a good bet (my current leading candidate, which I intend to focus on after this experiment: learning to better understand and shape LLMs’ self-models).
I very much appreciate the thoughtful critique, regardless of whether or not I’m convinced by it.
Aa I said elsewhere, https://www.lesswrong.com/posts/LfQCzph7rc2vxpweS/introducing-the-weirdml-benchmark?commentId=q86ogStKyge9Jznpv
Hopefully it sucks as a bragging rights challenge.
The trouble is that (unless I’m misreading you?) that’s a fully general argument against measuring what models can and can’t do. If we’re going to continue to build stronger AI (and I’m not advocating that we should), it’s very hard for me to see a world where we manage to keep it safe without a solid understanding of its capabilities.
if it’s a fully general argument, that’s a problem I don’t know how to solve at the moment. I suspect it’s not, but that the space of unblocked ways to test models is small. I’m bouncing ideas about this around out loud with some folks the past day, possibly someone will show up with an idea for how to constrain on what benchmarks are worth making soonish. but the direction I see as maybe promising is, what makes a benchmark reliably suck as a bragging rights challenge?
I see your view, I think, but I just disagree. I think that if our future goes well, it will be because we found ways to align AI well enough, and/or because we coordinated politically to slow or stop AI advancement long enough to accomplish the alignment part, not because researchers avoided measured AI’s capabilities.
Agree
But differential technological development matters, as does making it clear that when you make a capability game like this, you are probably just contributing to capabilities, not doing alignment. I won’t say you should never do that, but I’ll say that’s what’s being done. I personally am all in on “we just need to solve alignment as fast as possible”. But I’ve been a capabilities nerd for a while before I was an alignment nerd, and when I see someone doing something that I feel like is accidentally a potentially significant little capabilities contribution, it seems worth pointing out that that’s what it is.
Agreed.
I would distinguish between measuring capabilities and improving capabilities. I agree that the former can motivate the latter, but they still seem importantly different. I continue to think that the alternative of not measuring capabilities (or only measuring some small subset that couldn’t be used as training benchmarks) just means we’re left in the dark about what these models can do, which seems pretty straightforwardly bad from a safety perspective.
I agree that it’s definitely not doing alignment, and that working on alignment is the most important goal; I intend to shift toward directly working on alignment as I feel clearer about what work is a good bet (my current leading candidate, which I intend to focus on after this experiment: learning to better understand and shape LLMs’ self-models).
I very much appreciate the thoughtful critique, regardless of whether or not I’m convinced by it.