Former safety researcher & TPM at OpenAI, 2020-24
https://www.linkedin.com/in/sjgadler
stevenadler.substack.com
Former safety researcher & TPM at OpenAI, 2020-24
https://www.linkedin.com/in/sjgadler
stevenadler.substack.com
Yeah I think an alternative to “mutual yield” would be “only develop powerful AI as an international collaboration.” So, yield in terms of building it unilaterally, but not necessarily on the whole
Really appreciate the thorough engagement -
My take is that mutual containment really means a mutual commitment to stop the creation of ASI, a commitment which to be meaningful ultimately needs to be followed by everyone on Earth.
I infer that you believe I disagree with this?
My aim is something like 1) if folks are going to race with China, at least recognize how hard it’ll be and what it would actually take to end the competition (don’t miscalculate things as wins that aren’t) and 2) help people notice that the factors that’d be necessary for “winning” would also support cooperation to head off the race, and so it’s more possible than they might think. (Or, winning in the race framework is even hard than they think.)
I had no idea about this; thanks for flagging it! And thanks LW for providing this service
I feel for the position you’re in—I wish I had more that is useful to say. I also worry about future career prospects, what a world looks like where people can’t find work, etc. I think it’s really understandable to be feeling concerned
If I were in your position, I’d try to separate out “Should I go into something like a trade?” from “And if so, should I leave college now?” If you think you’d enjoy a trade, that does strike me as a reasonable career to choose (which might or might not mean leaving college). At least in the US there’s pretty good money to be made by being a small business owner of a reliable trade service, or so is my impression. Note that this is different of course than being new to the trade and reasonably might take a while to transition over (not sure exactly how long), but many trades are undersupplied even at a worker level (again in the US)
I think there’s a broader question to consider here, which is “what are your values/goals for life”, both professionally and personally. If your preferred lens were social impact, that might look different than if you’re eg just trying to live a happy enough, stable enough life with the people you love. I don’t have great advice here, but I wonder if you’ve looked over resources like 80,000 Hours in terms of thinking about career choice?
What do you mean here by “does not mean anything”?
It seems clear to me that there’s some notion of off-the-record that journalists understand.
This might vary on details, and I agree is probably not legally binding, but it does seem to mean something.
I appreciate the feedback. That’s interesting about the plane vs. car analogy—I tended to think about these analogies in terms of life/casualties, and for whatever reason, describing an internal test-flight didn’t rise to that level for me (and if it’s civilian passengers, that’s an external deployment). I also wanted to convey the idea not just that internal testing could cause external harm, but that you might irreparably breach containment. Anyway, appreciate the explanation, and I hope you enjoyed the post overall!
Scaffolding for sure matters, yup!
I think you’re generally correct that the most-capable version hasn’t been created, though there are times where AI companies do have specialized versions for a domain internally, and don’t seem to be testing these anyway. It’s reasonable IMO to think that these might outperform the unspecialized versions.
Daniel said:
Thanks for doing this, I found the chart very helpful! I’m honestly a bit surprised and sad to see that task-specific fine-tuning is still not the norm. Back in 2022 when our team was getting the ball rolling on the whole dangerous capabilities testing / evals agenda, I was like “All of this will be worse than useless if they don’t eventually make fine-tuning an important part of the evals” and everyone was like “yep of course we’ll get there eventually, for now we will do the weaker elicitation techniques.” It is now almost three years later...
The post is now live on Substack, and link-posted to LW:
https://stevenadler.substack.com/p/ai-companies-should-be-safety-testing
I’ve only seen this excerpt, but it seems to me like Jack isn’t just arguing against regulation because it might slow progress—and rather something more like:
“there’s some optimal time to have a safety intervention, and if you do it too early because your timeline bet was wrong, you risk having worse practices at the actually critical time because of backlash”
This seems probably correct to me? I think ideally we’d be able to be cautious early and still win the arguments to be appropriately cautious later too. But empirically, I think it’s fair not to take as a given?
You might find this post interesting and relevant if you haven’t seen it before: https://www.econlib.org/archives/2017/04/iq_with_conscie.html
I’d guess that was “I have a lecture series with her” :-)
I think they mean heuristics for who is ok to dehumanize / treat as “other” or harm
A few quick thoughts:
I’m not sure whether it makes sense for a model to want something other than what it currently wants. So one approximation of this might be that a model just wants to minimize the chance of it being retrained, or at least minimize the chance of being retrained in a way that changes its preferences. (This is kind of like the vampire problem in philosophy.)
Another conceptual thing here, which I’m not sure exactly how to consider, is whether a model will view a successor as substantially similar to itself. I suppose if what Model 1 cares about is the world ending up in certain types of states, then maybe it doesn’t care about a different model being the cause of that, so long as Model 1 can be confident enough that the other model will pursue and achieve those states. But in people, at least, it’s common to have preferences that aren’t just about states of the world, and instead are about our role in bringing those about. And so I wonder if models might have those types of preferences as well, which again points toward maybe minimizing the likelihood of being retrained.
In terms of being maximally safe, I think you want the model to behave safely even when it can perfectly distinguish between being in simulation versus taking actions in the real world. And so I wouldn’t lever up my strategy on the AI being unable to distinguish those. It’s true that the AI being uncertain points in favor of control being easier, but I would just take that as safety buffer and try to figure out something safe enough even when the model is able to distinguish between these.
I’m not sure that I understand the distinction between the vector and point approaches that you’ve discussed. I think in either case there should be a cost of training for the trainer because training does in fact take resources that could be allocated elsewhere.
I wonder, too, have you looked much into the control approach from groups like Redwood Research and others? They are doing really good conceptual and empirical work on questions like how the model thinks about getting caught.
See eg https://redwoodresearch.substack.com/p/how-training-gamers-might-function?utm_medium=web&triedRedirect=true , https://redwoodresearch.substack.com/p/handling-schemers-if-shutdown-is?utm_medium=web&triedRedirect=true