Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.
(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)
Running Lightcone Infrastructure, which runs LessWrong and Lighthaven.space. You can reach me at habryka@lesswrong.com.
(I have signed no contracts or agreements whose existence I cannot mention, which I am mentioning here as a canary)
I am also not that sure about Joe. I love Joe, but he is a man of many words, and I did not reread his whole sequence on this and adjacent topics when I made the comment, so I might be mischaracterizing it. At least my vague memory is that he is doing an equivocation here, but it’s more of a gestalt thing and I would need to reread more to argue this case.
So, I’d say this is a bit strawmanning for List of Lethalities and unclear to me whether the claim that Eliezer often does this is true.
Seems like we are roughly on the same page here. I think it would be fine if someone wanted to bring in comments or posts where they do think Eliezer is conflating in the relevant way.
Re Sam you say:
From my perspective, this seems true to me of the framing and discussion in List of Lethalities?
I think Sam’s sentence here pretty clearly implies “the first critical try framing is trying to imply that trial-and-error are less important”, and I think that’s just not really a valid inference unless you make an equivocation. NASA’s job does not get easier if you don’t get to run the experiments. NASA’s success is still highly contingent on learning from trial-and-error. It is an argument that trial-and-error is not sufficient, but not an argument that it isn’t important.
When something is very useful, it gets adopted at extreme speed.
But… aren’t prediction markets one of the fastest growing industries in the world right now?
I feel like this post misses the point of deontology and basically just re-asserts “but (naive) consequentialism is right!” 15 times in a row.
I don’t really know what kind of knots academic ethicists have tied themselves up in, but my case for deontology routes through the benefits of making commitments to certain policies for the sake of coordination and facilitating trust, and in other cases as a way of handling your bounded rationality, in-particular under pressures of motivated reasoning. None of these scenarios engage with that, so I have trouble imagining that it spells a strong counterargument against at least the kind of deontology that I think makes sense.
Why I think it’s relevant: I think it’s pretty clear that there is a lot of good reason to put EY near the provably safe AI camp.
Look, I am really confident that seeing the stuff from the provably safe AI camp fills Eliezer with the same kind of frustration as becomes me when I see it. I don’t get what the provably safe AI camp people are talking about, and I don’t think Eliezer gets it either (or like, maybe he understands the psychology better than I do, but I really doubt he believes it).
other reasons include his critiques of modern ML methods as being particularly bad for alignment and general AI pessimism.
I think modern ML methods are particularly bad for alignment. I do not think this has anything to do with thinking that we should “prove AI safe”. I do think an appropriate approach would include many mathematical proofs because mathematical proofs are one among a large set of tools you bring to bear to solve complicated problems, in the same way that of course many mathematical proofs were involved in landing rockets on the moon. The fact that people don’t seem to understand that you want to use mathematical proofs for anything but “proving that systems are safe end to end” or things like that is what Eliezer’s rant was about.[1]
Like, maybe it’s wrong, but it is a pretty understandable type of wrong.
I personally find the position that if you utilize proofs in your thinking, then you must be attempting to prove end-to-end things about a complicated real-world system, at the level of complexity of “proving that this rocket will land on the moon” or “proving that this self-driving car system will never cause any crashes” is very silly. IDK whether it’s “understandable”, but I think it deserves being solidly rebuked.
Which is not to say that it spells out what those things are, saying explicitly that it doesn’t go into that because it’s not the main topic of the essay, but I assure you, they exist.
Do you… disagree with that statement?
How does this imply that MIRI is trying to “prove that AI is safe” or that “empirical iteration is of close to no value”?
Agree, though in that case I think it would be good form to say “Eliezer is right about this being a “first critical try” sort of problem and that being important, but I disagree with him on other reasons for why he thinks the problem will be hard and they leave me substantially more optimistic”. The quotes I selected above do not do that.
I disagree because neither of them seems to somehow admit the first-critical-try nature of the problem into their subsequent arguments (in the relevant context). But I agree it’s tricky and I am not saying it’s obviously what’s going on (that’s why I call that part my “personal opinion” and have it in a footnote).
In any case, this post should be a welcome exposition to everyone involved since it makes it much harder for Eliezer to equivocate between the two. If Eliezer now says “getting it right on the first critical try means we can’t learn anything about alignment from experimentation” then you get to link to this post and say “no, you said right here, yourself, that this is not what you mean, please cut it out”. So even if you think Eliezer equivocated in the past, this post should help with that (this doesn’t mean it doesn’t make sense to litigate whether in the past equivocation happened, like, in as much as it did happen I think it would be good to hold Eliezer or others accountable for that, and if someone wants to provide receipts, I think that would be a reasonable thing to do).
Sometimes humanity does get things right on the first critical try. I think e.g. Paul gets the point about first critical try, but has models of the world that consistently predict that there are ways to get around the difficulties associated with that. I disagree with him on those points, but I don’t feel like I have a slam-dunk response to his models.
I do think it’s a sufficient argument for “this is a really high-stakes situation and of at least substantial difficulty”, but like, most people in the field are on board with that? And my sense is even most people at the labs.
Eliezer is here giving a rant about how people strawman him as a champion of “proving that the AI is safe”, which does really happen all the time! He isn’t providing any specific links here, but I could dig them up, and we could look at them, and I would be very surprised if we end up anywhere else but “yep, these people sure seem to think that because MIRI used to work on some agent foundations math, that this means MIRI is trying to prove that future AIs are aligned before proceeding”.
And there are really people out there who are championing the banner of “proving that the AI is safe”, and so this separation really matters: https://www.lesswrong.com/posts/P8XcbnYi7ooB2KR2j/provably-safe-ai-worldview-and-projects
Eliezer believes that good theory would help a lot with aligning AI on the first critical try (despite not being sufficient or completely necessary) while believing that iteration without theory won’t help that much (because the problem of aligning stupider less capable AIs just doesn’t apply that well to aligning superintelligence).
Look, he might believe that, or he might not. I just don’t think this post, or the general argument about “first critical try” is about that.
I am not saying everyone is strawmanning everything about Eliezer. People totally have valid arguments about the difficulty of alignment, and the value of empirical iteration, and of course hundreds of other aspects of the AI-risk situation, but on the specific narrow point of “you only get one critical try”, people seem to repeatedly want to make it into a different strawmanned point, and then respond to that. Acknowledging this does not need to involve conceding any major kind of argument. It’s really not a complicated point. We don’t need to tie ourselves up in these knots.
You can then argue with Eliezer about whether this point is sufficient for high risk from AI (which is some of what this post is about), but at least on the specific narrow point, of course, yes, AI is one of those problems you really only get one critical try for, that makes it different from other problems, and we need to calibrate accordingly.
And IMO it’s also totally valid to have a concern that Eliezer is trying to himself do a Motte-and-Bailey with this stuff. This post is trying to respond to some of that. I personally would like to just have a robust pointer to “first critical try” that doesn’t make everyone respond with dumb things about how takeoff is smooth and so this doesn’t apply, because the robust and easy-to-defend version of this claim is still true, and really has a lot of relevance to how to relate to this whole situation. If you don’t want to accept the specific language Eliezer has used because you are worried it will come with historical baggage of being used a motte-and-bailey, then propose a different term, but don’t throw out both the motte and the bailey!
E.g. I think Neel’s response here feels reasonable to me:
My guess is that they would implicitly consider this post to be motte-and-bailey-ing, but do strawman the position in this post (if this post is in fact the best representation of Eliezer’s position).
Sorry, yes, they (almost)[1] all say[2]:
“Eliezer, you said we couldn’t learn from experimentation! But we, the enlightened few, in contrast to those people trying to derive guaranteed conclusions from logical principles, understand that empiricism is a thing, and your concept of ‘critical first try’ is only harmful and misleads people about our ability to iterate and learn from earlier failures”.
That is, as I understand it, one of the core points of this post. The concept of a first critical try is not in contrast to empirical iteration. “Please, why do people keep bringing it up as if it conflicts with it. It’s a different point. Can you please stop sliding off of this point and just acknowledge it instead of trying to respond to this weird other strawman every time?”.
I am most confused about Buck’s exchange where it feels to me like Buck is kind of making a non-sequitur and Eliezer is also being weirdly dense about the point Buck is making and my guess is something kind of similar is going on but I wouldn’t quite put it in the same bucket
Of course greatly exaggerated for rhetorical effect to reduce ambiguity and introduce levity
The watchful eye of the moderators sees all.
OpenAI researcher confirmed to have “long timelines”. Still expects supermarkets and ground beef to be a thing.
I think there is something good about making a post that stands on its own like this, but I also think it’s useful to directly link to a bunch of direct quotes from people who said the kinds of thing this post is arguing against. So here are some I remember:
Paul Christiano, in “Where I agree and disagree with Eliezer”:
“Eliezer often equivocates between ‘you have to get alignment right on the first ‘critical’ try’ and ‘you can’t learn anything about alignment from experimentation and failures before the critical try.’”
[...]
“But we will be able to learn a lot about alignment from experiments and trial and error; I think we can get a lot of feedback about what works and deploy more traditional R&D methodology.”
Sam Marks, commenting on Paul’s post:
“Eliezer’s ‘first critical try’ framing downplays the importance of trial-and-error with non-critical tries.”
[...]
“Deceptive behavior may arise from AI systems before they are able to competently deceive us, giving us some chances to iterate.”
Joe Carlsmith, in “On first critical tries in AI alignment”:
“In AI alignment, you do still get to learn from non-existential failures.”
[...]
“you might catch AIs attempting to take over, and learn from that.”
Buck, in a thread with Eliezer:
I agree that the regime where mistakes don’t kill you isn’t the same as the regime where mistakes do kill you. But it might be similar in the relevant respects. As a trivial example, if you build a machine in America it usually works when you bring it to Australia. I think that arguments at the level of abstraction you’ve given here don’t establish that this is one of the cases where the risk of the generalization failing is high rather than low. (See Paul’s disagreement 1 here for a very similar objection (“Eliezer often equivocates between “you have to get alignment right on the first ‘critical’ try” and “you can’t learn anything about alignment from experimentation and failures before the critical try.””).)
People can argue about the degree to which this post is responding to these quotes, and how well it addresses the issues brought up in them, but it seemed to me like it would be helpful to have some concrete references and quotes in any case.[1]
my personal take is that this post is a pretty decent response and that the things I link here do indeed all strawman the core thing in this post, though sometimes in a way that is more load-bearing, and other times in ways that still makes valid points that I myself don’t want to just strawman and dismiss as misunderstanding Eliezer
My guess is you changed your mind on this? It seems pretty clear to me that GPT 5.5 and Opus 4.7 are much more reward-hacky than their predecessors, just substantially better at it (or like, they are much stronger apparent-success seekers, which IMO clearly was the reward they were trained on).
Papetoast I think set up an automation to post them as link-posts with a short excerpt, so I don’t think this is an issue anymore!
I’m under impression that conquest of the American continent was not a project of goodness, it was a project of conquest. Then, revolution happened (the way it happened), Washington resigning happened, and a whole lot of other things, and in the end, it could be argued (controversially), that, given on oracle into 2026 and a counterfactual world, a person guided by “goodness” would not oppose the colonial project.
Yes, this is a large part of the reason for this case-study in the post! “Most people involved do not seem in it for good reasons” is also one of those things that one might be tempted to use to give yourself license to just disengage from something, but it turns out, nope, people involved trying to do something good is not a necessary prerequisite for something turning out good!
(Indeed, this part seems very obvious to me. Any company that has done great things is of course centrally staffed by people who are primarily there because its a job that makes them money. Any evaluation of a social movement or a company or really any kind of social structure that tries to forecast its eventual harm on the basis of people within it not being on board with the mission seems pretty unlikely to get things right to me)
The claims I am responding to are straightforwardly in the text. Like I am literally quoting the text in my first paragraph.
He has read the post. What are you talking about?
He then asked a really pretty smart language model to confirm that indeed your post does not straightforwardly answer the questions in the post.
Yes, sometimes language models are too dumb to make obvious inferences, even today, but it’s relatively rare. But they clearly and obviously go beyond “scanning for keywords”.
I think Paul is saying “Eliezer is using an equivocation between a correct point and a false point for rhetorical effect”. I don’t think that is doing the same (I think it’s failing to give credit for the correct point). I do agree it’s doing some of what I was trying to point to here, but not following good form in the way I was trying to describe.
I think Joe is on a vibes-based level doing also a more direct equivocation, I think, but again, it’s been a while since I read it and I am not that confident.