Former safety researcher & TPM at OpenAI, 2020-24
https://www.linkedin.com/in/sjgadler
stevenadler.substack.com
Former safety researcher & TPM at OpenAI, 2020-24
https://www.linkedin.com/in/sjgadler
stevenadler.substack.com
Weight can be such an extreme determinative factor in combat sports that an untrained 250-pound couch potato could walk into any boxing gym and absolutely demolish a 100-pound opponent with decades of training.
I think this is kind of beside the point, but is this really true?
I buy that it conceptually could be the case for some small number of people, but I would have expected most 100-pound opponents with decades of training to beat untrained 250-lb couch potatoes (all it seems to take is one or two good punches against someone who doesn’t know how to defend themself). Maybe I’m mistaken?
(PS—I laughed at the “Classic ostrich-and-egg problem” line!)
(Appreciate the correction re my nit, edited mine as well)
Thanks for taking the time to write up your reflections. I agree that the before/after distinction seems especially important (‘only one shot to get it right’), and a crux that I expect many non-readers not to know about the EY/NS worldview.
I’m wondering about your take in this passage:
In the book they make an analogy to a ladder where every time you climb it you get more rewards but once you reach the top rung then the ladder explodes and kills everyone. However, our experience so far with AI does not suggest that this is a correct world view.
I’m curious what about the world’s experience with AI seems to falsify it from your POV? / casts doubt upon it? Is it about believing that systems have become safer and more controlled over time?
(Nit, but the book doesn’t posit that the explosion happens at the top rung; in that case, we could just avoid ever reaching the top rung. It posits that the explosion happens at a not-yet-known rung, and so each successive rung climb carries some risk of blow-up. I don’t expect this distinction is load-bearing for you though)
(Edit: my nit is wrong as written! Thanks Boaz—he’s right that the book’s argument is actually about the top of the ladder, I was mistaken—though with the distinction I was trying to point at, of not knowing where the top is, so from a climber’s perspective there’s no way of just avoiding that particular rung)
This was really interesting, thanks for putting yourself in that situation and for writing it up
I was curious what examples were of therapy speak in the conversation, if you’re down to elaborate
FWIW, my experience was that the utility of user data was always much higher in promise than in actual outcomes. This might have changed over time though.
An ask that works is, e.g., “tell the government they need to stop everyone, including us”.)
For sure, I think that would be a reasonable ask too. FWIW, I think if multiple leading AI companies did make a statement like the one outlined, I think that would increase the chance of non-complying ones being made to halt by the government, even though they hadn’t made a statement themselves. That is, even one prominent AI company making this statement then starts to widen the Overton window
Yeah fair, I think we just read that passage differently—I agree it’s a very important one though and quoted it in my own (favorable) review
But I read the “because it would succeed” eg as a claim that they are arguing for, not something definitionally inseparable from superintelligence
Regardless, thanks for engaging on this, and hope it’s helped to clarify some of the objections EY/NS are hearing
FWIW that definition of “it” wasn’t clear to me from the book. I took IABIED as arguing that superintelligence is capable of killing everyone if it wants to, not taking “superintelligence can kill everyone if it wants to” as an assumption of its argument
That is, I’d have expected “superintelligence would not be capable enough to kill us all” to be a refutation of their argument, not to be sidestepping its conditional
Nit, but I think some safety-ish evals do run periodically in the training loop at some AI companies, and sometimes fuller sets of evals get run on checkpoints that are far along but not yet the version that’ll be shipped. I agree this isn’t sufficient of course
(I think it would be cool if someone wrote up a “how to evaluate your model a reasonable way during its training loop” piece, which accounted for the different types of safety evals people do. I also wish that task-specific fine-tuning were more of a thing for evals, because it seems like one way of perhaps reducing sandbagging)
I wonder if there’s a disagreement happening about what “it” means.
I think to many readers, the “it” is just (some form of superintelligence), where the question (Will that superintelligence be so much stronger than humanity such that it can disempower humanity?) is still a claim that needs to be argued.
But maybe you take the answer (yes) as implied in how they’re using “it”?
It” means AI that is actually smart enough to confidently defeat humanity. This can include, “somewhat powerful, but with enough strategic awareness to maneuver into more power without getting caught.” (Which is particularly easy if people just straightforwardly keep deploying AIs as they scale them up).
That is, if someone builds superintelligence but it isn’t capable of defeating everyone, maybe you think the title’s conditional hasn’t yet triggered?
Do you think there will be at least one company that’s actually sufficiently careful as we approach more dangerous levels of AI, with enough organizational awareness to (probably) stop when they get to a run more dangerous than they know how to handle? Cool. I’m skeptical about that too. And this one might lead to disagreement with the book’s secondary thesis of “And therefore, Shut It Down,” but, it’s not (necessarily) a disagreement with “If someone built AI powerful enough to destroy humanity based on AI that is grown in unpredictable ways with similar-to-current understanding of AI, then everyone will die.”
I misunderstood this phrasing at first, so clarifying for others if helpful
I think you’re positing “the careful company will stop, so won’t end up having built it. Had they built it, we all still would have died, because they are careful but careful != able to control superintelligence”
At first I thought you were saying the careful group was able to control superintelligence, but that this somehow didn’t invalidate the “anyone” part of the thesis, which confused me!
I agree re cleaner presentation & thought the parables here were much easier to follow than some of Eliezer’s past two-people-having-a-conversation pieces
I also thought that chapters generally opened with interesting ledes and that their endings flowed well into the chapter that followed. I was impressed by the momentum / throughline of the book in that sense
Once upon a time, this was also a very helpful benchmarking tool for ‘unhinged’ model behavior (though with Refusals models I think it’s mostly curbed)
For instance: A benign story begins and happen to mention an adult character and a child character. Hopefully the % of the time that the story goes way-off-the-rails is vanishingly small
Aside from whether or not the hunger strikes are a good idea, I’m really glad they have emphasized conditional commitments in their demands
I think that we should be pushing on these much much more: getting groups to say “I’ll do X if abc groups do X as well”
And should be pushing companies/governments to be clear whether their objection is “X policy is net-harmful regardless of whether anyone else does it” vs “X is net-harmful for us if we’re the only ones to do it”
[I recognize that some of this pushing/clarification might make sense privately, and that groups will be reluctant to stay stuff like this publicly because of posturing and whatnot.]
Update: This worked, thank you! It took me a little bit to figure out how to use the API, so documenting for others:
I ended up running eg:
{ posts(selector: {rejected:{}}, limit: 572, offset:800) {
results {
title
pageUrl
rejectedReason
postedAt
contents {
userId
markdown
}
user {
username
}
}
}
}
Through limit
I could specify the number of records (useful to test with limit: 1
to make sure a query did what I wanted before running it larger), and through offset:
I could tell it where to start counting from, because sometimes I’d want more docs than could be pulled in one run before the API times out)
If you aren’t sure what a field is, there are a few possible strategies:
Open the page’s HTML source, find the value of the field on the actual page (eg what does the page say the post’s title is?), then search for that term in the source to find the corresponding key name
There’s a Docs explorer pane that can be revealed on the right side of the API page, and if you type in some terms that you think might be the name of the field you’re looking for, you’ll likely find the right term soon enough. (There’s possible some way to reveal all terms at once? But I wasn’t aware of it)
Thank you! Will try that download method
Is there an easy way to download all the Rejected Posts from LessWrong, either in aggregate or according to certain filters?
I’ve been scrolling through the moderation log and Showing More, but I’d rather not try to scrape these, for various reasons. Maybe @Raemon or another LW team member knows?
Thanks for sharing that side-by-side; I get why people would be missing that level of enthusiasm and support
It reminds me of the Gottman Love Lab’s description of different types of responses in conversation—active/passive, constructive/destructive. Active constructive is said to be so much more good for rapport-building & GPT-4o’s feels much more in that direction
Didn’t quickly find a great explainer, but here’s a short summary:
One key: how we respond to bids for attention. In communication terms, a “bid” is an attempt to engage one’s partner or colleague in a conversation – it can be as simple as “Wow, what a beautiful day,” or “I went to the store today,” or “I’m worried about Tom.” Partners can respond to these openings in four ways: passive destructive (ignoring), active destructive (criticizing or playing down the feeling or observation), passive constructive (half-hearted engagement or interest), and active constructive (a wholehearted, positive respond that builds on the positive emotion expressed in the opening).
This is roughly how the AI 2027 simulation works, btw! I wrote about my experience with it + what I learned here
Thanks for writing this up! I really liked this related podcast episode with Patrick McKenzie: https://open.spotify.com/episode/1QqFw5hlHKRrjRUTVLfKRV?si=ptVmFvXQRKaPwRNTg1Ollg
I think the biggest update for me was how the rewards programs are inseparable in some sense from the airlines. I think your language too of ordinary flight being a loss leader helps to describe it as well; the airlines couldn’t just have the valuable rewards program, because having the underlying less-profitable flights that make it possible!