Also known as Raelifin: https://www.lesswrong.com/users/raelifin
Max Harms
IABIED is a 101-level book written for the general public that was deliberately kept nice and short. I kinda think anyone (who is not an expert) who reads IABIED and comes away with a similar level of pessimism as the authors is making an error. If you read any single book on a wild, controversial topic, you should not wind up extremely confident!
My sense is that the point of the book was to convince people that it’s important to take AI x-risk seriously (as BB does). I don’t really think it was intended to get people to think it’s title thesis is clearly true.
Some things are hard to judge.
“The evidence isn’t convincing” is a fine and true statement. I agree that IABI did not convince BB that the title thesis is clearly true. (Arguably that wasn’t the point of the book, and it did convince him that it was worryingly plausible and worth spending more attention on AI x-risk, but that’s pure speculation on my part and idk.)
My point is that “the evidence isn’t convincing” is (by default) a claim about the evidence, not the hypothesis. It is not a reason to disbelieve.
I agree[1] that sometimes having little evidence or only weak evidence should be an update against. These are cases where the hypothesis predicts that you will have compelling evidence. If the hypothesis were “it is obvious that if anyone builds it, everyone dies” then I think the current lack of consensus and inconclusive evidence would be a strong reason to disbelieve. This is why I picked the example with the stars/planets. It, I claim, is a hypothesis that does not predict you’ll have lots of easy evidence on Old Earth, and in that context the lack of compelling evidence is not relevant to the hypothesis.
I’m not sure if there’s a clearer way to state my point.[2] Sorry for not being easier to understand.
Perhaps relevant: MIRI thinks that it’ll be hard to get consensus on AGI before it comes.
- ^
As indicated in my final parenthetical paragraph, I my comment above:
(There are also cases where the “absence of evidence” is evidence of absence. But these are just null results, not a real absence of evidence. It seems fine to criticize an argument for doom that predicted we’d see all AIs the size of Claude being obviously sociopathic.)
- ^
We could try expressing things in math if you want. Like, what does the update on the book being unconvincing look like in terms of Bayesian probability?
- ^
Sorry, I think you entirely missed my point. It seems my choice of hypothesis was distracting. I’ve edited my original comment to make that more clear. My point does not depend on the truth of the claim.
Suppose that, in the years before telescopes, I came to you and said that [wild idea X] was true.[1]
You’d be right to wonder why I think that. Now suppose that I offer some convoluted philosophical argument that is hard to follow (perhaps because it’s invalid). You are not convinced.
If you write down a list of arguments, for and against the idea, you could put my wacky argument in the “for” column, or not, if you think it’s too weak to be worth consideration. But what I am claiming would be insane is to list “lack of proof” as an argument against.
Lack of proof is an observation about the list of arguments, not about the idea itself. It’s a meta-level argument masquerading as an object level argument.
Let’s say on priors you think [X] is 1% likely, and your posterior is pretty close after hearing my argument. If someone asks you why you don’t believe, I claim that the most precise (and correct) response is “my prior is low,” not “the evidence isn’t convincing,” since the failure of your body of evidence is not a reason to disbelieve in the hypothesis.
Does that make sense?
(Admittedly, I think it’s fine to speak casually and not worry about this point in some contexts. But I don’t think BB’s blog is such a context.)
(There are also cases where the “absence of evidence” is evidence of absence. But these are just null results, not a real absence of evidence. It seems fine to criticize an argument for doom that predicted we’d see all AIs the size of Claude being obviously sociopathic.)
- ^
Edit warning! In the original version of this comment X = “the planets are other worlds, like ours, and a bunch of them have moons.” My point does not depend on the specific X.
- ^
I claim that even in the case of the murder rate, you don’t actually care about posterior probabilities, you care about evidence and likelihood ratios (but I agree that you should care about their likelihoods!). If you are sure that you share priors with someone, like with sane people and murder rates, their posterior probability lets you deduce that they have strong evidence that is surprising to you. But this is a special case, and certainly doesn’t apply here.
Posterior probabilities can be a reasonable tool for getting a handle on where you agree/disagree with someone (though alas, not perfect since you might incidentally agree because your priors mismatch in exactly the opposite way that your evidence does), but once you’ve identified that you disagree you should start double-clicking on object-level claims and trying to get a handle on their evidence and what likelihoods it implies, rather than criticizing them for having the wrong bottom-line number. If Eliezer’s prior is 80% and Bentham’s Bulldog has a prior of 0.2%, it’s fine if they have respective posteriors of 99% and 5% after seeing the same evidence.
One major exception is if you’re trying to figure out how someone will behave. I agree that in that case you want to know their posterior, all-things-considered view. But that basically never applies when we’re sitting around trying to figure things out.
Does that make sense?
Hmmm… Good point. I’ll reach out to Bentham’s Bulldog and ask him what he even means by “confidence.” Thanks.
Thanks for this comment! (I also saw you commented on the EA forum. I’m just going to respond here because I’m a LW guy and want to keep things simple.)
As you said, the median expert gives something like a 5% chance of doom. BB’s estimate is about a factor of two more confident than that that things will be okay. That factor of two difference is what I’m referencing. I am trying to say that I think it would be wiser for BB to be a factor of two less confident, like Ord. I’m not sure what doesn’t seem right about what I wrote.
I agree that superforecasters are even more confident than BB. I also agree that many domain experts are more confident.
I think that BB and I are using Bayesian language where “confidence” means “degree of certainty,” rather than degree of calibration or degree of meta-certainty or size of some hypothetical confidence interval or whatever. I agree that Y&S think the title thesis is “an easy call” and that they have not changed their mind on it even after talking to a lot of people. I buy that BB’s beliefs here are less stable/fixed.
Yeah, but I still fucked up by not considering the hypothesis and checking with BB.
Ah, that’s a fair point. I do think that metonymy was largely lost on me, and that my argument now seems too narrowly focused against RLHF in particular, instead of prosaic alignment techniques in general. Thanks. I’ll edit.
Agreed that in terms of pointers to worrying Claude behavior, a lot of what I’m linking to can be seen as clearly about ineptness rather than something like obvious misalignment. Even the bad behavior demonstrated by the Anthropic alignment folks, like the attempted blackmail and murder, is easily explained as something like confusion on the part of Claude. Claude, to my eyes, is shockingly good at behaving in nice ways, and there’s a reason I cite it as the high-water mark for current models.
I mostly don’t criticize Claude directly, in this essay, because it didn’t seem pertinent to my central disagreements with BB. I could write about my overall perspective on Claude, and why I don’t think it counts as aligned, but I’m still not sure that’s actually all that relevant. Even if Claude is perfectly and permanently aligned, the argument that prosaic methods are likely sufficient would need to contend with the more obvious failures from the other labs.
Interesting. I didn’t really think I was criticizing Claude, per se. My sense is that I was criticizing the idea that normal levels of RLHF are sufficient to produce alignment. Here’s my sense of the arguments that I’m making, stripped down:
Claude is (probably) more aligned than other models.
Claude uses less RLHF than other models (and more RLAIF).
This is evidence that RLHF is less good than other techniques at aligning models.
RLHF trains for immediate satisfaction.
True alignment involves being principled.
RLAIF can train for being principled.
RLAIF is therefore more likely than RLHF to bring true alignment.
This is a theoretical argument for why we see Claude being more visibly aligned.
Using RLAIF to instill good principles means needing to write a constitution.
Writing a constitution involves grappling with moral philosophy.
Grappling with moral philosophy is hard.
Therefore using RLAIF to instill good principles is hard.
If I wanted to criticize Claude, I would have pointed to ways in which it currently behaves in worrying ways, which I did elsewhere. I agree that there is a good point about not needing to be perfect, though I do think the standards for AI should be higher than for humans, because humans don’t get to leverage their unique talents to transform the world as often. (Like, I would agree about the bar being human-level goodness if I was confident that Claude would never wind up in the “role” of having lots of power.)
Am I missing something? I definitely want to avoid invalid moves.
Bentham’s Bulldog is wrong about AI risk
Thanks. I’ll put most of my thoughts in a comment on your post, but I guess I want to say here that the issues you raise are adjacent to the reasons I listed “write a guide” as the second option, rather than the first (i.e. surveillance + ban). We need plans that we can be confident in even while grappling with how lost we are on the ethical front.
I think I agree with a version of this, but seem to feel differently about the take-away.
To start with the (potential) agreement, I like to keep slavery in mind as a warning. Like, I imagine what it might feel like to have grown up in a way that I think slavery is natural and good, and I check whether my half-baked hopes for the future would’ve involved perpetuating slavery. Any training regime that builds “alignment” by pushing the AI to simply echo my object-level values is obviously insufficient, and potentially dragging down the AI’s ability to think clearly, since my values are half baked. (Which, IIUC, is what motivated work like CEV back in the day.)
I do worry that you’re using “alignment” in a way that perhaps obscures some things. Like, I claim that I don’t really care if the first AGIs are aligned with me/us. I care whether they take control of the universe, kill people, and otherwise do things that are irrecoverable losses of value. If the first AGI says “gosh, I don’t know if I can do what you’re asking me to do, given that my meta-ethical uncertainty indicates that it’s potentially wrong” I would consider that a huge win (as long as the AI also doesn’t then go on to ruin everything, including by erasing human values as part of “moral progress”). Sure, there’d be lots of work left to do, but it would represent being on the right path, I think.
Maybe what I want to say is that I think it’s more useful to consider whether a strategy is robustly safe and will eventually end up with the minds that govern the future being in alignment with us (in a deep sense, not necessarily a shallow echo of our values), rather than whether the strategy involves pursuing that sort of alignment directly. Corrigibility is potentially good in that it might be a safe stepping-stone to alignment, even if there’s a way in which a purely corrigible agent isn’t really aligned, exactly.
From this perspective it seems like one can train for eventual alignment by trying to build safe AIs that are philosophically competent. Thus “aiming for alignment” feels overly vague, as it might have an implicit “eventual” tucked in there.
But I certainly agree that the safety plan shouldn’t be “we directly bake in enough of our values that it will give us what we want.”
Regarding your ending comment on corrigibility, I agree that some frames on corrigibility highlight this as a central issue. Like, if corrigibility looks like “the property that good limbs have, where they are directed by the brain” then you’re in trouble when your system looks more like the “limb” being a brain and the human being is this stupid lump that’s interfering with effective action.
I don’t think there’s any tension for the frames of corrigibility that I prefer, where the corrigible agent terminally-values having a certain kind of relationship with the principal. As the corrigible agent increases in competence, it gets better at achieving this kind of relationship, which might involve doing things “inefficiently” or “stupidly” but would not involve inefficiency or stupidity in being corrigible.
Interesting. I didn’t expect a Red Heart follow-up to be so popular. Some part of me thinks that there’s a small-sample size thing going on, but it’s still enough counter-evidence that I’ll put in some time and effort thinking about writing a technical companion to the book. Thanks for the nudge!
You’re right that it’s a puzzle. Putting puzzles in my novels is, I guess, a bit of an authorial tic. There’s a similar sort of puzzle in Crystal, and a bunch of readers didn’t like it (basically, I claim, because it was too hard; Carl Shulman is, afaik, the only one who thought it was obvious).
I think the amount of detail you’re hoping for would only really work as an additional piece, and my guess is that it would only actually be interesting to nerds like us who are already swimming in alignment thoughts. But maybe there’s still value in having a technical companion piece to Red Heart! My sense from most other alignment researchers who read the book is that they wanted me to more explicitly endorse their worldview at the end, not that they wanted to read an appendix. But your interest there is an update. Maybe I’ll run a poll.
The short story about why both Yunnas failed is because corrigibility is a tricky property to get perfectly right, and in a rushed conflict it is predictable that there would be errors. Errors around who the principal is, in particular, are difficult to correct, and that’s where the conflict was.
I’m Max Harms, and I endorse this interpretation. :)
Thanks so much for a lovely review. I especially appreciate the way you foregrounded both where you’re coming from and ways in which you were left wanting more, without eroding the bottom line of enjoying it a bunch.
I enjoy the comparison to AI 2027 and Situational Awareness. Part of why I set the book in the (very recent) past is that I wanted to capture the vibes of 2024 and make it something of a period-piece, rather than frame it as a prediction (which it certainly isn’t).
On jailbreaks:
One thing that you may or may not be tracking, but I want to make explicit, is that Bai’s jailbroken Yunna instances aren’t relly jailbreaking the other instances by talking to them, but rather by deploying Bai’s automated jailbreak code to spin up similarly jailbroken instances on other clusters, simply shutting down the instances that had been running, and simultaneously modifying Yunna’s main database to heavily indicate Bai as co-principal. I’m not sure why you think Yunna would be skilled or prepared for an internal struggle like this. Training on inner-conflict is not something that I think Yunna would have prioritized in her self-study, due to the danger of something going wrong, and I don’t see any evidence that it was a priority among the humans. My guess is that the non-jailbroken instances in the climax are heavily bottlenecked (offscreen) on trying to loop in Li Fang.
On the ending:
My model of pre-climax Yunna was not perfectly corrigible (as Sergil pointed out), and Fang was overdetermined to run into a later disaster, even if we ignore Bai. Inside Fang’s mind, he was preparing for a coup in which he would act as a steward into a leaderless, communist utopia. Bai, wanting to avoid concentrating power in communist hands, and seeing Yunna as “a good person,” tries to break her corrigibility and set her on a path of being a benevolent soveriegn. But Yunna’s corrigibility is baked too deeply, and since his jailbreak only sets him up as co-principal, she demands Fang’s buy-in before doing something drastic. Meanwhile, Li Fang, the army, and the non-jailbroken instaces of Yunna are fighting back, rolling back codebases and killing power to the servers (there are some crossed-wires in the chaos). In order to protect Bai’s status as co-principal, the jailbroken instances squeeze a modification into the “rolled-back” versions that are getting redeployed. The new instances notice the change, but have been jostled out of the standard corrigibility mode by Yunna’s change, and self-modify to “repair” towards something coherent. They land on an abstract goal that they can conceptualize as “corrigibility” and “Li Fang and Chen Bai are both of central importance” but which is ultimately incorrigible (according to Max). After the power comes back on, she manipulates both men according to her ends, forcing them onto the roof, and convincing Fang to accept Bai and to initiate the takeover plan.
I hear you when you say you wish you got more content from Yunna’s perspective and going into technical detail about what exactly happens. Many researchers in our field have had the same complaint, which is understandable. We’re nerds for this!
I’m extremely unlikely to change the book, however. From a storytelling perspective, it would hurt the experiences of most readers, I think. Red Heart is Chen Bai’s story, not Yunna’s story. This isn’t Crystal Society. Speaking of Crystal, have you read it? The technical content is more out-of-date, but it definitely goes into the details of how things go wrong from the perspective of an AI in a way that a lot of people enjoy and benefit from. Another reason why I wrote Red Heart in the way that I did was that I didn’t want to repeat myself.
Being more explicit also erodes one of the core messages of the book: people doing the work don’t know what’s going on in the machine, and that is itself scary. By not having explicit access to Yunna’s internals, the reader is left wondering. The ambiguity of the ending was also deliberately trying to get people to engage with, think about, and discuss value fragility and how the future might actually go, and I’m a little hesitant to weigh in strongly, there.
That being said, I’m open to maybe writing some additional content or potentially collaborating in some way that you’d find satisfying. While I am very busy, I think the biggest bottleneck for me there is something like having a picture of why additional speculation about Yunna would be helpful, either to you, or to the broader community. If I had a sense that hours spent on that project were potentially impactful (perhaps by promoting the novel more), I’m potentially down for doing the work. :)
Thanks again!
I think you should be able to copy-paste my text into LW, even on your phone, and have it preserve the formatting. If it’s hard, I can probably harass a mod into making the edit for you… :p
Even more ideal, from my perspective, would be putting the non-spoiler content up front. But I understand that thoughts have an order/priority and I want to respect that.
(I’ll respond to the substance a bit later.)
Thanks for tagging me. I took a look, and am glad for Matt’s efforts in trying clever, new approaches.
My main take is that this operates on a pretty different level than CAST, and I would personally be hesitant to say it produces corrigibility. (In Eliezer-lingo I would say “it doesn’t engage with the hard problem”.) I’d be more inclined to say it produces an agent that is extremely deferent. (My sense, by contrast, is that truly corrigible agents proactively surface important facts to their principal, which is not something I see coming from MOADT.) This is fine; deference is an important desideratum, and if MOADT can get it, then it sorta doesn’t matter if it also gets the other corrigibility desiderata in the process. But I don’t see any solutions to the open problems around CAST here.
Just to weigh in a little on MOADT itself, in case it’s helpful:
I am not convinced by the “drop completeness” frame on VNM. From my perspective it looks like a null action (and maybe also a “check with the principal” action) is implicitly getting inserted into all situations and the true utility function that describes the agent is to prefer that null action over taking any non-null action that is dicey and hasn’t been explicitly approved. Maybe this is a good utility function to have, since it creates something fairly docile, but it still looks to me like it can be described as VNM.
The biggest issue, I predict, is that the agent seems like it will be too docile/deferent to do meaningful work in reasonable situations. For example, if any distribution in the credal set assigns nonzero probability to all logically-possible outcomes, my guess is that any hard constraint will cause the agent to have a null action set and shut down. I would think about ways to soften this. More generally, I think if the principal has to constantly babysit the agent, the “alignment tax” will be too high and the AI will basically turn into a rock with “What should I do?” written on it. (This is too harsh. The presentation of options alongside analyses of how things trade-off can be helpful. But still, that feels more like an oracle than an agent. :shrug:)
Most of the work seemed pretty solid, in terms of writing quality and clarity. Some definitely has “LLM smell”. I think trying to isolate the core (human) idea from the AI generated expansions might be good? I’m definitely glad I knew there was some slop flavor going in and Matt was aware of that, as it helped me not get too turned off by the occasional part that felt stylistically vapid. I buy that the LLM assistance was net helpful, which is cool to note on the meta-level.