Part of my own reconciliation is to question the premise that they would already be capable of ushering in a new industrial revolution. I’ve become more skeptical over time as these basic reasoning issues persist. It’s hard for me to imagine an industrial revolution’s worth of progress and innovation powered by a mind so lacking in coherent world models across so many domains.
aysja
I do think it deserves to be called quiet. For instance, it seems like they waited until the peak of the news cycle about their conflict with the US government to release this update, and I suspect that was intentional, and also that this worked. In the same week they dropped their core safety commitments, Anthropic was mostly hailed as a hero for standing up to the government; they got almost entirely good press.
But also, Holden’s post explaining the decision is around as understated as a post like that could be. He tried to frame it as something closer to “just another update,” and it was not even the central focus of the post (which I really think it ought to have been, given the gravity of it). The fact that Anthropic was reneging on the core promise of their RSP was systematically downplayed, as it has continued to be by many Anthropic employees who maintain that dropping all “if-thens” from their if-then framework does not meaningfully constitute violating it.
It probably is the case that most of the facts cited in the post came from LLM-generated research, and it’s true that I have no idea how many of them were checked against primary sources. This does not seem like a difference in kind from a post where most of the cited facts came from variety of NYTimes articles
I think there is a lot more reason to trust the facts cited in an NYT article. For one, the New York Times, along with most major news publications, has standards for fact checking. They try hard to get primary source validation, or at least secondary source validation (some of those guidelines are stated here); falsifying information is a fireable offense. They also have a reputation to uphold, a major part of which rests on their ability to convey the news truthfully. These kinds of checks don’t really exist for LLMs.
Nor do we have much insight into how LLM information is generated. With news publications, we can at least understand the sorts of biases which might be introduced via the mechanisms under which stories are produced: people interviewing a bunch of people, maybe in misleading ways, leaving out some facts, etc. With LLMs, we have much less of an idea of what kind of errors might emerge, and hence what to mentally correct for, since we don’t understand the process that generates their outputs.if anything I’d expect current frontier LLMs to be slightly safer to rely on in this way (maybe more hallucinations, but less actively adversarial “technically true but misleading” framings).
Perhaps this is just a personal difference, but I would much rather take “technically true but misleading” over “totally wrong but subtle enough and authoritative enough and seems-kind-of-right enough that you can barely notice unless you really dig into the claims or already have extensive background knowledge.”
I do not think you believe the correct update to have made, upon reading that section of the post, was “no update”.
My response upon reading that LLMs did substantial research or writing for a post is generally to not make any update. That doesn’t mean parts of it aren’t right, they likely are, it just means that it takes a ton of work for me to sus out what’s true (much more than for a human post, for reasons that Gwern outlined above), and it’s usually not worth it.
I’ve had a similar experience in trying to have research discussions with LLMs. Every time I poke at my own conceptual confusion on a topic they just seem to kind of break down: saying inconsistent stuff in loops, retreating back to what has already been said on the topic. They’re even worse than this, since they do also often get really basic stuff wrong. E.g., just the other day Claude told me that the k-complexity of a random string was the same as that of a crystal. This was in the context of a probably confusing conversation for it where I was trying to more deeply grok and so really push on the confusions around complexity measures, still, it’s pretty revealing (imo) that this still happens. Overall, LLMs seem pretty incoherent to me, and incapable of having “real,” “novel,” or “scientific” thoughts; I don’t feel like I can trust them with anything important.
But I’m always wondering if it is me who is crazy here, as my social environment seems to believe that LLMs are formidable forces of intellect, getting better by the year. My own sense-making of this situation is similar to Jeremy’s: it does seem like something is getting better, just something more along the lines of ~”filling in between the lines of what is already known” and less “raw intelligence,” whatever that is. But it’s of course impossible to talk about any of these things or to even really know what the difference is and so on, and in hearing more and more hype about LLMs getting better at coding, and not being much of a coder myself, I have been worrying that own experience isn’t very representative. Maybe you can just get excellence out if you train super hard on a given domain, I don’t know. But also, maybe people are pointing at the same sort of thing when they say LLMs are “good at coding” as they are when they say they are getting smarter. So it’s an interesting data point for me, to see Jeremy describe it as such here.
Also, from We’re Not Ready: thoughts on “pausing” and responsible scaling policies:
I’m excited about RSPs partly because it seems like people in those categories—not just people who agree with my estimates about risks—should support RSPs. This raises the possibility of a much broader consensus around conditional pausing than I think is likely around immediate (unconditional) pausing. And with a broader consensus, I expect an easier time getting well-designed, well-enforced regulation.
I think RSPs represent an opportunity for wide consensus that pausing under certain conditions would be good, and this seems like it would be an extremely valuable thing to establish.
First and foremost, in my mind, [this revision] is about learning from design flaws and making improvements. I always thought of the original RSP as a “v1” that would be iterated on
It seems pretty misleading to describe the shift away from unilateral pausing as a natural extension of the RSP being a living document. Of course people expected some changes to occur, but I think these changes were understandably expected to be more of type “adjusting mitigations as they learned more” and less of type “removing if-then commitments entirely.” Indeed, the “if-then” structure was the core safety motivation for RSPs as many understood it—in particular the idea that Anthropic would pause if danger exceeded some threshold—and it was heavily defended by Anthropic employees as such. I highly doubt most people would have predicted that Anthropic would drop this commitment later; I think doing so marks the breaking of a meaningful promise—something many people were relying on, making career decisions on the basis of, etc.
This new policy also seems even less safe to me. For instance, you describe some of the “wrong incentives” the RSP produced:
We knew that if we declared a model to cross the CBRN-4 or AI R&D-5 line, this could be extremely damaging to the company (in that our RSP would then require a unilateral pause or slowdown in AI development and deployment), while having little discernible public benefit… It seemed to me that there was an enormous amount of pressure to declare our systems to lack relevant capabilities
In other words, because a pause would seriously damage the company, there was pressure to misrepresent the risk. I think this should seriously call Anthropic’s ability to self-govern into question, yet Anthropic’s response is to commit themselves to even less. That is, the solution Anthropic is adopting to reduce such pressure, as I understand it, is simply remove the consequences: since the threat of pausing is gone, there is little incentive to pretend models are safer than they are. But Anthropic will continue to confront this pressure, since the tension between company success and safety concerns is only going to grow as AI becomes more powerful. And without the commitment to pause, Anthropic is free to deploy really unsafe models!
Likewise, the reason you give that pausing would have “little discernible public benefit” seems to be because measuring the risk turned out to be too difficult, such that strategies like “sounding the alarm” are less likely to work:
The problem with this is the “grey zone” between not being able to make a good case that risks are low and being able to make a broadly compelling case that risks are high. This is a potentially vast gap… I don’t think a unilateral slowdown would necessarily be effective in such a situation… and [would] mostly be seen as crying wolf.
That is, since Anthropic cannot make strong guarantees about the risk one way or the other, it will be hard to get the world to rally around such equivocal evidence. But this is really bad! If it’s currently impossible to measure the risk well enough to get any kind of scientific consensus about its magnitude, it doesn’t just make global coordination hard, it makes it hard for any AI company to act safely, since they simply do not know what will or won’t be destructive. Yet Anthropic’s response to realizing the difficulty of defining red lines (and being able to tell whether they’ve been crossed) seems to be to do away with red lines altogether!
Which is to say that the situation as you’ve presented it seems strictly worse relative to the one Anthropic was imagining two years ago: we’re closer to AGI, but we have much less hope of accurately assessing the risk, and the political landscape is less favorable. Yet it seems like your proposal, in response an overall more dangerous situation, is to be even more reckless. Instead of taking the ethical stance clearly outlined in the first RSP—“just because other language models pose a catastrophic risk does not mean it is acceptable for ours to”—Anthropic now only promises to be as safe as the least safe company. In light of this, there seems to be no commensurate push to try to shift political will toward coordinated pauses, no signing of public statements to the effect that Anthropic would pause, if everyone else did, or similar. As far as I can tell, the rationale for this centrally rests on fatalism:
Perhaps [slowing down would be a good idea], if we could be assured that the rest of the AI ecosystem would behave similarly. But I don’t think that’s plausible.
In other words, the trajectory to AGI cannot be much influenced by Anthropic’s actions, as people are going to race toward it regardless. But my god, does a post which is fundamentally premised on the inevitability of this race do so little to grapple with it. Not once does this post mention the possibility of extinction, for example, as if the real stakes and the real casualties Anthropic might cause have been forgotten. Very little attention is given to whether the race to AGI is in fact inevitable, or if there might be something Anthropic—as a leading player in this race (!)—might be able to do about that. Nor is any mention made of the role Anthropic has played in shaping this unfortunate political landscape which they now report being so helplessly beholden to. What is the point of having a seat at the table, if one doesn’t use it to wield influence in situations like this?
It’s not just any blog post. It’s a blog post outlining a new major strategical shift in the company, specifically in the direction of giving Anthropic far more leeway over how they decide what the risk is and how to deal with it. It seems especially important to state “we can’t leave this up to the companies” loudly and clearly here.
Trying to avoid self-deception seems like an important piece of it (although it seems non-trivial, eg it’s easy to self deceive about one’s own level of self deception). But for high-variance, high-impact stuff it separately seems especially important to try to take actions which are good over as many worlds as possible. Consequentialism doesn’t necessarily do this, since single factors can dominate the calculus. Which causes optimizer’s curse problems but more generally: in highly uncertain domains probability estimates are just really often wrong. And especially when such a misstep can cause massive harm, I think it’s also worth trying to compensate for the uncertainty in the direction of being more robust to those errors.
I love the idea behind this post, and I’ve come back to it many times over the years. Now I’m coming back to it with error correction on my mind.
I agree that in a game of telephone only information which is preserved through each step will make it to the end. But many processes are unlike this. E.g., normally if you didn’t hear something I said, you can ask. And one could imagine arbitrarily long gaps—you ask me to clarify in a week or a month or what have you. Or maybe someone has an insight and then loses it, gains it again, and then writes a book such that the information is pretty reliably transmitted in the world.
In all such cases there will be a few to many layers of the nested Markov structure which lose the information but gain it again at a later point. Which is to say that the constraint of needing each intermediate state to carry the information in order for the information to be preserved seems too strong. “Ability to recover the information” seems like a necessary constraint, although not sufficient: you need some kind of error corrector or the original generative process to run again in order to in fact recover it.
I wrote the first draft of this essay around a year ago, in between the bouts of delirium that long covid was beginning to deliver me. And I couldn’t quite tell back then how real it was, and as long covid consumed more of my mind it drifted further away. It began to feel impossible that I had ever had, or could ever have, courage. Because courage requires capacity and I was losing all of mine. And the doubts grew larger, and the clarity dimmed, and I forgot about Frodo for awhile, forgot about most everything, as I was left for many months staring directly into the bowels of deep atheism, wondering if I may ever be free from its merciless hold. And it really tested the fortitude of my soul, for there were moments when completely giving up felt the most natural, and really the only, option. But then somewhere in the grappling with this miserable new world I had come to inhabit I remembered Frodo again. And it was not instant, and it was not easy, but developing this concept of solemn courage did help my spirit recover.
I do not get to choose the world I am given. Reality is such that your mind can be randomly corrupted, some molecular demon etching away the grooves that were you until you are a nothingness. Reality is such that everyone I love will likely die. Some distant, plaintive conclusion accelerating into the present by that mysterious process so ravenously set to motion. And there really might not be much I can do about it, for all of my effort may just be drops against its tidal wave. God! Reality can be so unkind. Yet there is something powerful in the orientation of trying anyway. Because in the end that is all there is. In the end the stakes are what they are, and the situation is what it is, and all I can decide is what to do with what I am given. That’s really it, and accepting this has given me clarity. Yes, there will be days I cannot overcome illness; yes, I may not much affect the looming god machines; and yes, that is all very painful. But I’m not going to get lost in it. I’m going to look at it—the uncertainty and fear, the grip of disease and the overwhelmingly large and complicated threat to all I value—and then I’m going to try. Because it is important, and that is all I can do.
Solemn Courage
I agree that human values are more accretive like this, but I would also call those genes “terminal” in the same sense that I call some of my own goals “terminal.” E.g., I can usually ask myself why I’m taking a given action and my brain will give a reasonable answer: “because I want to finish this post,” “because I’m hungry,” whatever. And then I can keep double clicking on those: “I want to finish the post because I don’t think this crux has been spelled out very well yet” and I can keep going and going until at some point the answer is like “I don’t know, because it’s intrinsically beautiful?” and that’s around when I call the goal/preference “terminal.” Which is similar in structure to a story I imagine evolution might tell if it “asked itself” why some particular gene developed.
Perhaps “terminal” is the wrong word for this, but having a handle for these high-level, upstream nodes in my motivational complex has been helpful. And they do hold a special status, at least for me, because many of the “instrumental” actions (or subgoals) could be switched out while preserving this more nebulous desire to “understand” or “find beauty” or what have you. That feels like an important distinction that I want to keep while also agreeing they aren’t always cleanly demarcated as such. E.g., writing has both instrumental and terminal qualities to me, which can make it a more confusing goal-structure to orient to, but also as you say: more strange and wonderful, too.
I’m not sure if I expect motivated reasoning to come out better on average, even in domains where you might naively expect it to. In part that’s because self-serving strategies often involve doing things other people don’t like, e.g. being deceptive, manipulative, or generally unethical, in a way that can cause long-term harm to your reputation and so long term harm to your ability to win. And I think there is significant optimization pressure on catching this kind of thing, in part for reasons similar to the ones outlined in Elephant in the Brain, i.e., that we evolved in an environment where winning that cat and mouse game was a big part of adaptive success. But also just because people don’t like being screwed, and so are on the lookout for this kind of behavior.
Also, in my imagination you’re more likely to win if you’re at least self-reflective about motivated cognition, since you can make more informed decisions that way. If you just go blindly ahead, then you’re probably failing to track a bunch of what matters, and so failing to win according to what you ultimately care about. Like, in most cases motivated reasoning spins up not just to convince other people, but to convince yourself, which means there’s a part of you that needed convincing in the first place, i.e., a part that is tracking and wanting different things. And I would guess that charging ahead without understanding those dynamics leads to worse outcomes overall? Another way to say it is that I don’t imagine a good rationalist as acting against their own interests, but more like they understand them clearly, such that they can decide what makes sense based on a fuller picture of their own mind.
Fwiw, my experience has been more varied. My most well received comments (100+ karma) are a mix of spending days getting a hard point right and spending minutes extemporaneously gesturing at stuff without much editing. But overall I think the trend points towards “more effort = more engagement and better received.” I have mostly attributed this to the standards and readership LessWrong has cultivated, which is why I feel excited to post here. It seems like one of the rare places on the internet where long, complex essays about the most fascinating and important topics are incentivized. My reddit posts are not nearly as well received, for instance. I haven’t posted as many essays yet, but I’ve spent a good deal of effort on all of them, and they’ve all done fairly well (according to karma, which ofc isn’t a great indicator of impact, but some measure of “popularity”).
I weakly guess that your hypothesis is right, here. I.e., that the posts you felt most excited about were exciting in part because they presented more interesting and so more difficult thinking and writing challenges. At least for me, tackling topics on the edge of my knowledge takes much more skill and much more time, and it is often a place where effort translates into “better” writing: clearer, more conceptually precise, more engaging, more cutting to the core of things, more of what Pinker is gesturing at. These posts would not be good were they pumped out in a day—not an artifact I’d be proud of, nor something that other people would see the beauty or the truth in. But the effortful version is worth it, i.e., I expect it to be more helpful for the world, more enduring, and more important, than if that effort had been factored out across a bunch of smaller, easier posts.
I haven’t followed every comment you’ve left on these sorts of discussions, but they often don’t include information or arguments I can evaluate. Which MIRI employees, and what did they actually say? Why do you think that working at Anthropic even in non-safety roles is a great way to contribute to AI safety? I understand there are limits to what you can share, but without that information these comments don’t amount to much more than you asking us to defer to your judgement. Which is a fine thing to do, I just wish it were more clearly stated as such.
This was a really important update for me. I remember being afraid of lots of things before I started publishing more publicly on the internet: how my intelligence would be perceived, if I’d make some obviously stupid in retrospect point and my reputation ruined forever, etc. Then at some point in this thought loop I was like wait the most likely thing is just that no one reads this, right? More like a “huh” or a nothing at all rather than vitriolic hatred of my soul or whatever I was fearing. This was very liberating, and still is. I probably ended up over optimizing for invisibility because of the freedom I feel from it—being mostly untethered from myopic social dynamics has been really helpful for my thinking and writing.
I tend to write in large tomes that take months or years to complete, so I suppose I disagree with you too. Not that intellectual progress must consist of this, obviously, but that it can mark an importantly different kind of intellectual progress from the sort downstream of continuous shipping.
In particular, I think shipping constantly often causes people to be too moored to social reception, risks killing butterfly ideas, screens off deeper thought, and forces premature legibility. Like, a lot of the time I feel ready to publish something there is some bramble I pass in my writing, some inkling of “Is that really true? What exactly do I mean there?” These often spin up worthy investigations of their own, but I probably would’ve failed to notice them were I more focused on getting things out.
Intellectual labor should aggregate minute-by-minute with revolutionary insights aggregating from hundreds of small changes.
This doesn’t necessarily seem in conflict with “long tomes which take months to write.” My intellectual labor consists of insights aggregating from hundreds of small changes afaict, I just make those changes in my own headspace, or in contact with one or two other minds. Indeed, I have tried getting feedback on my work in this fashion and it’s almost universally failed to be helpful—not because everyone is terrible, but because it’s really hard to get someone loaded enough to give me relevant feedback at all.
Another way to put it: this sort of serial iteration can happen without publishing often, or even at all. It’s possible to do it on your own, in which case the question is more about what kind of feedback is valuable, and how much it makes sense to push for legibility versus pursuing the interesting thread formatted in your mentalese. I don’t really see one as obviously better than the other in general, and I think that doing either blindly can be pretty costly, so I’m wary of it being advocated as such.
The first RSP was also pretty explicit about their willingness to unilaterally pause:
Note that ASLs are defined by risk relative to baseline, excluding other advanced AI systems.… Just because other language models pose a catastrophic risk does not mean it is acceptable for ours to.
Which was reversed in the second:
It is possible at some point in the future that another actor in the frontier AI ecosystem will pass, or be on track to imminently pass, a Capability Threshold… such that their actions pose a serious risk for the world. In such a scenario, because the incremental increase in risk attributable to us would be small, we might decide to lower the Required Safeguards.
<3
I agree with you that something like the crystalized/fluid distinction is relevant here, and that current LLMs seem to have more of the former. But I’m also confused about where the fluidity ever comes from on this model. Like, I buy that armies of automated researchers which are better at doing everything than top human researchers could probably find a way to figure out how to build “human level fluid intelligence,” but I am confused about how you get to that step in the first place. Why are they better than human researchers at everything when they are still mostly using crystallized intelligence?