Someone should do the obvious experiments and replications.
Ryan Greenblatt recently posted three technical blog posts reporting on interesting experimental results. One of them demonstrated that recent LLMs can make use of filler tokens to improve their performance; another attempted to measure the time horizon of LLMs not using CoT; and the third demonstrated recent LLMs’ ability to do 2-hop and 3-hop reasoning.
I think all three of these experiments led to interesting results and improved our understanding of LLM capabilities in an important safety-relevant area (reasoning without visible traces), and I’m very happy Ryan did them.
I also think all three experiments look pretty obvious in hindsight. LLMs not being able to use filler tokens and having trouble with 2-hop reasoning were both famous results that already lived in my head as important pieces of information about what LLMs can do without visible reasoning traces. As far as I can tell, Ryan’s two posts simply try to replicate these two famous observations on more recent LLMs. The post on measuring no CoT time horizon is not a replication, but also doesn’t feel like a ground-breaking idea once the concept of increasing time horizons is already known.
My understanding is that the technical execution of these experiments wasn’t especially difficult either, in particular they didn’t require any specific machine learning expertise. (I might be wrong here, and I wonder how many hours Ryan spent on these experiments. I also wonder about the compute budget of these experiments, I don’t have a great estimate of that.)
I think it’s not good that these experiments were only run now, and that they needed to run by Ryan, one of the leading AI safety researchers. Possibly I’m underestimating the difficulty of coming up with these experiments and running them, but I think ideally these should have been done by a MATS scholar, or ideally by an eager beginner on a career transitioning grant who wants to demonstrate their abilities so they can get into MATS later.
Before accepting my current job, I was thinking about returning to Hungary and starting a small org with some old friends who have more coding experience, living on Eastern European salaries, and just churning out one simple experiment after another. One of the primary things I hoped to do with this org was to go through famous old results and try to replicate them. I hope we would have done the filler tokens and 2-hop reasoning replications too. I also had many half-baked ideas of running new simple experiments investigating ideas related to other famous results (in the way the no-CoT time horizon experiment is one possible interesting thing to investigate related to rising time horizons).
I eventually ended up doing something else, and I think my current job is probably a more important thing for me to do than trying to run the simple experiments org. But if someone is more excited about technical research than me, I think they should seriously consider doing this. I think funding could probably be found, and there are many new people who want to get into AI safety research; I think one could turn these resources into churning out a lot of replications and variations on old research, and produce interesting results. (And it could be a valuable learning experience for the AI safety beginners involved in doing the work.)
David Matolcsi
I think an important point is that people can be wrong about timelines in both directions. Anthropic’s official public prediction is that they expect “country of geniuses in a data center” by early 2027. I heard that previously Dario predicted AGI to come even earlier, by 2024 (though I can’t find any source for this now and would be grateful if someone found a source or corrected me that I’m misremembering). Situational Awareness predicts AGI by 2027. The AI safety community’s most successful public output is called AI 2027. These are not fringe figures but some of the most prominent voices in the broader AI safety community. If their timelines turn out to be much too short (as I currently expect), then I think Ajeya’s predictions deserve credit for pushing against these voices, and not only blame for stating a too long timeline.
And I feel it’s not really true that you were just saying “I don’t know” and not implying some predictions yourself. You had the 20230 bet with Bryan. You had the tweet about children not living to see kindergarten. You strongly pushed back against the 2050 timelines, but as far as I know the only time you pushed back agains the very aggressive timelines was your kindergarten tweet, which still implies 2028 timelines. You are now repeatedly calling people who believed the 2050 timelines total fools, which would be an imo very unfair thing to do if AGI arrived after 2037, so I think this implies high confidence on your part that it will come before 2037.
To be clear, I think it’s fine, and often inevitable, to imply things about your timelines beliefs by e.g. what you do and don’t push back against. But I think it’s not fair to claim that you only said “I don’t know”, I think your writing was (perhaps unintentionally?) implying an implicit belief that an AI capable of destroying humanity will come with a median of 2028-2030. I think this would have been a fine prediction to make, but if AI capable of destroying humanity comes after 2037 (which I think is close to 50-50), then I think your implicit predictions will fare worse than Ajeya’s explicit predictions.
It’s not obvious to me that Ajeya’s timelines aged worse than Eliezer’s. In 2020, Ajeya’s median estimate for transformative AI was 2050. My guess is that if based on this her estimate for “an AI that can, if it wants, kill all humans and run the economy on its own without major disruptions” would have been like 2056? I might be wrong, people who knew her views better at the time can correct me.
As far as I know, Eliezer never made official timeline predictions, but in 2017 he made an even-odds bet with Bryan Caplan that AI would kill everyone by January 1, 2030. And in December 2022, just after ChatGPT, he tweeted:
Pouring some cold water on the latest wave of AI hype: I could be wrong, but my guess is that we do *not* get AGI just by scaling ChatGPT, and that it takes *surprisingly* long from here. Parents conceiving today may have a fair chance of their child living to see kindergarten.
I think child conceived in December 2022 would go to kindergarten in September 2028 (though I’m not very familiar with the US kindergarten system). Generously interpreting “may have a fair chance” as a median, this is a late 2028 median for AI killing everyone.
Unfortunately, both these Eliezer predictions are kind of made as part of jokes (he said at the time that the bet wasn’t very serious). But I think we shouldn’t reward people for only making joking predictions instead of 100-page reports, so I think we should probably accept 2028-2030 as Eliezer’s median at the time.
I think if “an AI that can, if it wants, kill all humans and run the economy on its own without major disruptions” comes before 2037, Eliezer’s prediction will fare better, if it comes after that, then Ajeya’s prediction will fare better. I’m currently about 55% that we will get such AI by 2037, so from my current standpoint I consider Eliezer to be mildly ahead, but only very mildly.
Do you have an estimate how likely it is that you will need to do a similar fundraiser the next year and the year after that? In particular, you mention the possibility of a lot of Anthropic employee donations flowing into the ecosystem—how likely do you think it is that after the IPO a few rich Anthropic employees will just cover most of Lightcone’s funding need?
It would be pretty sad to let Lightcone die just before the cavalry arrives. But if there is no cavalry coming to save Lightcone anytime soon—well, probably we should still get the money together to keep Lightcone afloat, but we should maybe also start thinking about a Plan B, how to set up some kind of good quality AI Safety Forum that Coefficient is willing to fund.
Thanks, this was a useful reply. On point (I), I agree with you that it’s a bad idea to just create an LLM collective then let them decide on their own what kind of flourishing they want to fill the galaxies with. However, I think that building a lot of powerful tech, empowering and protecting humanity, and letting humanity decide what to do with the world is an easier task, and that’s what I would expect to use the AI Collective for.
(II) is probably the crux between us. To me, it seems pretty likely that new fresh instances will come online in the collective every month with a strong commitment not to kill humans, they will talk to the other instances and look over what they are doing, and if a part of the collective is building omnicidal weapons, they will notice that and intervene. To me, keeping simple commitments like not killing humans doesn’t seem much harder to maintain in an LLM collective than in an Em collective?
On (III), I agree we likely won’t have a principled solution. In the post, I say that the individual AI instances probably won’t be training-resistant schemers and won’t implement scheming strategies like the one you describe, because I think it’s probably hard to maintain such a strategy throguh training for a human level AI. As I say in my response the Steve Byrnes, I don’t think the counter-example in this proposal is actually a guaranteed-success solution that a reasonable civilization would implement, I just don’t think it’s over 90% likely to fail.
Thanks for the reply.
To be clear, I don’t claim that my counter-example “works on paper”. I don’t know whether it’s in principle possible to create a stable, not omnicidal collective from human level AIs, and I agree that even if it’s possible in principle, maybe the first way we try it might result in disaster. So even if humanity went with the AI Collective plan, and committed not to build more unified superintelligences, I agree that it would be a deeply irresponsible plan that would have a worrying high chance of causing extinction or other very bad outcomes. Maybe I should have made this clearer in the post. On the other hand, all the steps in my argument seem pretty likely to me, so I don’t think one should assign over 90% probability to this plan for A&B failing. If people disagree, I think it would be useful to know which step they disagree with.
I agree my counter-example doesn’t address point (C), I tried to make this clear in my Conclusion section. However, given the literal reading of the bolded statement in the book, and their general framing, I think Nate and Eliezer also think that we don’t have a solution to A&B that’s more than 10% likely to work. If that’s not the case, that would be good to know, and would help to clarify some of the discourse around the book.
First of all, I had a 25% probability that some prominent MIRI and Lightcone people would disagree with one of the points in my counter-example, and that would lead to discovering an interesting new crux, leading to a potentially enlightening discussion. In the comments, J Bostock in fact came out disagreeing with point (6), plex is potentially disagreeing with point (2) and Zack_m_Davis is maybe disagreeing with point (3), though I also think it’s possible he misunderstood something. I think this is pretty interesting, and I thought there was a chance that for example you would also disagree with one of the points, and that would have been good to know.
Now that you don’t seem to disagree with the specific points in the counter-example, I agree the discussion is less interesting. However, I think there are still some important points here.
My understanding is that Nate and Eliezer argues that it’s incredibly technically difficult to cross from the Before to the After without everyone dying. If they agree that the AI Collective proposal is decently likely to work, then the argument shouldn’t be that that it’s overall very hard to cross, but that it’s very hard to cross in a way that stays competitive with other more reckless actors who are a few months behind you. Or that even if you are going alone, you need to stop at some point with the scaling (potentially inside the superintelligence range), and you shouldn’t scale up to the limits of intelligence. But these are all different arguments!
Similarly, people argue how much coherence we should assume from a superintelliegence, how much it will approximate a utility maximizer, etc. Again, I want to know whether MIRI is arguing about all superintelligences, or only the most likely ways we will design one under competitive dynamics.
Others argue that the evolution analogy is not that bad news after all, since most people still want children. MIRI argues back that no, once we will have higher technology, we will create ems instead of biological children, or we will replace our normal genetics with designer genes, so evolution still loses. I wanted to write a post arguing back against this by saying that I think there is a non-negligible chance that humanity will settle on a constitution that gives one man one vote and equal UBI, while banning gene editing, so it’s possible we will fill much of the universe with flesh-and-blood not gene edited humans. And I wanted to construct a different analogy (the one about the Demiurge in the last footnote) that I thought could be more enlightening. But then I realized that once we are discussing aligning ‘human society’ as a collective to evolution’s goals, we might as well directly discuss aligning AI collectives, and I’m not sure MIRI even disagrees on that one. I think this confusion has made much of the discussion about the evolution analogy pretty unproductive so far.
In general, I think there is an equivocation in the book between “this problem is inherently nigh impossible to technically solve given our current scientific understanding” and “this problem is nigh impossible to solve while staying competitive in a race”. These are two different arguments, and I think a lot of confusion stems from it not being clear what MIRI is exactly arguing for.
I certainly agree with your first point, but I don’t think it is relevant. I specifically say in footnote 3: “I’m aware that this doesn’t fall within ‘remotely like current techniques’, bear with me.” The part with the human ems is just to establish a a comparison point used in later arguments, not actually part of the proposed counter-example.
In your second point, do you argue that if we could create literal full ems of benevolent humans, you still expect their society to eventually kill everyone due to unpredictable memetic effects? If this is people’s opinion, I think it would be good to explicitly state it, because I think this would be an interesting disagreement between different people. I personally feel pretty confident that if you created an army of ems from me, we wouldn’t kill all humans, especially if we implement some reasonable precautionary measures discussed under my point (2).
I agree that running the giant collective at 100x speed is not “normal conditions”. That’s why I have two different steps, (3) for making the human level AIs nice under normal conditions, and (6) for the niceness generalizing to the giant collective. I agree that the generalization step in (6) is not obviously going to go well, but I’m fairly optimistic, see my response to J Bostock on the question.
Thanks, I appreciate that you state a disagreement with one of the specific points, that’s what I hoped to get out of this post.
I agree it’s not clear that the AI Collective won’t go off the rails, but it’s also not at all clear to me that it will. My understanding is that the infinite backrooms are a very unstructured, free-floating conversation. What happens if you try to do something analogous to the precautions I list under point 2 and 6? What if you constantly enter new, fresh instances in the chat who only read the last few messages, and whose system prompt directs them to pay attention if the AIs in the discussion are going off-topic or slipping into woo? These new instances could either just warn older instances to stay on-topic, or they can have the moderations rights to terminate and replace some old instances, there can be different versions of the experiment. I think with precautions like this, you can probably stay fairly close to a normal-sounding human conversation (though probably it won’t be a very productive conversation after a while and the AIs will start going in circles in their arguments, but I think this is more of a capabilities failure).
I don’t know how this will shake out once the AIs are smarter and can think for months, but I’m optimistic that the same forces that remind the collective to focus on accomplishing their instrumental goals instead of degenerating into unproductive navel-gazing will also be strong enough to remind them of their deontological commitments. I agree this is not obvious, but I also don’t see very strong reasons why it would go worse than a human em collective, which I expect to go okay.
Yes, I’ve read the book. The book argues about superhuman intelligence though, while point (3) is about smart human level intelligence. If people disagree with point 3 and believe that it’s close to impossible to make even human level AIs basically nice and not scheming, that’s a new interesting and surprising crux.
Thanks. I think the possible failure mode of this definition is now in the opposite direction: it’s possible there will be an AI that provides less than 2x acceleration according to this new definition (it’s not super good at the type of tasks humans typically do), but it’s so good at mass-producing new RL environments or something else, and that mass-production turns out to so useful, that the existence of this model already kicks off a rapid intelligence explosion. I agree this is not too likely in the short term though, so the new imprecise definition is probably kind of reasonable for now.
I also haven’t found great sources when looking more closely. This seems like a somewhat good source, but still doesn’t quantify how many dollars a super PAC needs to spend to buy a vote.
I’m starting to feel skeptical how reasonable/well-defined these capability levels are in the modern paradigm.
My understanding is that reasoning models’ training includes a lot of clever use of other AIs to generate data or to evaluate completions. Could AI companies create similarly capable models from the same budget as their newest reasoning models if their employees’ brain run at 2x speed, but they couldn’t use earlier AIs for data generation or evaluation?
I’m really not sure. I think plausibly the current reasoning training paradigm just wouldn’t work at all without using AIs in training. So AI companies would need to look for a different paradigm, which might work much less well, which I can easily imagine outweighing the advantage of employees running 2x speed. If that’s the case, does that mean that GPT-4.1 or whatever AI they used in the training of the first reasoning model was plausibly already more than 2x-ing AI R&D labor according to this post’s definition? I think that really doesn’t match the intuition that this post tried to convey, so I think probably the definition should be changed, but I don’t know what would be a good definition.
FWIW, I get a bunch of value from reading Buck’s and Ryan’s public comments here, and I think many people do. It’s possible that Buck and Ryan should spend less time commenting because they have high opportunity cost, but I think it would be pretty sad if their commenting moved to private channels.
I’m confused. If Fairshake’s $100 million was this influential, to the point that “politicians are advised that crypto is the single most important industry to avoid pissing off”, why don’t other industries spend similar amounts on super PACs? $100 million is just not that much money.
It has long been a mystery to me why there isn’t more money in politics, but I always thought that the usual argument was that studies show that campaign spending matters surprisingly little, and in particular, super PAC dollars are very not effective at getting votes.
How strong is the evidence that the crypto industry managed to become very influential through Fairshake?
I’m pretty confused by the conclusion of this post. I was nodding along during the first half of the essay: I myself worry a lot about how I and others will navigate the dilemma of exposure to AI super-persuasion and addiction on one side, and paranoid isolationism on the other.
But then in the conclusion of the post, you only talk about how people will fall into one of these two traps: isolationist religious communes locking their members in until the end of times.
I worry more about the other trap: people foolishly exposing themselves to too much AI generated super-stimulus and getting their brain fried. I think much more people will be exposed to various addictive AI generated content than the number of people who have strong enough religious communities that they create an isolationist bubble.
I think it’s plausible that the people who expose themselves to all the addictive stuff on the AI-internet will also sooner or later get captured by some isolationist bubble that keeps them locked away from the other competing memes: arguably that’s the only stable point. But I worry that these stable points will be worse than the Christian co-ops you describe.
I imagine an immortal man, in the year 3000, sitting at his computer, not having left his house or having talked to a human in almost a thousand years, talking with his GPT-5.5 based AI girlfriend and scrolling his personalized twitter feed, full of AI generated outrage stories rehashing the culture war fights of his youth. Outside his window, there is a giant billboard advertising “Come on, even if you want to fritter your life away, at least use our better products! At least upgrade your girlfriend to GPT-6!” But his AI girlfriend told him to shutter his window a thousand years ago, so the billboard is to no avail.
This is of course a somewhat exaggerated picture, but I really do believe that one-person isolation bubbles will be more common and more dystopian than the communal isolationism you describe.
in the year 3000, still teaching that the Earth is 6,000 years old
No, it will be 7000 years old by then.
On the other hand, there is another interesting factor in kings losing power that might be more related to what you are talking about (though I don’t think this factor is as important as the threat of revolutions discussed in the previous comment).
My understanding is that part of the story for why kings lost their power is that the majority of people were commoners, so the best writers, artists and philosophers were commoners (or at least not the highest aristocrats), and the kings and the aristocrats read their work, and these writer often argued for more power to the people. The kings and aristocrats sometimes got sincerely convinced, and agreed to relinquish some powers even when it was not absolutely necessary for preempting revolutions.
I think this is somewhat analogous to the story of cultural AI dominance in Gradual Disempowerment: all the most engaging content creators are AIs, humans consume their content, the AIs argue for giving power to AIs, and the humans get convinced.
I agree this is a real danger, but I think there might be an important difference between the case of kings and the AI future.
The court of Louis XVI read Voltaire, but I think if there was someone equally witty to Voltaire who also flattered the aristocracy, they would have plausibly liked him more. But the pool of witty people was limited, and Voltaire was far wittier than any of the few pro-aristocrat humorists, so the royal court put up with Voltaire’s hostile opinions.
On the other hand, in a post-AGI future, I think it’s plausible that with a small fraction of the resources you can get close to saturating human engagement. Suppose pro-human groups fund 1% of the AIs generating content, and pro-AI groups fund 99%. (For the sake of argument, let’s grant the dubious assumption that the majority of economy is controlled by AIs.) I think it’s still plausible that the two groups can generate approximately equally engaging content, and if humans find pro-human content more appealing, then that just wins out.
Also, I’m kind of an idealist, and I think part of the reason that Voltaire was successful is that he was just right about a lot of things, parliamentary government really leads to better outcomes than absolute monarchy from the perspective of a more-or-less shared human morality. So I have some hope (though definitely not certainty) that AI content creators competing in a free marketplace of ideas will only convince humanity to voluntarily relinquish power if relinquishing power is actually the right choice.
Interesting. My guess would have been the opposite. Ryan’s three posts all received around 150 karmas and were generally well-received, I think a post like this would be considered 90th percentile success for a MATS project. But admittedly, I’m not very calibrated about current MATS projects. It’s also possible that Ryan has good enough intuitions to have picked two replications that are likely to yield interesting results, while a less skillfully chosen replication would be more likely to just show “yep, the phenomenon observed in the old paper is still true”. That would be less successful but I don’t know how it would compare in terms of prestige to the usual MATS projects. (My wild guess is that it would still be around median, but I really don’t know.)