yams
@StanislavKrym can you explain your disagree vote?
Strings of numbers are shown to transmit a fondness for owls. Numbers have no semantic content related to owls. This seems to point to ‘tokens containing much more information than their semantic content’, doesn’t it?
Doesn’t this have implications for the feasibility of neuralese? I’ve heard some claims that tokens are too low-bandwidth for neuralese to work for now, but this seems to point at tokens containing (edit: I should have said something like ‘relaying’ or ‘invoking’ rather than ‘containing’) much more information than their semantic content.
I’m not sure how useful I find hypotheticals of the form ‘if Claude had its current values [to the extent we can think of Claude as a coherent enough agent to have consistent values, etc etc], but were much more powerful, what would happen?’ A more powerful model would be likely to have/evince different values from a less powerful model, even if they were similar architectures subjected to similar training schema. Less powerful models also don’t need to be as well-aligned in practice, if we’re thinking of each deployment as a separate decision-point, since they’re of less consequence.
I understand that you’re in-part responding to the hypothetical seeded by Nina’s rhetorical line, but I’m not sure how useful it is when she does it, either.
I don’t think the quote from Ryan constitutes a statement on his part that current LLMs are basically aligned. He’s quoting a hypothetical speaker to illustrate a different point. It’s plausible to me that you can find a quote from him that is more directly in the reference class of Nina’s quote, but as-is the inclusion of Ryan feels a little unfair.
There should be announcements through the intelligence.org newsletter (as well as on the authors’ twitters) when those dates are announced (the deals were signed for some already, and more are likely to come, but they don’t tell you the release date when you sign the deal!).
The situation you’re describing definitely concerns me, and is about mid-way up the hierarchy of nested problems as I see it (I don’t mean ‘hierarchy of importance’ I mean ’spectrum from object-level-empirical-work to realm-of-pure-abstraction).
I tried to capture this at the end of my comment, by saying that even success as I outlined it probably wouldn’t change my all-things-considered view (because there’s a whole suite of nested problems at other levels of abstraction, including the one you named), but it would at least update me toward the plausibility of the case they’re making.
As is, their own tests say they’re doing poorly, and they’ll probably want to fix that in good faith before they try tackling the kind of dynamic group epistemic failures that you’re pointing at.
Get to near-0 failure in alignment-loaded tasks that are within the capabilities of the model.
That is, when we run various safety evals, I’d like it if the models genuinely scored near-0. I’d also like it if the models ~never refused improperly, ~never answered when they should have refused, ~never precipitated psychosis, ~never deleted whole codebases, ~never lied in the CoT, and similar.
These are all behavioral standards, and are all problems that I’m told we’ll keep under control. I’d like the capacity for us to have them under control demonstrated currently, as a precondition of advancing the frontier.
So far, I don’t see that the prosaic plans work in the easier, near-term cases, and am being asked to believe they’ll work in the much harder future cases. They may work ‘well enough’ now, but the concern is precisely that ‘well enough’ will be insufficient in the limit.
An alternative condition is ‘full human interpretability of GPT-2 Small’.
This probably wouldn’t change my all-things-considered view, but this would substantially ‘modify my expectations’, and make me think the world was much more sane than today’s world.
I usually think of these sorts of claims by MIRI, or by 1940s science fiction writers, as mapping out a space of ‘things to look out for that might provide some evidence that you are in a scary world.’
I don’t think anyone should draw strong conceptual conclusions from relatively few, relatively contrived, empirical cases (alone).
Still, I think that they are some evidence, and that the point at which they become some evidence is ‘you are seeing this behavior at all, in a relatively believable setting’, with additional examples not precipitating a substantial further update (unless they’re more natural, or better investigated, and even then the update is pretty incremental).
In particular, it is outright shocking to most members of the public that AI systems could behave in this way. Their crux is often ‘yeah but like… it just can’t do that, right?’ To then say ‘Well, in experimental settings testing for this behavior, they can!’ is pretty powerful (although it is, unfortunately, true that most people can’t interrogate the experimental design).
“Indicating that alignment faking is emergent with model scale” does not, to me, mean ‘there exists a red line beyond which you should expect all models to alignment fake’. I think it means something more like ‘there exists a line beyond which models may begin to alignment fake, dependent on their other properties’. MIRI would probably make a stronger claim that looks more like the first (but observe that that line is, for now, in the future); I don’ t know that Ryan would, and I definitely don’t think that’s what he’s trying to do in this paper.
Ryan Greenblatt and Evan Hubinger have pretty different beliefs from the team that generated the online resources, and I don’t think you can rely on MIRI to provide one part of an argument, and Ryan/Evan to provide the other part, and expect a coherent result. Either may themselves argue in ways that lean on the other’s work, but I think it’s good practice to let them do this explicitly, rather than assuming ‘MIRI references a paper’ means ‘the author of that paper, in a different part of that paper, is reciting the MIRI party line’. These are just discreet parties.
This isn’t at all a settled question internally; some folks at MIRI prefer the ‘international agreement’ language, and some prefer the ‘treaty’ language, and the contents of the proposals (basically only one of which is currently public, the treaty from the online resources for the book) vary (some) based on whether it’s a treaty or an international agreement, since they’re different instruments.
Afaict the mechanism by which NatSec folks think treaty proposals are ‘unserious’ is that treaties are the lay term for the class of objects (and are a heavy lift in a way most lay treaty-advocates don’t understand). So if you say “treaty” and somehow indicate that you in fact know what that is, it mitigates the effect significantly.
I think most TGT outputs are going to use the international agreement language, since they’re our ‘next steps’ arm (you usually get some international agreement ahead of a treaty; I currently expect a lot of sentences like “An international agreement and, eventually, a treaty” in future TGT outputs).
My current understanding is that Nate wanted to emphasize what would actually be sufficient by his lights, looked into the differences in the various types of instruments, and landed back on treaty, which is generally in line with the ‘end points’ emphasis of the book project as a whole.
In the >a dozen interactions where we brought this up with our most authoritative NatSec contact (many of which I was present for), he did not vomit blood even once!
It’s definitely plausible the treaty draft associated with the book is taking some hits here, but I think this was weighed against ’well, if we tell them what we want, and we actually get it, and it’s too weak, that’s a loss.” Strategically, I would not endorse everyone operating from that frame, but I do endorse it existing as part of the portfolio of approaches here, and am glad to support MIRI as the org in the room most willing to make that kind of call.
I’m glad Buck is calling this out, so that other actors don’t blindly follow the book’s lead and deploy ‘treaty’ unwisely.
(I think Ray’s explanation is coherent with mine, but speaks to the experience of someone who only saw something like ‘user-facing-book-side’, whereas I was in a significant subset of the conversations where this was being discussed internally, although never with Nate, so I wouldn’t be shocked if he’s seeing it differently.)
fwiw, from my time at MATS, I recall several projects that were ‘just capabilities’. I’m not sure if those ended up being published on, or what the overall ratio was (it wasn’t 1⁄6, surely, but the Anthropic program also has a much smaller sample size than MATS).
To the extent that some people updated on the fellows program based on this comment, it’s likely they should also update on MATS (although to a lesser degree), and I’d be interested in an analysis of MATS research outputs that found the ratio (maybe they’ve already done this analysis, and maybe the fraction is very small, like 1⁄50).
(I also think counting papers is a bad way to do this since, as Thomas points out, research is very long-tailed and it’s hard to know the total impact of any given piece of research soon after publication.)
What’s the least-worrying thing we may see that you’d expect to lead to a pause in development?
(this isn’t a trick question; I just really don’t know what kind of thing gradualists would consider cause for concern, and I don’t find official voluntary policies to be much comfort, since they can just be changed if they’re too inconvenient. I’m asking for a prediction, not any kind of commitment!)
I really don’t know how to evaluate this claim, and I mostly just want to see logs (of his lifts, intake) backing it up.
I’m also curious to know how much lactase he produces (the majority of humans are some degree of lactose intolerant, which can lead to some of the effects he described, especially at high volume of ingestion, but via a different mechanism than ‘sulfur’).
I also think ‘one guy can do it, actually’ isn’t much evidence here; there are lots of genetic freaks wandering around, including a sedentary friend of mine with a six pack who only ever consumes beer (10+ per day) and pastries.
For most, LLMs are the salient threat vector at this time, and the practical recommendations in the book are toward that. You did not say in your post ‘I believe that brain-in-a-box is the true concern; the book’s recommendations don’t work for this, because Chapter 13 is mostly about LLMs.’ That would be a different post (and redundant with a bunch of Steve Byrnes stuff).
Instead, you completely buried the lede and made a post inviting people to talk in circles with you unless they magically divined your true objection (which is only distantly related to the topic of the post). That does not look like a good faith attempt to get people on the same page.
I think targeting specific policy points rather than highlighting your actual crux makes this worse, not better
I think positions here come out on four quadrants (but is of course spectral), based on how likely you think Doom is and how easy or hard (that is: resource intensive) you expect ASI development to be.
ASI Easy/Doom Very Unlikely: Plan obviously backfires; you could have had nice things, but were too cautious!
ASI Hard/Doom Very Unlikely: Unlikely to backfire, but you might have been better off pressing ahead, because there was nothing to worry about anyway.
ASI Easy/Doom Very Likely: We’re kinda fucked anyway in this world, so I’d want to have pretty high confidence it’s the world we’re in before attempting any plan optimized for it. But yes, here it looks like the plan backfires (in that we’re selecting even harder than in default-world for power-seeking, willingness to break norms, and non-transparency in coordinating around who gets to build it). My guess is this is the world you think we’re in. I think this is irresponsibly fatalistic and also unlikely, but I don’t think it matters to get into it here.
ASI Hard/Doom Very Likely: Plan plausibly works.
I expect near-term ASI development to be resource-intensive, or to rely on not-yet-complete resource-intensive research. I remain concerned about the brain-in-a-box scenarios, but it’s not obvious to me that they’re much more pressing in 2025 than they were in 2020, except in ways that are downstream of LLM development (I haven’t looked super-close), which is more tractable to coordinate action around anyway, and that action plausibly leads to decreased risk on the margin even if the principle threat is from a brain-in-a-box. I assume you disagree with all of this.
I think your post is just aimed at a completely different threat model than the book even attempts to address, and I think you would be having more of the discussion you want to have if you opened by talking explicitly about your actual crux (which you seemed to know was the real crux ahead of time), than to incite an object-level discussion colored by such a powerful background disagreement. As-is, it feels like you just disagree with the way the book is scoped, and would rather talk to Eliezer about brain-in-a-box than talk to William about the tenative-proposals-to-solve-a-very-different-problem.
I take ‘backfire’ to mean ‘get more of the thing you don’t want than you would otherwise, as a direct result of your attempt to get less of it.’ If you mean it some other way, then the rest of my comment isn’t really useful.
Change of the winner
Secret projects under the moratorium are definitely on the list of things to watch out for, and the tech gov team at MIRI has a huge suite of countermeasures they’re considering for this, some of which are sketched out or gestured toward here.
It actually seems to me that an underground project is more likely under the current regime, because there aren’t really any meaningful controls in place (you might even consider DeepSeek just such a project, given that there’s some evidence [I really don’t know what I believe here and doesn’t seem useful to argue; just using them as an example here] that they stole IP and smuggled chips).
The better your moratorium is, the less likely you are to get wrecked by a secret project (because the fewer resources they’ll be able to gather) before you can satisfy your exit conditions.
So p(undergroundProjectExists) goes down as a result of moratorium legislation, but p(undergroundProjectWins) may go up if your moratorium sucks. (I actually think this is still pretty unclear, owing to the shape of the classified research ecosystem, which I talk more about below.)
This is, imo, your strongest point, and is a principal difference between the various MIRI plans and plans of other people we talk to (“Once you get the moratorium, do you suppose there must be a secret project and resolve to race against them?” MIRI answers no; some others answre yes.)
Intensified race...
You say: “a number of AI orgs would view the threat of prohibition on par with the threat of a competitor winning”. I don’t think this effect is stronger in the moratorium case than in the ‘we are losing the race and believe the finish line is near’ case, and this kind of behavior sooner is better (if we don’t expect the safety situation to improve), because the systems themselves are less powerful, the risks aren’t as big, the financial stakes aren’t as palpable, etc. I agree with something like “looming prohibition will cause some reckless action to happen sooner than it would otherwise”, but not with the stronger claim that this action would be created by the prohibition.
I also think the threat of a moratorium could cause companies to behave more sanely in various ways, so that they’re not caught on the wrong side of the law in the future worlds where some ‘moderate’ position wins the political debate. I don’t think voluntary commitments are trustworthy/sufficient, but I could absolutely see RSPs growing teeth as a way for companies to generate evidence of effective self-regulation, then deploy that evidence to argue against the necessity of a moratorium.
It’s just really not clear to me how this set of interrelated effects would net out, much less that it’s an obvious way pushing through a moratorium might backfire. My best guess is that these cooling effects pre-moratorium basically win out and compresses the period of greatest concern, while also reducing its intensity.
Various impairments for AI safety research
Huge amounts of classified research exists. There are entire parallel academic ecosystems for folks working on military and military-adjacent technologies. These include work on game theory, category theory, Conway’s Game of Life, genetics, corporate governance structures, and other relatively-esoteric things beyond ‘how make bomb go boom’. Scientists in ~every field and mathematicians in any branch of the discipline, can become DARPA PMs, and access to (some portion of) this separate academic canon is considered a central perk of the job. I expect gaining the ability to work on the classified parts of AI safety under a moratorium will be similarly difficult to qualifying for work at Los Alamos and the like.
As others have pointed out, not all kinds of research would need to be classified-by-default, under the plan. Mostly this would be stuff regarding architecture, algorithms, and hardware.
There are scarier worlds where you would want to classify more of the research, and there are reasonable disagreements about what should/shouldn’t be classified, but even then, you’re in a Los Alamos situation, and not in a Butlerian Jihad.
Most of these people claim to be speaking from their impression of how the public will respond, which is not yet knowable and will be known in the (near-ish) future.
My meta point remains that these are all marginal calls, that there are arguments the other direction, and that only Nate is equipped to argue them on the margin (because, in many cases, I disagree with Nate’s calls, but don’t think I’m right about literally all the things we disagree on; the same is true for everyone else at MIRI who’s been involved with the project, afaict). Eg I did not like the scenario, and felt Part 3 could have been improved by additional input from the technical governance team (and more detailed plans, which ended up in the online resources instead). It is unreasonable that I have been dragged into arguing against claims I basically agree with on account of introducing a single fact to the discussion (that length DOES matter, even among ‘elite’ audiences, and that thresholds for this may be low). My locally valid point and differing conclusions do not indicate that I disagree with you on your many other points.
That people wishing the book well are also releasing essays (based on guesses and, much less so in your case than others, misrepresentations) to talk others in the ecosystem out of promoting it could, in fact, be a big problem, mostly in that it could bring about a lukewarm overall reception (eg random normie-adjacent CEA employees don’t read it and don’t recommend it to their parents, because they believe the misrepresentations from Zach’s tweet thread here: https://x.com/Zach_y_robinson/status/1968810665973530781). Once that happens, Zach can say “well, nobody else at my workplace thought it was good,” when none of them read it, and HE didn’t read it, AND they just took his word for it.
I could agree with every one of your object level points, still think the book was net positive, and therefore think it was overconfident and self-fulfillingly nihilistic of you to aithoritatively predict how the public would respond.
I, of course, wouldn’t stand by the book if I didn’t think it was net positive, and hadn’t spent tens of hours hearing the other side out in advance of the release. Part I shines VERY bright in my eyes, and the other sections are, at least, better than similarly high-profile works (to the extent that those exist at all) tackling the same topics (exception for AI2027 vs Part 2).
I am not arguing about the optimal balance and see no value in doing so. I am adding anecdata to the pile that there are strong effects once you near particular thresholds, and it’s easy to underrate these.
In general I don’t understand why you continue to think such a large number of calls are obvious, or imagine that the entire MIRI team, and ~100 people outside of it, thinking, reading, and drafting for many months, might not have weighed such thoughts as ‘perhaps the scenario ought to be shorter.’ Obviously these are all just margin calls; we don’t have many heuristic disagreements, and nothing you’ve said is the dunk you seem to think it is.
Ultimately Nate mostly made the calls once considerations were surfaced; if you’re talking to anyone other than him about the length of the scenario, you’re just barking up the wrong tree.
More on how I’m feeling in general here (some redundancies with our previous exchanges, but some new):
I’ve met a large number of people who read books professionally (humanities researchers) who outright refuse to read any book >300 pages in length.
You’re disagreeing with a claim I didn’t intend to make.
I was unclear in my language and shouldn’t have used ‘contains’. Sorry! Maybe ‘relaying’ would have avoided this confusion.
I think you’re not objecting to the broader point other than by saying ‘neuralese requires very high bandwidth’, but LLMs have a lot of potential associations that can be made in processing a single token (which is, potentially, an absolute ton of bandwidth).