yams

Karma: 1,429

MIRI, formerly MATS, sometimes Palisade

yams 11 Nov 2025 21:45 UTC
2 points
0
in reply to: Drake Thomas’s comment on: How likely is dangerous AI in the short term?
AIs with reliable 1-month time horizons will basically not be time-horizon-limited in any way that humans aren’t
In this statement, are you thinking about time horizons as operationalized / investigated in the METR paper, or are you thinking about the True Time Horizon?

yams 11 Nov 2025 1:23 UTC
5 points
0
on: yams’s Shortform
A group of researchers has released the Longitudinal Expert AI Panel, soliciting and collating forecasts regarding AI progress, adoption, and regulation from a large pool of both experts and non-experts.

yams 6 Nov 2025 3:07 UTC
4 points
0
in reply to: Wei Dai’s comment on: Heroic Responsibility
Didn’t disagree vote myself, but I think there’s a linguistic pattern of ‘just asking questions’ that is used to signal disagreement while also evading interrogation yourself. At first glance, your comment may be reading that way to others, who then hastily smash the disagree button to signal disagreement with the position they think you’re implying (even though you were really genuinely just asking questions).
I see this happen a lot, where folks mismodel someone’s epistemic state or tacking when, really, the person is just confused and trying to explicate the conditions of their confusion. In the broader world, claiming to be confused about something is a common tactic for trying to covertly convince someone of your position.

yams 26 Oct 2025 2:10 UTC
6 points
0
in reply to: leogao’s comment on: Daniel Birnbaum’s Shortform
Duncan Sabien once ran the inverse experiment. He made a separate account to see how his posts would do without his reputation. The account only has one post still up, but iirc there used to be many more (tens). They performed similarly well to posts under his own name. Cool idea!
[plausibly I’m getting parts of the story wrong and someone who was around then will correct me]

yams 26 Oct 2025 1:55 UTC
3 points
0
in reply to: Taylor G. Lunt’s comment on: Guys I might be an e/acc
I think that I’d do this math by net QUALYs and not net deaths. My guess is doing it that way may actually change your result.
I’m not trying to avoid dying; I’m trying to steer toward living.

yams 24 Oct 2025 5:42 UTC
4 points
12
in reply to: sjadler’s comment on: Which side of the AI safety community are you in?
Yup! I just think there’s an unbounded way that a reader could view his comment: “oh! There are no current or future consequences at OAI for those who sign this statement!”
…and I wanted to make the bound explicit: real protections, into the future, can’t plausibly be offered, by anyone. Surely most OAI researchers are thinking ahead enough to feel the pressure of this bound (whether or not it keeps them from signing).
I’m still glad he made this comment, but the Strong Version is obviously beyond his reach to assure.

yams 23 Oct 2025 2:37 UTC
3 points
2
in reply to: boazbarak’s comment on: Which side of the AI safety community are you in?
This is good!
My guess is that their hesitance is also linked to potential future climates, though, and not just the current climate, so I don’t expect additional signees to come forward in response to your assurances.

yams 19 Oct 2025 19:53 UTC
4 points
0
in reply to: David Matolcsi’s comment on: The IABIED statement is not literally true
I think my crux is ‘how much does David’s plan resemble the plans labs actually plan to pursue?’
I read Nate and Eliezer as baking in ‘if the labs do what they say they plan to do, and update as they will predictably update based on their past behavior and declared beliefs’ to all their language about ‘the current trajectory’ etc etc.
I don’t think this resolves ‘is the tittle literally true’ in a different direction if it’s the only crux, and agree that this should have been spelled out more explicitly in the book (e.g. ‘in detail, why are the authors pessimistic about current safety plans’) from a pure epistemic standpoint (although think it was reasonable to omit from a rhetorical standpoint, given the target audience) and in various Headline Sentences throughout the book, and The Problem.
One generous way to read Nate and Eliezer here is to say ‘current techniques’ is itself intending to bake in ‘plans the labs currently plan to pursue’. I was definitely reading it this way, but think it’s reasonable for others not to. If we read it that way, and take David’s plan above to be sufficiently dissimilar from real lab plans, then I think the title’s literal interpretation goes through.
[your post has updated me from ‘the title is literally true’ to ‘the title is basically reasonable but may not be literally true depending on how broadly we construe various things’, which is a significantly less comfortable position!]

yams 19 Oct 2025 6:34 UTC
2 points
0
in reply to: Eli Tyre’s comment on: If Anyone Builds It Everyone Dies, a semi-outsider review
I want to vouch for Eli as a great person to talk with about this. He has been around a long time, has done great work on a few different sides of the space, and is a terrific communicator with a deep understanding of the issues.
He’s run dozens of focus-group style talks with people outside the space, and is perhaps the most practiced interlocutor for those with relatively low context.
[in case OP might think of him as some low-authority rando or something and not accept the offer on that basis]

yams 11 Oct 2025 17:37 UTC
3 points
3
in reply to: StanislavKrym’s comment on: Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
You’re disagreeing with a claim I didn’t intend to make.
I was unclear in my language and shouldn’t have used ‘contains’. Sorry! Maybe ‘relaying’ would have avoided this confusion.
I think you’re not objecting to the broader point other than by saying ‘neuralese requires very high bandwidth’, but LLMs have a lot of potential associations that can be made in processing a single token (which is, potentially, an absolute ton of bandwidth).

yams 11 Oct 2025 16:40 UTC
4 points
0
in reply to: yams’s comment on: Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
@StanislavKrym can you explain your disagree vote?
Strings of numbers are shown to transmit a fondness for owls. Numbers have no semantic content related to owls. This seems to point to ‘tokens containing much more information than their semantic content’, doesn’t it?

yams 10 Oct 2025 17:40 UTC
3 points
0
on: Subliminal Learning: LLMs Transmit Behavioral Traits via Hidden Signals in Data
Doesn’t this have implications for the feasibility of neuralese? I’ve heard some claims that tokens are too low-bandwidth for neuralese to work for now, but this seems to point at tokens containing (edit: I should have said something like ‘relaying’ or ‘invoking’ rather than ‘containing’) much more information than their semantic content.

yams 6 Oct 2025 17:59 UTC
3 points
1
on: LLMs are badly misaligned
I’m not sure how useful I find hypotheticals of the form ‘if Claude had its current values [to the extent we can think of Claude as a coherent enough agent to have consistent values, etc etc], but were much more powerful, what would happen?’ A more powerful model would be likely to have/evince different values from a less powerful model, even if they were similar architectures subjected to similar training schema. Less powerful models also don’t need to be as well-aligned in practice, if we’re thinking of each deployment as a separate decision-point, since they’re of less consequence.
I understand that you’re in-part responding to the hypothetical seeded by Nina’s rhetorical line, but I’m not sure how useful it is when she does it, either.

yams 6 Oct 2025 17:48 UTC
2 points
0
on: LLMs are badly misaligned
I don’t think the quote from Ryan constitutes a statement on his part that current LLMs are basically aligned. He’s quoting a hypothetical speaker to illustrate a different point. It’s plausible to me that you can find a quote from him that is more directly in the reference class of Nina’s quote, but as-is the inclusion of Ryan feels a little unfair.

yams 4 Oct 2025 1:59 UTC
2 points
0
in reply to: wachichornia’s comment on: If Anyone Builds It, Everyone Dies: Call for Translators (for Supplementary Materials)
There should be announcements through the intelligence.org newsletter (as well as on the authors’ twitters) when those dates are announced (the deals were signed for some already, and more are likely to come, but they don’t tell you the release date when you sign the deal!).

yams 1 Oct 2025 3:26 UTC
4 points
0
in reply to: Bronson Schoen’s comment on: A non-review of “If Anyone Builds It, Everyone Dies”
The situation you’re describing definitely concerns me, and is about mid-way up the hierarchy of nested problems as I see it (I don’t mean ‘hierarchy of importance’ I mean ’spectrum from object-level-empirical-work to realm-of-pure-abstraction).
I tried to capture this at the end of my comment, by saying that even success as I outlined it probably wouldn’t change my all-things-considered view (because there’s a whole suite of nested problems at other levels of abstraction, including the one you named), but it would at least update me toward the plausibility of the case they’re making.
As is, their own tests say they’re doing poorly, and they’ll probably want to fix that in good faith before they try tackling the kind of dynamic group epistemic failures that you’re pointing at.

yams 30 Sep 2025 19:29 UTC
13 points
10
in reply to: boazbarak’s comment on: A non-review of “If Anyone Builds It, Everyone Dies”
Get to near-0 failure in alignment-loaded tasks that are within the capabilities of the model.
That is, when we run various safety evals, I’d like it if the models genuinely scored near-0. I’d also like it if the models ~never refused improperly, ~never answered when they should have refused, ~never precipitated psychosis, ~never deleted whole codebases, ~never lied in the CoT, and similar.
These are all behavioral standards, and are all problems that I’m told we’ll keep under control. I’d like the capacity for us to have them under control demonstrated currently, as a precondition of advancing the frontier.
So far, I don’t see that the prosaic plans work in the easier, near-term cases, and am being asked to believe they’ll work in the much harder future cases. They may work ‘well enough’ now, but the concern is precisely that ‘well enough’ will be insufficient in the limit.
An alternative condition is ‘full human interpretability of GPT-2 Small’.
This probably wouldn’t change my all-things-considered view, but this would substantially ‘modify my expectations’, and make me think the world was much more sane than today’s world.

yams 30 Sep 2025 18:37 UTC
9 points
0
in reply to: 1a3orn’s comment on: Why Corrigibility is Hard and Important (i.e. “Whence the high MIRI confidence in alignment difficulty?”)
I usually think of these sorts of claims by MIRI, or by 1940s science fiction writers, as mapping out a space of ‘things to look out for that might provide some evidence that you are in a scary world.’
I don’t think anyone should draw strong conceptual conclusions from relatively few, relatively contrived, empirical cases (alone).
Still, I think that they are some evidence, and that the point at which they become some evidence is ‘you are seeing this behavior at all, in a relatively believable setting’, with additional examples not precipitating a substantial further update (unless they’re more natural, or better investigated, and even then the update is pretty incremental).
In particular, it is outright shocking to most members of the public that AI systems could behave in this way. Their crux is often ‘yeah but like… it just can’t do that, right?’ To then say ‘Well, in experimental settings testing for this behavior, they can!’ is pretty powerful (although it is, unfortunately, true that most people can’t interrogate the experimental design).
“Indicating that alignment faking is emergent with model scale” does not, to me, mean ‘there exists a red line beyond which you should expect all models to alignment fake’. I think it means something more like ‘there exists a line beyond which models may begin to alignment fake, dependent on their other properties’. MIRI would probably make a stronger claim that looks more like the first (but observe that that line is, for now, in the future); I don’ t know that Ryan would, and I definitely don’t think that’s what he’s trying to do in this paper.
Ryan Greenblatt and Evan Hubinger have pretty different beliefs from the team that generated the online resources, and I don’t think you can rely on MIRI to provide one part of an argument, and Ryan/Evan to provide the other part, and expect a coherent result. Either may themselves argue in ways that lean on the other’s work, but I think it’s good practice to let them do this explicitly, rather than assuming ‘MIRI references a paper’ means ‘the author of that paper, in a different part of that paper, is reciting the MIRI party line’. These are just discreet parties.

yams 30 Sep 2025 17:51 UTC
18 points
0
in reply to: Buck’s comment on: Buck’s Shortform
This isn’t at all a settled question internally; some folks at MIRI prefer the ‘international agreement’ language, and some prefer the ‘treaty’ language, and the contents of the proposals (basically only one of which is currently public, the treaty from the online resources for the book) vary (some) based on whether it’s a treaty or an international agreement, since they’re different instruments.
Afaict the mechanism by which NatSec folks think treaty proposals are ‘unserious’ is that treaties are the lay term for the class of objects (and are a heavy lift in a way most lay treaty-advocates don’t understand). So if you say “treaty” and somehow indicate that you in fact know what that is, it mitigates the effect significantly.
I think most TGT outputs are going to use the international agreement language, since they’re our ‘next steps’ arm (you usually get some international agreement ahead of a treaty; I currently expect a lot of sentences like “An international agreement and, eventually, a treaty” in future TGT outputs).
My current understanding is that Nate wanted to emphasize what would actually be sufficient by his lights, looked into the differences in the various types of instruments, and landed back on treaty, which is generally in line with the ‘end points’ emphasis of the book project as a whole.
In the >a dozen interactions where we brought this up with our most authoritative NatSec contact (many of which I was present for), he did not vomit blood even once!
It’s definitely plausible the treaty draft associated with the book is taking some hits here, but I think this was weighed against ’well, if we tell them what we want, and we actually get it, and it’s too weak, that’s a loss.” Strategically, I would not endorse everyone operating from that frame, but I do endorse it existing as part of the portfolio of approaches here, and am glad to support MIRI as the org in the room most willing to make that kind of call.
I’m glad Buck is calling this out, so that other actors don’t blindly follow the book’s lead and deploy ‘treaty’ unwisely.
(I think Ray’s explanation is coherent with mine, but speaks to the experience of someone who only saw something like ‘user-facing-book-side’, whereas I was in a significant subset of the conversations where this was being discussed internally, although never with Nate, so I wouldn’t be shocked if he’s seeing it differently.)

yams 30 Sep 2025 17:23 UTC
9 points
2
in reply to: Joseph Miller’s comment on: evhub’s Shortform
fwiw, from my time at MATS, I recall several projects that were ‘just capabilities’. I’m not sure if those ended up being published on, or what the overall ratio was (it wasn’t ¹⁄₆, surely, but the Anthropic program also has a much smaller sample size than MATS).
To the extent that some people updated on the fellows program based on this comment, it’s likely they should also update on MATS (although to a lesser degree), and I’d be interested in an analysis of MATS research outputs that found the ratio (maybe they’ve already done this analysis, and maybe the fraction is very small, like ¹⁄₅₀).
(I also think counting papers is a bad way to do this since, as Thomas points out, research is very long-tailed and it’s hard to know the total impact of any given piece of research soon after publication.)