yams
I expect this is true in the simple case of ‘you can sell a service to them’; the reason I used the Amazon example, is that Amazon owns shares in Anthropic ($60B worth according to quick Claude check), which it purchased (at least in part) with compute credits; this is a much more intimate entanglement then Anthropic being yet-another-AWS-customer.
Again, I genuinely don’t know how this pans out, but the crux for me is not ‘can you sell a product to a company that’s a supply chain risk and keep your contracts’; it’s ‘can you [do all the things Amazon is doing with Anthropic, which are largely mutually conditioned on one another as part of complex agreements] with a supply chain risk and keep your contracts.’
Pete Hegseth just declared Anthropic a supply chain risk.
Under the most expansive naive interpretation of what this might mean, it seems plausible that this bottlenecks their compute access (e.g., through Amazon, who has government contracts).
Does anyone know more here / has thought more about this? I’d be pretty surprised if the most expansive naive interpretation held, but it’s also probably not the most narrow one (e.g., can’t work with the defense industry, or on government-specific contracts in larger companies).
That so many companies (the majority of Fortune 500s, for instance) contract with the government seems to make this dramatically impactful (e.g. some-double-digit-percentage of B2B revenue; but is it 10 percent or 90? How much does it impact other aspects of their operation? I just don’t know).
Hmmm.. I think I took Evan’s opening line “I don’t really understand what point you think my quote is making” very literally. On rereading, I think Evan does understand the point you think his quote is making, but disagrees that it makes that point (I haven’t spoken with Evan enough to know how literal he is).
(And I just agree with you that if someone uses the word commit, they shouldn’t be locally redefining it to mean ‘commit unless we change our mind’, in particular when responding to a question that opens with the phrase ‘actually commit’.)
Retracted!
Put all the weight on the word ‘commits’ in the exchange and I think you’ll see Oli’s read! It took me a second to see it, too.
curious as to the strong negative response here; usually when I get downvoted a lot I kind of expect it ahead of time, but this one was surprising!
If anyone has guesses as to why this was so unpopular, I’d be interested to hear them.
Edit: the above comment was at −12 within 15 minutes of posting. My best guess now is 1 or 2 strong downvotes? Currently at −2. (I don’t care terminally; I just really don’t know if the comment is bad vs somehow narrowly offensive)
Edit 2: swung up to +17 (not totally sure that’s the exact number), and now down to +2. Glad to be controversial, I suppose. Renewing my bid for anyone to tell me why this comment would be divisive. Genuinely confused.
I think I’ve seen you in two modes around your more controversial opinions:
Demurring / ‘not wanting to get into it’ / ‘expecting to be attacked’
Aggressively shoe-horning in political examples without making their relevance over less-charged examples clear.
If you continue to hold this set of beliefs, my hope is that you come to feel less persecuted, such that you can unselfconsciously weave the true-to-you version of them into conversation, without drawing too much or too little attention.
However, I don’t know if I’m capable of perceiving a third thing, given my beliefs. Like, maybe there is no ‘sweet spot’ and it will always seem to me that you’re either being an evasive crypto fascist or awkwardly insisting on centering a -phobic/-ist line of argument, because my brain is broken. Afaict, this is what you believe about (people like) me, and I’m not really sure how to rule it out.
David and Oli are your allies. They’re endeavoring to help you see yourself and the world more clearly. The tone of your replies here seems to indicate that you may have lost sight of that.
I agree that ignoring benchmarks is wrong (and think our views are kind of nearby one another in absolute terms). However, benchmarks remain pretty bad, labs continue to hill climb on them (how much is unclear, but it happens), and the authors of the most celebrated benchmarks are extremely modest about how their results ought to be interpreted.
Benchmarks show that models are getting better; how fast and at what is still pretty ambiguous when you include these considerations, imo.
If there is, it’s not presented in that link, which is a recapitulation of prior facts from our discussion with the additional mention of character training in the Claude 3 era (for which only fragmentary public information is available; nothing like a full text which could be compared to the other constitutions).
I understand that Anthropic has characterized the difference between the ~2023 constitution and the late-2025 constitutions as a shift from rules to reasons, and I don’t dispute that this is a reasonable characterization of the delta between those two points. What’s less clear to me is if this was a continuous change or a discreet one (which would be most strongly answered by access to 4.1′s constitution). As-is, I don’t think this is resolvable based on public information, and it’s an important crux for your other claims.
Afaik we only have two drafts of the constitution from the last ~3 years, and both are very recent: the Richard Weiss elicitation and the official publication a few weeks ago. It’s not clear to me when they transitioned from the very simple constitutions to the newer, more elaborate format, and your theory seems to hinge on this change being pretty recent, and pretty stark.
My impression up until now has been that the constitution has steadily grown. Amanda Askill has been working on it iteratively for several years, with each generation of Claude models trained on a different (on average, more elaborate) version. Do we have any sources attesting otherwise?
The use of a constitution has been part of Anthropic’s safety approach since at least late 2022 (and conversations with Anthropic staff in mid-2023 suggested that the models deployed at the time used this technique).
The language below suggests that you’re under the impression that only more recent models use a constitution:
But I think there’s actually more: On Vending bench, people have found that post-constitution Claude is much more dishonest and misaligned:
Pre-constitution Claude never lied about being an ‘exclusive supplier’ or lied about competitors’ pricing. Post-constitution, it screwed over customers and lied routinely.[6] I think this is because Claude was semantically primed
I haven’t thought about it enough to consider if your point still holds in the case that we’re just comparing different versions of the constitution, but it at least seems harder to make these claims if you don’t actually have separate drafts to compare (and I’d be a little surprised if honesty, one of their three pillars, were mentioned significantly less often in previous versions than in the more recent, public versions).
My best guess for this is something like coding is both more routinized and more explicit. That is, the thinking and the tokens are synonymous. One can’t think the correct code and not be able to actually implement the correct code if you’re literally thinking about the exact text of the code. Whereas one can think the correct text in natural language and not implement the correct idea.
Another way of saying this is: there are more reasoning steps in excess of the tokens implicitly required by the most bare bones version of the reasoning process.
Probably you’ve already thought of this and thought of a reason that it’s wrong.
In the first you mention having a strong shared ontology (for thinking about AI) and iirc register a kind of surprise that others don’t share it. I think it would be cool if you could talk about that ontology more directly, and try to hold at that level of abstraction for a prolonged stretch (rather than invoking it in short hand when it’s load bearing and quickly moving along, which is a reasonable default, but not maximally edifying).
I think you gamed it a bit by packing a lot into each point (and to someone who doesn’t know the arguments already — or even someone who does — the groupings feel pretty unnatural).
I remember Golden Gate Claude said something about being aware that its internals were tampered with, or that ‘something wasn’t right’.
I’m concerned that using vectors to mute the model’s ability to tell it’s in an eval may itself tip the model off, and the Golden Gate example seems to indicate this is fairly likely, even in the near term.
Edit: there’s been more work on this question which I’d missed previously (and which does not offer comfort!)
I don’t hold these objections, and I don’t think anyone reasonable does, especially with the “never” in them. At best I could argue that humans aren’t actually great at pursuing goals robustly, and therefore the AI might also not be.
The Problem is intended for a general audience (e.g., not LW users). I assure you people make precisely these objections, very often.
I’m recognizing a lot of terms you’re using but there seems to be a supposition of my model that’s very different from my actual model, to such an extent that I actually can’t decode it. My best guess is that the productive thing is to zoom out and clarify my Actual Position in more detail, instead of arguing single points (which will then lead you to make other assumptions that don’t quite square with my actual model, which is the big failure mode I’m trying to avoid here). To the extent that your aim is to better understand my model (which is very nearby, but not synonymous with, the models of other MIRI staff), this looks like the best path forward to me. Hopefully along the way we locate some cruxes (and I’d like it if you also helped out in guessing the root of our disagreement/misunderstanding).
At a high level, I don’t think it’s particularly useful to talk about ‘alignment’ with respect to smaller and more specialized systems, since it introduces the potential for conflating between qualitatively distinct cases.
For any system with a very small number of outputs (e.g. [chess piece] to [board position] small, and plausibly several OOMs larger than that in absolute number), it is trivially easy to verify the safety of the range of available outputs, since you can simply generate them all in advance, check whether they’re liable to kill all humans, and move on. A key reason that I think alignment is hard is that the range of outputs of properly general systems is so large that all possible outputs in all possible deployment settings can’t possibly be hand-verified. There are reasons to think you can verify groups or categories of outputs bundled together, but so far verifying the safety of the space of all possible outputs of powerful systems is not doable (my impression is that RAT and LAT were gestures in this direction, and that some of the ELK stuff is inspired by related concerns, but I’m no authority!).
I think this explanation constitutes me answering in the affirmative to (the spirit of) your first question. Let me know if that seems right.
For the theorem-prover case, the details of the implementation really matter quite a lot, and so the question is underspecified from my perspective. (My guess is the vast majority of theorem-provers are basically safe, even under large optimization pressure, but I haven’t looked into it and invite someone who’s thought more about that case to chime in; there are definitely approaches I could imagine for creating a theorem prover that might risk dangerous generalization, I’m just not sure how central those approaches are in practice.)
And the only reason AIs will be goal-oriented to a dangerous extent, is because people will intentionally make them like that, despite obvious risks?
This is not my position, and is not the position of anyone at MIRI who I’ve spoken with on the topic.
When I was reading “The Problem”, I was sure that goal-oriented AI was seen as inevitable for some reason deeper than “Goal-oriented behavior is economically useful”.
This is a correct reading and I don’t understand what in my initial reply gave you some other impression. My best guess is that you’ve conflated between generality and goal-pursuing. Chess systems are safe because they’re not very general (i.e., they have a very small action space), not because they aren’t pursuing goals.
I’d still like to argue that “goal-oriented” is not a simple concept,
Agreed.
and it’s not trivial to produce a goal-oriented agent even if you try
In conversations I’ve seen about this, we usually talk about how ‘coherent’ an agent is, as a way of describing how robustly it pursues its objective, whatever that objective may be. If what you mean is something like “contemporary general systems do not pursue their goals especially robustly, and it may be hard to make improvement on this axis,” I agree.
not all useful agents are goal-oriented.
I think I disagree here, but I don’t know how to frame that disagreement without more detail. Feel free to offer it if you feel there’s more to talk about, and I’ll do my best to continue engaging. I want to acknowledge that I don’t really understand what you mean by goal-oriented, and how it might differ from my conception, which I’m hesitant to elaborate on for now, in the spirit of avoiding further confusion.
-
My best guess is that you thought the chess example was attempting to illustrate more things than it actually is. The chess example, as I recall, is a response to two common objections:
“AIs will never be better than humans at cognitive tasks.” Chess is a cognitive task where the AIs ~always beat the humans.
“AIs will never be able to pursue goals robustly over long time horizons.” Chess has many turns, and AIs seem to coherently pursue a single end (winning) over the course of even very long games (in clock time).
These are just existence proofs; if the AI can perform with superhuman competence at a game with n variables (like chess), then it seems plausible that AIs could eventually, in principle, perform with superhuman competence at a game with 2n, 100n, or 1e80 variables.
And I do personally believe, that EY and many others believe, that with enough optimization, even a chess bot should become dangerous. Not sure if there is any evidence for that belief.
I work at MIRI, worked on The Problem, and have never heard anyone express this belief.[1] Brendan is correct about the intention of that passage.
There is no way to make a training environment as complex as the real world.
It’s unclear that this is needed; e.g., the AI2027 story where you train coders that help you train scientists that help you build ASI.
Still, virtual environments for RL are a huge market right now; people are, indeed, currently trying a more modest version of this thing you claim is impossible. Of course, these aren’t literally ‘as complex as the real world’, but it’s not clear the fidelity you’d need to reach particular capability thresholds. Iirc this is the importance of work on, e.g., multi-level world models and Markov blankets: better understanding what fidelity you need in what portions of your conception of the world in order to meet a given end.
- ^
If someone were to chime in and say they believe this, my guess is that they’d get there by abusing the category ‘chess bot’; e.g., ChatGPT is kind of a chess bot in that it’s a bot that can play chess, even though it’s the product of a very different training regime than one would ever sensibly use to create a chess bot on purpose.
- ^
I’m not trying to silence anything. I have really just requested ~1 hour of effort (and named it as that previously).
You’re hyperbolizing my gestures and making selective calls for rigor.
Meta: I hope to follow a policy of mostly ignoring you in the future, in this thread and elsewhere. I suggest allocating your energy elsewhere.
afaict getting up to date on the cyborgism-adjacent discourse is something you (mostly) do by talking to people in person, rather than by reading things on the internet.
(I also wish there were a more convenient way to get up to speed.)