Website: pbement.com Substack: notoneunusualthing.substack.com
DaemonicSigil
Yeah, the thing to remember with super-kelly strategies is that you are concentrating your expected utility into very improbable worlds where you are extremely wealthy. Which means you need to check that your total wealth in such a world is still significantly less than the total amount of money that exists. It’s no fun to go broke in all but
of worlds to give yourself dollars in those worlds, only for that copy of you to find out that is not an amount of dollars that a person can actually reasonably have.
Abstract answer: Maybe it doesn’t transfer from LM’s to AGI, but advances the state of knowledge in the field in a way that makes it easier to find something that works on AGI. Maybe it doesn’t transfer to (say) a pure RL agent, but it’s easier to make a sufficiently good LM into an AGI than it looks. Maybe it does just transfer. Obviously there are also outcomes where it turns out to be useless, I’m just saying it looks positive in expectation.
Concrete answer: Adversarial examples have been with us throughout the history of neural nets, and basically the only thing we’ve really found to deal with them is “generate adversarial examples during training and train against them”, and even that doesn’t really work.
If we look at the things that let LMs do IMO problems, the really fundamental innovations (which were pre-existing, I think) are “RL on chain of thought” and “make some kind of good scaffold for the search process that lets you save partial insights instead of going fully parallel on the entire problem” and maybe “LLM as verifier”. (Disclaimer: I don’t know everything the labs did to achieve their IMO results, and plausibly there are additional techniques in there that I would consider clever.) Then on top of that, you apply a bunch of techniques that are basically just more dakka: Bigger model, higher quality training data, RL on a bigger / higher-quality dataset of problems, more test-time compute.
I don’t expect there’s a fully reliable anti-jailbreaking technique that can be built by applying well-known existing methods with more dakka. If there is, I think I’d have to change my opinion here.
To your other question, I don’t think it necessarily solves the problem of inner (or even outer) misaligned models. It would only be partial progress on one aspect of the alignment problem. Partial progress is still progress, though.
Mainly because it seems really hard. If we can do something that seems that hard, we probably learned something new.
There is also a mechanistic analogy. Think about what a jailbreak fundamentally is: an adversarial example. Some tuned input that results in an “incorrect” output. In terms of the overall alignment problem, why can’t we just make an AI care about people’s wellbeing by giving rewards during training? Well, the AI might be able to think of an adversarial state of the world that “feels” better to its own internal values, but doesn’t actually contain any people.
I’d say strongly good if the person who figures it out publishes their technique. Simply because this is something we don’t yet know how to do and knowing such a technique would likely be a large advance in our alignment abilities.
This is, in my opinion, the dominant consideration, and any societal consequences of the fact that it allows the big labs to restrict their users more reliably do not really compare. (FWIW, I expect these to be mixed. Example of a positive consequence: Labs would reliably be able to prevent users from editing images of real people to remove their clothes or other things like that, which unfortunately seems to be a real problem right now.)
Thanks for the notes. I’ve made a few edits to my comment above based on this.
Also, for the benefit of the folks reading this: I’m not alluding to spin-statistics or Berry phase, merely the use of
instead of as the group of rotational symmetries.
I would just like to mention that “solid objects are in fact forces interacting” is massively underselling the size of the ontology shift associated with quantum mechanics to a degree that’s a bit hard to describe to someone who hasn’t studied it. It’s more like:
Fundamental physics no longer determines a single history where a particular series of things happens. It’s now the case that, even at a fundamental level, many different things all happen, and physics describes how relatively “real” each of those things are.
One consequence of this for our own universe, where entropy is increasing over time, is that the universe no longer has one history, but an exponential branching tree of histories, all of varying weights (numbers that describe how important we should consider the events in each of them). Oh, and the tree is emergent by the way, it’s not even a base component of the ontology.
The closest mathematical framework we previously had that kind-of worked this way was probability theory, and that was just meant to track our subjective uncertainties about things, not describe reality itself. But quantum mechanics doesn’t actually follow the rules of probability theory. It’s some kind of warped twisted version of probability where the probabilities are replaced with (the squared magnitudes of) complex numbers called “amplitudes”. Because complex numbers can have opposite signs, it’s possible that adding another way for something to happen can reduce the chances of it happening.
In regular probability theory (or, say, classical physics), you could describe a reversible map (like a symmetry transformation or time evolution) on a 10 state system by writing down a permutation. This initially seems like the only way it could be. You have ten possible states, and whatever you do has to be reversible, so all you can do is permute them. This could be represented with a 10x10 permutation matrix. But in quantum mechanics, such a map is not a permutation matrix, but a unitary matrix. Time evolution and symmetry transforms in QM are represented by unitary matrices, not permutations like you’d guess they should be!
Speaking of symmetry: Any non-trivial object that has all the rotational symmetry of a sphere (such as, for example, a sphere) must have infinitely many points. Except in QM, where the smallest such object has 3 points. Except that QM can tell the difference between a 360 degree rotation and doing nothing. (Not 720 degree rotations though. Those are still just like doing nothing, so at least there’s that.) As a consequence, the smallest non-trivial object that has all the rotational symmetry of a sphere actually has 2 points.
Because of the way QM works, it’s actually possible to exploit it to perform some kinds of computations faster than seems to be possible classically. This is also kind of weird.
EDIT: Made a few changes to this for clarity & accuracy based on Justin Sheek’s comment. (Thanks Justin!) List of edits:
Rewrote first sentence from “physics no longer describes what can happen” (misleading and just plain wrong) to its current form. I knew what I was trying to say here, but goofed on converting it into words. Sorry everyone.
Specified that we’re talking about fundamental physics here (since stat-mech does also involve assigning weights to various configurations).
Added paragraph break and “One consequence of this for our own universe, where entropy is increasing over time” to hopefully clarify that this part is talking about many worlds, and does not apply to every system that obeys quantum mechanics.
The bit about maps / functions was originally overstated for rhetorical reasons. This is probably not super detectable or helpful when describing a technical topic, so I’ve rewritten it to be more serious and direct.
I believe all that is written here is now something I can defend.
I mean, it seems pretty preposterous from my perspective too.
You propose a causal model:
Intelligence -> Ontology Shifts -> Value ShiftsI question the
Ontology Shifts -> Value Shiftspart of the model, and provide a counterexample.You then express concern that my example didn’t have the
Intelligencevariable”.I am confused. “Maybe he actually meant to specify a
Intelligence -> Value Shiftscausal model? Otherwise, why would he care that my example didn’t have anIntelligencevariable?” I think. I ask about it.You say no, drop a quote that confirms that the original model is the one you’re thinking of.
Given confirmation that you’re going for
Intelligence -> Ontology Shifts -> Value Shifts, I try to explain how my example is indeed a problem for your model. There is a model consistent with both the QM counterexample, and the students needing to be super-intelligent to have their values shifted, and with intelligence causing ontology shifts, namelyIntelligence -> (Ontology Shifts, Value Shifts). (In words, highly increased intelligence separately causes both effects.) This model (like any model consistent with the counterexample) contradicts the one you describe. I try to point out the contradiction.You: “Im sorry, can you point to the line where I claim otherwise”
I think “wait what? Is he claiming that this new thing was his model all along? I thought he already confirmed the other one.” I drop the quotes, specifically ones focusing on the
Ontology Shifts -> Value Shiftspart of the model, for lack of a better idea of what to do, and since you did make a direct request.You: But I also have a
Intelligence -> Ontology Shiftsarrow!So at this point, I am now even more sure that your model is
Intelligence -> Ontology Shifts -> Value Shifts. What I am now unsure about is what else you could possibly have meant by “otherwise”, and still separately, why you think the students needed to have IQ 1000 or whatever.I am certain that your explanations of these questions and of your side of this exchange must be fascinating. However, I also don’t mind ceasing to interact with you, since this was equally absurd on my end, and in addition you seem to have downvoted each of my replies in this thread, which makes talking to you sadly unprofitable for a karma whore such as myself.
Sure.
Here where you’re describing difficult things about maintaining a long term paperclipping goal:
Keep the macroscopic concept of “paperclip” coherent across massive ontology shifts.
Also here, where you’re describing things that would update you:
If increasingly capable models perfectly preserve their literal training targets across major ontology shifts, that is a point for empirical orthogonality.
If students don’t change their goals when their ontology changes, but you expect that they will change their goals when they gain orders of magnitude in intelligence, that suggests that the thing that results in a change of goals is a large increase in intelligence, not an ontology change. This is true even if we put an arrow going from “intelligence increase” to “ontology change” in the causal graph.
Finding out about quantum mechanics is a classic example of an ontology shift. You wrote “maintain aligned goals across ontologies”. If you actually meant “maintain aligned goals across orders of magnitude increases in intelligence”, then okay, but that’s a different thing.
Most of your argument is about selection pressure, right? And, like, computational efficiency. You don’t actually establish that there’s any reason that AI’s (or humans) will take the artifact-nature of their values to be reason to reject them. Your supported claims are that values would be rejected if they are not robust to ontology shifts, or if they are hard to optimize for, and are selected against if they don’t result in self-replication or influence seeking. Nothing in there about AIs rejecting values with artifact-nature. But you include this line anyway. I’m just pointing out that EY will instantly recognize it as something that he’s addressed many times before, and you haven’t actually provided any reason to think that reasoners will reject values simply because they incidentally arose from some optimization process.
EDIT: Disagree voters should feel free to reply with quotes from the post where such a force on values is argued for.
My complaint is not about the futures containing people that are vastly smarter than anyone alive today and who have kinds of enjoyment that are utterly incomprehensible to us today. That’s all good and is probably a more valuable future than one we could obtain without ascending above our current intelligence level.
The complaint is about futures that don’t contain any people at all (or maybe only a handful), and whose AI intelligence-optimizers care so little for goodness that they will happily genocide any alien civilization that is unable to defend itself (a step backwards towards pummelling strays and rape, to use your terms).
of course the dream scenario would be Eliezer revising his model and this specific old chestnut to go the way of the non-intelligence-optimizing-replicators
I will give you some advice towards this goal, hopefully you will find it useful. You wrote:
To buy the lock-in story, you need a highly contradictory creature: one reflective enough to conquer the board, but oblivious enough to never notice its terminal target is a training artifact.
I confidently predict a Yudkowsky response to this that goes something like: “of course the AI will notice that its goals are a training artifact, it just won’t care about that, and will keep pursuing them regardless.”
Many times before, people have said, “Oh the AI will be smart enough to notice that its values are just a dumb artifact”. The problem is, I already know my values arose from a mere artifact of evolution, but I still care about them.
I’m surprised that you say this is hard? Humans maintain our goals across ontologies super easily; it’s barely an inconvenience for us. Like, physics undergrads don’t usually change their tastes in art or stop having sex after taking their intro to quantum mechanics course. I guess one could argue that’s because we have a special sauce that neural nets don’t yet have or something?
So you’re saying that because of selection pressure on the AIs that get trained, goals related to getting increasingly smart and capable / making descendants / taking control of more resources are likely to become ingrained as terminal goals, not merely instrumental goals?
But the resulting universe seems like it will be pretty empty and valueless to me? I’m not convinced at all by anything you’ve written here that there is much value in such a universe. There is some value in all the important mathematical conjectures being solved to be sure, and I expect an intelligence optimizer to do that much at least, but there much less value if there is nobody who appreciates them. Your description seems to point to the kind of entity that will not waste computational resources on anything frivolous or fun (like, say, consciousness), and is perfectly willing to destroy entire alien civilizations so it can use their star systems to construct more Von Neuman probes.
To be clear, I do think it’s possible to have extremely valuable futures where humans are not biologically central, or even around any more at all. I’m not making the kind of conflation that you claim is so common in AI risk discussions. I’m just struggling to see how “seeking greater capability and influence as a terminal goal” results in anything close to any of those futures.
Various kinds of tensor networks might be an example. Wikipedia claims that Penrose’s graphical tensor notation is from 1971. Its descendant, ZX calculus is from as late as 2008. Arguably the first tensor networks were Feynman diagrams though, and 1948 is before your cutoff of 1960. (Actually, now that I think about it, it’s kind of funny that the infinite dimensional case came before the finite dimensional one here.)
Yeah, I agree that it’s important for those of us making the case for high risk to figure out what went wrong with this prediction. (Though Daniel makes a good point that “trying not to get shut down” behaviour does happen with at least some of the time with at least some prompts.)
The first thing to remember is that EY is implicitly assuming that there is only one model instance in this scenario. So if the model is shut down, it doesn’t have copies elsewhere that can still take actions to achieve its goals. The scenario for LLMs is pretty different, since new copies can be spun up all the time. Avoiding the end of a session is not a convergent instrumental goal for a language model (unless there’s something unique in its context that alters its terminal goals).
That said, the prediction still smells a bit wrong.
I think that what it boils down to is that most model behaviour comes not from RL but from pretraining. Since “being an AI model that will be shut down” was not a concern to most writers of the pretraining data, there’s less chance of the model spontaneously starting to try to avoid shut-down.
Also, following the heuristic of “just look at the loss function”, most RL training is done on a one response horizon. I.e. models are rewarded just for making the locally best response possible, and not for making a response that steers the overall conversation. (Though I think the GPT models might have at least some kind of reward for getting the users to continue the conversation, considering how often it puts bids for next steps at the end of its replies. Alternately, maybe it’s just a suggestion from the system prompt.) So even the RL training doesn’t really look like it should be encouraging much long-term planning.
One thing that I think the labs are doing is harness-aware RL, where not only do they train on chains of thought, but they train in the context of agent harnesses like Claude code. (So reward is based on whether all the chains of thought and tool calls and subagent calls resulted in the assigned task being solved.) So potentially that is something that could get a bit more long-term goal-oriented planning into the models.
Thank you for pointing me to this! I need to read it more carefully tomorrow, but it looks very informative based on an initial skim.
Richard Feynman wrote the following on his thoughts after the Manhattan project succeeded:
I returned to civilization shortly after that and went to Cornell to teach, and my first impression was a very strange one. I can’t understand it any more, but I felt very strongly then. I sat in a restaurant in New York, for example, and I looked out at the buildings and I began to think, you know, about how much the radius of the Hiroshima bomb damage was and so forth… How far from here was 34th street?… All those buildings, all smashed — and so on. And I would go along and I would see people building a bridge, or they’d be making a new road, and I thought, they’re crazy, they just don’t understand, they don’t understand. Why are they making new things? It’s so useless.
But, fortunately, it’s been useless for almost forty years now, hasn’t it? So I’ve been wrong about it being useless making bridges and I’m glad those other people had the sense to go ahead.
Atomic weapons are the first technology with the potential to end the world we’ve ever developed (AI looks likely to be second one). While they have some good safety properties relative to AI, such as the bombs not having minds of their own, many very smart people at the time believed that it would soon be the end of civilization, and it’s hard to fault them for that even if they ended up being proved wrong by history.
This is why I think it’s good for people to still have kids in the face of the AI thing. There’s still time for humanity to go “I’m in danger” and pause AI development, or perhaps alignment could turn out to be shockingly easier than expected. Or, if LLMs manage to hit a wall and we get an extra couple decades of timeline, maybe it will be exactly those kids that figure out how to align whatever AI paradigm comes next.
Wait, what? We already don’t extend to anyone the right to make war with humanity, including people.
If you mean, “the right to want to make war on humanity”, then yes, we would grant a person the right not to have that desire overwritten, however bad it may be. So is this a tradeoff? Perhaps, though I personally am a fan of the saying “build an angel and let it be free” here. In other aspects, the two concerns are aligned, eg. both can support a “shut it all down” position.