I operate by Crocker’s rules. All LLM output is explicitely designated as such. I have made no self-hiding agreements.
niplav
While reading that exact passage I had an idea which I don’t think will pan out but nevertheless has some cool confluence: use black holes as shielding for intergalactic probes to scoop up the plasma.
It seems relevant that this is not a verbatim Mechanize post but a rewrite of this post.
Andrew Critch used “prepotent AI” for this in ARCHES.
Chat.
True, I’ll amend that this was simply me trying via prompting.
I wasn’t able to get Claude Sonnet 4.5 and Claude Opus 4.1 to produce the BIG-bench canary string.
Most posts I see I see on LessWrong, I don’t subscribe to substacks.
Follow-up to this experiment:
Starting 2025-06-13, I started flossing only the right side of my mouth (selected via random-number generator). On 2025-09-18 I went to the dentist and asked what side he guessed I’d flossed. He guessed right.
Nah, I don’t think so. Take the diamond maximizer problem—one problem is finding the function that physically maximizes diamond, e.g. as Julia code. The other one is getting your maximizer/neural network to point to that reliably maximizable function.
As for the “properly optimized human values”, yes. Our world looks quite DeepDream dogs-like compared to the ancestral environment (and, now that I think of it, maybe the degrowth/retvrn/convservative people can be thought of as claiming that our world is already “human value slop” in a number of ways—if you take a look at YouTube shorts and New York Times Square they’re not that different).
Sidenote to sidenote: “agile” also comes from agere, so the etymology is the same as for “agentic”. I often feel tempted to tell people they should become more agile.
@Eli Tyre My guess would be that functions with more inputs/degrees of freedom have more and “harsher” optima/global optima with higher values than local optima. This can be tested by picking random functions on ℝⁿ, and testing different amounts of “optimization pressure” (ability to jump to maxima further away from the current local maximum) on sub-spaces of ℝⁿ. Is that what your confusion was about?
Hm, I was indeed being cartoonishly over-the-top. Sonnet 4.5 also pointed this out. I felt like I was at the same time making fun of the people making the accusation and the accused, but on reflection that’s not much better—I’m happy I didn’t name names. Might edit.
Edit: Added two hedging words in the first paragraph as a first measure.
Also:
“instrumental convergence” → “convergent/robust instrumentality” (it’s not the convergence that’s instrumental, it’s the other way around)
“AI labs” → “AI companies” (They’re not really laboratories, just companies, and possibly trying to hide behind the scientific veneer, h/t Gavin Leech)
“open source model” → “open weight model” (since the source code for creating the models often isn’t available, and training is a lot like compiling)
Proposed changes for AI-related words:
“hallucination” → “confabulation” (hallucinations are made-up inputs, confabulations are made-up outputs)
“datacenter” → “compute center”/”compute cluster” (they mostly compute, and storing data is a necessary side-effect)
“Chain of Thought” → “scratchpad” (“scratchpad” is for the affordance, “chain of thought” is about the content)
One story goes like this: People allegedly used to believe that advanced AIs would take your instructions literally, and turn the entire universe into paperclips if you instructed them to create you a paperclip factory. But if you ask current LLMs, they tell you they won’t turn the entire universe into a paperclip factory just because you expressed a weak desire use a piece of bent metal to bring order to your government forms. Thus, the original argument (the “Value Misspecification Argument”) is wrong and the people who believed it should at least stop believing it).
Here’s a different story: Advanced AI systems are going to be optimizers. They are going to be optimizing something. What would that something be? There are two possibilities: (1) They are going to optimize a function of their world model[1], or (2) they are going to optimize a function of the sensors. (See Dewey 2010 on this.). Furthermore, so goes the assumption, they will be goal-guarding: They will take actions to prevent the target of their optimization from being changed. At some point then an AI will fix its goal in place. This goal will be to either optimize some function of its world model, or of its sensors. In the case (1) it will want to keep that goal, so for instrumental purposes it may continue improving its world model as it becomes more capable, but keep a copy of the old world model as a referent for what it truly values. In the case (2) it will simply have to keep the sensors.
What happens if an AI optimizes a function of its world model? Well, there’s precedent. DeepDream images were created by finding maximizing activations for neurons in the ConvNet Inception trained on ImageNet. These are some of the results:
So, even if you’ve solved the inner alignment problem, and you get some representation of human values into your AI, if it goal-guards and then ramps up its optimization power, the result will probably look like the DeepDream dogs, but for Helpfulness, Harmlessness and Honesty. I believe we once called this problem of finding a function that is safe to optimize the “outer alignment problem”, but most people seem to have forgotten about this, or believe it’s a solved problem.
One could argue that current LLM representations of human values are robust to strong optimization, or that they will be robust to strong optimization at the time when AIs are capable of taking over. I think that’s probably wrong, because (1) LLMs have many more degrees of freedom in their internal representations than e.g. Inception, so the resulting optimized outputs are going to look even stranger, and (2) I don’t think humans have yet found any function that’s safe and useful to optimize, so I don’t think it’s going to be “in the training data”.
If an advanced AI optimizes some functions of its sensors that is usually called wireheading or reward tampering or the problem of inducing environmental goals, and it doesn’t lead to an AI sitting in the corner being helpless, but probably like trying to agentically create an expanding protective shell around some register in a computer somewhere.
This argument fails if (1) advanced AIs are not optimizers, (2) AIs are not goal-guarding, (3) or representations can’t be easily extracted for later optimization.
- ↩︎
Very likely its weights or activations.
- ↩︎
Yep, I strongly remember there being an apostrophe in there. Not sure what Geeps was trying to say but I found it great.
I do remember some AI outputs very clearly, e.g. “horizons clitch’ and cline” from a ChatGPT malfunction, and “trading sanity for sanity points”, a Claude continuation of a poem I wrote. Also of course the creepy/wet and bottomless pit supervisor greentexts from good ol’ I-want-to-say-davinci?
2 votes
Overall karma indicates overall quality.
0 votes
Agreement karma indicates agreement, separate from overall quality.
Anders writes about this! p. 783-785 (809-811 in the raw PDF). Unfortunately, even with a carbyne chain (“presumably close to the ultimate limits of molecular matter”) checks “the lost mass-energy from extending the cable will be a factor 1.39×10⁹ larger than the work done by the cable”