Maxwell Clarke

Karma: 85

Maxwell Clarke 22 Oct 2025 8:29 UTC
0 points
0
on: Maxwell Clarke’s Shortform
I think a lot of people can’t think at the right level of abstraction for understanding Yudkowsky. Some things are overdetermined because of the high level structure of reality. Start with physical reality as we best understand it, then derive what is possible eventually, then derive what is possible soon, then derive the incentives, then derive categories of what will happen. This completely top-down way of drawing conclusions is perhaps tricky to get right, but gives broad predictions far into the future. At no point till now have any facts about AI development contradicted Yudkowsky and Bostrom’s arguments from this basis. At no point do these arguments rely on anything one has personally seen. It relies on scientific principles taken on trust, and careful argument hashed out through debate. I believe these arguments and I put a high level of faith in conclusions drawn this way, but some people just don’t get it. I don’t know why Will MacAskill appears to be among them.
It is clear on this basis that the local incentives of global society point in the direction of increasing technological development and automation. In the limit of that direction is a machine economy which comes at the cost of human existence.
To avoid that outcome, there needs to be some kind of complete, enduring global coordination. It needs to prevent anyone from ever creating an artificial agent powerful enough to successfully replicate even after we try to stop it.

Maxwell Clarke’s Shortform

Maxwell Clarke12 Aug 2025 5:45 UTC

2 points

3 comments1 min readLW link

Maxwell Clarke 12 Aug 2025 5:45 UTC
1 point
0
on: Maxwell Clarke’s Shortform
Pre-training vs Reinforcement Learning
Pre-training
In pre-training, the model learns to mirror a dataset.
Pre-training trains a model on a dataset—the model is trained to align with a set of tokens/actions, but the model never actually outputs any itself. Any biases (vs. the dataset) in the model towards one action/output sequence or the other cause poorer performance, and the model comes out relatively unbiased (vs. the dataset) as a result. In other words, the model comes out as a close representation of the dataset.
At inference time, the model can be used to produce actions/output sequences—it will then receive its own outputs as input—and any bias in the model can lead to a compounding bias across the sequence (in practice, this is towards repetitiveness in base models).
The model is not robust in inference—sampling injects noise and it quickly goes off distribution.
Reinforcement Learning
In reinforcement learning, the model learns to succeed at a goal.
Reinforcement learning trains a model “recursively”—on its own outputs/decisions. The model produces a sequence, and repeatedly receives its own outputs as input during training. If the AI has a “bias” towards one kind of action/output or another, it can compound, and if it succeeds on its task, then that compounding bias gets reinforced. This leads immediately to models that have much stronger compounding biases.
The model is however robust in inference—it learns to reduce noise and keep the context within distribution so that it does not go “off the rails” and it can then complete tasks.
Good Compounding Biases
Some of those biases are important, such as diversity (defeating base model tendency to repeat), error correction (defeating base model tendency to “go along with” accidental errors).
Interesting Compounding Biases
However, some are much more interesting—for example “adding positive spin.” In the pre-training dataset, probably both people chatting were depressed statistically speaking, but in the resulting model after reinforcement learning the second participant is relentlessly positive and helpful. If you make it talk to itself, it goes off the rails into increasingly positive slop.
Gaslighting AI Overseers
Another example: It might be the case that during training, the chance that it gets the maths question right is much lower than the chance that it gaslights its AI “marker” into believing it got the answer right. So, this trains a model to spend a portion of its token budget on producing some working to a maths problem, and the rest of the budget telling the AI marker model all the myriad reasons why the answer is correct.
This doesn’t have to work every time, just has to maximize the chance of getting marked correct compared to actually focusing on the maths. So the model learns to lie more often than would be seen in the pre-training dataset.
“Contingent” Training
It’s also possibly the case that reinforcement learning is “contingent” on random factors. Let’s imagine two models:
Model A that by chance focused on the maths early in training might continue to focus on the maths and find any tokens spent on gaslighting would hurt its performance at getting the correct answer—so it learns to get the correct answer.
Model B, due to random factors, tends to hype itself up early in training. Any tokens spent on trying to do the actual maths tend to hurt its chances at getting marked correct overall due to tricking the marker AI. So it learns to lie profusely in an excessively positive and bizarre fashion.
Luckily in practice it doesn’t seem like this scenario is actually contingent—the best solution (as evidenced by current SOTA models) is to do a mix of both.
Compounding biases are values
If (as I stated) a reinforcement learning model reduces noise in it’s “environment” so that it can succeed at a task—this is essentially what values are.
This is (one example of) what it means for an AI model to have emergent values. If trained in a sophisticated enough environment (such as with access to a virtual machine and the internet) then the model can absolutely be learning to bring “the real world” in line with its expectations in order to achieve a goal.
Pre-training on reinforcement learning rollouts
The original paper I remember seeing this in was Google Deepmind’s Gato 2 - which pretrained a transformer on something like ~600 tasks, including rollouts of video games from separate reinforcement learning systems.
It seems likely to me that if we want to avoid AI systems having values, we can do the following:
1. Pre-train a model (on curated pre-LM data + human written example sequences of chat, reasoning & tool calling)
2. Derive, by reinforcement learning on AI feedback, a thinking, tool calling reasoning model with actually good performance.
3. Process the successful rollouts of the reinforcement learning process—either by human or by frontier model if that works well enough. Fix the vibes, add diversity, disqualify for deception or for being “right for the wrong reasons*.
4. Add the processed rollouts to a second category of pretraining set.
5. Repeat the pre-training. (or, continue from snapshot, only modifying the artificial data)
The rollouts being processed is important to actually improve the resulting model.
The key alignment-relevant piece is that, hopefully, we can iterate and get better and better models that have only gone through pre-training. It may be the case that pre-training on reinforcement-learned rollouts gives rise to deception, but hopefully rollout processing prevents this.

Maxwell Clarke 7 Aug 2023 1:30 UTC
0 points
0
in reply to: Alexander Gietelink Oldenziel’s comment on: Darcy’s Shortform
In NZ we have biting bugs called sandflies which don’t do this—you can often tell the moment they get you.

Maxwell Clarke 24 May 2023 19:02 UTC
4 points
0
in reply to: Gunnar_Zarncke’s comment on: No—AI is just as energy-efficient as your brain.
Yes, that’s fair. I was ignoring scale but you’re right that it’s a better comparison if it is between a marginal new human and a marginal new AI.

Maxwell Clarke 24 May 2023 18:58 UTC
2 points
0
in reply to: Joey Marcellino’s comment on: No—AI is just as energy-efficient as your brain.
Well, yes, the point of my post is just to point out that the number that actually matters is the end-to-end energy efficiency — and it is completely comparable to humans.

The per-flop efficiency is obviously worse. But, that’s irrelevant if AI is already cheaper for a given task in real terms.

I admit the title is a little clickbaity but i am responding to a real argument (that humans are still “superior” to AI because the brain is more thermodynamically efficient per-flop)

Maxwell Clarke 24 May 2023 2:55 UTC
7 points
0
in reply to: mako yass’s comment on: No—AI is just as energy-efficient as your brain.
I saw some numbers for algae being 1-2% efficient but it was for biomass rather than dietary energy. Even if you put the brain in the same organism, you wouldn’t expect as good efficiency as that. The difference is that creating biomass (which is mostly long chains of glucose) is the first step, and then the brain must use the glucose, which is a second lossy step.
But I mean there is definitely far-future biopunk options eg. I’d guess it’s easy to create some kind of solar panel organism which grows silicon crystals instead of using chlorophyll.

No—AI is just as energy-efficient as your brain.

Maxwell Clarke24 May 2023 2:30 UTC

11 points

7 comments1 min readLW link

Maxwell Clarke 18 Jan 2023 0:29 UTC
7 points
0
in reply to: MikkW’s comment on: Models Don’t “Get Reward”
Fully agree—if the dog were only trying to get biscuits, it wouldn’t continue to sit later on in it’s life when you are no longer rewarding that behavior.Training dogs is actually some mix of the dog consciously expecting a biscuit, and raw updating on the actions previously taken.

Hear sit → Get biscuit → feel good
becomes
Hear sit → Feel good → get biscuit → feel good
becomes
Hear sit → feel good
At which point the dog likes sitting, it even reinforces itself, you can stop giving biscuits and start training something else

Maxwell Clarke 8 Jan 2023 1:02 UTC
LW: 1 AF: 1
0
AF
on: Categorizing failures as “outer” or “inner” misalignment is often confused
This is a good post, definitely shows that these concepts are confused. In a sense both examples are failures of both inner and outer alignment -
- Training the AI with reinforcement learning is a failure of outer alignment, because it does not provide enough information to fully specify the goal.
- The model develops within the possibilities allowed by the under-specified goal, and has behaviours misaligned with the goal we intended.
Also, the choice to train the AI on pull requests at all is in a sense an outer alignment failure.

Maxwell Clarke 30 Dec 2022 20:11 UTC
1 point
0
on: Exploring Mild Behaviour in Embedded Agents
If we could use negentropy as a cost, rather than computation time or energy use, then the system would be genuinely bounded.

Maxwell Clarke 10 Nov 2022 3:40 UTC
1 point
0
on: A Mystery About High Dimensional Concept Encoding
Gender seems unusually likely to have many connotations & thus redundant representations in the model. What if you try testing some information the model has inferred, but which is only ever used for one binary query? Something where the model starts off not representing that thing, then if it represents it perfectly it will only ever change one type of thing. Like idk, whether or not the text is British or American English? Although that probably has some other connotations. Or whether or not the form of some word (lead or lead) is a verb or a noun.

Agree that gender is a more useful example, just not one tha necessarily provides clarity.

Maxwell Clarke 7 Nov 2022 13:00 UTC
3 points
0
on: A philosopher’s critique of RLHF
Yeah I think this is the fundamental problem. But it’s a very simple way to state it. Perhaps useful for someone who doesn’t believe ai alignment is a problem?

Here’s my summary: Even at the limit of the amount of data & variety you can provide via RLHF, when the learned policy generalizes perfectly to all new situations you can throw at it, the result will still almost certainly be malign because there are still near infinite such policies, and they each behave differently on the infinite remaining types of situation you didn’t manage to train it on yet. Because the particular policy is just one of many, it is unlikely to be correct.

But more importantly, behavior upon self improvement and reflection is likely something we didn’t test. Because we can’t. The alignment problem now requires we look into the details of generalization. This is where all the interesting stuff is.
What links here?
- Compendium of problems with RLHF by Charbel-Raphaël (29 Jan 2023 11:40 UTC; 123 points)
- Compendium of problems with RLHF by Raphaël S (EA Forum; 30 Jan 2023 8:48 UTC; 18 points)

Maxwell Clarke 7 Nov 2022 11:04 UTC
1 point
1
in reply to: Oliver Siegel’s comment on: How to store human values on a computer
Respect for thinking about this stuff yourself. You seem new to alignment (correct me if I’m wrong) - I think it might be helpful to view posting as primarily about getting feedback rather than contributing directly, unless you have read most of the other people’s thoughts on whichever topic you are thinking/writing about.

Maxwell Clarke 6 Nov 2022 11:22 UTC
1 point
0
in reply to: Maxwell Clarke’s comment on: How to store human values on a computer
Oh or EA forum, I see it’s crossposted

Maxwell Clarke 6 Nov 2022 11:21 UTC
1 point
1
in reply to: Oliver Siegel’s comment on: How to store human values on a computer
I think you might also be interested in this: https://www.lesswrong.com/posts/Nwgdq6kHke5LY692J/alignment-by-default In general John Wentworths alignment agenda is essentially extrapolating your thoughts here and dealing with the problems in it.

It’s unfortunate but I agree with Ruby- your post is fine but a top-level lesswrong post isn’t really the place for it anymore. I’m not sure where the best place to get feedback on this kind of thing is (maybe publish here on LW but as a short-form or draft?) - but you’re always welcome to send stuff to me! (Although busy finishing master’s next couple of weeks)

Maxwell Clarke 6 Nov 2022 10:54 UTC
1 point
0
in reply to: Lonnie Chrisman’s comment on: AI X-risk >35% mostly based on a recent peer-reviewed argument
Great comment, this clarified the distinction of these arguments to me. And IMO this (Michael’s) argument is obviously the correct way to look at it.

Maxwell Clarke 6 Nov 2022 9:40 UTC
LW: 14 AF: 4
3
AF
on: AI X-risk >35% mostly based on a recent peer-reviewed argument
Hey, wanted to chip into the comments here because they are disappointingly negative.

I think your paper and this post are extremely good work. They won’t push forward the all-things-considered viewpoint, but they surely push forward the lower bound (or adversarial) viewpoint. Also because Open Phil and Future Fund use some fraction of lower-end risk in their estimate, this should hopefully wipe that put. Together they much more rigorously lay out classic x-risk arguments.

I think that getting the prior work peer reviewed is also a massive win at least in a social sense. While it isn’t much of a signal here on LW, it is in the wider world. I have very high confidence that I will be referring to that paper in arguments I have in the future, any time the other participant doesn’t give me the benefit of the doubt.
What links here?
- What should I ask Joe Carlsmith — Open Phil researcher, philosopher and blogger? by Robert_Wiblin (EA Forum; 9 Nov 2022 22:04 UTC; 33 points)

Maxwell Clarke 4 Nov 2022 9:13 UTC
1 point
0
on: Humans do acausal coordination all the time
I fully agree*. I think the reason most people disagree, and thing the post is missing is a big disclaimer about exactly when this applies. It applies if and only if another person is following the same decision procedure to you.

For the recycling case, this is actually common!

For voting, it’s common only in certain cases. e.g. here in NZ last election there was a party TOP which I ran this algorithm for, and had this same take re. voting, and thought actually a sizable fraction of the voters (maybe >30% of people who might vote for that party) were probably following the same algorithm. I made my decision based on what I thought the other voters would do, which I thought was that probably somehat fewer would vote for TOP than in the last election (where the party didn’t get into parliament), and decided not to vote for TOP. Lo and behold, TOP got around half the votes they did the previous election! (I think this was the correct move because I don’t think the number of people following that decision procedure increased)

*except confused by the taxes example?

Maxwell Clarke 4 Nov 2022 9:00 UTC
1 point
0
in reply to: Ruby’s comment on: Humans do acausal coordination all the time
Props for showing moderation in public

Maxwell Clarke

Maxwell Clarke’s Shortform

Pre-training vs Reinforcement Learning

Pre-training

Reinforcement Learning

Good Compounding Biases

Interesting Compounding Biases

Gaslighting AI Overseers

“Contingent” Training

Compounding biases are values

Pre-training on reinforcement learning rollouts

No—AI is just as en­ergy-effi­cient as your brain.

No—AI is just as energy-efficient as your brain.