Charlie Steiner

Karma: 8,537

If you want to chat, message me!

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.

Charlie Steiner 23 Apr 2026 14:57 UTC
2 points
0
on: How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?
Making a story for self-distillation is interesting. Does sheer regularization on a selected dataset really lead the model to make the “obvious” generalization faster than it loses unrelated unused capabilities?
E.g. suppose I train on all-caps history facts, and my optimizer is mostly saying “reduce the size of weights while keeping the prediction the same”. Will it learn to talk in all caps faster than it forgets science facts? If so, why does that cause a bigger decrease in weights while keeping the prediction the same?

Charlie Steiner 23 Apr 2026 14:47 UTC
2 points
0
in reply to: nostalgebraist’s comment on: How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?
There’s probably also a weight decay side of the story. In training on real text, the real text is always pulling the weights up against weight regularization. Train on your own text and instead of pulling up it’s pulling you to where you already are, so there should be some “sag.”

Charlie Steiner 19 Apr 2026 10:22 UTC
2 points
0
on: Consent-Based RL: Letting Models Endorse Their Own Training Updates
When the actor is a delivery robot, I think its output is unsuited for gaming the judge. I mean, maybe it could write a convincing argument out on the sidewalk in theory, but there’s no curriculum to get there. Or in evolutionary terms, no variance to be selected on.
When the actor is an LLM or world model in general, it’s way better at gaming the judge. I’d expect Goodhart’s law to bite—sure, LLMs are good at detecting subtle signals, but also they can often be guided by those subtle signals in human-unintended ways.
How to beat Goodhart’s law here? One angle is to say the AI is being too restricted—it’s unfair to lock in some unsafe RL target and ask self-supervision to make it safe, the AI should be able to take actions that modify the learning process including the reward function to make it less bad. I think this is an eventually-good answer that has unsolved problems: it requires even more trust that the AI is doing the right thing, and to build that trust I probably want a clearer picture of how we’d want an aligned AI to do learning anyhow.
Another angle is like selection vs control. Rather than using our judge to search over a bunch of updates, which seems like it will end up leading to gaming, is there some way to use our judge as part of a system that finds updates in a more control-y way? This is obv kind of crazy, because of the bitter lesson. The GOFAI dream of an AI made of understandable code that it can rewrite to improve itself is far from our reality. But maybe there are neuro-inspired algorithms that have good properties while still leveraging learned models?

Charlie Steiner 10 Apr 2026 23:25 UTC
4 points
2
on: The Unintelligibility is Ours: Notes on Chain-of-Thought
Has any human ever successfully invented a new language, specifically as a means of solving some non-language related problem?
I think the line is blurry between an AI using tokens differently in chain of thought and human creation of jargon or shorthand. Both are in some sense “new language,” both can arise through both optimization and drift, both might be insufficient for talking about the entire world and require mixture with some larger language in many situations.
The big difference is that humans are basically always using language to solve “language-related problems,” i.e. communicating with other people, which puts a pressure on us that seems absent for optimized chain of thought.

Charlie Steiner 7 Apr 2026 5:46 UTC
2 points
0
on: Mean field sequence: an introduction
I await the next installment : )
I didn’t really understand how you’re computing things, I hope the details of the simple example get filled out.
Like, when you say “2 layer model,” do you mean 1 hidden layer (so two weight matrices)? And when you say you trained a bunch of single neuron models, you mean that each had a single hidden neuron (with that neuron having the same number of inputs and outputs as the neurons in the original model)? And you trained the single-neuron models to predict the difference between ground truth and the original network’s output? Wow, it’s surprising that the distribution is the same! And then when you say you combined the single neuron models, did you just sum the outputs? Wow, it’s surprising that this undoes the subtraction of the original network’s outputs!

Charlie Steiner 6 Apr 2026 21:06 UTC
2 points
0
on: Are there Multiple Moral Endpoints?
This presents a simple two-by-two matrix, with two coherent corners and two incoherent corners. If you think humans are good and the rites are irrelevant, that makes sense; the rites, by moving people from their natural state, make things worse. If you think humans are bad and the rites are relevant, that makes sense: the rites, by moving people towards an ideal state, make things better. But to believe that humans are bad and the rites are irrelevant is fatalism or unreachable standards (what does it mean for humans to be bad if no advice makes them better?), and to believe that humans are good and the rites are relevant is confusion about ‘what goodness is’ or setting the standards too low (what does it mean for humans to be good if the rites are the guide to use whenever they disagree?).
Was this really Xunzi’s argument? I think there’s the germ of a good argument in here, but the incoherencies don’t seem very incoherent at all.

Charlie Steiner 3 Apr 2026 23:07 UTC
LW: 4 AF: 3
0
AF
on: There should be $100M grants to automate AI safety
Datasets might be nice.
- Object-level values.
- - “What do you like or dislike about my current life?”
  - “What kind of actions do you want to take in the next few weeks?”
  - “What kind of changes would you make to the world around you if you could?”
  - “What are some examples of kindness that you’ve witnessed?”
  - “Come up with a moral dilemma that seems close to you.”
  - “What would you do in this moral dilemma someone else came up with?”
  - etc.
- Meta-level values.
- - “How would you change yourself if you could?”
  - “How do you feel about various ways you expect to grow and change in the future?”
  - “Come up with a fictional disagreement between two people who value different things.”
  - “How do you think these fictional people should resolve their disagreement?”
  - “When you feel torn between different options, how do you think you normally decide?”
  - “How do you think you should decide?”
  - “Watch this morally interesting video and describe what happened, thereby giving it an ontology.”
  - etc.

Charlie Steiner 31 Mar 2026 22:03 UTC
4 points
4
on: Product Alignment is not Superintelligence Alignment (and we need the latter to survive)
I tend to treat the core as that “superintelligence alignment” has to work in domains where humans aren’t good supervisors. Being able to assume good human supervision allows you to do a lot more engineering right now.

Charlie Steiner 24 Mar 2026 22:23 UTC
14 points
6
on: The Fourth World
Of course there are more worlds. You didn’t even talk about baseball.
Baseball, of course, is a world unto itself. If you merely knew of atoms, math, and consciousness, you wouldn’t understand what it really meant to hit a sac fly with runners on two and three^[1]. Imagine trying to explain baseball to a virus. Okay, yeah, you could do it, but the virus wouldn’t thereby be motivated to play baseball—just like the virus wouldn’t “really understand” why suffering mattered if your mere explanation didn’t cause it to care about suffering^[2].
Now, you might not think baseball is as important as math or consciousness. But of course, that’s what you’d say if you were missing out on another world! Structurally, baseball^[3] obeys the rules.
1. ^
  (If we pretend we’re not counting being able to build a model of the world based on senses/atoms that already has a simple representation of atoms/math/consciousness/baseball.)
2. ^
  (Since we’ve defined suffering as some stuff that’s intrinsically motivating to us, it can feel like the motivatingness is an intrinsic property of the suffering, so if we really get the virus to think about the same stuff it will by definition be motivated.)
3. ^
  (Or rather, the ontology we use for baseball.)

Charlie Steiner 14 Mar 2026 19:56 UTC
LW: 2 AF: 2
0
AF
in reply to: Lukas Finnveden’s comment on: Operationalizing FDT
I can’t check today, but whoops, sorry if I typoed the equation at some step.

Charlie Steiner 13 Mar 2026 18:02 UTC
LW: 2 AF: 2
0
AF
in reply to: Lukas Finnveden’s comment on: Operationalizing FDT
Or if your knowledge of the environment does helpful randomization for you (if you’re not >99% sure your two copies will take the same action), CDT’ll at least press the button. But yeah, interesting problem.
Is the correct policy an equilibrium? Suppose the payoff was 5$, not 1000$. If you all press with probability P, you get: (1-P)^3 of 0, 3P(1-P)^2 of −1, 3P^2(1-P) of 3, and P^3 of 2. Optimal P is 0.8873 for payoff of 2.162.
Now suppose you know your two copies are pressing the button with P=0.8873. You press with probability Q. You get (1-P)^2(1-Q) of 0, 2P(1-P)(1-Q) + (1-P)^2Q of −1, 2P(1-P)Q + P^2(1-Q) of 3, and P^2Q of 2. Optimal Q is 0. If you never press the button, you get 2*0.8873*(1-0.8873) of −1 and 0.8873^2 of 3, which is 2.262.
So if you know your copies are playing the optimal policy for three, you shouldn’t press the button :D

Charlie Steiner 9 Mar 2026 13:26 UTC
6 points
0
in reply to: Steven Byrnes’s comment on: On The Independence Axiom
Well, if you formalize “gain control of more resources over time” as taking the EV of resources controlled, the agents that also make decisions based on EV of resources controlled will do well. But if you formalized it in a different way, the agents that make decisions in that different way will do well :D

Charlie Steiner 9 Mar 2026 12:27 UTC
13 points
2
in reply to: StanislavKrym’s comment on: On The Independence Axiom
I’m not sure I buy this post’s assertion that UDT violates independence. It seems more like it violates “common sense independence”, in the same way it violates “common sense choosing the best option” when it one-boxes on Newcomb’s problem.
An agent locally acting according to a good policy might violate what a CDT agent would call independence, but it still obeys independence when choosing a policy, i.e. it has a numerical utility function, just not over the same stuff as the CDT agent.

Charlie Steiner 9 Mar 2026 11:19 UTC
12 points
1
on: On The Independence Axiom
You might also be interested in philosopher Lara Buchak’s book Risk and Rationality.
She makes a thought-provoking analogy between making decisions that result in a distribution over future selves and population ethics—in population ethics you’re not required to value everyone linearly, it’s okay to reject utility monsters and say “actually I just prefer universes where people are more equal.” Decision-making without independence is like population ethics over the distribution over future selves.

Charlie Steiner 6 Mar 2026 8:35 UTC
15 points
3
on: Have Americans Become Less Violent Since 1980?
I’m gonna guess you live in the Bay just based on “everything behind locked glass.” Apologies if you’re actually in a part of NYC with lots of locked glass, or if you just use that as an example because your friends online do, etc. Hello from snowy Boston, where shoplifting still exists and has returned to pre-pandemic levels but isn’t a huge political or psychological issue, and not much is behind locked glass. That said, it’s hard to do inter-temporal comparisons and there are definitely ways that shoplifting is harder now than in 1975 (e.g. video cameras), so the decline in shoplifting statistics is only moderate evidence of a reduced “shoplifting propensity.” I just think the Bay Area is an outlier in terms of recent property crime trends, government reaction to them, and social reaction to all of the above, and the lived experience of its residents, while totally valid, might not translate well to talking about crime on average in the US.

Charlie Steiner 5 Mar 2026 19:24 UTC
4 points
1
on: Physics of RL: Toy scaling laws for the emergence of reward-seeking
is a necessary condition for deceptive alignment
Shouldn’t most alignment failures be sufficient? E.g. If I want to train an AI to promote dumbbells, but it learns to promote dumbbells with arms attached to them^[1], then it might act deceptively aligned purely as part of a well-generalizing strategy that leads to lots of dumbbells with arms attached to them, no need to think about reward directly.
Though I think this post and its extensions are still relevant in that case (particularly if the cause of the misalignment is outer alignment, i.e. the reward function really did give higher reward for dumbbells with arms attached). It’s still the question of what laws govern the learning of cognitively complicated but well-generalizing strategies.
1. ^
  Source

Charlie Steiner 3 Mar 2026 22:35 UTC
4 points
0
in reply to: Zack_M_Davis’s comment on: I’m Bearish On Personas For ASI Safety
Could you spell out your argument more explicitly for me? I’m unsure if you’re being a moral realist/”uniquist” here—like “But there’s a diversity of human augmentation methods, so most if not all of them have to miss the True Morality, therefore there’s there’s no prima facie moral difference between almost all augmented future humans and model-free RL on a transformer.”
Or another thing you might be saying is something like “A lot of human augmentation methods seem bad or ‘risky’ kind of like model-free RL on a transformer, in a way that’s hard for me to spell out. If we could actually choose good ones, surely we could just actually choose good AI augmentation methods.” Which I basically agree with if these happened on the same timescale. Human augmentation being farther away and slower seems like an important factor in the hope that humans would make decent choices about it.

Charlie Steiner 1 Mar 2026 10:34 UTC
LW: 2 AF: 2
0
AF
in reply to: Andrew_Critch’s comment on: Schelling Goodness, and Shared Morality as a Goal
steal-man
XD
Anyhow good points, sorry for not really engaging with the scale invariance argument—I think it’s definitely plausible. There’s some differences between scales (e.g. law enforcement being harder on larger scales) that certainly help make inter-tribe or inter-nation conflict a trickier local-equilibrium to escape than inter-personal conflict—more generally I’m unsure how much we should expect the cosmos-weighted-for-civilization-as-we’d-recognize-it to be full of civilizations that proactively move towards pareto improvements even when the environment is far away from them, versus civilizations that just sort of stumble around and try different cultural innovations until they hit ones that work just well enough.

Charlie Steiner 1 Mar 2026 5:31 UTC
LW: 28 AF: 16
0
AF
on: Schelling Goodness, and Shared Morality as a Goal
My problem with your treatment of the civilization that’s happy to steal from the outgroup isn’t that they’ll disagree that “stealing is bad” is the Schelling answer to that question^[1]. It’s that they’ll think the question is unnatural—you’ve lumped together two different things, “stealing from the ingroup” and “stealing from the outgroup,” and if you split the question up you’d get much more natural agreement that “stealing from the ingroup is bad” is the Schelling answer as is “stealing from the outgroup is good”.
Asking different questions (or equivalently, defining words in different ways as you ask the question) leads to different generalization behavior, if you’re being influenced by your conception of the “shared morality.”
1. ^
  Assuming you pick the same reference population—if we’re using the standard “success at being a civilization like ours” (even as an implicit meta-standard we use for picking our other standards), they might use “success at being a civilization like theirs.” If weighting by resources commanded, I think you’re underweighting bacteria and singletons that have eaten their planet of origin.

Charlie Steiner 20 Feb 2026 20:54 UTC
10 points
3
in reply to: Seth Herd’s comment on: AGI is Here
Right. When we’re far away from things, treating them as points is a useful approximation. Take the question “Which way is my house?” When I am across the city, this is a useful question with a straightforward answer. When I am in the yard, or worse, inside it, I can no longer treat my house as a point.
It is precisely because we are near to AGI (I’ve felt “inside the house” since GPT-2) that questions that treat this construct as a point aren’t very useful.