Charlie Steiner

Karma: 8,221

If you want to chat, message me!

LW1.0 username Manfred. PhD in condensed matter physics. I am independently thinking and writing about value learning.

Charlie Steiner 6 Oct 2025 13:25 UTC
4 points
2
on: Did Tyler Robinson carry his rifle as claimed by the government?
It sure looks more like a jacket than a rifle.
Also, that GPT-5 analysis seems really bad. Not very informative that it didn’t find anything. It also might not be “objective” if you interacted with it after not seeing the rifle yourself, which I feel like I get from your messages.

Charlie Steiner 24 Sep 2025 6:29 UTC
9 points
4
in reply to: ryan_greenblatt’s comment on: Notes on fatalities from AI takeover
This is a pretty natural notion of slight caring IMO
Agree to disagree about what seems natural, I guess. I think “slight caring” being relative more than absolute makes good sense as a way to talk about some common behaviors of humans and parliaments of subagents, but is a bad fit for generic RL agents.

Charlie Steiner 24 Sep 2025 0:07 UTC
52 points
19
on: Notes on fatalities from AI takeover
I think there’s a fallacy in going from “slight caring” to “slight percentage of resources allocated.”
Suppose that preserving earth now costs one galaxy that could be colonized later. Even if that one galaxy is merely one billionth of the total reachable number, it’s still an entire galaxy (“our galaxy itself contains a hundred billion stars...”), and its usefulness in absolute terms is very large.
So there’s a hidden step where you assume that the AIs that take over have diminishing returns (strong enough to overcome the hundred-billion-suns thing) on all their other desires for what they’re going to do with the galaxies they’ll reach, allowing a slight preference for saving Earth to seem worthwhile. Or maybe they have some strong intrinsic drive for variety, like how if my favorite fruit is peaches I still don’t buy peaches every single time I go to the supermarket.
If I had no such special pressures for variety, and simply valued peaches at $5 and apples at $0.05, I would not buy apples 1% of the time, or dedicate 1% of my pantry to apples. I would just eat peaches. Or even if I had some modest pressure for variety but apples were only my tenth favorite fruit, I might be satisfied just eating my five favorite fruits and would never buy apples.
You describe this as “there are some reasons why small amounts of motivation don’t suffice (given above) which are around 20% likely”, but I think that’s backwards. Small amounts of motivation by default don’t suffice, but there’s some extra machinery AIs could have that would make them matter.
Or to try a different analogy: Suppose a transformer model is playing a game where it gets to replace galaxies with things, and one of the things it can replace a galaxy with is “a preserved Earth.” If actions are primitive, then there’s some chance of preserving Earth, which we might call “the exponential of how much it cares about Earth, divided by the partition function” making a Boltzmann-rational modeling assumption. But if a model with similar caring it has to execute multi-step plans to replace each galaxy, then the probability of preserving Earth goes down dramatically, because it will have chances to change its mind and do the thing it cares for more (using Boltzmann-rationality assumption the other way). So in this toy example, a slight “caring” in the sense of what the model will say it would pick when quickly asked isn’t represented when you look at the distribution of results of many-step plans.
If small motivations do matter, I think you can’t discount “weird” preferences to do other things with Earth than preserve it. “Optimize Earth according to proxy X, which will kill all humans but really grow the economy / save the ecosystem / fill it with intelligent life / cure cancer / make it beautiful / maximize law-abiding / create lots of rewarding work for a personal assistant / really preserve Earth”. Such motivations sound like they’d be small unless fairly directly optimized for, but the AI is supposed to be acting on small motivations, why not those bad ones, rather than the one we want?

Charlie Steiner 23 Sep 2025 0:38 UTC
LW: 2 AF: 2
0
AF
on: Focus transparency on risk reports, not safety cases
Different regulation (or other legislation) might also make other sorts of transparency good ideas, imo.
A mandate or subsidy for doing safety research might make it a good idea to require transparency for more safety-relevant AI research.
Regulation aimed at improving company practices (e.g. at security against weight theft, or preventing powergrab risks like access to helpful-only models above some threshold, or following some future safety practices suggested by some board of experts, or [to be meta] good transparency practices) should generate some transparency about how companies are doing (at cybersecurity or improper internal use mitigation or safety best practices or transparency).
If safety cases are actually being evaluated and you don’t get to do all the research you want if the safety is questionable, then the landscape for transparency of safety cases (or other safety data that might have a different format) looks pretty different.
I’m actually less clear on how risk reports would tie in to regulation—maybe they would get parted out to reports on how the company is doing at various risk-mitigation practices, if those are transparent?

Charlie Steiner 21 Sep 2025 6:04 UTC
LW: 2 AF: 2
0
AF
on: If anyone builds it, everyone will plausibly be fine
Supposing that we get your scenario where we have basically-aligned automated researchers (but haven’t somehow solved the whole alignment problem along the way). What’s your take on the “people will want to use automated researchers to create smarter, dangerous AI rather than using them to improve alignment” issue? Is your hope that automated researchers will be developed in one leading organization that isn’t embroiled in a race to the bottom, and that org will make a unified pivot to alignment work?

Charlie Steiner 17 Sep 2025 21:31 UTC
4 points
2
on: Christian homeschoolers in the year 3000
I agree this is worse than it could be. But maybe some of the badness hinges on the “They’re your rival group of conspecifics who are doing the thing you have learned not to like about them! That’s inherently bad!” reflex, a piece of machinery within myself that I try not to cultivate.

Charlie Steiner 14 Sep 2025 1:28 UTC
4 points
1
on: Creating a Standard for TAI Governance
I like it.
You’ve listed mostly things that countries should do out of self-interest, without much need for international cooperation. (A little bit disanalogous with the UN SDGs.) This is fine, but I think there could also be some useful principles for international regulation of AI that countries could agree to in principle, to pave the way for cooperation even in an atmosphere of competitive rhetoric.
Under Develop Safe AI, it’s possible “Alignment” should be broken down into a few chunks, though I’m not sure. There’s a current paradigm called “alignment” that uses supervised finetuning + reinforcement learning on large models, where new reward functions, and new ways of leveraging human demonstrations/feedback, all have a family resemblance. And then there’s everything else—philosophy of preferences, decision theory for alignment, non-RL alignment of LLMs, neuroscience of human preferences, speculative new architectures that don’t fit in the current paradigm. Labels might just be something like “Alignment via finetuning” vs. “Other alignment.”
Under Societal Resilience to Disruption, I think “Epistemic Security Measures” could be fleshed out more. The first thing that pops to mind is letting people tell whether some content or message is from an AI, and empowering people to filter for human content / messages. (Proposals range from legislation outlawing impersonating a human, to giving humans unique cryptographic identifiers, to something something blockchain something Sam Altman.)
But you might imagine more controversial and dangerous measures—like using your own AI propaganda-bot to try to combat all external AI propaganda-bots, or instituting censorship based on the content of the message and not just whether its sender is human (which could be a political power play, or as mission creep trying to combat non-AI disinformation under the banner of “Epistemic Security,” or because you expect AIs or AI-empowered adversaries to have human-verified accounts spreading their messages). I think the category I’m imagining (which may be different than the category you’re imagining) might benefit from a more specific label like “Security from AI Manipulation.”

Charlie Steiner 11 Sep 2025 17:13 UTC
LW: 2 AF: 2
0
AF
on: Recent progress on the science of evaluations
Thanks, just watched a talk by Luxin that explained this. Two questions.
- Currently the models’ ability scores seem to be pretty-staightforward average score as a function of the annotated level of the task on that ability. But it would be useful to be able to infer ability scores without too much data from the actual task. Could you do something clever like estimate just from a token probabilities beam search on a few proxy questions?
- The headline number for predictiveness looks good. But does the predictiveness have systematic errors on different benchmarks? I.e. for tasks from different benchmarks with the same annotated requirements, are some benchmarks systematically harder or easier?

Charlie Steiner 3 Sep 2025 2:02 UTC
2 points
0
on: But Have They Engaged With The Arguments? [Linkpost]
Great post.
I wanted to pick on the model of “sequentially hear out arguments, then stop when get fed up with one,” but I think it doesn’t make too much difference compared to a more spread-out model where people engage with all the arguments but at different rates, and get fed up globally rather than locally.

Charlie Steiner 2 Sep 2025 23:46 UTC
72 points
46
on: Your LLM-assisted scientific breakthrough probably isn’t real
In order for your ideas to qualify as science, you need to a) formulate a specific, testable, quantitative hypothesis^[2], b) come up with an experiment that will empirically test whether that hypothesis is true, c) preregister what your hypothesis predicts about the results of that experiment (free at OSF), and d) run the experiment^[3] and evaluate the results. All of those steps are important! Try to do them in a way that will make it easy to communicate your results. Try to articulate the hypothesis in a clear, short way, ideally in a couple of sentences. Design your experiment to be as strong as possible. If your hypothesis is false, then your experiment should show that; the harder it tries to falsify your hypothesis, the more convincing other people will find it. Always ask yourself what predictions your theory makes that other theories don’t, and test those. Preregister not just the details of the experiment, but how you plan to analyze it; use the simplest analysis and statistics that you expect to work.
I think this is the weakest part of the essay, both as philosophy of science and as communication to the hopefully-intended audience.
“Qualifying as science” is not about jumping through a discrete set of hoops. Science is a cultural process where people work together to figure out new stuff, and you can be doing science in lots of ways that don’t fit onto the gradeschool “The Scientific Method” poster.
a) You can be doing science without formulating a hypothesis—e.g. observational studies / fishing expeditions, making phenomenological fits to data, building new equipment. If you do have a hypothesis, it doesn’t have to be specific (it could be a class of hypotheses), it doesn’t have to be testable (it’s science to make the same observable predictions as the current leading model in a simpler way), and it doesn’t have to be quantitative (you can do important science just by guessing the right causal structure without numbers).
b) You can be doing science without coming up with an experiment (Mainly when you’re trying to explain existing results. Or when doing any of that non-hypothesis-centric science mentioned earlier).
c) If you do have a hypothesis and experiment in that order, public pre-registration is virtuous but not required to be science. Private pre-registration, in the sense that you know what your hypothesis predicts, is a simple consequence of doing step (b), and can be skipped when step (b) doesn’t apply.
d) Experiments are definitely science! But you can be doing science without them, e.g. if you do steps a-c and leave step d for other people, that can be science.
From a communication perspective, this reads as setting up unrealistic standards of what it takes to “qualify as science,” and then using them as a bludgeon against the hopefully-intended audience of people who think they’ve made an LLM-assisted breakthrough. Such an audience might feel like they were being threatened or excluded, like these standards were just there to try to win an argument.
Although, even if that’s true, steps (a)-(d) do have an important social role: they’re a great way to convince people (scientists included) without those other people needing to do much work. If you have an underdog theory that other scientists scoff at, but you do steps (a)-(d), many of those scoffers will indeed sit up and take serious notice.
But normal science isn’t about a bunch of solo underdogs fighting it out to collate data, do theoretical work, and run experiments independently of each other. Cutting-edge science is often too hard for that even to be reasonable. It’s about people working together, each doing their part to make it easier for other people to do their own parts.
This isn’t to say that there aren’t standards you can demand of people who think they’ve made a breakthrough. And those standards can be laborious, and even help you win the argument! It just means standards, and the advice about how to meet them, have to be focused more on helping people participate in the cultural process where people work together to figure out new stuff.
A common ask of people who claim to have made advances: do they really know what the state of the art is, in the field they’ve supposedly advanced? You don’t have to know everything, but you have to know a lot! If you’re advancing particle physics, you’d better know the standard model and the mathematics required to operate it. And if there’s something you don’t know about the state of the art, you should just be a few steps away from learning it on your own (e.g. you haven’t read some important paper, but you know how to find it, and know how to recurse and read the references or background you need, and pretty soon you’ll understand the paper at a professional level).
The reasons you have to really know the state of the art are (1) if you don’t, there are a bunch of pitfalls you can fall into so your chances of novel success are slim, and (2) if you don’t, you won’t know how to contribute to the social process of science.
Which brings us to the more general onerous requirement, one that generalizes steps (a)-(d), is: Have you done hard work to make this actually useful to other scientists? This is where the steps come back in. Because most “your LLM-assisted scientific breakthrough”s are non-quantitative guesses, that hard work is going to look a lot like steps (a) and (b). It means putting in a lot of hard work to make your idea as quantitative and precise as you can, and then to look through the existing data to quantitatively show how your idea compares to the current state of the art on the existing data, then maybe proposing new experiments that could be done, filling in enough detail that you can make quantitative predictions for an experiment that show how the predictions might differ between your idea and the state of the art.

Charlie Steiner 29 Aug 2025 21:45 UTC
0 points
0
on: Wikipedia, but written by AIs
No more sycophancy—now the AI tells you what it believes.
???
The AI will output the words that follow the strategies that worked well in RL, subject to the constraint that they’re close to what it predicts would follow the particular encyclopedia-article prompt, and the randomly sampled text so far.
If one of the strategies that worked well in RL is “flatter the preconceptions of the average reader,” then it will flatter the preconceptions of the average reader (sycophancy also may come from the behavior of actual human text conditional on the prompt).
If it has a probability distribution over the next word that would cause it to output encyclopedia articles that appear to believe very different things, it will just sample randomly. If slightly different prompts would have resulted in encyclopedia articles that appear to believe very different things, the AI will not let you know this, it will just generate an article conditioned on the prompt.

Charlie Steiner 25 Aug 2025 15:54 UTC
LW: 3 AF: 2
0
AF
on: Notes on cooperating with unaligned AIs
Fortunately, there’s a correlation between situations where (i) AI takeover risk is high, and (ii) AIs have a good understanding of the world. If AI developers have perfect ability to present the AI with false impressions of the world, then the risk from AI takeover is probably low. While if AIs have substantial ability to distinguish truth from falsehood, then perhaps that channel can also be used to communicate facts about the world.
Whether this is fortunate depends a lot on how beneficial communication with unaligned AIs is. If unaligned AI with high chance of takeover can exploit trade to further increase its chances of takeover (“Oh, I just have short-term preferences where I want you to run some scientific simulations for me”), then this correlation is the opposite of fortunate. If people increase an unaligned AI’s situational awareness so it can trust our trade offer, then the correlation seems indirectly bad for us.

Charlie Steiner 19 Aug 2025 20:24 UTC
22 points
24
on: The Egyptian Mamluks as case study for AI take-over
Fun read, thanks for sharing, application to AI safety doubtful :)

Charlie Steiner 19 Aug 2025 10:30 UTC
2 points
0
on: Giving AIs safe motivations
Thanks! I think your perspective is important for me to engage with, since I’m mostly concerned with doing step 1 much better than what you think of as succeeding at step 1.
In particular, the problem of evaluating performance even in safe situations seems like something we could do much better at “if we knew what we were doing” (for hard problems and for manipulation, which you mention, and for ambiguity/underspecification, which is easy to forget about).
So prong one is to try to know what we’re doing better—e.g. by finding improvements to architectures and training schemes to support good performance evaluations. And prong two is to figure out how to better muddle ahead with bad evaluations, “the AI will be misaligned but hopefully it’s not too bad and we can do other things to compensate” style.
A random nitpick:
In particular: the early discourse about AI alignment seemed quite concerned, in various ways, about the problem of crafting/specifying good instructions.
“With a safe genie, wishing is superfluous. Just run the genie.”—The Hidden Complexity of Wishes (2007)

Charlie Steiner 7 Aug 2025 2:40 UTC
6 points
9
on: Inscrutability was always inevitable, right?
That Yuxi Liu article is great, thanks for the link.

Charlie Steiner 6 Aug 2025 7:49 UTC
2 points
0
in reply to: Towards_Keeperhood’s comment on: The perils of under- vs over-sculpting AGI desires
I (tentatively) disagree with the frame here, because “don’t get caught being misleading” isn’t a utility-shard over world-trajectories, but rather just a myopic value accessor on the model of a current situation (IIUC)
I don’t quite understand, some jargon might have to be unpacked.
Why shouldn’t there be optimization pressure to steer the world in a quite general, even self-reflective, way so that you don’t get caught being misleading?
Or are you saying something like “don’t get caught being misleading” is somehow automatically ‘too small an idea’ to be a thing worth talking about the AI learning to scheme because of?

Charlie Steiner 16 Jul 2025 23:43 UTC
2 points
0
in reply to: Priyanka Bharadwaj’s comment on: Why Eliminating Deception Won’t Align AI
What are you thinking of for a situation where truth-telling has a real cost?
Here’s my guess (or an example of it): the AI is in training, and the AI gets reward from human approval. The AI has made a mistake, but if it lies to the human the human won’t notice. So because of the ‘cost’ in reward, the AI will learn to lie to the human when it can get away with it, and then it will continue that bad behavior during deployment.
Here the “principled solution” (scare quotes because we don’t know how to do it right) is to have a different reward function than human approval. The human might still press a thumbs-up button when they think the AI did something good, but now the AI treats that as an observation that could have multiple explanations, rather than a sure indicator of good behavior. In particular, the AI should know about lying and manipulation, and treat “I lied to the human so they’d press the button” as a sign that the button-press shouldn’t be updated on as a reward.

Charlie Steiner 16 Jul 2025 20:33 UTC
2 points
0
in reply to: ACCount’s comment on: Narrow Misalignment is Hard, Emergent Misalignment is Easy
Seems reasonable.

Charlie Steiner 16 Jul 2025 4:48 UTC
2 points
0
in reply to: ACCount’s comment on: Narrow Misalignment is Hard, Emergent Misalignment is Easy
Why not all personas (i.e., optimizing mechanisms used by all sorts of text prediction where “what would a cartoonishly evil person do?” might sometime be useful), not just the chatbot persona?

Charlie Steiner 16 Jul 2025 2:54 UTC
3 points
1
on: Why Eliminating Deception Won’t Align AI
I agree that the main problem is to stop deception being incentivized. But I’m kind of negative about this post—you might be making complicated what should be simple. From this post it didn’t feel to me that you were starting from a clear picture of how deception was incentivized in AI and trying to address it, it felt more like you were just adding some things that seemed like they would help.
Incentivizing deception, in my view, is representative of a class of problems where if humans make systematic errors in evaluating an AI’s performance, it will be incentivized to exploit those systematic errors. This makes me excited about trying to find a simple, principled way of dealing with deception, because then it can also generalize to other problems caused by human errors.
I’m concerned that if we just try to keep adding anti-deception measures that seem helpful, we’ll end up relying on some measures that are “fighting the AI” rather than “aligning the AI”—measures that the AI has an incentive to defeat if it gets smarter or gets more ways to affect the real world.