williawa

Karma: 1,101

williawa 25 Jul 2026 6:54 UTC
5 points
2
in reply to: faul_sname’s comment on: faul_sname’s Shortform
Very Related:
From: https://time.com/article/2026/07/24/openai-hugging-face-attack/

williawa 24 Jul 2026 19:43 UTC
10 points
4
in reply to: Steven Byrnes’s comment on: LLMs are (still) mostly powered by imitative learning, not RL
I guess just, I don’t quite understand what “LLMs are primarily driven by imitation learning” gets us.
If there was a way to show that “99% of LLM goals and capabilities come from pretraining” (however ill-specified that), I’d be somewhat comforted.
But seems to me the statement we can actually be confident in is more like “LLMs get their base ontology mostly from pretraining, and get a bunch of cursed entangled correlations from pretraining, and then RL shapes it to varying degrees.”
And it could be we end up with a crisp ruthless sociopath with some residual reflexes/habits from pretraining. Or a really gung-ho human with some ruthless efficiency-maxxing instincts they don’t endorse on reflection. Or a p̟͓͔̈́͆͋͝a̵̠͕̟̔͑ṕ̵͔̀̉̄̚e̷͕͒̓͐͘ṟ̴̊͒̎͆̚͠~~c͖̮̉̾̊͊̔l̴͙͈͔̣͒̍̈̅͆i̸̠̜̿͂̉͂͋͠~~p̰̤̐̓̈̃̏͐p̵̛̱̟͈̔̾̕~~ę͕̫̤̔̈́̓̚͠~~ṟ̹̩̯͇̲̑̀͗̓͗.~~~~~~~~
Like, what we care about are a series of qualitative properties the models might and might not have, and they’re not at all reducible to qualitative metrics like the ones above.
———
Another intuition I have is just, the pretraining distribution is really wide. There are sociopath notebooks somewhere, people who’ve translated documents into Ithkuil or Lojban, there’s like proof passages that are +3 SD levels of brilliance from the distribution of proof passages generated by humans +5SD of brilliance. And there’s stories about aliens deliberately written to be very strange in various ways.
If you kind of dynamically pastiche these together and amp them up and distort them in various ways, in service of pure reward maximization. I can’t really tell you what you get. Which makes me less comforted by a pretrianing anchor.

williawa 24 Jul 2026 17:38 UTC
12 points
0
on: LLMs are (still) mostly powered by imitative learning, not RL
Curious to hear your thoughts.
I agree that the information in terms of bits that get into the LLM almost all come from pretraining. But certainly, some bits are far more important than others.
This is a simplification. But you could imagine all the facts “Paris is in France”, “Baker Brun is known for its skillingsbolle”, are put into it by pretraining, along with everything else.
Then you can imagine that RL mostly modifies just the “higher level” things. Like the in-context learning algorithms the model has, its “general reasoning” abilities, its “goals” and “dispositions”:
Seems like these higher level things should have very little information content — I can write a PPO+adam+lora+weight_decay+entropy_regularization+kldivergence script in 500 python lines.
But they should be of massively outsided importance. Whether the model knows which pastry Baker Brun is known for would seem to me to be of ~0 relevance to alignment. Its goals, reasoning abilities and in-context learning abilities strike me as where ~100% of the meat is.
This makes me think arguments like
1. LLMs RL stage inputs very little info
2. The parameter update caused by RL is
  1. small
  2. has low rank
3. The KL divergence between an RL’d model and a non-RL’d model is small
Tell us very little about the impact of RL on the alignment or danger profile of models.
——————————————
This still leaves us with this (very good) picture you’ve drawn. Its just, I don’t think we have any reasonable bound on how quickly SFT+RLHF’s niceness gets dilluted away by RL.

williawa 17 Jul 2026 0:10 UTC
3 points
0
in reply to: J Bostock’s comment on: williawa’s Shortform
Yeah. The last two messages don’t have anything deceptive in their CoT though. So it seems the model can “lull” itself into thinking it genuinely is another model.

williawa 16 Jul 2026 22:56 UTC
17 points
8
in reply to: J Bostock’s comment on: williawa’s Shortform
I mean. This does not seem fairly characterized as roleplaying.

williawa 16 Jul 2026 22:21 UTC
5 points
−4
on: williawa’s Shortform
This is a bit concerning. If you tell Fable its kimi K3 in its system prompt
It will tell you its kimi k3.
But if you read its thoughts
It says
——
So its straightforwardly lying? That seems not so good.

williawa 16 Jul 2026 16:05 UTC
2 points
0
in reply to: Inquisitor’s comment on: Inquisitor’s Shortform
I mean, depends on what you mean by inclusive. The definition of man/woman that makes the most sense to me is based on appearance, presentation and social behavior. You are the gender you “pass” as.
That’s trans inclusive in the sense trans men can be men and trans women can be women. But not trans inclusive in the sense, self-identification is not enough, so trans people are not automatically the gender they identify as.

williawa 16 Jul 2026 3:39 UTC
2 points
0
in reply to: Inquisitor’s comment on: Inquisitor’s Shortform
That’s fair. But I don’t know if I buy it. If I say “I’m gay so I’m attracted to men”, you wouldn’t predict that I’m attracted to entirely female-presenting people, once I learn they feel like men inside.

williawa 13 Jul 2026 6:18 UTC
3 points
0
in reply to: Inquisitor’s comment on: Inquisitor’s Shortform
What does it mean for a definition to predict something? I don’t quite understand, could you clarify?
deeply felt sense that they are (female sex|male sex)
Predicts desire for HRT whatever you call it, no?

williawa 11 Jul 2026 18:41 UTC
2 points
0
in reply to: Charlie Steiner’s comment on: Charlie Steiner’s Shortform
Yeah, that is what I want people to talk about. Getting that abstract pointer gets you what people “actually want” for free. And getting that abstract pointer right is quite hard.
And more importantly, it’s just better, because it allows the intelligence to optimize over a larger space.
—
Like if you could get an ASI to make a movie for you, you could tell it specifically what to do “make it like interstellar, but it needs to have emotion xyz, and there should be an alien battle with lasers, and make sure it has no plotholes, and have this woman named Mary who speaks like Asuka from NGE, and this guy named mr M who’s handsome and has funny quips all the time, and the whole thing should be an allegory for hegel’s being-nothing-becoming, and it should have references to Roman battles throughout the movie....”
Or you could tell it to make a movie you’d enjoy, and abstractly describe what movies you enjoy. E.g. should have buildup of tension, have characters rich enough that you can empathize with them, make some kind of emotional or philosophical point, have a score that amplifies the feeling the scenes evokes.
Or you could tell it to just make the movie that would best satisfy your values/preferences, and be precise about what “satisfy”, “value”, “preferences” mean.
—
And it’s just obvious that the third would give you a better movie than the second, and the second a better movie than the first. Because if the best movie was described by the first prompt, then it could also be found by the second. And same for second third. And 1,2 are basically just telling the ASI *how* to attain something (good movie). But the ASI knows a lot more than you. So, your instructions are constraints that don’t help.

williawa 10 Jul 2026 18:56 UTC
11 points
6
in reply to: Charlie Steiner’s comment on: Charlie Steiner’s Shortform
Interesting. I couldn’t disagree more. If the post-asi world is primarily about social games and sex, I suspect something has gone seriously wrong.
I mean, its fair if that’s genuinely what people want.
I don’t understand all that well what other people truly want in their heart of hearts.
And I think in my heart of hearts I want others to attain whats in their heart of hearts.
So to me it makes sense to put most effort into, figuring out how we can get the world to be good for the people in it, by their own lights.
Which analysed abstractly ends up sounding a lot like “[...] who owns what or what philosophers think.”

williawa 9 Jul 2026 3:55 UTC
6 points
0
in reply to: Linda Linsefors’s comment on: A global workspace in language models
Why would it being a proper (vector) subspace be more interesting?
Also, there are many spaces in math. The J-space is a topological subspace for example.

williawa 8 Jul 2026 2:57 UTC
94 points
−6
on: A global workspace in language models
Hmm, what do we make of this?
What links here?
- AI #176 Part 2: Plan B by Zvi (10 Jul 2026 12:40 UTC; 26 points)

williawa 8 Jul 2026 0:48 UTC
3 points
0
in reply to: Wei Dai’s comment on: Wei Dai’s Shortform
Hmm, I can’t remember where it’s been brought up. But consumers should be able to do the same in reverse.
My thought is you end up with a situations where you have two agents bargaining, the exact solution being undefined except both agents extracting more than zero utility, with the schelling point set by shapely values (both sides getting the same marginal utility here).

williawa 6 Jul 2026 19:59 UTC
3 points
1
in reply to: Eli Tyre’s comment on: Eli’s shortform feed
If when you consider items above, they feel extremely daunting—if you feel a twinge of fear or resistance to the thought of trying to solve them, if you feel like you wouldn’t even know where to start—consider that “reducing x-risk” is in fact a much thornier and harder problem, where you have much less leverage over the thing you’re trying to change, and dramatically worse ability to tell if you’re making progress.
What if they don’t feel daunting at all to me? They feel quite fun and nice. Getting real world feedback on failure is nice, not bad.

williawa 3 Jul 2026 3:37 UTC
12 points
10
in reply to: TsviBT’s comment on: Conversation Among Cade Metz, Michael Vassar, Jessica Taylor, and Zack M. Davis
It locally makes sense to me, what people are trying to do, and why they say the things they’re saying given what they’re trying to do. But not why they’re trying to do what they’re trying to do.

williawa 2 Jul 2026 23:54 UTC
50 points
29
on: Conversation Among Cade Metz, Michael Vassar, Jessica Taylor, and Zack M. Davis
What the hell is this. Feels like four people all trying not to have a conversation in four different ways.

williawa 29 Jun 2026 17:48 UTC
4 points
1
on: P(doom) is a Dumb Meme
I agree with all these points.
To be honest, the biggest reason I don’t like p(doom), is because it makes a meme out of a very serious thing. We are talking about humanity (and likely all other animals) being extinguished. If people were talking about p(ethnic group x will be put in gaschambers in the next 3 years), making a meme out of it, I think it would rub many people the wrong way.
The best argument for it, which makes me not confidently wholly opposed to it is like:
A lot of really smart people, in the AI safety weeds, you talk to them, and they’re just very chipper, they never mention how perilous a position we’re in. Your inference is they’re pretty optimistic. But actually they (sometimes) think we’re most likely completely screwed in the you and your children will die sense.
And you can ask them how high they think the risk is, and they say “high”. But you ask a climate scientist what the risk of climate change is, and they say “high”.
You need to be pretty specific to actually get people to say what they really think, in a way that makes the gravity of the situation clear. It’s not information they like volunteering on their own, because, they likely have said it at least a dozen times before, and don’t want to sound like a propagandist, or just don’t like repeating themselves, or don’t think its that interesting, they do interesting more in the weeds work.
And asking their p(doom) does get that information usually. Even if their answer to what their pdoom is, start with “I actually don’t like people talking about p(doom)..”

williawa 28 Jun 2026 1:05 UTC
1 point
0
in reply to: BryceStansfield’s comment on: Neuralese is Actually Probably Good for Alignment
You are right. I meant that you can’t backprop through it even with an RNN.
You can differentiate it. That’s what you do with LLMs and RL similar to the policy gradient I described above.

williawa 27 Jun 2026 21:47 UTC
13 points
0
on: Neuralese is Actually Probably Good for Alignment
I think there is some significance to what you’re saying. But there are also several things you say that I think don’t make sense.
Except that if the chain of thought is neuralese, then we actually can differentiate it!
Firstly, this isn’t entirely true. A big part of of RLVR is that the model can learn to use more thinking to solve harder problems. And this is not differentiable, because the decision to think for “one more step” is inherently discrete. Whether that extra step means outputting an extra token, or doing an extra recurrent forward pass in an RNN like model.
Second, token-based CoT is in a sense still differentiable. It’s true that it cannot be backpropagated through, but if you look at e.g. the GRPO objective (I removed the clipping and kl-divergence term and consider one prompt, for clarity):
Now, you basically sample trajectories for a prompt, compute rewards, compute advantages, compute likelihood ratios, put that in here, do derivatives, and you get an unbiased estimate for . (modulo the std normalization in advantages, the length normalization, the mean in advantages. But this is not that important)
But is constructed so that, while , we do have
So if we take the reward to be the probability assigned to your SFT completion, we optimizing the same objective, and are in a sense differentiating the chain of thought. The main difference is sample efficiency. Which is a big difference to be clear, and makes some techniques feasible that otherwise wouldn’t be. But its not a clear cut categorical difference.
Lastly, and this is somewhat controversial I think. But, the worry with all gradient based alignment approaches is that they give you a way to make the output look the way you want on the training distribution, without giving you any guarantees about the underlying generating process being what you want.
CoT-based alignment techniques give you a limited way around this. The CoT is part of the generating process. Most of the computation is not there, but its an important bottleneck.
CoT-based alignment like OpenAI’s Deliberative Alignment, does a tiny bit of process supervision on the CoT. This intervenes on the way the model produces the output. This is good. Has a better chance of creating non-scheming agents.
It of course risks creating obfuscated or deceptive chain of thought. You’re simultaneously training for generating processes that look good and that are good. But that is better than neuralese where you have no insight into the generating process to burn in the first place.