Dylan Xu

Karma: 156

Dylan Xu 20 Apr 2026 22:13 UTC
3 points
0
in reply to: nostalgebraist’s comment on: How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?

if I’m reading correct some of the data in there is RL, so I dunno.

Yeah, it’s a mishmash of all the runs, including on-policy and RL runs.

So, I can imagine an interpretation of your (non-RL) results that’s like, “off-policy training after RL typically degrades instruction-following, and when you observe transfer, that’s just a particular manifestation of this more general trend.” Do you have a sense of how plausible this explanation is?

I wouldn’t expect this to be true for LR=5e-5 (where, of course, we see many instances of transfer), at least based on our IFEval box plot. I do agree that some of the high-LR runs are worse than ideal, though. If it’s helpful, I think I’ll add a graphic showing us trying to train back in IFEval capabilities while preserving the alternate behavior rate, similar to Olympiads (it’s pretty low-cost to do). I think the capability degradation box plot clearly shows that transfer is quite doable without the hypothesis you mention, however.

How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?

Dylan Xu, Alek Westover, Vivek Hebbar, Sebastian Prasanna, frisby, Buck and Julian Stastny

20 Apr 2026 16:58 UTC

61 points

4 comments19 min readLW link

Model organisms researchers should check whether high LRs defeat their model organisms

Dylan Xu, Sebastian Prasanna, Alek Westover, Vivek Hebbar and Julian Stastny

10 Apr 2026 0:07 UTC

40 points

0 comments5 min readLW link

Dylan Xu 31 Jan 2026 20:49 UTC
1 point
0
in reply to: Eli Tyre’s comment on: Eli’s shortform feed

If I knew it was going to happen in 2035 to 2039, I think I would mostly back up and try to improve the overall quality of US governance, and/or work to get competent candidates for the 2034 presidential election.

Do you mean the quality of US governance with respect to AI or in general? In the latter case, I’m curious what your concrete plans would be, since this is usually considered a difficult and not very neglected (but still very important!) area.

Dylan Xu 11 Jan 2026 15:14 UTC
2 points
0
on: If AI alignment is only as hard as building the steam engine, then we likely still die
I think when Olah says that solving alignment may be as “easy” as the steam engine, he’s basically envisioning current training + eval techniques (or similar techniques equivalently difficult to the steam engine) scaling all the way to superintelligence. (This is my interpretation; I might be wrong here.) For instance, maybe inducing corrigibility in ASI turns out to be not that difficult, such that the “first critical try” framework does not really apply, and takeoff is slow enough that model organisms/evals work means we can test our alignment methods and have them reasonably generalize to real world scenarios. Disagreeing with this view just means that “alignment” is harder than the steam engine scenario.

Dylan Xu 16 Oct 2025 22:36 UTC
2 points
1
on: It will cost you nothing to “bribe” a Utilitarian
I realize I’m being a little pedantic here, but on the “joke” calculation: the problem here is that $P R$ is a binary function depending on whether $k$ utilitarians join or not, right? For instance, let $- s_{P R}$ be the effective safety premium from $k$ safety-minded utilitarians joining (the value being negative as joining presumably accelerates the company), and suppose that each utilitarian joining leads to $- s_{P R} / k$ acceleration. Then a rational utilitarian would demand $(ϵ + s_{P R}) / k$ premium, which is not negligible.
Going back to the joke calculation, it implies that the bottleneck to preventing defection is coordination: $k$ utilitarians acting together would not join for $ϵ / k$ value as it is against their interests, but individually they have zero counterfactual impact, so they all join. In the real world, coordination is plausibly relevant, but it seems like the more relevant problem is just the object-level beliefs of those joining: how impactful is your work versus how much it accelerates capabilities, along with vague non-utilitarian desires like status that bias people towards joining the big labs.

Dylan Xu 16 Mar 2025 18:55 UTC
1 point
0
in reply to: plex’s comment on: I make several million dollars per year and have hundreds of thousands of followers—what is the straightest line path to utilizing these resources to reduce existential-level AI threats?

I recommend https://agentfoundations.study/, and much of https://www.aisafety.com/stay-informed,

Currently these two links include the commas so they redirect to 404 pages

Dylan Xu 17 Feb 2025 3:48 UTC
8 points
4
in reply to: JBlack’s comment on: dx26′s Shortform
We (or at least a majority of humans) do still have inner desires to have kids, though; they just get balanced out by other considerations, mostly creature comforts/not wanting to deal with the hassle of kids. But yeah, evolution did not foresee birth control, so that’s a substantial misgeneralization.

We are still a very successful species overall according to IGF, but birth rates continue to decline, which is why I made my last point about inner alignment possibly drifting farther and farther away the stronger the inner optimizer (e.g. human culture) becomes.

dx26′s Shortform

Dylan Xu16 Feb 2025 21:31 UTC

2 points

3 comments1 min readLW link

Dylan Xu 16 Feb 2025 21:31 UTC
4 points
3
on: dx26′s Shortform
I saw that Katja Grace has said something similar here; I’m just putting my own spin on the idea.

The relevance of the evolutionary analogy for inner alignment has been long discussed in this community, but one observation that seems to not be mentioned is that humans are still… pretty good at inclusive genetic fitness? Even in way-out-of-distribution environments like modern society, we still have strong desires to eat food, stay alive, find mates and reproduce (although the last one has relatively decreased recently; IGF hasn’t totally generalized). We don’t monomanically optimize for IGF, but we (and probably future NN-based AIs) don’t monomanically optimize for anything, and our motivational circuits still do a pretty good job at keeping our species alive. So… why should we expect future AIs to catastrophically fail (i.e. be completely non-inner aligned with what we wanted it to do) at doing the actions we rewarded in RL training, which should be a much stronger outer optimizer than evolution?

Some possible objections:
- “Human values are more fragile than IGF, so it’s much easier to catastrophically fail on human values”
  - Is this true? Is it really easier to misgeneralize on human values than on IGF? Maybe, but we have a lot of animal skulls on the road that say otherwise
  - More relevantly, modern LLMs have already learned human values pretty well, so the difficulty of enacting said values shouldn’t matter as much if the concepts already exist in the weights (I’m less sure about this)
- “Optimizing a generally intelligent, situationally aware agent presents unique challenges compared to evolution because of scheming, gradient hacking, wireheading, etc.”
  - Sure! This definitely seems like a problem. However, by the time the AI gains the capabilities needed for scheming, its inner alignment may have to be absolutely terrible for catastrophic effects to occur once out of training, as otherwise we end up in the “mostly fine” state that evolution stumbled into with humans.
- “AIs could drift off over time in the same way that humans seem to be currently with evolution”
  - Yep, this also seems like a problem. Hopefully general capabilities allow a value-aligned AI to strategically preserve its values over time. We could also continually optimize our AIs; gradient descent hopefully never becomes ~billions of times weaker than any inner optimizer like evolution is versus human culture.

Dylan Xu 16 Feb 2025 20:36 UTC
2 points
1
on: Can someone, anyone, make superintelligence a more concrete concept?
The thing is, there exists lots of popular movies about rogue AIs taking over the world -- 2001, Terminator, etc etc—so the concept should already exist in popular culture. The roadblocks seem to be:
1. The threat somehow doesn’t seem as tangible or threatening as, for example, ISIS developing a bioweapon or the CCP permanently dominating the world. One explanation is that the reference class for “enemy does bad things with new technology” or other near-term threat models contains lots of examples throughout history, whereas “species smarter than humans” contains none. Related:
2. The threat doesn’t seem realistic, i.e. people (even those who want to accelerate towards AGI) have long timelines. Hypothetically, if you truly “feel the AGI” and understand that we’re close to building something smarter than us in every way, the idea that we should make sure it does what we want should be intuitive. I don’t know if making people “feel the AGI” is a smart PR strategy, but nevertheless this does seem to still be a barrier to the public taking AGI risk seriously.

Dylan Xu 28 Dec 2024 8:51 UTC
4 points
1
in reply to: Viliam’s comment on: If all trade is voluntary, then what is “exploitation?”
In this case, the starving person presumably has to press the button or else starve to death, and thus has no bargaining power. The other person only has to offer the bare minimum beyond what the starving person needs to survive, and the starving person must take the deal. In Econ 101 (assuming away monopolies, information asymmetry, etc.), exploited workers do have bargaining power by being able to work for other companies, hence why companies can’t just do stupid, spiteful actions in the long term.

Dylan Xu 3 May 2024 17:38 UTC
3 points
2
on: Coherence of Caches and Agents
It might be relevant to note that the meaningfulness of this coherence definition depends on the chosen environment. For instance, in an deterministic forest MDP where an agent at a state $s$ can never return to $s$ for any $s$ and there is only one path between any two states, suppose we have a deterministic policy $π$ and let $s_{1} = π (s)$ , $s_{2} = π (s_{1})$ , etc. Then for the zero-current-payoff Bellman equations, we only need that $V (s_{1}) > V (s^{'})$ for any successor $s^{'}$ from $s$ , $V (s_{2}) > V (s^{''})$ for any successor $s^{''}$ from $s^{'}$ , etc. We can achieve this easily by, for example, letting all values except $V (s_{i})$ be near-zero; since $s_{j}$ is a successor of $s_{i}$ iff $j = i + 1$ (as otherwise there would be a cycle), this fits our criterion. Thus, every $π$ is coherent in this environment. (I haven’t done the explicit math here, but I suspect that this also works for non-deterministic $π$ and non-stochastic MDPs.)
Importantly, using the common definition of language models in an RL setting where each state represents a sequence of tokens and each action adds a token to the end of a sequence of length $t$ to produce a sequence of length $t + 1$ , the environment is a deterministic forest, as there is only one way to “go between” two sequences (if one is a prefix of the other, choose the remaining tokens in order). Thus, any language model is coherent, which seems unsatisfying. We could try using a different environment, but this risks losing stochasticity (as the output logits of an LM is determined by its input sequence) and gets complicated pretty quickly (use natural abstractions/world model as states?).

Measuring Coherence and Goal-Directedness in RL Policies

Dylan Xu22 Apr 2024 18:26 UTC

10 points

0 comments7 min readLW link

Dylan Xu 19 Mar 2024 1:30 UTC
1 point
0
in reply to: Garrett Baker’s comment on: Measuring Coherence of Policies in Toy Environments
Right, I think this somewhat corresponds to the “how long it takes a policy to reach a stable loop” (the “distance to loop” metric), which we used in our experiments.

What did you use your coherence definition for?

Measuring Coherence of Policies in Toy Environments

Dylan Xu and Richard_Ngo

18 Mar 2024 17:59 UTC

59 points

9 comments14 min readLW link

Supervised Program for Alignment Research (SPAR) at UC Berkeley: Spring 2023 summary

mic, Dylan Xu, adamk and Carolyn Qian

19 Aug 2023 2:27 UTC

23 points

2 comments6 min readLW link

Dylan Xu

How do LLMs gen­er­al­ize when we do train­ing that is in­tu­itively com­pat­i­ble with two off-dis­tri­bu­tion be­hav­iors?

Model or­ganisms re­searchers should check whether high LRs defeat their model organisms

dx26′s Shortform

Mea­sur­ing Co­her­ence and Goal-Direct­ed­ness in RL Policies

Mea­sur­ing Co­her­ence of Poli­cies in Toy Environments

Su­per­vised Pro­gram for Align­ment Re­search (SPAR) at UC Berkeley: Spring 2023 summary

How do LLMs generalize when we do training that is intuitively compatible with two off-distribution behaviors?

Model organisms researchers should check whether high LRs defeat their model organisms

Measuring Coherence and Goal-Directedness in RL Policies

Measuring Coherence of Policies in Toy Environments

Supervised Program for Alignment Research (SPAR) at UC Berkeley: Spring 2023 summary