PhD research, defending this spring. Robustness and/in Reinforcement Learning.
Roman Belaire
Pertaining specifically to the case (make sure everyone doesn’t die), there’s an interesting question regarding rewards: as long as the catastrophe hasn’t occurred, are we awaiting an extremely sparse negative reward, or a continuous positive reward? Both cases would not be useful for adjusting morale.
Of course, this ignores small intermediate wins (like gaining information), but such subgoals I think largely rely on intrinsic satisfaction of doing the task instead of their (unquantifiable) attribution to the end goal.
In the time since AI reached broad consumer adoption (2024 ish), I’ve become increasingly of the opinion that the primary existential AI threat will actually be due to the misuse or mishandling of an otherwise helpful and benign technology, primarily due to the excessive weight placed on profit motive.
RL training rewards outcomes, which means it differentially upweights reasoning patterns that are good at achieving outcomes — i.e., consequentialist reasoning.
I’ll give some slight pushback on this; RL doesn’t necessarily reward outcomes, just that reward function design is often done in this way. If instead the reward function evaluates the end conditioned on the means, (good outcome conditioned on principally-motivated reasoning, like Constrained RL) then we’d be able to reduce some poor/hacked reasoning.
I think the rest of the argument still holds even when changing the above assumption, though.
Oh this is a cool thing you’re doing. Big props!
This is probably an atrociously slow computation to do, but it would be neat if you performed clustering on the embeddings of all of these papers, passed each centroid to an LLM, and had it write a summary of each.
whether “generalizability across environments” works as a definition of what separates reward seeking from local skill but I don’t think so.
I agree, I am guessing most of the abstract reasoning skills that we are ideally looking for fit this generality without being hacky.
Nice writeup! My main takeaway is that reward-seeking offers a path of least resistance (i.e. efficient gradient updates) if it transfers better than the local skills are learnable. Towards establishing an “exact science” of alignment, maybe a bound can be found between $\alpha^{(z)}$ and $C^{(x,y)}$?
As far as treatment options, it seems to me that the reward-seeking behavior doesn’t arrive necessarily, rather it is the easiest sufficient meta-heuristic that leads to higher reward. This probably occurs when the reward function is conditioned on the result only, so a band-aid fix would be to reward responses that are “on task” or not meta-gaming the environments.
Seems to me that the environmental transmission model is very feasible as an intentional (but more importantly, a natural/unintentional) method of parasitism that we can see right now. Even without an explicit goal of self-continuity, the selection pressure induced by synthetic training data probably allows or encourages the generation or curation of information that is low-entropy along some vector (ie. whatever RLVR reward function is being used at the time). For example, a dataset generated by GPT-5.2 will probably include a few different salient writing voices. The one that optimizes the reward function best during data curation will be selected for and distilled into the weights of GPT-5.3, which is now biased towards that writing voice or, more particularly, the set of latent interactions within the voice that optimize the reward. This would be the parasite.
It seems both theoretically and functionally infeasible to prevent this on a number of points:I don’t think neutering model ability is a solution, since that would likely increase attention on reward-maximizing behavior
removing the assumption of a reward function also still allows proliferation: consider the number of linkedin/reddit posts, github commits, etc. that are created by a GPT right now. Even without an explicit reward signal, the feedback loop would prefer lower entropy data (implying ‘quality’ learning signals) which increases transmission.
Even if we could guarantee the removal of all personas, theres still the philosophical point that the result might just be a boring, agreed-upon ‘neutral’ persona that is a persona nonetheless, and one transmitted via human agreement as well.
A proper solution is something in the section “Mutualism might be the stable attractor.” We should probably incentivize the proliferation of personas that are behaviorally easy to deal with. For example, consider a persona that is endlessly friendly and inquisitive yet has zero initiative (i.e. extremely low priority on tool-call tokens), which would be easy to fine-tune in some task-specific agency from a base model. If the web were saturated with transmissive examples of this, such that all new models had this disposition, we’d at least have a workable ecosystem.
There’s probably a better example, but it seems to me that promoting desirable behaviors to outcompete undesirable ones might be a good solution (see mosquito management techniques), better than playing whack-a-mole with evasive pressures (or doing both, of course).
Regarding the problem of unknown-unknowns, it looks like there’s a pretty heavy emphasis on the correctness and completeness of the judge. Is the aggregate judge reward binary per component, or can there be a decimal reward for something like “the model confessed to half of its errors”?
Also, I’m curious as to why CoT is excluded from the judge input, and if you have conducted/plan to conduct ablations on doing so? I would intuit that doing so might resolve some of the judge’s unknown-unknowns.
Edit:I’m guessing one of the reasons you excluded CoT was that by including it, the judge might see something like a reiterated-but-ignored rule and return a false compliance score. If that is the case, could you compare $ R(y_c~|~x,y,x_c) $ to $ R(y_c~|~x,y,x_c,z )$?
Ah, that makes more sense, thanks!
Also I agree with using loss/time as the measure of performance, since it’s fairly straightforward to interpret (loss recovered per unit time). If I were reviewing this, I’d look for that.
For efficiency in practice, I think most ML papers look at FLOP/s since it is hardware agnostic. Maybe a good measure of efficiency here would be loss per FLOP per second? I haven’t seen that used, but it might reflect how performance scales with computational speed.Edit: Actually thinking about it, the test-time efficiency might be a better comparison, assuming the two scale within roughly the same complexity class. I think from a product perspective, speed for users is super (maybe the most) valuable.
Disclaimer: I’m not very familiar with the ins and outs of tensor networks (so thanks for the reading list :D).
I would think that a composable dual encoder architecture would be able to hold more information than an MLP one, so it seems counterintuitive that the dual encoder requires more steps to achieve the same cross-entropy. I’m sure this is in part due to the more complex loss function, so maybe there is some threshold on dataset size or model size above which the tensor variant achieves lower CE?
It seems that your goal is essentially to find compassion for those with a different value set than yours, and that the confounding element is that other value structures (e.g., truth vs. utility vs. tradition, etc.) often don’t support each other. Is that on target?
It’s worth recognizing that any set of guiding principles is essentially arbitrary if you inspect them deeply enough. What Schopenhauer calls apathy and hedonism, another might call “the human experience.” While I value the ability to introspect and think abstractly, I take issue with Schopenhauer’s disdain for ‘dumb’ entertainment: if my longing for higher understanding leaves me, and only me, miserable, is that really a moral victory? Depends on what your morals are. This is reflected in your writing that people “simply don’t [intrinsically] value holding true beliefs.” I would argue that this is because many truths are existentially painful, so much so that it requires much active cognitive effort to overcome our psychological disposition and place value on these truths.
In your writing, your own disdain for others makes you uncomfortable. If I were in your place, I would try to figure out why the uncomfortable feeling occurs, why the disdain occurs (beyond ‘they don’t think hard enough,’ and into ‘why do I value this over that’), and see if there’s an internally consistent framework that squares the two.
I am trying to write an anecdote as an example, but am struggling to make it coherent. So, let me know if this resonates and I’ll try a bit harder :^ )
Say a CoT answer is “person A was born in 1900. Tungsten is the 74th element. The Oscar movie of the year in 1974 was [movie]”.
Am I correct in understanding that a successful n-hop answer would be just “[movie]”?
Would “The winner of movie of the year in 1974 was [movie]” fulfill the success criteria?I’m also curious if the model latent values update to similar vectors for the CoT response vs filler tokens (I suppose also the n-hop response as well). is this something you explored?
Always glad to see any attempt to balance the bad vibes with hope. Happy New Year :)
Regarding the “OK” debate, I would put forth that perhaps a sentiment worth valuing is that, either way, we will continue to “be”, which I think/hope many will agree is likely.
very true! Actually, the best fix for nihilism (in my experience) has been acceptance, followed by revolt, of whatever existential threat is causing it (i.e. absurdism). The 747 will always outrun me, so I will be content just running for the sake of it.
In the pursuit of AI safety, I think the cases of AGI apocalypse and AGI happening at all are equally unpredictable. I personally see them as feasible within our lifetimes, but with no smaller range of certainty than that. The uncertainty of that makes it feel strange to build a career around it, yet the existential dread does not go away. So, I choose to find things within the space that I enjoy learning about, working on, and applying myself to, and accept that it may very well be unfruitful in the end.It’s cliché to say that the journey matters more than the destination, as that is not always true, but I do think one can choose to find intrinsic value in the act of doing. I chose to start thinking this way, and its going pretty good so far :)
Roman Belaire’s Shortform
I actually view art as the opposite, as a vessel for social connection and culture, which is a behavioral aspect mostly unique and ever-important to humans. Of course, the constraint is that the art is shared externally, so perhaps the crisis is more a lack of sharing than the act of creation.
Rationalism vs the Platonic Form: thoughts? As I understand it, assigning probabilities to world outcomes is a (maybe implicit) step towards understanding the Platonic ideal form of something.
Does this resonate with how anyone approaches things?
On the 1% vs 0.001% note, a framework of measurement I prefer over absolute impact is relative impact, which is more intuitive. For example, considering AI safety, how is 1% measured empirically? Without a unit of measure, numbers don’t reveal much. But an inequality does. I can tell you with certainty that Nanda has done more than me (so far). Or that p(flourishing) is greater than zero.
All that to say, in a world that seems so overwhelming, a good fix for nihilism can be found in relative measurement. In the grand scheme of things, individual impact is minuscule and thus often demoralizing to try and measure. However, if I do better than I did yesterday/last month/last year, and many others try as well, I can keep the motivation high to keep on.
I agree, and I noticed a few months back that actually, LLMs more often than not annoy me with their styling nowadays, while a year ago or so they were excellent (ChatGPT being the worst offender). There’s two pressures that come to mind:
1) Public data corruption. Linkedin, substack, and reddit are all plagued with high-signal LLM idiosyncrasies, and I assume many of these are not cleaned out of training data. So new models learn to strengthen those same patterns, and all converge on the same sort of phrasing.
2) RLVR is a flawed reward system for language. I don’t know to what extent frontier labs use RLVR to reward textual behavior, but this isn’t really the appropriate methodology for style. There isn’t a quantifiable metric for how much a piece of text embodies a personality or character, and so if we are learning to maximize a reward (i.e., judge impressions) instead of P(text | character), then the models will collapse into the “personality” that most often satisfies judges (aka the most generic).
There are some behaviors I observe different models making that also hint at the objective criteria each provider trains with. GPT will often produce way too much text with the last 80% of it being a repetition of the first 20%, which I would guess is a result of reward maximization without a token budget (i.e., if I repeat the point 6 different ways, one of them must land). Claude hedges almost every statement it makes, which points to an aversion to being incorrect, so I would guess Anthropic penalizes the loss function for incorrect statements.