AviS

Karma: 4

AviS 11 Jun 2026 10:38 UTC
2 points
0
on: Tracing Eval-Awareness Emergence Through Training of OLMo 3
I think that the explanation for why DPO suppresses VEA despite the chosen set having more VEA than the reject set may be incomplete.
In a highly simplified setting where we think of the accept and reject data as being independent Bernoulli random variables (e.g., corresponding to VEA) and we learn a single probability parameter , DPO does not push toward the chosen marginal rate . If the accept datapoints are and reject datapoints are , then the global minimum of the DPO loss satisfies

This shows that if then the parameter value that minimises the DPO loss is greater than the reference value , even if is less than . (Because is the condition for the second term above being positive.)
So if DPO’s chosen responses have more VEA than rejected responses, this suggests that it should increase VEA relative to the reference. (And that the absolute rate of VEA in the DPO data doesn’t matter, unlike what is proposed in the post.) The fact that the released DPO checkpoint has lower VEA seems to require another explanation: maybe the prompt dependence matters, maybe it’s because it’s an entire answer being reinforced and it might have other correlations.

AviS 12 Jan 2026 22:31 UTC
3 points
0
in reply to: Fabien Roger’s comment on: Why AIs aren’t power-seeking yet
The first bullet point here is what I see as the most important factor for why current AI doesn’t seek extreme power: they are best thought of not as being intrinsically motivated to complete tasks, but rather as having a reflex to complete contexts in a human-like way.
Maybe RL focusses this reflex and adds some degree of motivation to models, but I doubt this effect is large. My reasoning for this is that the default behavior of a pretrained model is to act like it is pursuing a goal when the inputted context suggests this, so there is little reward/gradient pressure to instill additional goal-pursuing drive.

AviS 18 Sep 2025 15:32 UTC
3 points
2
in reply to: Daniel Kokotajlo’s comment on: Daniel Kokotajlo’s Shortform
It certainly seems like Israel is acting in a manner consistent with taking this concern seriously, seeming intent on ending Hamas’s presence in Gaza, holding a buffer zone in Syria, and weakening Hezbollah.

AviS 1 May 2024 14:16 UTC
1 point
−3
on: Counting arguments provide no evidence for AI doom
A point about counting arguments that I have not seen made elsewhere (although I may have missed it!).
The failure of the counting argument that SGD should result in overfitting is not a valid countexample! There is a selection bias here—the only reason we are talking about SGD is *because* it is a good learning algorithm that does not overfit. It could well still be true that almost all counting arguments are true about almost all learning algorithms. The fact that SGD does generalises well is an exception *by design*.