sl6928@princeton.edu
Axiomatic Decision Theory
Shuo Li
Nice paper! I’m curious how the safety case handles is monitor staleness. The paper seems to rely on monitors’ ability being close to the generator. But in real deployment we might have
monitoring . Here is a later checkpoint / fine-tune / post-training step.I guess this is less about honeypot slack. It is more about whether the elicited ROC curves are stable under version gap. If those curves degrade quickly, the advantage of untrusted monitoring might only hold in an almost stationary case.
Has anyone tried the obvious eval here: freeze UM at checkpoint i, continue improving UG to checkpoint j>i, and then measure how collusion and suspiciousness ROC shift with j-i? If that degrades fast, it seems like monitor freshness needs to be added as a deployment parameter in the safety case.
This is conceptually spot on! I want add that, the formal situation for Savage’s Theorem is actually worse than not guaranteeing the dominance requirement. My collegue and I just published a decision theoretic proof that Savage’s axioms forces the agent into strict dominance failures.
that is strictly dominated by across an entire positive-probability event, yet Savage’s axioms mathematically mandate that . So not only does the desire to avoid dominated strategies fail to entail EU (as you argued), but successfully achieving EU optimization via Savage’s axioms implies that agents are blind to strict dominance.
Specifically, for every Savage EU satisfying the Savage’s axioms, under the axiom of constructibility, you can construct an act
Moreover, we show that a simple Dutch Book argument is avaible for the Savage’s EU.
A detailed discussion is pending LW approval: https://www.lesswrong.com/posts/8ppB4ixfoKdGDqeHf/savage-s-axioms-make-dominated-acts-eu-maxima-9
The full proof can be found here: https://drive.google.com/file/d/15BHxSUR93bN5DQ5spU971gEpGGq6PHRc/view