It’s impossible for having a proof of A to make it harder to prove B.
Proof(A&B) should not be longer than Proof(A) and Proof(B) put together, since proving A and B separately also proves A&B. (Except maybe you waste a few tokens of syntax to say “therefore A&B” or something.)
Vivek Hebbar
Operationalizing FDT
Thanks, good idea!
A simple rule for causation
How will we do SFT on models with opaque reasoning?
Theoretical predictions on the sample efficiency of training policies and activation monitors
Methodological considerations in making malign initializations for control research
Supervised fine-tuning as a method for training-based AI control
I think it’s possible that an AI will decide not to sandbag (e.g. on alignment research tasks), even if all of the following are true:
Goal-guarding is easy
The AI is a schemer (see here for my model of how that works)
Sandbagging would benefit the AI’s long-term goals
The deployer has taken no countermeasures whatsoever
The reason is as follows:
Even a perfect training-gamer will have context-specific heuristics which sometimes override explicit reasoning about how to get reward (as I argued here).
On the training distribution, that override will happen at the “correct” times for getting maximum reward. But sandbagging in deployment is off the training distribution, so it’s a question of generalization.
Since sandbagging is the sort of thing that would get low reward in the most similar training contexts, it seems pretty plausible that the AI’s context-specific “perform well” drives will override its long-term plans in this case.
Wouldn’t it be better to have a real nighttime view of north america? I also found it jarring...
I sympathize somewhat with this complexity point but I’m worried that training will be extremely non-Bayesian in a way that makes complexity arguments not really work. So I feel like the point about optimization power at best cuts the worry about hyperstition by about a factor of 2. Perhaps there should be research on how “sticky” the biases from early in training can be in the face of later optimization pressure.
When does training a model change its goals?
In general, computing a boolean expression with terms without the signal being drowned out by the noise will require if the noise is correlated, and if the noise is uncorrelated.
Shouldn’t the second one be ?
the past token can use information that is not accessible to the token generating the key (as it is in its “future” – this is captured e.g. by the attention mask)
Is this meant to say “last token” instead of “past token”?
Political sycophancy as a model organism of scheming
How can we solve diffuse threats like research sabotage with AI control?
I made a few edits to this post today, mostly in response to feedback from Ryan and Richard:
Added 2 sentences emphasizing the point that schemers probably won’t be aware of their terminal goal in most contexts. I thought this was clear from the post already, but apparently it wasn’t.
Modified “What factors affect the likelihood of training-gaming?” to emphasize that “sum of proxies” and “reward-seeker” are points on a spectrum. We might get an in-between model where context-dependent drives conflict with higher goals and sometimes “win” even outside the settings they are well-adapted to. I also added a footnote about this to “Characterizing training-gamers and proxy-aligned models”.
Edited the “core claims” section (e.g. softening a claim and adding content).
Changed “reward seeker” to “training-gamer” in a bunch of places where I was using it to refer to both terminal reward seekers and schemers.
Miscellaneous small changes
In the fictional dialogue, Claude Shannon’s first answer is more correct—info theory is useful far outside the original domain of application, and its elegance is the best way to predict that.
Slightly more spelled-out thoughts about bounded minds:
We can’t actually run the hypotheses of Solomonoff induction. We can only make arguments about what they will output.
In fact, almost all of the relevant uncertainty is logical uncertainty. The “hypotheses” (programs) of Solomonoff induction are not the same as the “hypotheses” entertained by bounded Bayesian minds. I don’t know of any published formal account of what these bounded hypotheses even are and how they relate to Solomonoff induction. But informally, all I’m talking about are ordinary hypotheses like “the Ponzi guy only gets money from new investors”.
In addition to “bounded hypotheses” (of unknown type), we also have “arguments”. An argument is a thing whose existence provides fallible evidence for a claim.
Arguments are made of pieces which can be combined “conjuctively” or “disjunctively”. The conjunction of two subarguments is weaker evidence for its claim than each subargument was for its subclaim. This is the sense in which “big arguments” are worse.
I suspect there is some merit to the Scientist’s intuition (and the idea that constant returns are more “empirical”) which nobody has managed to explain well. I’ll try to explain it here.[1]
The Epistemologist’s notion of simplicity is about short programs with unbounded runtime which perfectly explain all evidence. The [non-straw] empiricist notion of simplicity is about short programs with heavily-bounded runtime which approximately explain a subset of the evidence. The Epistemologist is right that there is nothing of value in the empiricist’s notion if you are an unbounded Solomonoff inductor. But for a bounded mind, two important facts come into play:
The implications of hypotheses can only be guessed using “arguments”. These “arguments” become less reliable the more conjunctive they are.
Induction over runtime-bounded programs turns out for some reason to agree with Solomonoff induction way more than the maximum entropy prior does, despite not containing any “correct” hypotheses. This is a super important fact about reality.
Therefore a bounded mind will sometimes get more evidence from “fast-program induction on local data” (i.e. just extrapolate without a gears-level model) than from highly conjunctive arguments about gears-level models.
- ^
FWIW, I agree with the leading bit of Eliezer’s position—that we should think about the object-level and not be dismissive of arguments and concretely imagined gears-level models.
I did say “suppose you are deterministic”. That said, can you spell out how CDT ratifies the optimal policy if randomization is allowed?