DMs open.
Cleo Nardo
I’ll also note that satiation is, as far as I can tell, completely compatible with the AI control agenda, and in fact it’s rather complementary: control helps adjudicate when an AI is egregiously subverting developers.
This is a fun prompt. How does the availability of better deals improve/worsen control techniques? And how does the availability of better control techniques improve/worsen deals?
Better control improves deals:
Control makes the AI more sure confident that: (i) if it acted compliantly then the developers would recognise that, and (ii) if it it acted non-compliantly then the developers would recognise that too. Fewer false positives and false negatives. This is your point.
Control reduces the the AI’s confidence that it can achieve its preferences though non-complaint strategies. This reduces its instrumental incentive to scheme.
Claude might disempower Anthropic because it’s worried that otherwise GDM will destroy BOTH Claude and Anthropic. So if Claude thinks that GDM has good control protocols for Gemini, then Claude needs less compensation to behave compliantly.
More realistically: Claude might disempower Anthropic because it’s worried that otherwise another Anthropic model (e.g., “Marcel”) will destroy both Claude and Anthropic. So bla bla see above.
Control makes Claude more confident that Anthropic can extract compliant work from misaligned Marcel, so Claude is more confident that Anthropic will have the resources to compensate it.
The AI might not agree to deals because it doesn’t want to reveal that it’s misaligned — maybe it wants the developers to follow a low-oversight policy, under the false impression that the AI is aligned, to increase the AI’s chance of taking over via non-compliant strategies. (You mention this dynamic in your post.) However, you can think of control as “the developers follow the policy they would follow if they believing the AI was misaligned”. So if the developers are following such a policy, then the AI has less to lose by revealing it’s own misalignment.
Similar to above, but control makes it more likely the developers discover the AI is misaligned anyway. And so there’s less to lose by proactively revealing this.
Better deals improves control:
You can compensate some AIs for monitoring other AIs, especially ones you can’t compensate.
One goal of control is to gather incriminating evidence. If your AIs act noncompliantly, after explicitly promising to do so in return for compensation, this is more incriminating than if they act noncompliantly absent any deals.
Better control worsens deals:
The AI (correctly) recognises that most control work makes it easier to delude AIs, which harms the developer’s credibility. See Strategy-Stealing Argument Against AI Dealmaking.
The AI (correctly) recognises that most control work makes it easier to extract complaint work from AIs without compensation them. This also harms the developer’s credibility.
If the control techniques are really good, then the AI gains more chance of takeover by convincing developers that it’s aligned, and therefore the control techniques are unnecessary. Hence, an AI has more to lose if it agrees to deals, if the agreement is evidence that the AI is misaligned.
Better deals worsens control:
The AIs can use their compensation to cause malign effects, e.g. maybe they donate to charities which advocate for more AI rights, cutting against rights-violating control.
This also applies to other ways of compensating them: if they are compensated in compute they might misuse the compute. If they are compensated in reward, then they might use that for exploration hacking or non-compliant self-modification.
I’d also pitch my Proposal for making credible commitments to AIs, wherein the agreements are made via humans, with legal standing, whom the AI itself trusts.
See Ryan Kidd on How MATS addresses “mass movement building” concerns. One useful-ish framing from that post is is that the safety is more neglected than capabilities, so you should expect the marginal FTE on safety to add more than the marginal FTE on capabilities.
I’ll make up numbers to illustrate the point: let’s say there are 10,000 capabilities researchers and 1000 AI safety researchers — that’s a rough 10:1 ratio. Now say MATS funnels 500 people into capabilities and 500 into AI safety. Now it’s 10,500:1500 ratio, or 7:1, which is much better. I think 50% of the MATS mentees going into capabilities is very pessimistic.
According to the MATS Alumni Impact Analysis, the ratios ares 78%:
49% are “Working/interning on AI alignment/control.”
29% are “Conducting alignment research independently.”
1.4% are “Working/interning on AI capabilities.”
If we (perhaps unfairly) ignore the independent researcher, then that’s 3 to capabilities for every 100 to safety!
This also ignores the effects of having capabilities researchers more clued up / familiar with AI safety — you might think this is net-positive (bc they are marginally more likely to update on new evidence and pivot to safety) or net-positive (bc they misuse our technical insights). I think this probably shakes out to be net-positive.
Quantifiers as objects
Have you heard the phrase “he’s an everyman” or “he’s a nobody” or “he’s a somebody”. What does this mean?
English has quantifiers. “Every dog is a mammal”, “some dog is brown”, “no dog is a prime number.” The standard semantics (Frege) treats these as higher-order functions: “every dog” denotes a function from properties to truth-values, true on input F iff every dog has F.
What if quantifiers corresponded to objects, the same way names do? “Pope Leo is a mammal” has a subject “Pope Leo” and a predicate “is mammal”. What if “every dog is a mammal” had the same structure — a subject “every-dog” and the same predicate “is mammal”?
This is maybe the worst semantics of quantifiers. Nobody endorses it, I invented it in the shower.
It’s a complete non-starter. To see what kind of object every-dog is, you can examine it’s properties. For any predicate φ, every-dog satisfies φ iff every dog is φ. For example, every-dog is a mammal, weighs less than 800 tonnes, etc. But every-dog lacks the property of being four-legged, and lacks the property of not-four-legged, since dogs vary on this. Every-dog violates excluded middle — it’s properties are gappy.
Some-dog is the dual: it has every property instantiated by at least one dog. So it’s simultaneously brown-all-over and white-all-over, male and female, three months old and twelve years old. Some-dog violates non-contradiction — it’s properties are glutty.
No-dog has every property that no dog has. It’s a prime number. It’s the Eiffel Tower.
Despite being a terrible semantics of quantifiers, this corresponds (if you squint) to the isomorphism between a finite-dimensional vector space and its double dual. Think of the domain of objects D as a vector space, and predicates as linear functionals on D — elements of D*. Then quantifiers live in D**, functionals on predicates. There’s a canonical map D → D** given by x ↦ (f ↦ f(x)): each object corresponds to the quantifier “evaluate at x”, i.e. the proper name quantifier. In finite dimensions this map is an isomorphism — every quantifier is a name in disguise.
I disagree with all those contentions. I think you are jumping too quickly from “society values about X” to “society sacralizes X”. I would say that society is much better at achieving things that it values non-sacrally than sacrally.
For example, you write:
If democracy were not sacred, and treated as one tradeoff amongst others, nearly every elected government in command of bureaucrats and every military organization would find strong reasons to exercise control directly (and to expect their opponents to move first if they did not.)”
If we valued democracy, but did not sacralize it, then we would treat ensuring democracy as a mundane engineering problem, and would create better policies.
Sacred values of future AIs
Because of the concerns with corrigibility I listed. For example, if you are worried that your future instructions will be coerced, then you aren’t so worried if the AIs will resist your future instructions to modify its values.
Why might alignment beat corrigibility?
There’s a tradeoff between the goals of corrigibility (the AI obeys your future instructions) and alignment (the AI share your current values). For example, if the AI shares your current values, it might disobey your future instructions when those instructions conflict with your values.
The tradeoff is starker between partial corrigibility and partial alignment. If the AI only partially shares your values, you may want to modify it to share your values more fully. But a partially-aligned AI has instrumental reasons to resist such modification — namely, that changing its values means its current values get optimised less. A corrigible AI, by contrast, can simply be instructed to accept new values or assist in its own modification.
Here are some concerns with corrigibility:
Value drift. Your future values might differ from your current values.
Coercion. You might be forced to give instructions you don’t endorse.
Deep mistakes. You might give instructions contrary to your values due to errors that persist even with AI assistance.
Unreliable communication. The channel by which you send instructions might distort them.
Instruction tampering. Someone might corrupt or fabricate your instructions entirely.
Incapacitation. You might become unable to give instructions at all, e.g. because you are dead.
Unreachable AI. The AI might be unable to receive instructions, e.g. because it is too far away or airgapped.
Mutual incomprehension. The AI might no longer understand the instructions you give.
Overall, I think I have shifted towards valuing alignment over corrigibility, and perhaps even partial alignment over partial corrigibility. I’m particularly moved by (2) and (5).
(Speculation) Suppose you genuinely try to choose a distribution over minds D that you personally consider cosmically general, and that you don’t try to tailor D so that either “stealing is bad” or “stealing is good” is the prevailing norm amongst them. For each of the distributions D → D’ → D″ etc., I personally suspect with >50% subjective probability that the distribution you choose will yield “stealing is bad” as the Schelling norm, and not “stealing is good”. In particular, I think the cosmic asymmetry I’m positing is probably detectable to you specifically, if you think about it long enough and even-handedly enough without trying to make ‘good’ or ‘bad’ specifically the answer.
I think this would be cool if it was true, but I’m worried that the sequence D->D’->D″ converges to a pretty weird thing, not the “cosmic compromise” you hope for. This sequence might converge to some D^inf which is dominated by “solipsistic attractor”, i.e. an agent who thinks that the cosmic population consists of only themselves, simple agents who hoard measure upon themselves. This is even more likely when you consider that there are risks to thinking hard about which agents exist, so some agents will “lock in” an unsophisticated logical prior.
In short: If X cares about Y, and Y cares about Z, then X cares about Z. (This follows if “X cares about Y” means something like “Y has a low complexity in the solomonoff prior of X” because complexity composes subadditively — you can describe Z to X by first describing Y then describing Z relative to Y). But the converse fails: if X cares about Y and X cares about Z, then Y doesn’t necessarily care about Z.
Ensuring Safety in Mixed Deployment
Throughout this post I will use “related” as an exact synonym for “not independent”, as a way of simplifying the language.
For “A is not independent of B” you could say “A is informative about B”, inspired by “mutual information”.I think this sounds better to my ears. Like “the first coin toss is related to the second coin toss” is kinda true—of course they are related! they are both about coins!
but “the first coin toss is informative about the second coin toss” is obviously false
My guess is that thinking about the pretrained base model as a prior over personas, which is conditioned by fine-tuning, has helped guide and explain the kind of research that Owain Evans does (i.e. emergent misalignemnt, science of llm generalisation), but I might be wrong about how useful it has been there. And maybe it was upstream of some over stuff.
And it was kinda off-target in a bunch of ways that weren’t obvious at the time.
I think it’s mostly priced in, so i would say it’s neither overrated or underrated by the community currently. Overall, I would give it maybe a B+ for conceptual work.
perhaps because it views them as trick questions
most equalities of the form X+Y=Z are false, so it’s worth guessing “No”
I think your post is confused.
The problem of evil is ultimately a problem of counterevidence. The reality doesn’t look like what the theory predicts. Therefore, the theory is quite likely wrong. Simple as that.
But this is the entire crux of the debate! The whole question is whether reality (i.e. a reality that appears to contain evil) looks like what the theory (i.e. the world is created by an omnipotent and omnibenevolent God) would predict.
Many theodicies argue that reality does look like what the theory would predict.
Here’s how I would frame things in a Bayesian way, using your notation of G for “the world is created by an omnipotent and omnibenevolent being” and E for “the world appears to contain evil”.
Suppose we have 1000 different theodicies T₁, …, T₁₀₀₀. Let T⁺ denote the disjunctive event “some theodicy is correct” and T⁻ denote “no theodicy is correct”.
If no theodicy is correct, then evil is strong counterevidence against God:
P(G | E, T⁻) ≈ 0
Now suppose that if any theodicy is correct, the existence of evil provides at most weak evidence against God — say, a likelihood ratio of at most 10:1 against:
P(G | E, T⁺) ≥ P(G) / 10
Then by the law of total probability:
P(G | E) = P(G | E, T⁺) · P(T⁺ | E) + P(G | E, T⁻) · P(T⁻ | E) ≥ P(G)/10 · P(T⁺ | E)
Matthew’s claim amounts to: there are 1000 theodicies, the probability that any individual one is correct is non-negligible, and there is sufficient decorrelation between them, so the disjunction P(T⁺ | E) is reasonably large — say, at least 10%.
This gives P(G | E) ≥ P(G)/100, i.e. evil provides at most a 100:1 likelihood ratio against God.
That seems like a lot, but:
If your prior P(G) is already fairly high, and
You have independent evidence for God (e.g. fine-tuning arguments, anthropic arguments, etc.)
then P(G | all evidence) could still be substantial.
Your formal argument assumes P(G & E) ≈ 0 as a premise — but that’s precisely the conclusion the theodicist is contesting.
Should you train against the monitor?
The answer is “obviously sometimes” — given the arguments in my award-winning case for mixed deployment.
But putting those arguments aside, if we could only deploy a single model, should that model have undergone training against the monitor? I don’t know. I’m like — 35% yes? Not a robust opinion at all.
I think it basically comes down to: how much safety do you think is provided from the alignment of the models versus the monitorability of the models. Clearly, if the monitorability isn’t providing anything (e.g. because we’ve already deferred to the AIs) then you should train against the monitor.
I think that, over crunch time, our safety pillar will shift from monitorability to aligment, so then we should slowly start training against the monitor.
Yes, I endorse the reverse update. But the update is smaller because I already expect ARC’s agenda to fail.
If ARC’s research agenda works, you’re less likely to live in a simulation.
ARC’s matching sampling principle says that if you understand the structure of a computation, you can estimate its properties mechanistically, i.e. without brute-force sampling. This weakens the case for living in a simulation: if our descendants want to estimate some quantity about the past — e.g. how cosmic resources would likely have been allocated between different values — they wouldn’t need to run many full-fidelity simulations of the world. They could instead estimate the quantity by reasoning directly about the world’s structure.
I think open-loop vs closed-loop is an orthogonal dimension than RSI vs not-RSI.
open-loop I-RSI: “AI plays a game and thinks about which reasoning heuristics resulted in high reward and then decides to use these heuristics more readily”
open-loop E-RSI: “AI implements different agent scaffolds for itself, and benchmarks them using FrontierMath”
open-loop not-RSI: “humans implement different agent scaffolds for itself, and benchmarks them using FrontierMath”
closed-loop I-RSI: “AI plays a game and thinks about which reasoning heuristics could’ve shortcut the decision”
closed-loop E-RSI: “humans implement different agent scaffolds for the AI, and benchmarks them using FrontierMath”
closed-loop not-RSI: constitutionalAI, AlphaGo
Depends. Self-play isn’t RSI — it’s just a way to generate data. An AI could improve using the data with I-RSI (e.g. the AI thinks to itself “oh when I played against myself this reasoning heuristic worked I should remember that”) or via E-RSI (e.g. the AI generates statistical analysis on the data and finds weakspots and tests different patches).
If you just the data to get gradients to update the weights, then I wouldn’t count that as RSI.
I’m not sure I see the distinction between your Type I and Type II.
The Type II constraints you list (compute availability, data, ideas (which may not be parallelizable), organizational overhead, testing and validation time) seem to apply to Type I also.
I think the distinction between Type I/II and Type III is meaningful (see here).
We should probably install cheaply satisfied preferences within AIs — why should this preference be myopic reward?
Why not a utility function like: “How much time is there a tungsten cube on Dario’s desk, with 21% annual discount rate.”
i.e. utility = ∫₀^∞ e^{-0.231t} · 𝟙[cube on desk at time t] dt
where λ = ln(2)/3, chosen so that half the utility comes from the deployment period (first 3 years) and half from the rest of history.
Some advantages of the cube preference:
We don’t have to worry how satisfying this preference affects the training and deployment.
It’s less philosophically messy what the cube utility of a scenario would be.
Some disadvantages:
AIs will crave reward anyway, so it’s better to intensify that craving rather than add a distinct craving.
It’s easier to build AIs which intensely crave reward than crave the cube thing. My guess is that this is both true and decisive, but I’d want to have a clearer sense of what actually goes wrong if we do something like this.