(Formerly “antimonyanthony.”) I’m an s-risk-focused AI safety researcher at the Center on Long-Term Risk. I (occasionally) write about altruism-relevant topics on my Substack. All opinions my own.
Anthony DiGiovanni
Who ignores, or argues against courage and honesty?
As an intrinsic value? Lots of utilitarians, myself included. I’m unsure if Rob’s intent was to suggest these things are values worth respecting intrinsically or just instrumentally.
Something I’m wondering, but don’t have the expertise in meta-learning to say confidently (so, epistemic status: speculation, and I’m curious for critiques): extra OOMs of compute could overcome (at least) one big bottleneck in meta-learning, the expense of computing second-order gradients. My understanding is that most methods just ignore these terms or use crude approximations, like this, because they’re so expensive. But at least this paper found some pretty impressive performance gains from using the second-order terms.
Maybe throwing lots of compute at this aspect of meta-learning would help it cross a threshold of viability, like what happened for deep learning in general around 2012. I think meta-learning is a case where we should expect second-order info to be very relevant to optimizing the loss function in question, not just a way of incorporating the loss function’s curvature. In the first paper I linked, the second-order term accounts for how the base learner’s gradients depend on the meta-learner’s parameters. This seems like an important feature of what their meta-learner is trying/supposed to do, i.e., use the meta-learned update rule to guide the base learner—and the performance gains in the second paper are evidence of this. (Not all meta-learners have this structure, though, and MAML apparently doesn’t get much better when you use Hessians. Hence my lack of confidence in this story.)
Not a direct answer to your question, but I want to flag that using “AI alignment” to mean “AI [x-risk] safety” seems like a mistake. Alignment means getting the AI to do what its principal/designer wants, which is not identical to averting AI x-risks (much less s-risks). There are plausible arguments that this is sufficient to avert such risks, but it’s an open question, so I think equating the two is confusing.
Thanks, this makes it pretty clear to me how alignment could be fundamentally hard besides deception. (The problem seems to hold even if your values are actually pretty simple; e.g. if you’re a pure hedonistic utilitarian and you’ve magically solved deception, you can still fail at outer alignment by your AI optimizing for making it look like there’s more happiness and less suffering.)
Some (perhaps basic) notes to check that I’ve understood this properly:
The Bayes net running example per se isn’t really necessary for ELK to be a problem.
The basic problem is that in training, the AI can do just as well by reporting what a human would believe given their observations, and upon deployment in more complex tasks the report of what a human would believe can come apart from the “truth” (what the human would believe given arbitrary knowledge of the system).
This seems to crop up for a variety of models of AI and human cognition.
It seems like the game is stacked against “doing X” rather than “making it look like X” in many contexts, such that even with regularizers that push towards the latter, the overall inductive bias would plausibly still be towards the former. It’s just easier to make it look to humans like you’re creating a utopia than to do all the complex work of utopia-building.
I suspect this would hold even for much less ambitious yet still superhuman tasks, such that deferring to future human-level aligned AIs wouldn’t be sufficient.
But, if we train a reporter module, reporting what the human would believe doesn’t seem prima facie easier than reporting the truth in this way. So that’s why we might reasonably hope a good regularizer can break the tie.
In the build-break loop examples in the report, we’re generously assuming the human overseers know the relevant set of questions to ask to check if there’s malfeasance going on. And that this set isn’t so hopelessly large that iterating through it for training is too slow.
In the imitative generalization example, it seems like besides the problem that the output Bayes net may be ontologically incomprehensible to humans, the training process requires humans to understand all the relevant hypotheses and data (to report their priors and likelihoods). This may be a general confusion about imitative generalization on my part.
If we tried distillation to get around the prohibitive slowness of amplification for the “AI science” proposal, that would introduce both inner alignment problems and perhaps bring us to the same sort of “alien ontology” problem as the imitative generalization proposal.
The ontology mismatch problem isn’t just a possibility, it seems pretty likely by default, for reasons summarized in the plot of model interpretability here.
Intuitively, the ontology/primitive concepts that quantum physicists use to make very excellent predictions about the universe—better than I could make, certainly—are alien to me, and to anyone else who hasn’t spent a lot of time learning quantum physics. This is consistent with human-interpretable concepts being more prevalent in recent powerful language models than in early-2010s neural networks.
Deferring to future human-level aligned AIs isn’t sufficient because even if we had many more human-level minds giving feedback to superhuman AIs, they would still be faced with ELK too. i.e., This doesn’t seem to be a problem that can be solved just by parallelizing across more overseers than we currently have, although having aligned assistants could of course still help with ELK research.
I feel confused as to how step (3) is supposed to work, especially how “having the training be done by the model being trained given access to tools from (2)” is a route to this.
At some step in the amplification process, we’ll have systems that are capable of deception, unlike the base case. So it seems that if we let the model train its successor using the myopia-verification tools, we need some guarantee that the successor is non-deceptive in the first place. (Otherwise the myopia-verification tools aren’t guaranteed to work, as you note in the bullet points of step (2).) Are you supposing that there’s some property other than myopia that the model could use to verify that its successor is non-deceptive, such that it can successfully verify myopia? What is that property? And do we have reason to think that property will only be guaranteed if the model doing the training is myopic? (Otherwise why bother with myopia at all—just use that other property to guarantee non-deception.)
Intuitively step (3) seems harder than (2), since in (3) you have to worry about deception creeping in to the more powerful successor agent, while (2) by definition only requires myopia verification of non-deceptive models.
ETA: Other than this confusion, I found this post helpful for understanding what success looks like to (at least one) alignment researcher, so thanks!
I think “the very repugnant conclusion is actually fine” does pretty well against its alternatives. It’s totally possible that our intuitive aversion to it comes from just not being able to wrap our brains around some aspect of (a) how huge the numbers of “barely worth living” lives would have to be, in order to make the very repugnant conclusion work; (b) something that is just confusing about the idea of “making it possible for additional people to exist.”
While this doesn’t sound crazy to me, I’m skeptical that my anti-VRC intuitions can be explained by these factors. I think you can get something “very repugnant” on scales that our minds can comprehend (and not involving lives that are “barely worth living” by classical utilitarian standards). Suppose you can populate* some twin-Earth planet with either a) 10 people with lives equivalent to the happiest person on real Earth, or b) one person with a life equivalent to the most miserable person on real Earth plus 8 billion people with lives equivalent to the average resident of a modern industrialized nation.
I’d be surprised if a classical utilitarian thought the total happiness minus suffering in (b) was less than in (a). Heck, 8 billion might be pretty generous. But I would definitely choose (a).
To me the very-repugnance just gets much worse the more you scale things up. I also find that basically every suffering-focused EA I know is not scope-neglectful about the badness of suffering (at least, when it’s sufficiently intense), or in any area other than population ethics. So it would be pretty strange if we just happened to be falling prey to that error in thought experiments where there’s another explanation—i.e., we consider suffering especially important—which is consistent with our intuitions about cases that don’t involve large numbers.
* As usual, ignore the flow-through effects on other lives.
The amount of EV at stake in my (and others’) experiences over the next few years/decades is just too small compared to the EV at stake in the long-term future.
AI alignment isn’t the only option to improve the EV of the long-term future, though.
Sort of! This paper (of which I’m a coauthor) discusses this “unraveling” argument, and the technical conditions under which it does and doesn’t go through. Briefly:
It’s not clear how easy it is to demonstrate military strength in the context of an advanced AI civilization, in a way that can be verified / can’t be bluffed. If I see that you’ve demonstrated high strength in some small war game, but my prior on you being that strong is sufficiently low, I’ll probably think you’re bluffing and wouldn’t be that strong in the real large-scale conflict.
Supposing strength can be verified, it might be intractable to do so without also disclosing vulnerable info (irrelevant to the potential conflict). As TLW’s comment notes, the disclosure process itself might be really computationally expensive.
But if we can verifiably disclose, and I can either selectively disclose only the war-relevant info or I don’t have such a vulnerability, then yes you’re right, war can be avoided. (At least in this toy model where there’s a scalar “strength” variable; things can get more complicated in multiple dimensions, or where there isn’t an “ordering” to the war-relevant info.)
Another option (which the paper presents) is conditional disclosure—even if you could exploit me by knowing the vulnerable info, I commit to share my code if and only if you commit to share yours, play the cooperative equilibrium, and not exploit me.
From the beginning, I invented timeless decision theory because of being skeptical that two perfectly sane and rational hyperintelligent beings with common knowledge about each other would have no choice but mutual defection in the oneshot prisoner’s dilemma. I suspected they would be able to work out Something Else Which Is Not That, so I went looking for it myself.
I don’t see how this makes the point you seem to want it to make. There’s still an equilibrium selection problem for a program game of one-shot PD—some other agent might have the program that insists (through a biased coin flip) on an outcome that’s just barely better for you than defect-defect. It’s clearly easier to coordinate on a cooperate-cooperate program equilibrium in PD or any other symmetric game, but in asymmetric games there are multiple apparently “fair” Schelling points. And even restricting to one-shot PD, the whole commitment races problem is that the agents don’t have common knowledge before they choose their programs.
Perhaps the crux here is whether we should expect all superintelligent agents to converge on the same decision procedure—and the agent themselves will expect this, such that they’ll coordinate by default? As sympathetic as I am to realism about rationality, I put a pretty nontrivial credence on the possibility that this convergence just won’t occur, and persistent disagreement (among well-informed people) about the fundamentals of what it means to “win” in decision theory thought experiments is evidence of this.
Why does the person asking this question care about whether “interesting”-to-humans things happen, in a future where no humans exist to find them interesting?
That all sounds fair. I’ve seen rationalists claim before that it’s better for “interesting” things (in the literal sense) to exist than not, even if nothing sentient is interested by them, so that’s why I assumed you meant the same.
It feels to me like “have humans try to get to know the AIs really well by observing their behaviors, so that they’re able to come up with inputs where the AIs will be tempted to do bad things, so that we can do adversarial training” is probably worth including in the smorgasbord of techniques we use to try to prevent our AIs from being deceptive
Maybe I missed something here, but how is this supposed to help with deception? I thought the whole reason deceptive alignment is really hard to solve is that you can’t tell if the AI’s being deceptive via its behavior.
They can read each other’s source code, and thus trust much more deeply!
Being able to read source code doesn’t automatically increase trust—you also have to be able to verify that the code being shared with you actually governs the AGI’s behavior, despite that AGI’s incentives and abilities to fool you.
(Conditional on the AGIs having strongly aligned goals with each other, sure, this degree of transparency would help them with pure coordination problems.)
On my picture, I think a key variable is the length of time between when-we-understand-the-basic-shape-of-things-that-will-get-to-AGI and when-it-reaches-strong-superintelligence.
I don’t understand why you think the sort of capabilities research done by alignment-conscious people contributes to lengthening this time. In particular, what reason do you have to think they’re not advancing the second time point as much as the first? Could you spell that out more explicitly?
I notice that I strongly disagree with a majority of them (#1, #2, #4, #8, #10, #11, #13, #14, #15, #17, #18, #21)
Re: #2, what do you consider to be The Bad other than suffering?
I like that this post clearly argues for some reasons why we might expect deception (and similar dynamics) to not just be possible in the sense of getting equal training rewards, but to actually provide higher rewards than the honest alternatives. This positively updates my probability of those scenarios.
(Speaking for myself as a CLR researcher, not for CLR as a whole)
I don’t think it’s accurate to say CLR researchers think increasing transparency is good for cooperation. There are some tradeoffs here, such that I and other researchers are currently uncertain whether marginal increases in transparency are net good for AI cooperation. Though, it is true that more transparency opens up efficient equilibria that wouldn’t have been possible without open-source game theory. (ETA: some relevant research by people (previously) at CLR here, here, and here.)
Thanks for this! I agree that inter-agent safety problems are highly neglected, and that it’s not clear that intent alignment or the kinds of capability robustness incentivized by default will solve (or are the best ways to solve) these problems. I’d recommend looking into Cooperative AI, and the “multi/multi” axis of ARCHES.
This sequence discusses similar concerns—we operationalize what you call inter-agent alignment problems as either:
Subsets of capability robustness, because if an AGI wants to achieve X in some multi-agent environment, then accounting for the dependencies of its strategy on other agents’ strategies is instrumental to achieving X (but accounting for these dependencies might be qualitatively harder than default capabilities); or
Subsets of intent alignment, because the AGI’s preferences partly shape how likely it is to cooperate with others, and we might be able to intervene on cooperation-relevant preferences even if full intent alignment fails.
(At the risk of necroposting:) Was this paper ever written? Can’t seem to find it, but I’m interested in any developments on this line of research.