Former AI safety research engineer, now AI governance researcher at OpenAI. Blog: thinkingcomplete.blogspot.com
I’m feeling very excited about this agenda. Is there currently a publicly-viewable version of the living textbook? Or any more formal writeup which I can include in my curriculum? (If not I’ll include this post, but I expect many people would appreciate a more polished writeup.)
Oh, that makes sense. Yepp, if you’re talking about essays from throughout history then breaking the top 10 does seem like a high bar.
Though I think, for me, that the following probably make it in (especially when I weight more heavily on usefulness to me, rather than prescience):
Curious what essays make your top 10 list?
Ah, I see. That makes sense now!
I do expect this to happen. The question is merely: what’s the best predictor of how hard it is to find inference algorithms more efficient effective than gradient descent? Is it whether those inference algorithms are more complex than gradient descent? Or is it whether those inference algorithms run for longer than gradient descent? Since gradient descent is very simple but takes a long time to run, my bet is the latter: there are many simple ways to convert compute to optimisation, but few compute-cheap ways to convert additional complexity to optimization.
No, I wasn’t advocating adding a speed penalty, I was just pointing at a reason to think that a speed prior would give a more accurate answer to the question of “which is favored” than the bounded simplicity prior you’re assuming:
Suppose that your imitator works by something akin to Bayesian inference with some sort of bounded simplicity prior (I think it’s true of transformers)
But now I realise that I don’t understand why you think this is true of transformers. Could you explain? It seems to me that there are many very simple hypotheses which take a long time to calculate, and which transformers therefore can’t be representing.
I didn’t read the post particularly carefully, it’s totally plausible that I’m misunderstanding the key ideas you were trying to convey. I apologise for phrasing my claims in a way that made it sound like I was skeptical of your motivations; I’m not, and I’m glad you wrote this up.
I think my concerns still apply to the position you stated in the previous comment, but insofar as the main motivation behind my comment was to generically nudge LW in a certain direction, I’ll try to do this more directly, rather than via poking at individual posts in an opportunistic way.
I think there is a very legitimate sense in which optimizing the steps of a plan to do a thing is a separate skill and/or mental propensity to executing that plan (as in, actually sending those signals outside the computer) or wanting it executed, and in which agency is mostly a measure of the latter.
My main criticism is that, in general, you have to think while you’re executing plans, not just while you’re generating them. The paradigm where you plan every step in advance, and then the “agency” comes in only when executing it, is IMO a very misleading one to think in.
(This seems related to Eliezer’s argument that there’s only a one-line difference difference between an oracle AGI and an agent AGI. Sure, that’s true in the limit. But thinking about the limit will make you very confused about realistic situations!)
I’m not sure what the practical difference is between criticizing a post and criticizing people that upvoted it
It’s something like: “I endorse people following the policy of writing posts like this one, it’s great when people work through their thoughts in this way. I don’t endorse people following the policy of upvoting posts like this one to this extent, because it seems likely that they’re mainly responding to high-level applause lights.”
to the extent that this is a criticism of the post I wish you had been more explicit about what you are objecting to.
I’m sympathetic to you wanting more explicit feedback but the fact that this post is so high-level and ungrounded is what makes it difficult for me to give that. To me it reads more like a story than an argument.
In that case, gradient descent will reduce the weights that are used to calculate that specific activation value.
Hence, the honest-imitation hypothesis is heavily penalized compared to hypotheses that are in themselves agents which are more “epistemically sophisticated” than the outer loop of the AI.
In a deep learning context, the latter hypothesis seems much more heavily favored when using a simplicity prior (since gradient descent is simple to specify) than a speed prior (since gradient descent takes a lot of computation). So as long as the compute costs of inference remain smaller than the compute costs of training, a speed prior seems more appropriate for evaluating how easily hypotheses can become more epistemically sophisticated than the outer loop.
I’ve edited via the link you gave, but it doesn’t seem to be showing up in the main post.
(Specifically, I edited the first dating doc link, which was broken.)
Context-sensitivity: the goals that a corrigible AGI pursues should depend sensitively on the intentions of its human users when it’s run.
Default off: a corrigible AGI run in a context where the relevant intentions or instructions aren’t present shouldn’t do anything.
Explicitness: a corrigible AGI should explain its intentions at a range of different levels of abstraction before acting. If its plan stops being a central example of the explained intentions (e.g. due to unexpected events), it should default to a pre-specified fallback.
Goal robustness: a corrigible AGI should maintain corrigibility even after N adversarially-chosen gradient steps (the higher the better).
Satiability: a corrigible AGI should be able to pre-specify a rough level of completeness of a given task, and then shut down after reaching that point.
I was surprised by this quote. On following the link, the sentence by itself seems noticeably out of context; here’s the next part:
On the growing artificial intelligence market: “AI will probably most likely lead to the end of the world, but in the meantime, there’ll be great companies.” On what Altman would do if he were President Obama: “If I were Barack Obama, I would commit maybe $100 billion to R&D of AI safety initiatives.” Altman also shared that he recently invested in a company doing “AI safety research” to investigate the potential risks of artificial intelligence.
On the growing artificial intelligence market: “AI will probably most likely lead to the end of the world, but in the meantime, there’ll be great companies.”
On what Altman would do if he were President Obama: “If I were Barack Obama, I would commit maybe $100 billion to R&D of AI safety initiatives.” Altman also shared that he recently invested in a company doing “AI safety research” to investigate the potential risks of artificial intelligence.
RL usually applies some discount rate, and also caps episodes at a certain length, so that an action taken at a given time isn’t reinforced very much (or at all) for having much longer-term consequences.
How does this compare to evolution? At equilibrium, I think that a gene which increases the fitness of its bearers in N generations’ time is just as strongly favored as a gene that increases the fitness of its bearers by the same amount straightaway. As long as it was already widespread at least N generations ago, they’re basically the same thing, because current gene-holders benefit from the effects of the gene-holders from N generations ago.
That gene would evolve much more slowly, though. Plus in practice it’s hard to ensure that the benefits accrue only to gene-holders, and there’s so much variance in the environment that for N of more than 3 or 4 this seems pretty implausible. Still, the disanalogy seems kinda interesting.
But what makes you so confident that it’s not possible for subject-matter experts to have correct intuitions that outpace their ability to articulate legible explanations to others?
Yepp, this is a judgement call. I don’t have any hard and fast rules for how much you should expect experts’ intuitions to plausibly outpace their ability to explain things. A few things which inform my opinion here:
Explaining things to other experts should be much easier than explaining them to the public.
Explaining things to other experts should be much easier than actually persuading those experts.
It’s much more likely that someone has correct intuitions if they have a clear sense of what evidence would make their intuitions stronger.
I don’t think Eliezer is doing particularly well on any of these criteria. In particular, the last one was why I pressed Eliezer to make predictions rather than postdictions in my debate with him. The extent to which Eliezer seemed confused that I cared about this was a noticeable update for me in the direction of believing that Eliezer’s intuitions are less solid than he thinks.
It may be the case that Eliezer has strong object-level intuitions about the details of how intelligence works which he’s not willing to share publicly, but which significantly increase his confidence in his public claims. If so, I think the onus is on him to highlight that so people can make a meta-level update on it.
Please don’t send these types of emails, I expect that they’re actively counterproductive for high-profile recipients.
If you want to outreach, there are clear channels which should be used for coordinating it. For example, you could contact DeepMind’s alignment team, and ask them if there’s anything which would be useful for them.
I think it’s less about how many holes there are in a given plan, and more like “how much detail does it need before it counts as a plan?” If someone says that their plan is “Keep doing alignment research until the problem is solved”, then whether or not there’s a hole in that plan is downstream of all the other disagreements about how easy the alignment problem is. But it seems like, separate from the other disagreements, Eliezer tends to think that having detailed plans is very useful for making progress.
Analogy for why I don’t buy this: I don’t think that the Wright brothers’ plan to solve the flying problem would count as a “plan” by Eliezer’s standards. But it did work.
Strong +1s to many of the points here. Some things I’d highlight:
Eliezer is not doing the type of reasoning that can justifiably defend the level of confidence he claims to have. If he were, he’d have much more to say about the specific details of consequentialism, human evolution, and the other key intuitions shaping his thinking. In my debate with him he mentioned many times how difficult he’s found it to explain these ideas to people. I think if he understood these ideas well enough to justify the confidence of his claims, then he wouldn’t have found that as difficult. (I’m sympathetic about Eliezer having in the past engaged with many interlocutors who were genuinely very bad at understanding his arguments. However, it does seem like the lack of detail in those arguments is now a bigger bottleneck.)
I think that the intuitions driving Eliezer’s disagreements with many other alignment researchers are interesting and valuable, and would love to have better-fleshed-out explanations of them publicly available. Eliezer would probably have an easier time focusing on developing his own ideas if other people in the alignment community who were pessimistic about various research directions, and understood the broad shape of his intuitions, were more open and direct about that pessimism. This is something I’ve partly done in this post; and I’m glad that Paul’s partly done it here.
I like the analogy of a mathematician having intuitions about the truth of a theorem. I currently think of Eliezer as someone who has excellent intuitions about the broad direction of progress at a very high level of abstraction—but where the very fact that these intuitions are so abstract rules out the types of path-dependencies that I expect solutions to alignment will actually rely on. At this point, people who find Eliezer’s intuitions compelling should probably focus on fleshing them out in detail—e.g. using toy models, or trying to decompose the concept of consequentialism—rather than defending them at a high level.
Meta-level: +1 for actually writing a thing.
Also meta-level: −1 because when I read this I get the sense that you started from a high-level intuition and then constructed a set of elaborate explanations of your intuition, but then phrased it as an argument.
I personally find this frustrating because I keep seeing people being super confident in their high-level intuitive metaphorical view of consequentialism and then never doing the work of actually digging beneath those metaphors. (Less a criticism of this post, more a criticism of everyone upvoting this post.)
In this case, this cashes out in claims like “agency is orthogonal to optimization power” which are clearly false for any reasonable definitions of agency and optimization power, and only seem to make sense when you’re operating at at a level of abstraction that’s far too high to be useful.
Thanks for the post, I think it’s a useful framing. Two things I’d be interested in understanding better:
In the one real example of intelligence being developed we have to look at, continuous application of natural selection in fact found Homo sapiens sapiens, and the capability-gain curves of the ecosystem for various measurables were in fact sharply kinked by this new species (e.g., using machines, we sharply outperform other animals on well-established metrics such as “airspeed”, “altitude”, and “cargo carrying capacity”).
As I said in a reply to Eliezer’s AGI ruin post:
There are some ways in which AGI will be analogous to human evolution. There are some ways in which it will be disanalogous. Any solution to alignment will exploit at least one of the ways in which it’s disanalogous. Pointing to the example of humans without analysing the analogies and disanalogies more deeply doesn’t help distinguish between alignment proposals which usefully exploit disanalogies, and proposals which don’t.
So I’d be curious to know what you think the biggest disanalogies are between the example of human evolution and building AGI. Relatedly, would you consider raising a child to be a “real example of intelligence being developed”; why or why not?
Many different training scenarios are teaching your AI the same instrumental lessons, about how to think in accurate and useful ways. Furthermore, those lessons are underwritten by a simple logical structure
Granting that there’s a bunch of logical structure around how to think in accurate ways (e.g. solving scientific problems), and there’s a bunch of logical structure around how to pursue goals coherently (e.g. avoiding shutdown) what’s the strongest reason to believe that agents won’t learn something closely approximating the former before they learn something closely approximating the latter? My impression of Eliezer’s position is that it’s because they’re basically the same structure—if you agree with this, I’d be curious what sort of intuitions or theorems are most responsible for this belief.
(Another way of phrasing this question: suppose I made an analogous argument before the industrial revolution, saying something like “matter and energy are fundamentally the same thing at a deep level, we’ll soon be able to harness superhuman amounts of energy, therefore we’re soon going to be able to create superhuman amounts of matter”. Yet in fact, while the premise of mass-energy equivalence is true, the constants are such that it takes stupendously more energy than humans can generate, in order to produce human-sized piles of matter. What’s the main thing that makes you think that the constants in the intelligence case are such that AIs will converge to goal-coherence before, or around the same time as, superhuman scientific capabilities?)