What is The Plan for AI alignment? Briefly: Sort out our fundamental confusions about agency and abstraction enough to do interpretability that works and generalizes robustly. Then, look through our AI’s internal concepts for a good alignment target, and Retarget the Search.
I think this won’t work because many human-value-laden concepts aren’t very natural for an AI. More specifically, in the 2023 version of the plan you wrote:
Suppose the natural abstraction hypothesis[2] is basically correct, i.e. a wide variety of minds trained/evolved in the same environment converge to use basically-the-same internal concepts.
… Then it’s pretty likely that neural nets end up with basically-human-like internal concepts corresponding to whatever stuff humans want.
… So in principle, it shouldn’t take that many bits-of-optimization to get nets to optimize for whatever stuff humans want.
… Therefore if we just kinda throw reward at nets in the obvious ways (e.g. finetuning/RLHF), and iterate on problems for a while, maybe that just works?
In the linked post, I gave that roughly a 10% chance of working. I expect the natural abstraction part to basically work, the problem is [...]
I think the natural abstraction part here does not work—not because natural abstractions aren’t a thing—but because there’s an exception for abstractions that are dependent on the particular mind architecture an agent has.
Concepts like “love”, “humor”, and probably “consciousness” may be natural for humans but probably less natural for AIs.
But also we cannot just wire up those concepts into the values of an AI and expect the AI’s values to generalize correctly. The way our values generalize—how we will decide what to value as we grow smarter and do philosophical reflection—seems quite contingent on our mind architecture. Unless we have an AI that shares our mind architecture (like in Steven Byrnes’ agenda), we’d need to point the AI to an indirect specification of what we value, aka CEV. And CEV doesn’t seem like a simple natural abstraction that an AI would learn without us teaching it about CEV. And even if it knows CEV because we taught it, I find it hard to imagine how we would point the search process to it (even assuming we have a retargetable general purpose search).
Also see here and here. But mainly I think you need to think a lot more concretely about what goal we actually want to point the AI at.
Although I agree with this:
Generally, we aim to work on things which are robust bottlenecks to a broad space of plans. In particular, our research mostly focuses on natural abstraction, because that seems like the most robust bottleneck on which (not-otherwise-doomed) plans get stuck.
However, it does not look to me like you are making much progress relative to your stated beliefs of how close you are. Aka relative to (from your 2024 update where this statement sounded like it’s based on 10ish year timelines):
Earlier this year, David and I estimated that we’d need roughly a 3-4x productivity multiplier to feel like we were basically on track.
So here are some thoughts on how your progress looks to me, although I’ve not been following your research in detail anymore since summer 2024 (after your early natural latents posts):
Basically, it seems to me like you’re making the mistake of Aristotelians that Francis Bacon points out in the Baconian Method (or Novum Organum generally):
the intellect mustn’t be allowed •to jump—to fly—from particulars a long way up to axioms that are of almost the highest generality… Our only hope for good results in the sciences is for us to proceed thus: using a valid ladder, we move up gradually—not in leaps and bounds—from particulars to lower axioms, then to middle axioms, then up and up...
Aka, you look at a few examples, and directly try to find a general theory of abstraction. I think this makes your theory overly simplistic and probably basically useless.
Like, when I read Natural Latents: The Concepts, I already had a feeling of the post trying to explain too much at once—lumping together things as natural latents that seem very importantly different, and also in some cases natural latents seemed like a dubious fit. I started to form an intuitive distinction in my mind between objects (like a particular rigid body) and concepts (like clusters in thingspace like “tree” (as opposed to a particular tree)), although I couldn’t explain it well at the time. Later I studied a bit formal language semantics and the distinction there is just total 101 basics.
I studied language a bit and tried to carve up in a bit more detail what types of abstractions there are, which I wrote up here. But really I think that’s still too abstract and still too top-down and one probably needs to study particular words in a lot of detail, then similar words, etc.
Not that this kind of study of language is necessarily the best way to proceed with alignment—I didn’t continue it after my 5 month language-and-orcas-exploration. But I do think concrete study of observations and abstracting slowly is important.
ADDED: Basically, from having tried a little to understand natural/human ontologies myself it does not look to me like natural latents is much progress. But again I didn’t follow your work in detail and if you have concrete plans or evidence of how it’s going to be useful for pointing AIs then lmk.
Basically, it seems to me like you’re making the mistake of Aristotelians that Francis Bacon points out in the Baconian Method (or Novum Organum generally):
the intellect mustn’t be allowed •to jump—to fly—from particulars a long way up to axioms that are of almost the highest generality… Our only hope for good results in the sciences is for us to proceed thus: using a valid ladder, we move up gradually—not in leaps and bounds—from particulars to lower axioms, then to middle axioms, then up and up...
Aka, you look at a few examples, and directly try to find a general theory of abstraction. I think this makes your theory overly simplistic and probably basically useless.
Like, when I read Natural Latents: The Concepts, I already had a feeling of the post trying to explain too much at once—lumping together things as natural latents that seem very importantly different, and also in some cases natural latents seemed like a dubious fit. I started to form an intuitive distinction in my mind between objects (like a particular rigid body) and concepts (like clusters in thingspace like “tree” (as opposed to a particular tree)), although I couldn’t explain it well at the time. Later I studied a bit formal language semantics and the distinction there is just total 101 basics.
I studied language a bit and tried to carve up in a bit more detail what types of abstractions there are, which I wrote up here. But really I think that’s still too abstract and still too top-down and one probably needs to study particular words in a lot of detail, then similar words, etc.
Not that this kind of study of language is necessarily the best way to proceed with alignment—I didn’t continue it after my 5 month language-and-orcas-exploration. But I do think concrete study of observations and abstracting slowly is important.
+1 to this. to me this looks like understanding some extremely toy cases a bit better and thinking you’re just about to find some sort of definitive theory of concepts. there’s just SO MUCH different stuff going on with concepts! wentworth+lorell’s work is interesting, but so much more has been understood about concepts in even other existing literature than in wentworth+lorell’s work (i’d probably say there are at least 10000 contributions to our understanding of concepts in at least the same tier), with imo most of the work remaining! there’s SO MANY questions! there’s a lot of different structure in eg a human mind that is important for our concepts working! minds are really big, and not just in content but also in structure (including the structure that makes concepting tick in humans)! and minds are growing/developing, and not just in content but also in structure (including the structure that makes concepting tick in humans)! “what’s the formula for good concepts?” should sound to us like “what’s the formula for useful technologies?” or “what’s the formula for a strong economy?”. there are very many ideas that go into having a strong economy, and there are probably very many ideas that go into having a powerful conceptive system. this has mostly just been a statement of my vibe/position on this matter, with few arguments, but i discuss this more here.
on another note: “retarget the search to human values” sounds nonsensical to me. by default (at least without fundamental philosophical progress on the nature of valuing, but imo probably even given this, at least before serious self-re-programming), values are implemented in a messy(-looking) way across a mind, and changing a mind’s values to some precise new thing is probably in the same difficulty tier as re-writing a new mind with the right values from scratch, and not doable with any small edit
concretely, what would it look like to retarget the search in a human so that (if you give them tools to become more capable and reasonable advice on how to become more capable “safely”/”value-preservingly”) they end up proving the riemann hypothesis, then printing their proof on all the planets in this galaxy, and then destroying all intelligent life in the galaxy (and committing suicide)? this is definitely a simpler thing than object-level human values, and it’s plausibly more natural than human values even in a world in which there is already humanity that you can try to use as a pointer to human values. it seems extremely cursed to make this edit in a human. some thoughts on a few approaches that come to mind:
you could try to make the human feel good about plans for futures that involve learning a bunch of analysis and number theory, and about plan-relevant visions of the future that involve having a proof of the riemann hypothesis in hand in particular, and so on. it seems pretty clear that this doesn’t generalize correctly, and in particular that the human isn’t actually going to do the deeply unnatural thing of committing suicide after finishing the rest.[1] i think it’s very unlikely that they’ll even focus much on proving the riemann hypothesis in particular. if you’re really good at this sort of editing, maybe they will get really into analysis and number theory for a while, i guess, and it might even affect what happens in the very far future.[2] but the far future isn’t going to look like what you wanted.
with like 100 years of philosophy and neuroscience research, i think one might get into a position where one could edit a human to be locally trying to solve some math problem for like 10 minutes, with the edit doing sth like what happens when one naturally just decides to work on a math problem without it fitting into one’s life/plans in any very deep way, eg just to learn. there is retargetable search in humans in that sense, and i think it’s probable sth like this will be present in the first AGI as well. but this is different than editing the human to have some specific different long-term values. on longer timescales than 10 minutes, the human will have their mess-values kick in again, implemented in/as eg many context-specific drives, habits, explicit and implicit principles, explicit and implicit meta-principles, understanding of the purposes of various things, ways of harmonizing attitudes, various processes running on various human institutions and other humans, etc.[3] it would be a motte and bailey to argue “it is generic for a mind to have at least some sort of targetable search ability” (in a way that considers the 10 min thing one could in principle do to a human an example), and then to go from this to “it is generic for a mind as a whole to have some sort of retargetable search structure, like with an ultimate goal slot in which something can be written”.
you could try to edit the human’s memories in a really careful way to make them think that they have made a very serious promise to do this riemann hypothesis thing. doing this is probably not possible in all but at most a very small fraction of humans because humans almost universally don’t have a strong enough promise capability to actually stick to this over the very long term. (actually, i’d mostly guess it’s not possible in any humans, because it’s such a fucked thing to promise. what would the story be of how you made this promise, of which you now have fake memories? maybe there’s some construction… but the promise-keeping part will have to fight a huge long-term war against all the many other value-bearing components of the human, that are all extremely unhappy about this life path.) also, if it were possible to plant plausible memories of making the promise (maybe with different choices for the details of the promise), you could probably just have the human make the promise the good old-fashioned way. anyway, default AGIs won’t be deeply social beings like humans, so it would be extremely weird for an AGI to already have machinery for making promises installed. it’s also extremely difficult to do this in a way that the guy never realizes they were just tampered with and so isn’t actually bound by the promise (after realizing which they will probably ignore it).
but maybe there’s a better sort of thing you could try on a human, that i’m not quickly thinking of?
maybe the position is “humans aren’t retargetable searchers (in their total structure, in the way needed for this plan), but the first AGI will probably be one”. it seems very likely to me that values will in fact be diffusely and messily implemented in that AGI as well. for example, there won’t even remotely be a nice cleavage between values and understanding
a response: the issue is that i’ve chosen an extremely unnatural task. a counterresponse: it’s also extremely unnatural to have one’s valuing route through an alien species, which is what the proposal wants to do to the AI
that said, i think it’s also reasonably natural to be the sort of guy who would actively try to undo any supposed value changes after the fact, and it’s reasonably natural to be the sort of guy whose long-term future is more governed by stuff these edits don’t touch. in these cases, these edits would not affect the far future, at least not in the straightforward way
these are all given their correct meaning/function only in the context of their very particular mind, in basically all its parts. so i could also say: their mind just kicks in again in general.
wentworth+lorell’s work is interesting, but so much more has been understood about concepts in even other existing literature than in wentworth+lorell’s work (i’d probably say there are at least 10000 contributions to our understanding of concepts in at least the same tier), with imo most of the work remaining!
Btw if you mean there are 10k contributions already that are on the level of John’s contributions, I strongly disagree with this. I’m not sure whether John’s math is significantly useful, and I don’t think it’s been that much progress relative to “almost on track to maybe solve alignment”, but in terms of (alignment) philosophy John’s work is pretty great compared to academic philosophy.
In terms of general alignment philosophy (not just work on concepts but also other insights), I’d probably put John’s collective works in 3rd place after Eliezer Yudkowsky and Douglas Hofstadter. The latter is on the list mainly because of Surfaces and Essences, which I can recommend (@johnswentworth).
Aka I’d probably put John above people like Wittgenstein, although I admit I don’t know that much about the works of philosophers like Wittgenstein. Could be that there are more insights in the collective works of Wittgenstein, but if I’d need to read through 20x the volume because he doesn’t write clearly enough that’s still a point against him. Even if a lot of John’s insights have been said before somewhere, writing insights clearly provides a lot of value.
Although John’s work on concepts play a smaller role in what I think makes John a good alignment philosopher than it does in his alignment research. Partially I think John just has some great individual posts like science in a high dimensional world, you’re not measuring what you think you’re measuring, why agent foundations (coining the word true names), and probably a couple more less known older ones that I haven’t processed fully yet. And generally his philosophy that you need to figure out the right ontology is good.
But also tbc, this is just alignment philosophy. In terms of alignment research, he’s a bit further down my list, e.g. also behind Paul Christiano and Steven Byrnes.
to clarify a bit: my claim was that there are 10k individuals in history who have contributed at least at the same order of magnitude to our understanding of concepts — like, in terms of pushing human understanding further compared to the state of understanding before their work. we can be interested in understanding what this number is for this reason: it can help us understand whether it’s plausible that this line of inquiry is just about to find some sort of definitive theory of concepts.
(i expect you will still have a meaningfully lower number. i could be convinced it’s more like 1000 but i think it’s very unlikely to be like 100.) i think wentworth is obviously much higher eg if you rank people on publicly displayed alignment understanding, very likely in the top 10
Unless we have an AI that shares our mind architecture (like in Steven Byrnes’ agenda)
I think there’s an important distinction here between (a) “including human value concepts” and (b) “being able to point at human value concepts”. Systems sharing our mind architecture make (a) more likely but do not make (b) more likely, and I think (b) is required for good outcomes.
Thanks for your yearly update!
On the plan:
I think this won’t work because many human-value-laden concepts aren’t very natural for an AI. More specifically, in the 2023 version of the plan you wrote:
I think the natural abstraction part here does not work—not because natural abstractions aren’t a thing—but because there’s an exception for abstractions that are dependent on the particular mind architecture an agent has.
Concepts like “love”, “humor”, and probably “consciousness” may be natural for humans but probably less natural for AIs.
But also we cannot just wire up those concepts into the values of an AI and expect the AI’s values to generalize correctly. The way our values generalize—how we will decide what to value as we grow smarter and do philosophical reflection—seems quite contingent on our mind architecture. Unless we have an AI that shares our mind architecture (like in Steven Byrnes’ agenda), we’d need to point the AI to an indirect specification of what we value, aka CEV. And CEV doesn’t seem like a simple natural abstraction that an AI would learn without us teaching it about CEV. And even if it knows CEV because we taught it, I find it hard to imagine how we would point the search process to it (even assuming we have a retargetable general purpose search).
Also see here and here. But mainly I think you need to think a lot more concretely about what goal we actually want to point the AI at.
Although I agree with this:
However, it does not look to me like you are making much progress relative to your stated beliefs of how close you are. Aka relative to (from your 2024 update where this statement sounded like it’s based on 10ish year timelines):
So here are some thoughts on how your progress looks to me, although I’ve not been following your research in detail anymore since summer 2024 (after your early natural latents posts):
Basically, it seems to me like you’re making the mistake of Aristotelians that Francis Bacon points out in the Baconian Method (or Novum Organum generally):
Aka, you look at a few examples, and directly try to find a general theory of abstraction. I think this makes your theory overly simplistic and probably basically useless.
Like, when I read Natural Latents: The Concepts, I already had a feeling of the post trying to explain too much at once—lumping together things as natural latents that seem very importantly different, and also in some cases natural latents seemed like a dubious fit. I started to form an intuitive distinction in my mind between objects (like a particular rigid body) and concepts (like clusters in thingspace like “tree” (as opposed to a particular tree)), although I couldn’t explain it well at the time. Later I studied a bit formal language semantics and the distinction there is just total 101 basics.
I studied language a bit and tried to carve up in a bit more detail what types of abstractions there are, which I wrote up here. But really I think that’s still too abstract and still too top-down and one probably needs to study particular words in a lot of detail, then similar words, etc.
Not that this kind of study of language is necessarily the best way to proceed with alignment—I didn’t continue it after my 5 month language-and-orcas-exploration. But I do think concrete study of observations and abstracting slowly is important.
ADDED: Basically, from having tried a little to understand natural/human ontologies myself it does not look to me like natural latents is much progress. But again I didn’t follow your work in detail and if you have concrete plans or evidence of how it’s going to be useful for pointing AIs then lmk.
+1 to this. to me this looks like understanding some extremely toy cases a bit better and thinking you’re just about to find some sort of definitive theory of concepts. there’s just SO MUCH different stuff going on with concepts! wentworth+lorell’s work is interesting, but so much more has been understood about concepts in even other existing literature than in wentworth+lorell’s work (i’d probably say there are at least 10000 contributions to our understanding of concepts in at least the same tier), with imo most of the work remaining! there’s SO MANY questions! there’s a lot of different structure in eg a human mind that is important for our concepts working! minds are really big, and not just in content but also in structure (including the structure that makes concepting tick in humans)! and minds are growing/developing, and not just in content but also in structure (including the structure that makes concepting tick in humans)! “what’s the formula for good concepts?” should sound to us like “what’s the formula for useful technologies?” or “what’s the formula for a strong economy?”. there are very many ideas that go into having a strong economy, and there are probably very many ideas that go into having a powerful conceptive system. this has mostly just been a statement of my vibe/position on this matter, with few arguments, but i discuss this more here.
on another note: “retarget the search to human values” sounds nonsensical to me. by default (at least without fundamental philosophical progress on the nature of valuing, but imo probably even given this, at least before serious self-re-programming), values are implemented in a messy(-looking) way across a mind, and changing a mind’s values to some precise new thing is probably in the same difficulty tier as re-writing a new mind with the right values from scratch, and not doable with any small edit
concretely, what would it look like to retarget the search in a human so that (if you give them tools to become more capable and reasonable advice on how to become more capable “safely”/”value-preservingly”) they end up proving the riemann hypothesis, then printing their proof on all the planets in this galaxy, and then destroying all intelligent life in the galaxy (and committing suicide)? this is definitely a simpler thing than object-level human values, and it’s plausibly more natural than human values even in a world in which there is already humanity that you can try to use as a pointer to human values. it seems extremely cursed to make this edit in a human. some thoughts on a few approaches that come to mind:
you could try to make the human feel good about plans for futures that involve learning a bunch of analysis and number theory, and about plan-relevant visions of the future that involve having a proof of the riemann hypothesis in hand in particular, and so on. it seems pretty clear that this doesn’t generalize correctly, and in particular that the human isn’t actually going to do the deeply unnatural thing of committing suicide after finishing the rest. [1] i think it’s very unlikely that they’ll even focus much on proving the riemann hypothesis in particular. if you’re really good at this sort of editing, maybe they will get really into analysis and number theory for a while, i guess, and it might even affect what happens in the very far future. [2] but the far future isn’t going to look like what you wanted.
with like 100 years of philosophy and neuroscience research, i think one might get into a position where one could edit a human to be locally trying to solve some math problem for like 10 minutes, with the edit doing sth like what happens when one naturally just decides to work on a math problem without it fitting into one’s life/plans in any very deep way, eg just to learn. there is retargetable search in humans in that sense, and i think it’s probable sth like this will be present in the first AGI as well. but this is different than editing the human to have some specific different long-term values. on longer timescales than 10 minutes, the human will have their mess-values kick in again, implemented in/as eg many context-specific drives, habits, explicit and implicit principles, explicit and implicit meta-principles, understanding of the purposes of various things, ways of harmonizing attitudes, various processes running on various human institutions and other humans, etc. [3] it would be a motte and bailey to argue “it is generic for a mind to have at least some sort of targetable search ability” (in a way that considers the 10 min thing one could in principle do to a human an example), and then to go from this to “it is generic for a mind as a whole to have some sort of retargetable search structure, like with an ultimate goal slot in which something can be written”.
you could try to edit the human’s memories in a really careful way to make them think that they have made a very serious promise to do this riemann hypothesis thing. doing this is probably not possible in all but at most a very small fraction of humans because humans almost universally don’t have a strong enough promise capability to actually stick to this over the very long term. (actually, i’d mostly guess it’s not possible in any humans, because it’s such a fucked thing to promise. what would the story be of how you made this promise, of which you now have fake memories? maybe there’s some construction… but the promise-keeping part will have to fight a huge long-term war against all the many other value-bearing components of the human, that are all extremely unhappy about this life path.) also, if it were possible to plant plausible memories of making the promise (maybe with different choices for the details of the promise), you could probably just have the human make the promise the good old-fashioned way. anyway, default AGIs won’t be deeply social beings like humans, so it would be extremely weird for an AGI to already have machinery for making promises installed. it’s also extremely difficult to do this in a way that the guy never realizes they were just tampered with and so isn’t actually bound by the promise (after realizing which they will probably ignore it).
but maybe there’s a better sort of thing you could try on a human, that i’m not quickly thinking of?
maybe the position is “humans aren’t retargetable searchers (in their total structure, in the way needed for this plan), but the first AGI will probably be one”. it seems very likely to me that values will in fact be diffusely and messily implemented in that AGI as well. for example, there won’t even remotely be a nice cleavage between values and understanding
a response: the issue is that i’ve chosen an extremely unnatural task. a counterresponse: it’s also extremely unnatural to have one’s valuing route through an alien species, which is what the proposal wants to do to the AI
that said, i think it’s also reasonably natural to be the sort of guy who would actively try to undo any supposed value changes after the fact, and it’s reasonably natural to be the sort of guy whose long-term future is more governed by stuff these edits don’t touch. in these cases, these edits would not affect the far future, at least not in the straightforward way
these are all given their correct meaning/function only in the context of their very particular mind, in basically all its parts. so i could also say: their mind just kicks in again in general.
Btw if you mean there are 10k contributions already that are on the level of John’s contributions, I strongly disagree with this. I’m not sure whether John’s math is significantly useful, and I don’t think it’s been that much progress relative to “almost on track to maybe solve alignment”, but in terms of (alignment) philosophy John’s work is pretty great compared to academic philosophy.
In terms of general alignment philosophy (not just work on concepts but also other insights), I’d probably put John’s collective works in 3rd place after Eliezer Yudkowsky and Douglas Hofstadter. The latter is on the list mainly because of Surfaces and Essences, which I can recommend (@johnswentworth).
Aka I’d probably put John above people like Wittgenstein, although I admit I don’t know that much about the works of philosophers like Wittgenstein. Could be that there are more insights in the collective works of Wittgenstein, but if I’d need to read through 20x the volume because he doesn’t write clearly enough that’s still a point against him. Even if a lot of John’s insights have been said before somewhere, writing insights clearly provides a lot of value.
Although John’s work on concepts play a smaller role in what I think makes John a good alignment philosopher than it does in his alignment research. Partially I think John just has some great individual posts like science in a high dimensional world, you’re not measuring what you think you’re measuring, why agent foundations (coining the word true names), and probably a couple more less known older ones that I haven’t processed fully yet. And generally his philosophy that you need to figure out the right ontology is good.
But also tbc, this is just alignment philosophy. In terms of alignment research, he’s a bit further down my list, e.g. also behind Paul Christiano and Steven Byrnes.
to clarify a bit: my claim was that there are 10k individuals in history who have contributed at least at the same order of magnitude to our understanding of concepts — like, in terms of pushing human understanding further compared to the state of understanding before their work. we can be interested in understanding what this number is for this reason: it can help us understand whether it’s plausible that this line of inquiry is just about to find some sort of definitive theory of concepts. (i expect you will still have a meaningfully lower number. i could be convinced it’s more like 1000 but i think it’s very unlikely to be like 100.) i think wentworth is obviously much higher eg if you rank people on publicly displayed alignment understanding, very likely in the top 10
I think there’s an important distinction here between (a) “including human value concepts” and (b) “being able to point at human value concepts”. Systems sharing our mind architecture make (a) more likely but do not make (b) more likely, and I think (b) is required for good outcomes.