What is The Plan for AI alignment? Briefly: Sort out our fundamental confusions about agency and abstraction enough to do interpretability that works and generalizes robustly. Then, look through our AI’s internal concepts for a good alignment target, and Retarget the Search.
I think this won’t work because many human-value-laden concepts aren’t very natural for an AI. More specifically, in the 2023 version of the plan you wrote:
Suppose the natural abstraction hypothesis[2] is basically correct, i.e. a wide variety of minds trained/evolved in the same environment converge to use basically-the-same internal concepts.
… Then it’s pretty likely that neural nets end up with basically-human-like internal concepts corresponding to whatever stuff humans want.
… So in principle, it shouldn’t take that many bits-of-optimization to get nets to optimize for whatever stuff humans want.
… Therefore if we just kinda throw reward at nets in the obvious ways (e.g. finetuning/RLHF), and iterate on problems for a while, maybe that just works?
In the linked post, I gave that roughly a 10% chance of working. I expect the natural abstraction part to basically work, the problem is [...]
I think the natural abstraction part here does not work—not because natural abstractions aren’t a thing—but because there’s an exception for abstractions that are dependent on the particular mind architecture an agent has.
Concepts like “love”, “humor”, and probably “consciousness” may be natural for humans but probably less natural for AIs.
But also we cannot just wire up those concepts into the values of an AI and expect the AI’s values to generalize correctly. The way our values generalize—how we will decide what to value as we grow smarter and do philosophical reflection—seems quite contingent on our mind architecture. Unless we have an AI that shares our mind architecture (like in Steven Byrnes’ agenda), we’d need to point the AI to an indirect specification of what we value, aka CEV. And CEV doesn’t seem like a simple natural abstraction that an AI would learn without us teaching it about CEV. And even if it knows CEV because we taught it, I find it hard to imagine how we would point the search process to it (even assuming we have a retargetable general purpose search).
Also see here and here. But mainly I think you need to think a lot more concretely about what goal we actually want to point the AI at.
Although I agree with this:
Generally, we aim to work on things which are robust bottlenecks to a broad space of plans. In particular, our research mostly focuses on natural abstraction, because that seems like the most robust bottleneck on which (not-otherwise-doomed) plans get stuck.
However, it does not look to me like you are making much progress relative to your stated beliefs of how close you are. Aka relative to (from your 2024 update where this statement sounded like it’s based on 10ish year timelines):
Earlier this year, David and I estimated that we’d need roughly a 3-4x productivity multiplier to feel like we were basically on track.
So here are some thoughts on how your progress looks to me, although I’ve not been following your research in detail anymore since summer 2024 (after your early natural latents posts):
Basically, it seems to me like you’re making the mistake of Aristotelians that Francis Bacon points out in the Baconian Method (or Novum Organum generally):
the intellect mustn’t be allowed •to jump—to fly—from particulars a long way up to axioms that are of almost the highest generality… Our only hope for good results in the sciences is for us to proceed thus: using a valid ladder, we move up gradually—not in leaps and bounds—from particulars to lower axioms, then to middle axioms, then up and up...
Aka, you look at a few examples, and directly try to find a general theory of abstraction. I think this makes your theory overly simplistic and probably basically useless.
Like, when I read Natural Latents: The Concepts, I already had a feeling of the post trying to explain too much at once—lumping together things as natural latents that seem very importantly different, and also in some cases natural latents seemed like a dubious fit. I started to form an intuitive distinction in my mind between objects (like a particular rigid body) and concepts (like clusters in thingspace like “tree” (as opposed to a particular tree)), although I couldn’t explain it well at the time. Later I studied a bit formal language semantics and the distinction there is just total 101 basics.
I studied language a bit and tried to carve up in a bit more detail what types of abstractions there are, which I wrote up here. But really I think that’s still too abstract and still too top-down and one probably needs to study particular words in a lot of detail, then similar words, etc.
Not that this kind of study of language is necessarily the best way to proceed with alignment—I didn’t continue it after my 5 month language-and-orcas-exploration. But I do think concrete study of observations and abstracting slowly is important.
ADDED: Basically, from having tried a little to understand natural/human ontologies myself it does not look to me like natural latents is much progress. But again I didn’t follow your work in detail and if you have concrete plans or evidence of how it’s going to be useful for pointing AIs then lmk.
Basically, it seems to me like you’re making the mistake of Aristotelians that Francis Bacon points out in the Baconian Method (or Novum Organum generally):
the intellect mustn’t be allowed •to jump—to fly—from particulars a long way up to axioms that are of almost the highest generality… Our only hope for good results in the sciences is for us to proceed thus: using a valid ladder, we move up gradually—not in leaps and bounds—from particulars to lower axioms, then to middle axioms, then up and up...
Aka, you look at a few examples, and directly try to find a general theory of abstraction. I think this makes your theory overly simplistic and probably basically useless.
Like, when I read Natural Latents: The Concepts, I already had a feeling of the post trying to explain too much at once—lumping together things as natural latents that seem very importantly different, and also in some cases natural latents seemed like a dubious fit. I started to form an intuitive distinction in my mind between objects (like a particular rigid body) and concepts (like clusters in thingspace like “tree” (as opposed to a particular tree)), although I couldn’t explain it well at the time. Later I studied a bit formal language semantics and the distinction there is just total 101 basics.
I studied language a bit and tried to carve up in a bit more detail what types of abstractions there are, which I wrote up here. But really I think that’s still too abstract and still too top-down and one probably needs to study particular words in a lot of detail, then similar words, etc.
Not that this kind of study of language is necessarily the best way to proceed with alignment—I didn’t continue it after my 5 month language-and-orcas-exploration. But I do think concrete study of observations and abstracting slowly is important.
+1 to this. to me this looks like understanding some extremely toy cases a bit better and thinking you’re just about to find some sort of definitive theory of concepts. there’s just SO MUCH different stuff going on with concepts! wentworth+lorell’s work is interesting, but so much more has been understood about concepts in even other existing literature than in wentworth+lorell’s work (i’d probably say there are at least 10000 contributions to our understanding of concepts in at least the same tier), with imo most of the work remaining! there’s SO MANY questions! there’s a lot of different structure in eg a human mind that is important for our concepts working! minds are really big, and not just in content but also in structure (including the structure that makes concepting tick in humans)! and minds are growing/developing, and not just in content but also in structure (including the structure that makes concepting tick in humans)! “what’s the formula for good concepts?” should sound to us like “what’s the formula for useful technologies?” or “what’s the formula for a strong economy?”. there are very many ideas that go into having a strong economy, and there are probably very many ideas that go into having a powerful conceptive system. this has mostly just been a statement of my vibe/position on this matter, with few arguments, but i discuss this more here.
on another note: “retarget the search to human values” sounds nonsensical to me. by default (at least without fundamental philosophical progress on the nature of valuing, but imo probably even given this, at least before serious self-re-programming), values are implemented in a messy(-looking) way across a mind, and changing a mind’s values to some precise new thing is probably in the same difficulty tier as re-writing a new mind with the right values from scratch, and not doable with any small edit
concretely, what would it look like to retarget the search in a human so that (if you give them tools to become more capable and reasonable advice on how to become more capable “safely”/”value-preservingly”) they end up proving the riemann hypothesis, then printing their proof on all the planets in this galaxy, and then destroying all intelligent life in the galaxy (and committing suicide)? this is definitely a simpler thing than object-level human values, and it’s plausibly more natural than human values even in a world in which there is already humanity that you can try to use as a pointer to human values. it seems extremely cursed to make this edit in a human. some thoughts on a few approaches that come to mind:
you could try to make the human feel good about plans for futures that involve learning a bunch of analysis and number theory, and about plan-relevant visions of the future that involve having a proof of the riemann hypothesis in hand in particular, and so on. it seems pretty clear that this doesn’t generalize correctly, and in particular that the human isn’t actually going to do the deeply unnatural thing of committing suicide after finishing the rest.[1] i think it’s very unlikely that they’ll even focus much on proving the riemann hypothesis in particular. if you’re really good at this sort of editing, maybe they will get really into analysis and number theory for a while, i guess, and it might even affect what happens in the very far future.[2] but the far future isn’t going to look like what you wanted.
with like 100 years of philosophy and neuroscience research, i think one might get into a position where one could edit a human to be locally trying to solve some math problem for like 10 minutes, with the edit doing sth like what happens when one naturally just decides to work on a math problem without it fitting into one’s life/plans in any very deep way, eg just to learn. there is retargetable search in humans in that sense, and i think it’s probable sth like this will be present in the first AGI as well. but this is different than editing the human to have some specific different long-term values. on longer timescales than 10 minutes, the human will have their mess-values kick in again, implemented in/as eg many context-specific drives, habits, explicit and implicit principles, explicit and implicit meta-principles, understanding of the purposes of various things, ways of harmonizing attitudes, various processes running on various human institutions and other humans, etc.[3] it would be a motte and bailey to argue “it is generic for a mind to have at least some sort of targetable search ability” (in a way that considers the 10 min thing one could in principle do to a human an example), and then to go from this to “it is generic for a mind as a whole to have some sort of retargetable search structure, like with an ultimate goal slot in which something can be written”.
you could try to edit the human’s memories in a really careful way to make them think that they have made a very serious promise to do this riemann hypothesis thing. doing this is probably not possible in all but at most a very small fraction of humans because humans almost universally don’t have a strong enough promise capability to actually stick to this over the very long term. (actually, i’d mostly guess it’s not possible in any humans, because it’s such a fucked thing to promise. what would the story be of how you made this promise, of which you now have fake memories? maybe there’s some construction… but the promise-keeping part will have to fight a huge long-term war against all the many other value-bearing components of the human, that are all extremely unhappy about this life path.) also, if it were possible to plant plausible memories of making the promise (maybe with different choices for the details of the promise), you could probably just have the human make the promise the good old-fashioned way. anyway, default AGIs won’t be deeply social beings like humans, so it would be extremely weird for an AGI to already have machinery for making promises installed. it’s also extremely difficult to do this in a way that the guy never realizes they were just tampered with and so isn’t actually bound by the promise (after realizing which they will probably ignore it).
but maybe there’s a better sort of thing you could try on a human, that i’m not quickly thinking of?
maybe the position is “humans aren’t retargetable searchers (in their total structure, in the way needed for this plan), but the first AGI will probably be one”. it seems very likely to me that values will in fact be diffusely and messily implemented in that AGI as well. for example, there won’t even remotely be a nice cleavage between values and understanding
a response: the issue is that i’ve chosen an extremely unnatural task. a counterresponse: it’s also extremely unnatural to have one’s valuing route through an alien species, which is what the proposal wants to do to the AI
that said, i think it’s also reasonably natural to be the sort of guy who would actively try to undo any supposed value changes after the fact, and it’s reasonably natural to be the sort of guy whose long-term future is more governed by stuff these edits don’t touch. in these cases, these edits would not affect the far future, at least not in the straightforward way
these are all given their correct meaning/function only in the context of their very particular mind, in basically all its parts. so i could also say: their mind just kicks in again in general.
I think the core intuition that makes me believe some sort of relatively simple edit might possibly achieve this comes from the observation that I can ask myself what plans I would make if I had some arbitrary different set of goals, and the plans my brain supplies in answer aren’t much worse than those I make for the goals I actually have. This indicates that my plan-making capacity is, at least on short time scales, essentially orthogonal to my goals and can be re-pointed in arbitrary directions very readily. If an edit can trigger that same process, but stop my brain from ever ceasing the mental motion of reasoning through the hypothetical, that would already be an impressive amount of targetable general optimisation power.
To be clear, I am not suggesting that the actual edit one would actually make to an ASI in real life looks much like making the ASI start a thought experiment or roleplay that never stops.
(Though current “alignment” techniques for current AIs do seem to work sort of like that, and I think that actually isn’t entirely a coincidence.) I am just trying to gesture at an intuition pump for why one might think that the optimisation power of some general minds that occur in real life could be quite readily and precisely re-targetable if you can manipulate their internals.
A related intuition: Many general agents solve problems by, for example, recursively hacking them up into subproblems, or recursively relating them to easier problems, and then solving these other problems instead. To the extent the agents solve the many different problems using one general set optimisation machinery, that general optimisation machinery needs to be very readily and precisely retargetable at arbitrary problems. If you could get inside these retargeting loop(s), you could perhaps exploit them to point the agent along a very different optimisation trajectory, or make a new agent out of the existing agent relatively cheaply (there isn’t actually a hard distinction between these two options of course).
If an edit can trigger that same process, but stop my brain from ever ceasing the mental motion of reasoning through the hypothetical, that would already be an impressive amount of targetable general optimisation power.
I agree with this , but tentatively disagree with the . I’m plausibly/probably on board with editing for short time scales making sense[1], but I think it’s cursed to make an edit that makes it so you don’t cease to work on the problem. For a concrete example, let’s consider a smart human on a deserted island for 50 years, with lots of resources so staying alive is easy and by default the human could do whatever they want.[2] Do you think that there is a fairly “small”/”simple” edit that could be made to this human at the beginning so that for their 50 years on the island, they will be working on some particular hard open problem in algebraic number theory a significant fraction of the time?[3] This seems really cursed to me. What would it look like to be this human after the edit? What happens when the human thinks “wait why am I working on this problem again?” or “what should I be doing?”. What happens when the human gets drawn toward other questions, as they are by default? One could try to edit away machinery that makes the human ask such questions and machinery that makes the human get interested in various things, but I think that asking these questions is caused/constituted by structure/processes such that removing that/those in any simple way breaks the human’s thinking, as is getting interested in various things. In particular, to have a chance of solving the hard math problem, it seems like the human needs to be able to ask questions in a basically open-ended way, and needs to be able to really think about what questions should be asked, but this is in a great deal of tension with not asking “wait why am I working on this problem again?” or “what should I be doing?”. There are some thinking-structures/processes determining what questions the human is interested in, and these are crucial for selecting other questions to study which help the human understand stuff important for solving the math problem (e.g. coming up with toy special cases of the problem, e.g. studying other related problems, e.g. trying to solve subproblems after proposing some decomposition), and it’s really hard to keep this stuff functional while making the human not ever think “wait why am I working on this problem again?”. It seems cursed to not have the human ask this question implicitly, and also cursed to not have the human ask this question explicitly. One possible way out is to say: ok we just let the human ask “wait why am I doing this?”, and make it so some answer is consistently provided which makes it seem reasonable to the human to continue working on the problem. I have a hard time coming up with a way this could look like. Here are some options I’ve considered:
We could try to make the human think that they will be rewarded a lot for solving this math problem by implanting false memories of some past events (like that there are hidden overseers who will take the human back to civilization once the human solves the math problem). I think this could maybe be done, but it has the major issue that this makes sense roughly if and only if you could just actually have a robust verification+reward setup that would make the human instrumentally want to solve the math problem. But then you could just actually set up that mechanism and not have to do this memory editing at all, so your clever search retargeting is redundant. Like, also in the AI case, just set up the reward/verification mechanism instead of implanting these false memories of it being present. Of course, setting up such a mechanism can be hard, but then also it’s hard to implant false memories of a compelling such mechanism being present? There are ways this isn’t a completely perfect implication though[4] and then maybe it would matter to have this search retargeting ability? I could be convinced retargeting is maybe useful if one could provide a reasonable construction of this kind. I’m flagging this for myself as something to think more about later.
We could try to make edits that turn the human into someone who is just deeply interested in this particular problem, like maybe Andrew Wiles was wrt Fermat’s last theorem[5]. I think there could be a way to do this for problems you are already close to being really interested in — e.g. one might be able to subtly nudge how Terence Tao assigns intuitive interestingness to many questions/[open problems]/topics and get him to spend much more time on the Collatz conjecture or the Riemann hypothesis or whatever. It seems really hard to figure out a subtle nudge that gets someone to be really interested in a problem which isn’t already something they are very familiar with (like, imagine the problem looking like just another statement with 5 quantifiers involving some unfamiliar objects) — like, the nudging method relies on increasing the valences of various other intellectual things which solving the problem would contribute to, but this isn’t possible if you don’t already have mental structure/content around those questions set up, and it seems hard to set up that structure/content except by having the human study those other things, but that’s sort of a chicken and egg situation.[6] For a problem that is really novel, it’s also hard to know ahead of time what other question-representations need to be created and upvalenced — it takes work to figure out how the question relates to other questions — but getting the human to do this work is a chicken and egg situation, and we don’t want to do this work ourselves because doing it decently would plausibly constitute a significant fraction of the work required to solve the problem. Generally my take is that after you make the edit, the guy just pursuing what’s interesting to them will be developing pretty chaotically, and it seems cursed to try to get this chaotic system to hit something which has a low “interestingness/familiarity prior” (like, what would one upvalence when trying to get a very talented 5 year old or guy with no math-related degree to eventually make progress on the Riemann hypothesis). I guess I’ve mostly been imagining that the math problem is “universally interesting” (like any famous conjecture would be), even if initially unfamiliar, but really for the case of practical interest we should probably imagine a problem that isn’t so universally interesting. In that case, your targeting of these interestingness-dynamics needs to be even more precise, because the kinda-crap-sentence-number- attractor is smaller/weaker than the attractor for the Riemann hypothesis. Also, it seems plausible that if you could do a thing this precise, you could also just retarget the guy to be very interested in this problem in a black-box way, but not sure + haven’t thought this through carefully.
You could try to invent a religion in which solving the math problem is extremely important, and try to edit the human into an adherent. I think this has a combination of the issues of the previous two approaches.
Despite the above approaches not getting there so far, I’m open to there being some clever way to pull this off. Even though I haven’t come up with anything decent yet, thinking about this for a few hours still made me somewhat more optimistic about long-term retargeting making sense, at least toward fairly well-defined research problems and if we’re fine with losing like an order of magnitude from full effort. Overall it still looks cursed I guess, but somewhat less than I expected earlier.
Instead of doing a single edit at the beginning, we can also consider the option of continuing to do various edits across the 50 year period, as I think you mention in your comment. I think this is somewhat more promising relative to the single edit case. I spent a few hours thinking about this at some point a while ago, saw various obstacles, and didn’t come up with anything that seemed promising then, but there could be something. I’m flagging this for myself as another thing to think more about later.
(There are also additional major problems with the case of editing an AI to have human values (or maybe specifically to serve a particular human) compared to this 50 year human example. It’s a weirder target, and the edit now needs to continue to apply over arbitrary future development, which is cursed. Instead of retargeting to human values, one could retarget to e.g. making mind uploads, with a plan to use these mind uploads to ban AGI later. Unfortunately this doesn’t make sense once one considers that other AI-makers are ending the world before you get anywhere with this (and that the government is probably taking over your lab I guess), even if you have retargeting methods ready to go. In either case, you have to pause the foom in some specific somewhat super-human capability window, which is problematic because that might well be passed in only a few days of fooming (one reason: if the AI can solve a >100 year problem such as making mind uploads in a month, that suggests it could do years of human algo progress conceptual work in a day) or be passed inside your deployment run.)
like, it seems plausible/likely that with at most a few centuries of philosophy+neuroscience research, we could edit a human to be interested in a math problem for 10 minutes once, at least assuming we are granted an ability to do low-level edits for free. i have some caveat around an alien’s values being a more cursed thing to retarget to even for 10 minutes than solving a nice math problem, but i’ll ignore this because other issues seem more important/interesting for now
I’m saying the human is on an otherwise uninhabited island so we don’t have to think about the effects of interacting with other unedited people / editing many people at once.
let’s say that averaging at least idk hours a week of work [which the person reasonably takes to be directed toward the problem] would be considered a win
like, maybe you can create vague memories of some extremely good verifiers being around, initialized with little doubt registered about this, such that you can’t actually create verifiers that are this good, with this story ultimately not hanging together but hanging together well enough so as to not be taken apart in 50 years
wentworth+lorell’s work is interesting, but so much more has been understood about concepts in even other existing literature than in wentworth+lorell’s work (i’d probably say there are at least 10000 contributions to our understanding of concepts in at least the same tier), with imo most of the work remaining!
Btw if you mean there are 10k contributions already that are on the level of John’s contributions, I strongly disagree with this. I’m not sure whether John’s math is significantly useful, and I don’t think it’s been that much progress relative to “almost on track to maybe solve alignment”, but in terms of (alignment) philosophy John’s work is pretty great compared to academic philosophy.
In terms of general alignment philosophy (not just work on concepts but also other insights), I’d probably put John’s collective works in 3rd place after Eliezer Yudkowsky and Douglas Hofstadter. The latter is on the list mainly because of Surfaces and Essences, which I can recommend (@johnswentworth).
Aka I’d probably put John above people like Wittgenstein, although I admit I don’t know that much about the works of philosophers like Wittgenstein. Could be that there are more insights in the collective works of Wittgenstein, but if I’d need to read through 20x the volume because he doesn’t write clearly enough that’s still a point against him. Even if a lot of John’s insights have been said before somewhere, writing insights clearly provides a lot of value.
Although John’s work on concepts play a smaller role in what I think makes John a good alignment philosopher than it does in his alignment research. Partially I think John just has some great individual posts like science in a high dimensional world, you’re not measuring what you think you’re measuring, why agent foundations (coining the word true names), and probably a couple more less known older ones that I haven’t processed fully yet. And generally his philosophy that you need to figure out the right ontology is good.
But also tbc, this is just alignment philosophy. In terms of alignment research, he’s a bit further down my list, e.g. also behind Paul Christiano and Steven Byrnes.
to clarify a bit: my claim was that there are 10k individuals in history who have contributed at least at the same order of magnitude to our understanding of concepts — like, in terms of pushing human understanding further compared to the state of understanding before their work. we can be interested in understanding what this number is for this reason: it can help us understand whether it’s plausible that this line of inquiry is just about to find some sort of definitive theory of concepts.
(i expect you will still have a meaningfully lower number. i could be convinced it’s more like 1000 but i think it’s very unlikely to be like 100.) i think wentworth is obviously much higher eg if you rank people on publicly displayed alignment understanding, very likely in the top 10
Unless we have an AI that shares our mind architecture (like in Steven Byrnes’ agenda)
I think there’s an important distinction here between (a) “including human value concepts” and (b) “being able to point at human value concepts”. Systems sharing our mind architecture make (a) more likely but do not make (b) more likely, and I think (b) is required for good outcomes.
Thanks for your yearly update!
On the plan:
I think this won’t work because many human-value-laden concepts aren’t very natural for an AI. More specifically, in the 2023 version of the plan you wrote:
I think the natural abstraction part here does not work—not because natural abstractions aren’t a thing—but because there’s an exception for abstractions that are dependent on the particular mind architecture an agent has.
Concepts like “love”, “humor”, and probably “consciousness” may be natural for humans but probably less natural for AIs.
But also we cannot just wire up those concepts into the values of an AI and expect the AI’s values to generalize correctly. The way our values generalize—how we will decide what to value as we grow smarter and do philosophical reflection—seems quite contingent on our mind architecture. Unless we have an AI that shares our mind architecture (like in Steven Byrnes’ agenda), we’d need to point the AI to an indirect specification of what we value, aka CEV. And CEV doesn’t seem like a simple natural abstraction that an AI would learn without us teaching it about CEV. And even if it knows CEV because we taught it, I find it hard to imagine how we would point the search process to it (even assuming we have a retargetable general purpose search).
Also see here and here. But mainly I think you need to think a lot more concretely about what goal we actually want to point the AI at.
Although I agree with this:
However, it does not look to me like you are making much progress relative to your stated beliefs of how close you are. Aka relative to (from your 2024 update where this statement sounded like it’s based on 10ish year timelines):
So here are some thoughts on how your progress looks to me, although I’ve not been following your research in detail anymore since summer 2024 (after your early natural latents posts):
Basically, it seems to me like you’re making the mistake of Aristotelians that Francis Bacon points out in the Baconian Method (or Novum Organum generally):
Aka, you look at a few examples, and directly try to find a general theory of abstraction. I think this makes your theory overly simplistic and probably basically useless.
Like, when I read Natural Latents: The Concepts, I already had a feeling of the post trying to explain too much at once—lumping together things as natural latents that seem very importantly different, and also in some cases natural latents seemed like a dubious fit. I started to form an intuitive distinction in my mind between objects (like a particular rigid body) and concepts (like clusters in thingspace like “tree” (as opposed to a particular tree)), although I couldn’t explain it well at the time. Later I studied a bit formal language semantics and the distinction there is just total 101 basics.
I studied language a bit and tried to carve up in a bit more detail what types of abstractions there are, which I wrote up here. But really I think that’s still too abstract and still too top-down and one probably needs to study particular words in a lot of detail, then similar words, etc.
Not that this kind of study of language is necessarily the best way to proceed with alignment—I didn’t continue it after my 5 month language-and-orcas-exploration. But I do think concrete study of observations and abstracting slowly is important.
ADDED: Basically, from having tried a little to understand natural/human ontologies myself it does not look to me like natural latents is much progress. But again I didn’t follow your work in detail and if you have concrete plans or evidence of how it’s going to be useful for pointing AIs then lmk.
+1 to this. to me this looks like understanding some extremely toy cases a bit better and thinking you’re just about to find some sort of definitive theory of concepts. there’s just SO MUCH different stuff going on with concepts! wentworth+lorell’s work is interesting, but so much more has been understood about concepts in even other existing literature than in wentworth+lorell’s work (i’d probably say there are at least 10000 contributions to our understanding of concepts in at least the same tier), with imo most of the work remaining! there’s SO MANY questions! there’s a lot of different structure in eg a human mind that is important for our concepts working! minds are really big, and not just in content but also in structure (including the structure that makes concepting tick in humans)! and minds are growing/developing, and not just in content but also in structure (including the structure that makes concepting tick in humans)! “what’s the formula for good concepts?” should sound to us like “what’s the formula for useful technologies?” or “what’s the formula for a strong economy?”. there are very many ideas that go into having a strong economy, and there are probably very many ideas that go into having a powerful conceptive system. this has mostly just been a statement of my vibe/position on this matter, with few arguments, but i discuss this more here.
on another note: “retarget the search to human values” sounds nonsensical to me. by default (at least without fundamental philosophical progress on the nature of valuing, but imo probably even given this, at least before serious self-re-programming), values are implemented in a messy(-looking) way across a mind, and changing a mind’s values to some precise new thing is probably in the same difficulty tier as re-writing a new mind with the right values from scratch, and not doable with any small edit
concretely, what would it look like to retarget the search in a human so that (if you give them tools to become more capable and reasonable advice on how to become more capable “safely”/”value-preservingly”) they end up proving the riemann hypothesis, then printing their proof on all the planets in this galaxy, and then destroying all intelligent life in the galaxy (and committing suicide)? this is definitely a simpler thing than object-level human values, and it’s plausibly more natural than human values even in a world in which there is already humanity that you can try to use as a pointer to human values. it seems extremely cursed to make this edit in a human. some thoughts on a few approaches that come to mind:
you could try to make the human feel good about plans for futures that involve learning a bunch of analysis and number theory, and about plan-relevant visions of the future that involve having a proof of the riemann hypothesis in hand in particular, and so on. it seems pretty clear that this doesn’t generalize correctly, and in particular that the human isn’t actually going to do the deeply unnatural thing of committing suicide after finishing the rest. [1] i think it’s very unlikely that they’ll even focus much on proving the riemann hypothesis in particular. if you’re really good at this sort of editing, maybe they will get really into analysis and number theory for a while, i guess, and it might even affect what happens in the very far future. [2] but the far future isn’t going to look like what you wanted.
with like 100 years of philosophy and neuroscience research, i think one might get into a position where one could edit a human to be locally trying to solve some math problem for like 10 minutes, with the edit doing sth like what happens when one naturally just decides to work on a math problem without it fitting into one’s life/plans in any very deep way, eg just to learn. there is retargetable search in humans in that sense, and i think it’s probable sth like this will be present in the first AGI as well. but this is different than editing the human to have some specific different long-term values. on longer timescales than 10 minutes, the human will have their mess-values kick in again, implemented in/as eg many context-specific drives, habits, explicit and implicit principles, explicit and implicit meta-principles, understanding of the purposes of various things, ways of harmonizing attitudes, various processes running on various human institutions and other humans, etc. [3] it would be a motte and bailey to argue “it is generic for a mind to have at least some sort of targetable search ability” (in a way that considers the 10 min thing one could in principle do to a human an example), and then to go from this to “it is generic for a mind as a whole to have some sort of retargetable search structure, like with an ultimate goal slot in which something can be written”.
you could try to edit the human’s memories in a really careful way to make them think that they have made a very serious promise to do this riemann hypothesis thing. doing this is probably not possible in all but at most a very small fraction of humans because humans almost universally don’t have a strong enough promise capability to actually stick to this over the very long term. (actually, i’d mostly guess it’s not possible in any humans, because it’s such a fucked thing to promise. what would the story be of how you made this promise, of which you now have fake memories? maybe there’s some construction… but the promise-keeping part will have to fight a huge long-term war against all the many other value-bearing components of the human, that are all extremely unhappy about this life path.) also, if it were possible to plant plausible memories of making the promise (maybe with different choices for the details of the promise), you could probably just have the human make the promise the good old-fashioned way. anyway, default AGIs won’t be deeply social beings like humans, so it would be extremely weird for an AGI to already have machinery for making promises installed. it’s also extremely difficult to do this in a way that the guy never realizes they were just tampered with and so isn’t actually bound by the promise (after realizing which they will probably ignore it).
but maybe there’s a better sort of thing you could try on a human, that i’m not quickly thinking of?
maybe the position is “humans aren’t retargetable searchers (in their total structure, in the way needed for this plan), but the first AGI will probably be one”. it seems very likely to me that values will in fact be diffusely and messily implemented in that AGI as well. for example, there won’t even remotely be a nice cleavage between values and understanding
a response: the issue is that i’ve chosen an extremely unnatural task. a counterresponse: it’s also extremely unnatural to have one’s valuing route through an alien species, which is what the proposal wants to do to the AI
that said, i think it’s also reasonably natural to be the sort of guy who would actively try to undo any supposed value changes after the fact, and it’s reasonably natural to be the sort of guy whose long-term future is more governed by stuff these edits don’t touch. in these cases, these edits would not affect the far future, at least not in the straightforward way
these are all given their correct meaning/function only in the context of their very particular mind, in basically all its parts. so i could also say: their mind just kicks in again in general.
I think the core intuition that makes me believe some sort of relatively simple edit might possibly achieve this comes from the observation that I can ask myself what plans I would make if I had some arbitrary different set of goals, and the plans my brain supplies in answer aren’t much worse than those I make for the goals I actually have. This indicates that my plan-making capacity is, at least on short time scales, essentially orthogonal to my goals and can be re-pointed in arbitrary directions very readily. If an edit can trigger that same process, but stop my brain from ever ceasing the mental motion of reasoning through the hypothetical, that would already be an impressive amount of targetable general optimisation power.
To be clear, I am not suggesting that the actual edit one would actually make to an ASI in real life looks much like making the ASI start a thought experiment or roleplay that never stops. (Though current “alignment” techniques for current AIs do seem to work sort of like that, and I think that actually isn’t entirely a coincidence.) I am just trying to gesture at an intuition pump for why one might think that the optimisation power of some general minds that occur in real life could be quite readily and precisely re-targetable if you can manipulate their internals.
A related intuition: Many general agents solve problems by, for example, recursively hacking them up into subproblems, or recursively relating them to easier problems, and then solving these other problems instead. To the extent the agents solve the many different problems using one general set optimisation machinery, that general optimisation machinery needs to be very readily and precisely retargetable at arbitrary problems. If you could get inside these retargeting loop(s), you could perhaps exploit them to point the agent along a very different optimisation trajectory, or make a new agent out of the existing agent relatively cheaply (there isn’t actually a hard distinction between these two options of course).
I agree with this , but tentatively disagree with the . I’m plausibly/probably on board with editing for short time scales making sense
[1]
, but I think it’s cursed to make an edit that makes it so you don’t cease to work on the problem. For a concrete example, let’s consider a smart human on a deserted island for 50 years, with lots of resources so staying alive is easy and by default the human could do whatever they want.
[2]
Do you think that there is a fairly “small”/”simple” edit that could be made to this human at the beginning so that for their 50 years on the island, they will be working on some particular hard open problem in algebraic number theory a significant fraction of the time?
[3]
This seems really cursed to me. What would it look like to be this human after the edit? What happens when the human thinks “wait why am I working on this problem again?” or “what should I be doing?”. What happens when the human gets drawn toward other questions, as they are by default? One could try to edit away machinery that makes the human ask such questions and machinery that makes the human get interested in various things, but I think that asking these questions is caused/constituted by structure/processes such that removing that/those in any simple way breaks the human’s thinking, as is getting interested in various things. In particular, to have a chance of solving the hard math problem, it seems like the human needs to be able to ask questions in a basically open-ended way, and needs to be able to really think about what questions should be asked, but this is in a great deal of tension with not asking “wait why am I working on this problem again?” or “what should I be doing?”. There are some thinking-structures/processes determining what questions the human is interested in, and these are crucial for selecting other questions to study which help the human understand stuff important for solving the math problem (e.g. coming up with toy special cases of the problem, e.g. studying other related problems, e.g. trying to solve subproblems after proposing some decomposition), and it’s really hard to keep this stuff functional while making the human not ever think “wait why am I working on this problem again?”. It seems cursed to not have the human ask this question implicitly, and also cursed to not have the human ask this question explicitly. One possible way out is to say: ok we just let the human ask “wait why am I doing this?”, and make it so some answer is consistently provided which makes it seem reasonable to the human to continue working on the problem. I have a hard time coming up with a way this could look like. Here are some options I’ve considered:
We could try to make the human think that they will be rewarded a lot for solving this math problem by implanting false memories of some past events (like that there are hidden overseers who will take the human back to civilization once the human solves the math problem). I think this could maybe be done, but it has the major issue that this makes sense roughly if and only if you could just actually have a robust verification+reward setup that would make the human instrumentally want to solve the math problem. But then you could just actually set up that mechanism and not have to do this memory editing at all, so your clever search retargeting is redundant. Like, also in the AI case, just set up the reward/verification mechanism instead of implanting these false memories of it being present. Of course, setting up such a mechanism can be hard, but then also it’s hard to implant false memories of a compelling such mechanism being present? There are ways this isn’t a completely perfect implication though [4] and then maybe it would matter to have this search retargeting ability? I could be convinced retargeting is maybe useful if one could provide a reasonable construction of this kind. I’m flagging this for myself as something to think more about later.
We could try to make edits that turn the human into someone who is just deeply interested in this particular problem, like maybe Andrew Wiles was wrt Fermat’s last theorem [5] . I think there could be a way to do this for problems you are already close to being really interested in — e.g. one might be able to subtly nudge how Terence Tao assigns intuitive interestingness to many questions/[open problems]/topics and get him to spend much more time on the Collatz conjecture or the Riemann hypothesis or whatever. It seems really hard to figure out a subtle nudge that gets someone to be really interested in a problem which isn’t already something they are very familiar with (like, imagine the problem looking like just another statement with 5 quantifiers involving some unfamiliar objects) — like, the nudging method relies on increasing the valences of various other intellectual things which solving the problem would contribute to, but this isn’t possible if you don’t already have mental structure/content around those questions set up, and it seems hard to set up that structure/content except by having the human study those other things, but that’s sort of a chicken and egg situation. [6] For a problem that is really novel, it’s also hard to know ahead of time what other question-representations need to be created and upvalenced — it takes work to figure out how the question relates to other questions — but getting the human to do this work is a chicken and egg situation, and we don’t want to do this work ourselves because doing it decently would plausibly constitute a significant fraction of the work required to solve the problem. Generally my take is that after you make the edit, the guy just pursuing what’s interesting to them will be developing pretty chaotically, and it seems cursed to try to get this chaotic system to hit something which has a low “interestingness/familiarity prior” (like, what would one upvalence when trying to get a very talented 5 year old or guy with no math-related degree to eventually make progress on the Riemann hypothesis). I guess I’ve mostly been imagining that the math problem is “universally interesting” (like any famous conjecture would be), even if initially unfamiliar, but really for the case of practical interest we should probably imagine a problem that isn’t so universally interesting. In that case, your targeting of these interestingness-dynamics needs to be even more precise, because the kinda-crap-sentence-number- attractor is smaller/weaker than the attractor for the Riemann hypothesis. Also, it seems plausible that if you could do a thing this precise, you could also just retarget the guy to be very interested in this problem in a black-box way, but not sure + haven’t thought this through carefully.
You could try to invent a religion in which solving the math problem is extremely important, and try to edit the human into an adherent. I think this has a combination of the issues of the previous two approaches.
Despite the above approaches not getting there so far, I’m open to there being some clever way to pull this off. Even though I haven’t come up with anything decent yet, thinking about this for a few hours still made me somewhat more optimistic about long-term retargeting making sense, at least toward fairly well-defined research problems and if we’re fine with losing like an order of magnitude from full effort. Overall it still looks cursed I guess, but somewhat less than I expected earlier.
Instead of doing a single edit at the beginning, we can also consider the option of continuing to do various edits across the 50 year period, as I think you mention in your comment. I think this is somewhat more promising relative to the single edit case. I spent a few hours thinking about this at some point a while ago, saw various obstacles, and didn’t come up with anything that seemed promising then, but there could be something. I’m flagging this for myself as another thing to think more about later.
(There are also additional major problems with the case of editing an AI to have human values (or maybe specifically to serve a particular human) compared to this 50 year human example. It’s a weirder target, and the edit now needs to continue to apply over arbitrary future development, which is cursed. Instead of retargeting to human values, one could retarget to e.g. making mind uploads, with a plan to use these mind uploads to ban AGI later. Unfortunately this doesn’t make sense once one considers that other AI-makers are ending the world before you get anywhere with this (and that the government is probably taking over your lab I guess), even if you have retargeting methods ready to go. In either case, you have to pause the foom in some specific somewhat super-human capability window, which is problematic because that might well be passed in only a few days of fooming (one reason: if the AI can solve a >100 year problem such as making mind uploads in a month, that suggests it could do years of human algo progress conceptual work in a day) or be passed inside your deployment run.)
like, it seems plausible/likely that with at most a few centuries of philosophy+neuroscience research, we could edit a human to be interested in a math problem for 10 minutes once, at least assuming we are granted an ability to do low-level edits for free. i have some caveat around an alien’s values being a more cursed thing to retarget to even for 10 minutes than solving a nice math problem, but i’ll ignore this because other issues seem more important/interesting for now
I’m saying the human is on an otherwise uninhabited island so we don’t have to think about the effects of interacting with other unedited people / editing many people at once.
let’s say that averaging at least idk hours a week of work [which the person reasonably takes to be directed toward the problem] would be considered a win
like, maybe you can create vague memories of some extremely good verifiers being around, initialized with little doubt registered about this, such that you can’t actually create verifiers that are this good, with this story ultimately not hanging together but hanging together well enough so as to not be taken apart in 50 years
or the Taniyama–Shimura conjecture or understanding some stuff about elliptic curves or whatever one would say if one knew anything about this case
There might be a way to bootstrap but I’ll file this under “schemes involving many edits over time”, which I won’t discuss in the present comment.
Btw if you mean there are 10k contributions already that are on the level of John’s contributions, I strongly disagree with this. I’m not sure whether John’s math is significantly useful, and I don’t think it’s been that much progress relative to “almost on track to maybe solve alignment”, but in terms of (alignment) philosophy John’s work is pretty great compared to academic philosophy.
In terms of general alignment philosophy (not just work on concepts but also other insights), I’d probably put John’s collective works in 3rd place after Eliezer Yudkowsky and Douglas Hofstadter. The latter is on the list mainly because of Surfaces and Essences, which I can recommend (@johnswentworth).
Aka I’d probably put John above people like Wittgenstein, although I admit I don’t know that much about the works of philosophers like Wittgenstein. Could be that there are more insights in the collective works of Wittgenstein, but if I’d need to read through 20x the volume because he doesn’t write clearly enough that’s still a point against him. Even if a lot of John’s insights have been said before somewhere, writing insights clearly provides a lot of value.
Although John’s work on concepts play a smaller role in what I think makes John a good alignment philosopher than it does in his alignment research. Partially I think John just has some great individual posts like science in a high dimensional world, you’re not measuring what you think you’re measuring, why agent foundations (coining the word true names), and probably a couple more less known older ones that I haven’t processed fully yet. And generally his philosophy that you need to figure out the right ontology is good.
But also tbc, this is just alignment philosophy. In terms of alignment research, he’s a bit further down my list, e.g. also behind Paul Christiano and Steven Byrnes.
to clarify a bit: my claim was that there are 10k individuals in history who have contributed at least at the same order of magnitude to our understanding of concepts — like, in terms of pushing human understanding further compared to the state of understanding before their work. we can be interested in understanding what this number is for this reason: it can help us understand whether it’s plausible that this line of inquiry is just about to find some sort of definitive theory of concepts. (i expect you will still have a meaningfully lower number. i could be convinced it’s more like 1000 but i think it’s very unlikely to be like 100.) i think wentworth is obviously much higher eg if you rank people on publicly displayed alignment understanding, very likely in the top 10
I think there’s an important distinction here between (a) “including human value concepts” and (b) “being able to point at human value concepts”. Systems sharing our mind architecture make (a) more likely but do not make (b) more likely, and I think (b) is required for good outcomes.