Re: Values are easy to learn, this mostly seems to me like it makes the incredibly-common conflation between “AI will be able to figure out what humans want” (yes; obviously; this was never under dispute) and “AI will care” (nope; not by default; that’s the hard bit).
I think it’s worth reflecting on what type of evidence would be sufficient to convince you that we’re actually making progress on the “caring” bit of alignment and not merely the “understanding” bit. Because I currently don’t see what type of evidence you’d accept beyond near-perfect mechanistic interpretability.
I think current LLMs demonstrate a lot more than mere understanding of human values; they seem to actually ‘want’ to do things for you, in a rudimentary behavioral sense. When I ask GPT-4 to do some task for me, it’s not just demonstrating an understanding of the task: it’s actually performing actions in the real world that result in the task being completed. I think it’s totally reasonable, prima facie, to admit this as evidence that we are making some success at getting AIs to “care” about doing tasks for users.
It’s not extremely strong evidence, because future AIs could be way harder to align, maybe there’s ultimately no coherent sense in which GPT-4 “cares” about things, and perhaps GPT-4 is somehow just “playing the training game” despite seemingly having limited situational awareness.
But I think it’s valid evidence nonetheless, and I think it’s wrong to round this datum off to a mere demonstration of “understanding”.
We typically would not place such a high standard on other humans. For example, if a stranger helped you in your time of need, you might reasonably infer that the stranger cares about you to some extent, not merely that they “understand” how to care about you, or that they are merely helping people out of a desire to appear benevolent as part of a long-term strategy to obtain power. You may not be fully convinced they really care about you because of a single incident, but surely it should move your credence somewhat. And further observations could move your credence further still.
Alternative explanations of aligned behavior we see are always logically possible, and it’s good to try to get a more mechanistic understanding of what’s going on before we confidently declare that alignment has been solved. But behavioral evidence is still meaningful evidence for AI alignment, just as it is for humans.
I think it’s worth reflecting on what type of evidence would be sufficient to convince you that we’re actually making progress on the “caring” bit of alignment and not merely the “understanding” bit. Because I currently don’t see what type of evidence you’d accept beyond near-perfect mechanistic interpretability.
I’m not Nate, but a pretty good theoretical argument that X method of making AIs would lead to an AI that “cared” about the user would do it for me, and I can sort of conceive of such arguments that don’t rely on really good mechanistic interpretability.
Not sure how close you want it to be but how about this example: “animals will typically care about their offspring’s survival and reproduction in worlds where their action space is rich enough for them to be helpful and too rich for them to memorize extremely simple heuristics, because if they didn’t their genes wouldn’t propagate as much”. Not air-tight, and also I knew the stylized fact before I heard the argument so it’s a bit unfair, but I think it’s pretty good as it goes.
I admit I’m a bit surprised by your example. Your example seems to be the type of heuristic argument that, if given about AI, I’d expect would fail to compel many people (including you) on anything approaching a deep level. It’s possible I was just modeling your beliefs incorrectly.
Generally speaking, I suspect there’s a tighter connection between our selection criteria in ML and the stuff models will end up “caring” about relative to the analogous case for natural selection. I think this for similar reasons that Quintin Pope alluded to in his essay about the evolutionary analogy.
If you think you’d be persuaded that animals will end up caring about their offspring because a heuristic argument about that type of behavior being selected for in-distribution, I’m not sure why you’d need a lot of evidence to be convinced the same will be true for AIs with regard to what we train them to care about. But again, perhaps you don’t actually need that much evidence, and I was simply mistaken about what you believe here.
Your example seems to be the type of heuristic argument that, if given about AI, I’d expect would fail to compel many people (including you) on anything approaching a deep level.
I think people are often persuaded of things about AI by heuristic arguments, like “powerful AI will probably be able to reason well and have a decent model of the world because if you don’t do that you can’t achieve good outcomes” (ok that argument needs some tightening, but I think there’s something that works that’s only ~2x as long). I think it’s going to be harder to persuade me of alignment-relevant stuff about AI with this sort of argument, because there are more ways for such arguments to fail IMO—e.g. the evolution argument relies on evolutionary pressure being ongoing.
Two meta points:
There’s arguments that convince me that we had made progress, and there’s arguments that convince me we’ve solved it. It’s easier to get your hands on the first kind than the second.
It’s easier for me to answer gallabytes’ question than yours because I don’t think argument tactics I see are very good, so it’s going to be hard to come up with one that I think is good! The closest that I can come is that “what if we tried to learn values” and “AI safety via debate” felt like steps forward in thought, even tho I don’t think they get very far.
Generally speaking, I suspect there’s a tighter connection between our selection criteria in ML and the stuff models will end up “caring” about relative to the analogous case for natural selection. I think this for similar reasons that Quintin Pope alluded to in his essay about the evolutionary analogy.
For the record I’m not compelled of this enough to be optimistic about alignment, but I’m roughly at my budget for internet discussion/debate right now, so I’ll decline to elaborate.
If you think you’d be persuaded that animals will end up caring about their offspring because a heuristic argument about that type of behavior being selected for in-distribution, I’m not sure why you’d need a lot of evidence to be convinced the same will be true for AIs with regard to what we train them to care about.
Roughly because AI can change the distribution and change the selection pressure that gets applied to it. But also I don’t think I need a lot of evidence in terms of likelihood ratio—my p(doom) is less than 99%, and people convince me of sub-1-in-100 claims all the time—I’m just not seeing the sort of evidence that would move me a lot.
Testing it on out of distribution examples seems helpful. If an AI still acts as if it follows human values out of distribution, it probably truly cares about human values. For AI with situational awareness, we can probably run simulations to an extent (and probably need bootstrap this after a certain capabilities threshold)
we’ll have good reasons to believe alternate (deceptive) strategies are unlikely / relevant concepts for deceptive alignment are less accessible: e.g. through evals vs. situational awareness, through conceptual arguments around speed priors and not enough expressivity without CoT + avoiding steganography + robust oversight over intermediate text, by unlearning/erasing/making less accessible (e.g. by probing) concepts relevant for deceptive alignment, etc.
So I just skimmed the abstracts you linked so maybe I was too hasty there, but I’d want to see evidence that (a) a language model was representing concept C really well and (b) it’s really relevant for alignment. I think those papers show something like “you can sort of model brain activations by language model activations” or “there’s some embedding space for what brains are sort of doing in conversation” which seems like a different thing (unless the fit is good enough that you can reconstruct one from the other without loss of functionality, then I’m interested).
Makes sense. Just to clarify, the papers I shared for 1 were mostly meant as methodological examples of how one might go about quantifying brain-LLM alignment; I agree about b), that they’re not that relevant to alignment (though some other similar papers do make some progress on that front, addressing [somewhat] more relevant domains/tasks—e.g. on emotion understanding—and I have/had an AI safety camp ’23 project trying to make similar progress—on moral reasoning). W.r.t. a), you can (also) do decoding (predicting LLM embeddings from brain measurements), the inverse of encoding; this survey, for example, covers both encoding and decoding.
Each of the three plans I mention are attempts to put the “understanding” part into the “wanting” slot (a “steering subsystem” that controls goals for decision-making purposes) in a different AGI design. That brief post links to somewhat more detailed plans.
I think it’s worth reflecting on what type of evidence would be sufficient to convince you that we’re actually making progress on the “caring” bit of alignment and not merely the “understanding” bit. Because I currently don’t see what type of evidence you’d accept beyond near-perfect mechanistic interpretability.
I think current LLMs demonstrate a lot more than mere understanding of human values; they seem to actually ‘want’ to do things for you, in a rudimentary behavioral sense. When I ask GPT-4 to do some task for me, it’s not just demonstrating an understanding of the task: it’s actually performing actions in the real world that result in the task being completed. I think it’s totally reasonable, prima facie, to admit this as evidence that we are making some success at getting AIs to “care” about doing tasks for users.
It’s not extremely strong evidence, because future AIs could be way harder to align, maybe there’s ultimately no coherent sense in which GPT-4 “cares” about things, and perhaps GPT-4 is somehow just “playing the training game” despite seemingly having limited situational awareness.
But I think it’s valid evidence nonetheless, and I think it’s wrong to round this datum off to a mere demonstration of “understanding”.
We typically would not place such a high standard on other humans. For example, if a stranger helped you in your time of need, you might reasonably infer that the stranger cares about you to some extent, not merely that they “understand” how to care about you, or that they are merely helping people out of a desire to appear benevolent as part of a long-term strategy to obtain power. You may not be fully convinced they really care about you because of a single incident, but surely it should move your credence somewhat. And further observations could move your credence further still.
Alternative explanations of aligned behavior we see are always logically possible, and it’s good to try to get a more mechanistic understanding of what’s going on before we confidently declare that alignment has been solved. But behavioral evidence is still meaningful evidence for AI alignment, just as it is for humans.
I’m not Nate, but a pretty good theoretical argument that X method of making AIs would lead to an AI that “cared” about the user would do it for me, and I can sort of conceive of such arguments that don’t rely on really good mechanistic interpretability.
Can you give an example of a theoretical argument of the sort you’d find convincing? Can be about any X caring about any Y.
Not sure how close you want it to be but how about this example: “animals will typically care about their offspring’s survival and reproduction in worlds where their action space is rich enough for them to be helpful and too rich for them to memorize extremely simple heuristics, because if they didn’t their genes wouldn’t propagate as much”. Not air-tight, and also I knew the stylized fact before I heard the argument so it’s a bit unfair, but I think it’s pretty good as it goes.
I admit I’m a bit surprised by your example. Your example seems to be the type of heuristic argument that, if given about AI, I’d expect would fail to compel many people (including you) on anything approaching a deep level. It’s possible I was just modeling your beliefs incorrectly.
Generally speaking, I suspect there’s a tighter connection between our selection criteria in ML and the stuff models will end up “caring” about relative to the analogous case for natural selection. I think this for similar reasons that Quintin Pope alluded to in his essay about the evolutionary analogy.
If you think you’d be persuaded that animals will end up caring about their offspring because a heuristic argument about that type of behavior being selected for in-distribution, I’m not sure why you’d need a lot of evidence to be convinced the same will be true for AIs with regard to what we train them to care about. But again, perhaps you don’t actually need that much evidence, and I was simply mistaken about what you believe here.
I think people are often persuaded of things about AI by heuristic arguments, like “powerful AI will probably be able to reason well and have a decent model of the world because if you don’t do that you can’t achieve good outcomes” (ok that argument needs some tightening, but I think there’s something that works that’s only ~2x as long). I think it’s going to be harder to persuade me of alignment-relevant stuff about AI with this sort of argument, because there are more ways for such arguments to fail IMO—e.g. the evolution argument relies on evolutionary pressure being ongoing.
Two meta points:
There’s arguments that convince me that we had made progress, and there’s arguments that convince me we’ve solved it. It’s easier to get your hands on the first kind than the second.
It’s easier for me to answer gallabytes’ question than yours because I don’t think argument tactics I see are very good, so it’s going to be hard to come up with one that I think is good! The closest that I can come is that “what if we tried to learn values” and “AI safety via debate” felt like steps forward in thought, even tho I don’t think they get very far.
For the record I’m not compelled of this enough to be optimistic about alignment, but I’m roughly at my budget for internet discussion/debate right now, so I’ll decline to elaborate.
Roughly because AI can change the distribution and change the selection pressure that gets applied to it. But also I don’t think I need a lot of evidence in terms of likelihood ratio—my p(doom) is less than 99%, and people convince me of sub-1-in-100 claims all the time—I’m just not seeing the sort of evidence that would move me a lot.
Testing it on out of distribution examples seems helpful. If an AI still acts as if it follows human values out of distribution, it probably truly cares about human values. For AI with situational awareness, we can probably run simulations to an extent (and probably need bootstrap this after a certain capabilities threshold)
How about an argument in the shape of:
we’ll get good evidence of human-like alignment-relevant concepts/values well-represented internally (e.g. Scaling laws for language encoding models in fMRI, A shared linguistic space for transmitting our thoughts from brain to brain in natural conversations); in addition to all the cumulating behavioral evidence
we’ll have good reasons to believe alternate (deceptive) strategies are unlikely / relevant concepts for deceptive alignment are less accessible: e.g. through evals vs. situational awareness, through conceptual arguments around speed priors and not enough expressivity without CoT + avoiding steganography + robust oversight over intermediate text, by unlearning/erasing/making less accessible (e.g. by probing) concepts relevant for deceptive alignment, etc.
we have some evidence for priors in favor of fine-tuning favoring strategies which make use of more accessible concepts, e.g. Predicting Inductive Biases of Pre-Trained Models, Reinforcement Learning Fine-tuning of Language Models is Biased Towards More Extractable Features.
For 1 it would depend on how alignment-relevant the concepts and values are. Also I wouldn’t think of the papers you linked as much evidence here.
For 2, that would for sure do it, but it doesn’t feel like much of a reduction.
3 sounds like it’s maybe definitionally true? At the very least, I don’t doubt it much.
Interesting, I’m genuinely curious what you’d expect better evidence to look like for 1.
So I just skimmed the abstracts you linked so maybe I was too hasty there, but I’d want to see evidence that (a) a language model was representing concept C really well and (b) it’s really relevant for alignment. I think those papers show something like “you can sort of model brain activations by language model activations” or “there’s some embedding space for what brains are sort of doing in conversation” which seems like a different thing (unless the fit is good enough that you can reconstruct one from the other without loss of functionality, then I’m interested).
Makes sense. Just to clarify, the papers I shared for 1 were mostly meant as methodological examples of how one might go about quantifying brain-LLM alignment; I agree about b), that they’re not that relevant to alignment (though some other similar papers do make some progress on that front, addressing [somewhat] more relevant domains/tasks—e.g. on emotion understanding—and I have/had an AI safety camp ’23 project trying to make similar progress—on moral reasoning). W.r.t. a), you can (also) do decoding (predicting LLM embeddings from brain measurements), the inverse of encoding; this survey, for example, covers both encoding and decoding.
Would any of these count for you?
We have promising alignment plans with low taxes
Each of the three plans I mention are attempts to put the “understanding” part into the “wanting” slot (a “steering subsystem” that controls goals for decision-making purposes) in a different AGI design. That brief post links to somewhat more detailed plans.