You’ve probably already mentioned it somewhere, but perceptual control theory relatedly posits that motivations/actions are just a way to control what sorts of things we experience.
I don’t think I have, but yes. Agreed.
If the same machinery underlies factual prediction and normative actions, we confuse them to all heck.
Yep! And there are pretty big practical consequences of this.
This is a clearer, much more precise statement of a somewhat different mechanism than I was originally proposing here. I’ll need to think for a bit about whether this changes my BCI-superintelligence stuff.
Hm. I haven’t had time to read and process your other post yet, but I do think that human alignment is important for having a hope at aligning things bigger and more intelligent/powerful than ourselves. Like, there’s a big “I’m outside the system!” type error, which systematically screws up control attempts because they don’t take into account the inside-the-systemness and attempt to align “them” instead of “us, starting with me”—two boxing AI alignment, basically. It sounds like maybe you’re on a similar track?
But even in situations like this, there’s probably confusion somewhere. Like, “why’s a fellow rationalist confidently Wrong?” is probably bubbling from some part of their mind, even when other stuff is talking over the confusion.
The problem is that this is in direct opposition to the attempt to control. Ask a thermostat why the room is too cold, and the only answer it has is “Because I haven’t added enough FIRE!”. Why is the rationalist confidently wrong? Because he’s a Bad rationalist! Why is he a bad rationalist? Because y’all haven’t called him out for his Badness! More shame! Beat him into shape! Why haven’t we done that? Because y’all are bad rationalists too! That’s why I’m yelling at y’all to fix you!
Pondering “Hmm… I dunno. Maybe he doesn’t see this piece?” requires people to relinquish the control which they’ve already decided is worth doing, so you’re gonna get “control” type answers unless you push back against their control loop hard enough that they let go. I’m not saying you can’t do it, but it’s gonna take some oomph which has to be sourced from somewhere—which makes it trickier to self apply. The question “is it working?” comes from outside the loop and points at the loop itself, which makes it a lot more widely useful and easier to self apply.
So we’re ~synced re: psychology underlying irrationality. But I don’t know whether you’re trying to change cultural or individual rationality here:
The problem is that this is in direct opposition to the attempt to control. Ask a thermostat why the room is too cold, and the only answer it has is “Because I haven’t added enough FIRE!”. Why is the rationalist confidently wrong? Because he’s a Bad rationalist! Why is he a bad rationalist? Because y’all haven’t called him out for his Badness! More shame! Beat him into shape! Why haven’t we done that? Because y’all are bad rationalists too! That’s why I’m yelling at y’all to fix you!
Like, I’m mostly optimistic about getting a few individuals to not do Crappy Epistemics, whereas I feel like you’re targeting groups, which seems difficult if I’m one of very few people who get what you’re saying.
To back up, I’m working BCIs so we can have actual superintelligent humans, not AIs. Enough to do a pivotal act, no more.
I do think that human alignment is important for having a hope at aligning things bigger and more intelligent/powerful than ourselves
If you’re saying “we need dramatically better instrumental rationality, of which short-term optimization targets are a big component” then yes, strong agree. I feel like you’re saying something else though, maybe about coordination between humans?
Like, there’s a big “I’m outside the system!” type error, which systematically screws up control attempts because they don’t take into account the inside-the-systemness and attempt to align “them” instead of “us, starting with me”—two boxing AI alignment, basically.
I have exactly one friend whom I consider a rationalist, so I haven’t interacted with enough groups to comment here.
It sounds like maybe you’re on a similar track?
My immediate trajectory is “improve my rationality (including self-alignment) while working on funding for BCI research so a cohort of aligned humans can partially apotheosize and end the acute AGI risk window”. Rationality via self-alignment is instrumentally useful for rolling an FRO and doing novel research, but the positive externalities for value alignment aren’t currently my main interest.
Ask a thermostat why the room is too cold, and the only answer it has is “Because I haven’t added enough FIRE!”.
(Noting that it took a moment for me to connect the analogy, but including it definitely helped.)
But I don’t know whether you’re trying to change cultural or individual rationality here:
Like, I’m mostly optimistic about getting a few individuals to not do Crappy Epistemics, whereas I feel like you’re targeting groups,
If it’s from the criticisms of the rationality community for not berating people into rationality, oops. I should have put those in quotation marks. The point was “you’ll hear this from someone in this mode”, not to assert those things myself. I mostly see those exhortations as examples of the thing they’re railing against.
I’m not really trying to change anything here, just describing what is actually going on.
which seems difficult if I’m one of very few people who get what you’re saying.
Oh, for sure :)
But to be clear, I’m excited that you seem to be picking up on the same thing from another angle, separate from anything I’ve said.
I’ve gotten pretty comfortable with my ability to help people see the things needed to realign themselves on some object level issue, but I’m still new at communicating the meta thing of how to help people see how this alignment process works. Difficult, but fun/interesting, and I don’t think too difficult to learn.
If you’re saying “we need dramatically better instrumental rationality, of which short-term optimization targets are a big component” then yes, strong agree. I feel like you’re saying something else though, maybe about coordination between humans?
Yeah, I think that’s a facet of it. But also, yeah, it’s more than that. The stuff about coordinating between humans is just another facet too.
There’s a failure mode in changework where people try to “fix their irrational fear”, or “walk the client through what they need to do to fix their irrational fear”. These seem like perfectly reasonable responses which is why people do them, but watch it play out enough times and you start to notice that the stubborn attempt to control away the fear is the problem. That once you notice that you don’t actually know your fear to be irrational, you naturally turn towards noticing whether you’re actually safe, and that’s the move that conditions away inappropriate fears. That once you notice that trying to push people towards having a certain “correct” orientation to their pain is actually the thing that causes the suffering, you naturally turn towards “what’s the real problem here?” and that’s what dissolves the suffering in a word or “a few messages”.
This move of “Oh, control isn’t working. I wonder why?” turns out to be very general. Not just applied to one’s own mind, or helping others with their own minds, but also helping others learn to help others with their own minds and so on and so on. When my friend asked for help getting her four year old to take her eyedrops, I was able to play with my friends discomfort which led to her being able to play with her daughters discomfort, which led to her daughter playing with her own discomfort. “You’re a brown belt in jiu jitsu, what do you mean how do I get my child to take her eyedrops!?” → “Sigh. I just feel like I shouldn’t have to use force… and I guess this is another one of those things where my own tension is telling her to be tense, huh?” → “Mommy, can we play the eyedrop game!?”
In short, and translating into the language we’re using here, “trying to align AGI” is itself an instance of this same exact failure to align ourselves. Because no one is looking at it like “Oh, yeah, easy peasy. I predict that I will not experience prediction error, because I got this”. It’s all pushing back in attempt to control away prediction error because the consequences of failing are unimaginably bad, while failing to act on the uneasiness coming from predictions of this not panning out. Which turns out to be where all the most useful information is.
When I look instead towards “If this goes well, what does that look like? How did we get here?”, the answer I see is one where the people guiding the development of AI aren’t pushing away from any of the relevant information, let alone the information that they themselves perceive as most important with respect to whether what they’re doing is working and what kind of moves might actually work.
In other words, it’s one where the researchers themselves have enough embodied skill in alignment that they can approach the problem with their full faculties. Not just because “That’s what rationality is, and good instrumental rationality is necessary for succeeding at hard things”, but also “it’s literally the same skill”. In the same way that relating to one’s own mind is the same skill as helping a friend relate to theirs is the same as helping a friend helping their kid relate to theirs.
Same skill, applied on multiple levels. The skill in “becoming rational”/”coordinating groups of humans”/”aligning AI” is all skill in alignment. Borrowing Val’s words again, “It’s fractal”.
When I think about “how to align AI”, I notice that I don’t actually know how to do this. There’s nothing I can see, where I think “Ah, this is the code I need to write” or “here’s the things I need to exhort at people to do” that will predictably yield the outcomes I want. Not through “targeting groups”, or “targeting individuals”, or “targeting code”.
And I notice that stopping to notice this is by far the most important thing I can do, since “trying” would necessarily blind and therefore doomed to success by luck at best. And that one of the better “object level applications” for me right now is to highlight the nature of this move, since a big part of “Why is this control not working?” is “Because people aren’t aware of how control works”—Okay, cool. So if we change that, this part of the problem dissolves.
There’s something even more general and self referential that I’m fumbling towards though, since doing that thing is the thing I can actually expect to lead to the best possible outcomes—structurally and necessarily. But it’s a bit mindbending because “trying to generalize” is itself an instance of the thing I’d be trying to avoid (and so would “trying to not try” or concluding “we shouldn’t try”). “Generalizing is hard. I wonder why?” is the generalization. So I guess that’s the next thing to wonder, once I have some mental room for it.
Anyway, your post on BCI facilitated AI alignment looks to me like a step in the same direction. A step towards noticing that AI alignment is downstream of human alignment (in this case, because aligned and augmented humans are more competent which is instrumentally useful), and that the solutions which actually work have more competent humans more tightly integrated in the alignment process for longer—rather than keeping a stance of “I’m outside the system, aligning THAT THING is what I’m trying to do, dammit”.
I don’t think you’ve been explicitly thinking about it in the same terms I’m laying out here, but it does seem like it might be downstream of beginning to sense and act on the same thing I’m fumbling towards. Like you might be already on the same path fumbling towards the same thing that I’m trying to put a finger on (and noticing myself not fully having yet, in this sentence. Lol). Does this fit?
I don’t think I have, but yes. Agreed.
Yep! And there are pretty big practical consequences of this.
Hm. I haven’t had time to read and process your other post yet, but I do think that human alignment is important for having a hope at aligning things bigger and more intelligent/powerful than ourselves. Like, there’s a big “I’m outside the system!” type error, which systematically screws up control attempts because they don’t take into account the inside-the-systemness and attempt to align “them” instead of “us, starting with me”—two boxing AI alignment, basically. It sounds like maybe you’re on a similar track?
The problem is that this is in direct opposition to the attempt to control. Ask a thermostat why the room is too cold, and the only answer it has is “Because I haven’t added enough FIRE!”. Why is the rationalist confidently wrong? Because he’s a Bad rationalist! Why is he a bad rationalist? Because y’all haven’t called him out for his Badness! More shame! Beat him into shape! Why haven’t we done that? Because y’all are bad rationalists too! That’s why I’m yelling at y’all to fix you!
Pondering “Hmm… I dunno. Maybe he doesn’t see this piece?” requires people to relinquish the control which they’ve already decided is worth doing, so you’re gonna get “control” type answers unless you push back against their control loop hard enough that they let go. I’m not saying you can’t do it, but it’s gonna take some oomph which has to be sourced from somewhere—which makes it trickier to self apply. The question “is it working?” comes from outside the loop and points at the loop itself, which makes it a lot more widely useful and easier to self apply.
So we’re ~synced re: psychology underlying irrationality. But I don’t know whether you’re trying to change cultural or individual rationality here:
Like, I’m mostly optimistic about getting a few individuals to not do Crappy Epistemics, whereas I feel like you’re targeting groups, which seems difficult if I’m one of very few people who get what you’re saying.
To back up, I’m working BCIs so we can have actual superintelligent humans, not AIs. Enough to do a pivotal act, no more.
If you’re saying “we need dramatically better instrumental rationality, of which short-term optimization targets are a big component” then yes, strong agree. I feel like you’re saying something else though, maybe about coordination between humans?
I have exactly one friend whom I consider a rationalist, so I haven’t interacted with enough groups to comment here.
My immediate trajectory is “improve my rationality (including self-alignment) while working on funding for BCI research so a cohort of aligned humans can partially apotheosize and end the acute AGI risk window”. Rationality via self-alignment is instrumentally useful for rolling an FRO and doing novel research, but the positive externalities for value alignment aren’t currently my main interest.
(Noting that it took a moment for me to connect the analogy, but including it definitely helped.)
If it’s from the criticisms of the rationality community for not berating people into rationality, oops. I should have put those in quotation marks. The point was “you’ll hear this from someone in this mode”, not to assert those things myself. I mostly see those exhortations as examples of the thing they’re railing against.
I’m not really trying to change anything here, just describing what is actually going on.
Oh, for sure :)
But to be clear, I’m excited that you seem to be picking up on the same thing from another angle, separate from anything I’ve said.
I’ve gotten pretty comfortable with my ability to help people see the things needed to realign themselves on some object level issue, but I’m still new at communicating the meta thing of how to help people see how this alignment process works. Difficult, but fun/interesting, and I don’t think too difficult to learn.
Yeah, I think that’s a facet of it. But also, yeah, it’s more than that. The stuff about coordinating between humans is just another facet too.
There’s a failure mode in changework where people try to “fix their irrational fear”, or “walk the client through what they need to do to fix their irrational fear”. These seem like perfectly reasonable responses which is why people do them, but watch it play out enough times and you start to notice that the stubborn attempt to control away the fear is the problem. That once you notice that you don’t actually know your fear to be irrational, you naturally turn towards noticing whether you’re actually safe, and that’s the move that conditions away inappropriate fears. That once you notice that trying to push people towards having a certain “correct” orientation to their pain is actually the thing that causes the suffering, you naturally turn towards “what’s the real problem here?” and that’s what dissolves the suffering in a word or “a few messages”.
This move of “Oh, control isn’t working. I wonder why?” turns out to be very general. Not just applied to one’s own mind, or helping others with their own minds, but also helping others learn to help others with their own minds and so on and so on. When my friend asked for help getting her four year old to take her eyedrops, I was able to play with my friends discomfort which led to her being able to play with her daughters discomfort, which led to her daughter playing with her own discomfort. “You’re a brown belt in jiu jitsu, what do you mean how do I get my child to take her eyedrops!?” → “Sigh. I just feel like I shouldn’t have to use force… and I guess this is another one of those things where my own tension is telling her to be tense, huh?” → “Mommy, can we play the eyedrop game!?”
Valentine has a good post “We are already in AI takeoff”, which I took a stab at putting into my own words in a comment there.
In short, and translating into the language we’re using here, “trying to align AGI” is itself an instance of this same exact failure to align ourselves. Because no one is looking at it like “Oh, yeah, easy peasy. I predict that I will not experience prediction error, because I got this”. It’s all pushing back in attempt to control away prediction error because the consequences of failing are unimaginably bad, while failing to act on the uneasiness coming from predictions of this not panning out. Which turns out to be where all the most useful information is.
When I look instead towards “If this goes well, what does that look like? How did we get here?”, the answer I see is one where the people guiding the development of AI aren’t pushing away from any of the relevant information, let alone the information that they themselves perceive as most important with respect to whether what they’re doing is working and what kind of moves might actually work.
In other words, it’s one where the researchers themselves have enough embodied skill in alignment that they can approach the problem with their full faculties. Not just because “That’s what rationality is, and good instrumental rationality is necessary for succeeding at hard things”, but also “it’s literally the same skill”. In the same way that relating to one’s own mind is the same skill as helping a friend relate to theirs is the same as helping a friend helping their kid relate to theirs.
Same skill, applied on multiple levels. The skill in “becoming rational”/”coordinating groups of humans”/”aligning AI” is all skill in alignment. Borrowing Val’s words again, “It’s fractal”.
When I think about “how to align AI”, I notice that I don’t actually know how to do this. There’s nothing I can see, where I think “Ah, this is the code I need to write” or “here’s the things I need to exhort at people to do” that will predictably yield the outcomes I want. Not through “targeting groups”, or “targeting individuals”, or “targeting code”.
And I notice that stopping to notice this is by far the most important thing I can do, since “trying” would necessarily blind and therefore doomed to success by luck at best. And that one of the better “object level applications” for me right now is to highlight the nature of this move, since a big part of “Why is this control not working?” is “Because people aren’t aware of how control works”—Okay, cool. So if we change that, this part of the problem dissolves.
There’s something even more general and self referential that I’m fumbling towards though, since doing that thing is the thing I can actually expect to lead to the best possible outcomes—structurally and necessarily. But it’s a bit mindbending because “trying to generalize” is itself an instance of the thing I’d be trying to avoid (and so would “trying to not try” or concluding “we shouldn’t try”). “Generalizing is hard. I wonder why?” is the generalization. So I guess that’s the next thing to wonder, once I have some mental room for it.
Anyway, your post on BCI facilitated AI alignment looks to me like a step in the same direction. A step towards noticing that AI alignment is downstream of human alignment (in this case, because aligned and augmented humans are more competent which is instrumentally useful), and that the solutions which actually work have more competent humans more tightly integrated in the alignment process for longer—rather than keeping a stance of “I’m outside the system, aligning THAT THING is what I’m trying to do, dammit”.
I don’t think you’ve been explicitly thinking about it in the same terms I’m laying out here, but it does seem like it might be downstream of beginning to sense and act on the same thing I’m fumbling towards. Like you might be already on the same path fumbling towards the same thing that I’m trying to put a finger on (and noticing myself not fully having yet, in this sentence. Lol). Does this fit?