Beyond Hardcoded Evolutionary Psychology
Steven Byrnes has written quite a lot on brain-like AGI algorithms. I’ll reiterate here of a small part of his work, but you’d be better off reading his stuff directly.
For the ideas which inspired me, see here, here, and here.
This post has good handles for intuitively applying neuroscience to rationality. Don’t mistake my “simulator” cluster for a distilled and reduced concept; it’s “lumpy rock”, not “skewed prism of seafloor impactite from the late Devonian”.
My (loose) extension of Steve’s models of human minds predicts that we’ll get stuck in local minima of self-reinforcing nonreality; with this knowledge, we can more directly target the underlying mechanisms of cognitive biases.
Deliberately tracking what you’re actually optimizing for is one example skill which seems worth developing despite its time cost. In fact, many rationality skills (like noticing your confusion and scout mindset) are species of this overlying stance.
Let’s take scout mindset as an example.
While discussing something, the goal for rationalists usually isn’t to convince the other person of their ideas, but rather to come to truthier beliefs.
One should therefore be vigilant, lest they fall into a combative posture, since this makes them slow to update.
Cool! Is there a principled way to train this?
Do you mean trigger-action patterns?
No; TAPs take conscious effort to implement. How do you make it an unconscious System 1 response, the same way I keep my balance while walking?
Lots of practice.
But what if I have stronger subconscious pressures pushing me away from scout mindset? Like, sometimes I want some idea to work, and then I notice myself defending it. But then it feels like something blocks me from seeing that defensive posture. It’s like a part of my brain is trying to keep me blissfully unaware. You’re telling me to run uphill, but I want to remove the hill entirely.
If you keep practicing it, you’ll learn to do it subconsciously! Then it won’t take any effort, just like walking.
But as an infant, I wanted to walk; it got rewarded. Rationality isn’t emotionally satisfying on the timescales I care about, at least not by default. My conscious practice builds a sandcastle, then tomorrow the tide washes away all but a shallow lump. Sure, after 3 decades my hill might rise above the waves, but what stops me from building a levee?[1]
Perhaps modern neuroscience can help this person be rewarded by rational thoughts? Let me just...
Simulators
Brains run simulations. I’ve not seen Steve claim this[2], hence “simulations” instead of “thought predictors”.
Let’s say that I go parasailing in the Andes. During flight, my mind absorbs tons of data about the sky, landscape, my gear, etc. When I get home, my brain runs lots of internal simulations about flying through the mountains, birds, and more specific stuff about, say, wind patterns. These simulations are based off of those earlier sensory experiences.
In this way, my mind converts the tidbits of data in my flight memories into lots of training examples for itself. To a Solomonoff inductor (an imagined perfect intelligence), my simulations don’t add any extra information about my flight compared to my memories.
But I’m not a Solomonoff inductor, so my mind needs many self-generated training examples to fully internalize it.
(This is one reason we’re so damn sample-efficient compared to LLMs. We can keep generating internal training data. LLMs, by comparison, have to be fed training data by some externally bottlenecked process like internet scraping or human-written RL environments.)
From what I see, LessWrong rationality advice is almost entirely “ingest this particular training example”, which works great if your simulators are faithful to the content of that data. But they’re not.
Simulator dynamics cause cognitive biases and actively prevent you from fixing them.
Drift; cognitive attractors
Simulators can be influenced by each other cyclicly; they’re not straightforwardly truthful updates on sensory data. As an extreme example, these internal thoughts[3]:
Blah unfocused daydreaming blah
If my best friend of 15 years attacked me with a frying pan, I’d probably punch him in the face and try to dodge the pan since it’s slow to swing a cast iron pan like that
Imagining friend misses and shatters a window with the pot momentum
“Bro, stop!”
Now I like my friend less! He did absolutely nothing to imply that he’d attack me with a frying pan. In fact, I had a stomache cramp I didn’t notice, which biased my idle thoughts towards pain, and my friend happened to be on my mind because I visited him that morning.
A more typical example for the audience of this post:
After a conversation with a non-rationalist family member about my decision to drop out of my undergrad math degree
She just doesn’t understand. AI safety is too weird of a subject, and she’s stuck on immigration or whatever Facebook feeds her all day.
Option value
What’s that? Oh, I lost my train of thought. Why can’t I remember what I was just thinking about? Anyway, humanity doesn’t have time for me to screw around. This is urgent.
Cognitive simulators are a dynamical system, in the same sense as weather systems and Conway’s Game of Life. At least in humans, simulators can loop back on themselves and/or influence parallel simulators such that they fall into attractor basins.
When a simulator predicts a thought/action X will be rewarded, X-promoters actually do get rewarded as intended; but now, X-promoters indirectly reinforce the prediction that X is rewarding. You’re pinned in a loop.
Like water running down a hill, minds can get stuck in local puddles.
What are you optimizing in your head
From “What Are You Tracking In Your Head”[4]:
the expert tracks some extra information/estimate in their head. Usually the extra information is an estimate of some not-directly-observed aspect of the system of interest. From outside, watching the expert work, that extra tracking is largely invisible; the expert may not even be aware of it themselves. Rarely are these mental tracking skills explicitly taught. And yet, based on personal experience, each of these is a central piece of performing the task well—arguably the central piece, in most cases.
Do you track what you’re locally optimizing for? Are you intuitively aware of whatever queries/searches you’re running, all of the time?
Lots of biases don’t feel wrong (to an untrained person) because they don’t realize they’re running the wrong search. Their mind is aiming for something completely different from what they “think” they’re doing. If you ask such a person what they’re trying to do, they’ll describe something different from what they’d feel by introspecting during the process.
Say I’m explaining to someone why they’re wrong about X. I “want” both of us to get to the truth; but I’m locally intuitively trying to get them to change their mind. They’re wrong, after all. I’m speaking some words I think will get them to believe not X, but Y...
So now I’m optimizing, intuitively but not explicitly, to change this person’s beliefs. What do you expect I’ll feel if they give a strong counterargument?
I’ll feel defensive. I don’t endorse this feeling, but my short-term intuitive goal is beset!
Were I aware of my intuitive goal instead of merely tracking it, I’d notice the swap. This holds for many classes of error you can find in the Sequences.[5]
Social pulls
Humans have quite strong social drives. This is more of a consequence of evolutionarily specified drives than of architectural inefficiencies, so training tactics differ a bit.
Working on a LW post about cognition enhancement; preliminary title Need Smarter Humans at top of screen
Notice that non-rat Altanta folks can see my screen through my office’s glass walls
They probably think I’m loony, don’t they?
Two minutes later, I’m looking at 3d visualizations of molecules from hydrogel chemistry simulations I’m running
Oh gee, what a happy coincidence that now I look smart and professional to the people behind me! … Oh. Yikes.
Here again, had I been conscious of my immediate optimization target, I would’ve noticed the prestige-drive pulling me away from good work.
But this goes much deeper. Down to the felt sense of what’s culturally acceptable/expected by the people around me, and in which ways I care.
Probably surrounding myself with people who think very clearly would help. But that leaves a weird residual mismatch; I don’t immediately want to be The Rationalistist, even if my milieu would very much respect that.
No, I want to look like someone with excellent epistemology, not to mention my thousand other evolved desires. Culture is only a partial solution, because I still need to actively guide my thought-drift.
Negative queries are unnatural (to most people)
“Try to falsify your beliefs” is is good advice, but notice that it’s a very strange operation to ask of a simulator architecture.
When I run a query like “why is the 3d printer I just bought not working?” my mind can start rolling out possible explanations: clogged nozzle, disconnected wire etc. I have a problem, so my mind searches for trajectories which would explain my experience. Same for social situations, research, driving. I have some current problem and my mind searches forward through explanations and possible actions.
But what would it mean to run the negative query “what would it feel like if my current model of this situation were false?”
Simulators run forward, generating possible continuations from some starting frame. If I ask “what explains this under my current model?”, my mind has something to roll out; how’s the simulator supposed to run “what explains not(this) under my current model?”
Maybe I spend a lot of time imagining what the inverse shape of each of my models feels like. A priori this seems very wasteful to me, as it’s swimming upstream against your hardwired cognitive architecture. But I could imagine it working out.
Confusion, however, is something on which we can run positive queries, and it’s trainable. “What am I confused about” slowly but predictably clarifies where reality is biting your models, so long as you’re actually optimizing for understanding your confusion.
The resulting toolset is very different from trigger-action patterns.
- ^
See Taste & Shaping.
- ^
Engram replay seems to be doing something like “running simulations”, though much of that is probably also “optimizing the representation of episodic memories for predictive efficiency”. Like defragmenting a hard disk drive, but neuromorphic.
- ^
I noticed this sort of spiral all the time as a young teenager.
- ^
Explicitly citing quote here, not an example monologue.
- ^
I might just be saying in more words “deliberately practice rationality”, but “deliberate practice” doesn’t, to me, properly describe the metacognitive pose one should take therein.
I’m excited by this post, because I think you’re onto something big and getting it almost right. Strong upvote!
Let me summarize what I’m reading here and you can tell me if I’m understanding correctly.
We simulate forward, including reward, and this means we can drift away from reality if we don’t re-ground ourselves deliberately.
Rationality techniques often fail because they assume we’re actually optimizing for the things we think we’re optimizing for, and this is empirically false: “If you ask such a person what they’re trying to do, they’ll describe something different from what they’d feel by introspecting during the process.”
Because of this, trying to “install TAPs” is fighting the current, which takes effort, and is less effective than we’d like
Maybe we can figure out how to want truthier beliefs, so that we aim for them automatically the way a baby aims to walk? This means getting actual truth seeking rewarded.
The key is to become aware of what we’re actually optimizing for
But asking “what would it feel like if my model were wrong?” feels underspecified because you’re throwing away your whole model you use to predict
So notice confusion instead.
I’ll start with where I agree. Then give a worked example where I think your details aren’t quite right, then gesture at the general direction.
This is quite the understatement. You’re talking about inner alignment. This is the skill for rationality IMO, and has strong connections to the big problem of the day once you start to notice it—and I see from your next post that you have.
This is exciting to see because I rarely see people diagnose the core problem so clearly, but the details are where you’re going to struggle in implementation.
A bit of backstory, I found this stuff from a different angle—by using hypnosis as a starting point to understand how this all works and working up from the “bit banging” towards a more generalizable model of what’s going on. This “Trying to explain to someone they’re wrong, but it doesn’t work” is actually the opening example/challenge I give for my sequence on the topic.
You’re right that we aren’t actually optimizing for “getting both of us to the truth” and that this is why we don’t progress smoothly towards the truth. The problem is that “trying to get them to change their mind” isn’t a sufficiently complete picture either.
We’re trying to get them to change their mind conditional on us being right—in other words, trying to assert our perspective at them, expecting this to change their mind. We’re not drawn towards any possibilities that involve us even examining whether we could be wrong, even if this would be the persuasive thing to do—as evidenced by the predictable defensiveness should we be pressed in that direction.
If we were actually optimizing towards “change their mind”, we’d notice real fast that our own defensiveness gets in the way. Because we’d notice that our first impulses aren’t going to succeed at “changing their mind”, and wonder why that is and what we can do to fix it. And the obvious answer, once we want to find it, is because we’re coming off as closed minded and not open to finding out if we’re actually right.
Which makes sense, because what do we do when people assert their perspective at us, expecting it to change our mind, without even considering what they might be missing? Okay, so the way to change their mind is to demonstrate open mindedness and—shit, yeah, that feels aversive to me. Okay, what now?
What we’re locally optimizing for is even less flattering. We’re trying to push away their disagreement. We’re trying to batter it with arguments and reason until it stops reminding us that our expectations we’re placing on the other persons beliefs are wrong. This can kinda sorta work, when social pressure is enough for suppression, but it never actually changes their beliefs let alone helps us find reality.
There are a lot of moving pieces to sort out, which is why I didn’t close the loop on the opening example until the capstone post, but the take away is that aligning our local optimization with our conception of what we’re optimizing for brings both towards “finding truth” and, as a result of bringing us both towards truth, conditional on us being as right as we thought we were, convincing the other person.
So like, expect to actually change their mind if you’re so right, and then notice what happens.
Right, it’s a confusing question to grapple with. It seems unproductive because “If I were wrong” is assumed, not evidenced, so it could go anywhere. Then what am I using to predict? How do I choose which bits to pretend aren’t actually real, in my best guess?
Fortunately, we don’t have to assume, and we don’t even have to notice confusion.
We just have to notice wrongness. Dissatisfaction with reality. Dukkha. Prediction error. It’s everywhere.
When we’re frustrated that the other person isn’t yielding to our obvious rightness, why? We’re expecting them to believe us because we’re right, and they’re not believing us. Our expectation is being empirically falsified, and the path towards becoming less wrong is to start wondering why that is.
We’ll often end up at places like “Because they’re DumbBad, that’s why”, which, okay, fine. Maybe. But then why were we expecting them to listen to our fact and logic? Is that what DumbBad people do? Until all feels right in the world, we’re wrong about something, and that feeling of “not-rightness” is the clue as to where our predictions have diverged from reality.
We might be patting ourselves on the back when we get things that we predict should be rewarding, but unless we’re actually right reality is going to be slapping us in the face about it sooner or later. We might misinterpret the slap at first, but if we keep going it’ll lead us back to ground. And we can update on predicted slaps before they happen, even, and do a lot of the work ahead of time before we actually do anything wrong.
Confusion is what it feels like to notice that our models are insufficient. There won’t be confusion to notice until we notice that our models are insufficient, and it’s noticing “wrongness” that gets us there.
I basically agree with everything you said here; can you highlight where you disagree? I don’t see where we’re diverging, seems useful to know.
I do think we’re mostly in agreement. If I read your post as a blurry gesture at a shape of things, I think you’re getting the picture exactly right which is why I’m so excited to see another person “getting it”. So, “Yes! Strong upvote!”.
If I read your post as speaking technically and precisely, I see details that will need to be changed in order to get things to actually work in practice.
I’m saying that this is directionally correct, but that the same problem shows up with respect to “changing people’s minds”.
“I “want” to get them to change their mind (because that’s what gets both of us to the truth which I already have); but I’m locally intuitively trying to push away from experiencing wrongness”
What we’re actually optimizing for is something more specific and even more out of alignment. If we were actually optimizing for changing people’s minds, things would play out very differently because it would lead right back to truth seeking.
The mental state which we label “confusion” when we notice it in ourselves, is the state of feeling disoriented. If we were to put into words the implicit stance of this state, it’s “I have not been able to properly orient to this situation. I don’t know what to make of it and I’ve noticed this”. Without noticing that your models are failing you, it just feels like the world is wrong. “The problem is that the peg won’t go through the hole!”, not “I can’t figure out which hole this peg goes in”, let alone “Lol, square peg don’t go in round hole”.
In order to enter this state we label confusion, we have to notice that the wrongness we perceive is in our map, not the territory. We have to notice “Ah, I feel wrongness, that means I’m not oriented properly. I don’t know what to make of this”.
When you ask a Rationalist why they’re frustrated in a dialog that isn’t going their way on LW, they’ll tell you they’re frustrated because that other guy is wrong on the internet—and they’re supposed to be rational, dammit. Even in well respected (and otherwise respectable) rationalists, there is often-to-usually no recognition that the feeling of wrongness that they are modeling as “in the other guy” is actually in their own models, necessarily. This is where we go wrong.
And asking “What am I confused about?” won’t actually help. Because there’s no such such thing to notice. An outsider may describe the struggling person as “confused” or “disoriented”, but from the inside they have no feeling of disorientation to notice. In their own model, they are oriented properly—it’s the other guy who isn’t! So far as they’re concerned the problem is the hole, not the fact that they’re trying to shove a square peg into a round hole.
The thing which is actually there to notice is our perceptions of “wrongness”. If we ask “Where am I experiencing wrongness” then we’ll find it immediately. I experience it when the peg won’t go into the hole. When the guy online doesn’t change his mind.
This is the thing that’s actually there to notice, which we can then use to take the next step of “Why is the peg not going into the hole, do you think?”. Why is that guy wrong on the internet? This wrongness I’m feeling… what’s that about?
And this is the step that updates models to better match reality. Including noticing the inner misalignment that has been screwing everything up.
It’s true that sometimes we have tiny little notes of discord. Tiny little hints of “maybe I’m confused here?” that we fail to give due weight. But by the time there’s even a hint of subjective confusion to notice, we’ve already done the hard part. Most of our failures are due to stubbornly externalizing wrongness because that’s how we try to control, and we don’t want to give that up until we see a better way.
Thank you for clarifying something I understood intuitively but hadn’t put into words! See here for my response to that post. You’ve probably already mentioned it somewhere, but perceptual control theory relatedly posits that motivations/actions are just a way to control what sorts of things we experience.
If the same machinery underlies factual prediction and normative actions, we confuse them to all heck. This is a clearer, much more precise statement of a somewhat different mechanism than I was originally proposing here. I’ll need to think for a bit about whether this changes my BCI-superintelligence stuff.
Yes, I think we’re exactly in agreement here.
Noticing confusion probably isn’t the best remedy for all situations, especially when you have a much louder mismatch like this. I didn’t mean to imply that confusion is The Solution; it’s one directionally better way to orient to a situation.
But even in situations like this, there’s probably confusion somewhere. Like, “why’s a fellow rationalist confidently Wrong?” is probably bubbling from some part of their mind, even when other stuff is talking over the confusion.
Might just be selection bias though, or some weirdness with my mind in particular, since I’m running more off of introspective memories than second-order extrapolation here.
I don’t think I have, but yes. Agreed.
Yep! And there are pretty big practical consequences of this.
Hm. I haven’t had time to read and process your other post yet, but I do think that human alignment is important for having a hope at aligning things bigger and more intelligent/powerful than ourselves. Like, there’s a big “I’m outside the system!” type error, which systematically screws up control attempts because they don’t take into account the inside-the-systemness and attempt to align “them” instead of “us, starting with me”—two boxing AI alignment, basically. It sounds like maybe you’re on a similar track?
The problem is that this is in direct opposition to the attempt to control. Ask a thermostat why the room is too cold, and the only answer it has is “Because I haven’t added enough FIRE!”. Why is the rationalist confidently wrong? Because he’s a Bad rationalist! Why is he a bad rationalist? Because y’all haven’t called him out for his Badness! More shame! Beat him into shape! Why haven’t we done that? Because y’all are bad rationalists too! That’s why I’m yelling at y’all to fix you!
Pondering “Hmm… I dunno. Maybe he doesn’t see this piece?” requires people to relinquish the control which they’ve already decided is worth doing, so you’re gonna get “control” type answers unless you push back against their control loop hard enough that they let go. I’m not saying you can’t do it, but it’s gonna take some oomph which has to be sourced from somewhere—which makes it trickier to self apply. The question “is it working?” comes from outside the loop and points at the loop itself, which makes it a lot more widely useful and easier to self apply.
So we’re ~synced re: psychology underlying irrationality. But I don’t know whether you’re trying to change cultural or individual rationality here:
Like, I’m mostly optimistic about getting a few individuals to not do Crappy Epistemics, whereas I feel like you’re targeting groups, which seems difficult if I’m one of very few people who get what you’re saying.
To back up, I’m working BCIs so we can have actual superintelligent humans, not AIs. Enough to do a pivotal act, no more.
If you’re saying “we need dramatically better instrumental rationality, of which short-term optimization targets are a big component” then yes, strong agree. I feel like you’re saying something else though, maybe about coordination between humans?
I have exactly one friend whom I consider a rationalist, so I haven’t interacted with enough groups to comment here.
My immediate trajectory is “improve my rationality (including self-alignment) while working on funding for BCI research so a cohort of aligned humans can partially apotheosize and end the acute AGI risk window”. Rationality via self-alignment is instrumentally useful for rolling an FRO and doing novel research, but the positive externalities for value alignment aren’t currently my main interest.
(Noting that it took a moment for me to connect the analogy, but including it definitely helped.)
If it’s from the criticisms of the rationality community for not berating people into rationality, oops. I should have put those in quotation marks. The point was “you’ll hear this from someone in this mode”, not to assert those things myself. I mostly see those exhortations as examples of the thing they’re railing against.
I’m not really trying to change anything here, just describing what is actually going on.
Oh, for sure :)
But to be clear, I’m excited that you seem to be picking up on the same thing from another angle, separate from anything I’ve said.
I’ve gotten pretty comfortable with my ability to help people see the things needed to realign themselves on some object level issue, but I’m still new at communicating the meta thing of how to help people see how this alignment process works. Difficult, but fun/interesting, and I don’t think too difficult to learn.
Yeah, I think that’s a facet of it. But also, yeah, it’s more than that. The stuff about coordinating between humans is just another facet too.
There’s a failure mode in changework where people try to “fix their irrational fear”, or “walk the client through what they need to do to fix their irrational fear”. These seem like perfectly reasonable responses which is why people do them, but watch it play out enough times and you start to notice that the stubborn attempt to control away the fear is the problem. That once you notice that you don’t actually know your fear to be irrational, you naturally turn towards noticing whether you’re actually safe, and that’s the move that conditions away inappropriate fears. That once you notice that trying to push people towards having a certain “correct” orientation to their pain is actually the thing that causes the suffering, you naturally turn towards “what’s the real problem here?” and that’s what dissolves the suffering in a word or “a few messages”.
This move of “Oh, control isn’t working. I wonder why?” turns out to be very general. Not just applied to one’s own mind, or helping others with their own minds, but also helping others learn to help others with their own minds and so on and so on. When my friend asked for help getting her four year old to take her eyedrops, I was able to play with my friends discomfort which led to her being able to play with her daughters discomfort, which led to her daughter playing with her own discomfort. “You’re a brown belt in jiu jitsu, what do you mean how do I get my child to take her eyedrops!?” → “Sigh. I just feel like I shouldn’t have to use force… and I guess this is another one of those things where my own tension is telling her to be tense, huh?” → “Mommy, can we play the eyedrop game!?”
Valentine has a good post “We are already in AI takeoff”, which I took a stab at putting into my own words in a comment there.
In short, and translating into the language we’re using here, “trying to align AGI” is itself an instance of this same exact failure to align ourselves. Because no one is looking at it like “Oh, yeah, easy peasy. I predict that I will not experience prediction error, because I got this”. It’s all pushing back in attempt to control away prediction error because the consequences of failing are unimaginably bad, while failing to act on the uneasiness coming from predictions of this not panning out. Which turns out to be where all the most useful information is.
When I look instead towards “If this goes well, what does that look like? How did we get here?”, the answer I see is one where the people guiding the development of AI aren’t pushing away from any of the relevant information, let alone the information that they themselves perceive as most important with respect to whether what they’re doing is working and what kind of moves might actually work.
In other words, it’s one where the researchers themselves have enough embodied skill in alignment that they can approach the problem with their full faculties. Not just because “That’s what rationality is, and good instrumental rationality is necessary for succeeding at hard things”, but also “it’s literally the same skill”. In the same way that relating to one’s own mind is the same skill as helping a friend relate to theirs is the same as helping a friend helping their kid relate to theirs.
Same skill, applied on multiple levels. The skill in “becoming rational”/”coordinating groups of humans”/”aligning AI” is all skill in alignment. Borrowing Val’s words again, “It’s fractal”.
When I think about “how to align AI”, I notice that I don’t actually know how to do this. There’s nothing I can see, where I think “Ah, this is the code I need to write” or “here’s the things I need to exhort at people to do” that will predictably yield the outcomes I want. Not through “targeting groups”, or “targeting individuals”, or “targeting code”.
And I notice that stopping to notice this is by far the most important thing I can do, since “trying” would necessarily blind and therefore doomed to success by luck at best. And that one of the better “object level applications” for me right now is to highlight the nature of this move, since a big part of “Why is this control not working?” is “Because people aren’t aware of how control works”—Okay, cool. So if we change that, this part of the problem dissolves.
There’s something even more general and self referential that I’m fumbling towards though, since doing that thing is the thing I can actually expect to lead to the best possible outcomes—structurally and necessarily. But it’s a bit mindbending because “trying to generalize” is itself an instance of the thing I’d be trying to avoid (and so would “trying to not try” or concluding “we shouldn’t try”). “Generalizing is hard. I wonder why?” is the generalization. So I guess that’s the next thing to wonder, once I have some mental room for it.
Anyway, your post on BCI facilitated AI alignment looks to me like a step in the same direction. A step towards noticing that AI alignment is downstream of human alignment (in this case, because aligned and augmented humans are more competent which is instrumentally useful), and that the solutions which actually work have more competent humans more tightly integrated in the alignment process for longer—rather than keeping a stance of “I’m outside the system, aligning THAT THING is what I’m trying to do, dammit”.
I don’t think you’ve been explicitly thinking about it in the same terms I’m laying out here, but it does seem like it might be downstream of beginning to sense and act on the same thing I’m fumbling towards. Like you might be already on the same path fumbling towards the same thing that I’m trying to put a finger on (and noticing myself not fully having yet, in this sentence. Lol). Does this fit?
I see what these words might say, but don’t follow the link. Like, seems basically true that rationalism → human coordination works, but AI alignment is such a different thing, so alien to whatever concepts help humans self-align and coordinate.
Perhaps I just need more time to work through this concept. Right now I’m more focused on understanding my own mind, to make better decisions, because I’m finding a crapton of low-hanging fruit very quickly.
I’ve rolled this around in my head for a bit, and it seems to me like, for rationality, “control” of lower processes is better done by something like “improved training data” than operant force.
This is an vignette from my life, not a quote from anyone:
because imposing top-down “control” misaligns:
The optimizer which calls itself an Elliot, wants human flourishing, wants to hold someone
vs the constellation of smaller optimizers Elliot is an intelligible supervenience of
Sounds plausible
I very much have come to think of myself as a system, nothing special other than this weird consciousness phlogiston, though I haven’t stopped using self/other borders.
Maybe? I find myself confused about your explanation of AI alignment; but after reading Valentine’s post about memeplexes, I’m thinking you’re talking about being in entirely the wrong frame, where “align the AI” straightforwardly might not be a thing. And “stop doom” might also not be a thing.
(Thanks for linking that, by the way. I preliminarily do expect, if BCIs or similar enhancements work, that moderately superintelligent humans will birth more powerful Friendly hypercreatures. If you have a list of older, similarly gearsy posts, I’d definitely like to read them.)
How do you define “conscious” vs “merely tracking”? Like, what’s a rough way to explain the most important differences between them?
Intuitively, to me, “tracking” is synonymous with “being conscious of”; I wonder if you’retrying to point at some specific category of qualia or what :D
Yeah! I’m talking about the clarity / prominence of the thing in your mind, not a class of qualia per se. Pretty much orthogonal things. “Awareness” is probably a better term, so I’ll correct the post.
Let’s say I’m a seasoned pentester trying to crack some app. As I’m probing the auth mechanism, “somewhere in the back of my mind” is a map of the attack surfaces, but I’m not introspective enough to notice I’m tracking that shape. It’s still there influencing my subjective experience as a “qualium”, but I can’t be metacognitive of it.
Or, while driving, every action I’m taking is in service of reaching my destination, but I’m not aware that my cognition is aimed at getting to the destination.
I’m saying “conscious of” in the sense most psychologists do, where it’s possible to actually deliberate about the object of attention. It doesn’t have anything to do with qualia; unfortunately the term is really overloaded and I don’t know a better one.
Metacognition of what you’re tracking is extremely helpful for tacit skill acquisition; it’s deliberate practice up one meta level. As a special case of this, meta-awareness of the direction you’re optimizing lets you notice when something is irrationally tugging you and gives you intuitive handles for how to fix it.