21yo. I quit my undergrad math degree to work on technical alignment, then found it’s fundamentally extremely hard, and am now working on cognitive enhancement of FAI researchers.
PM me if you’d like to discuss anything! :)
21yo. I quit my undergrad math degree to work on technical alignment, then found it’s fundamentally extremely hard, and am now working on cognitive enhancement of FAI researchers.
PM me if you’d like to discuss anything! :)
So we’re ~synced re: psychology underlying irrationality. But I don’t know whether you’re trying to change cultural or individual rationality here:
The problem is that this is in direct opposition to the attempt to control. Ask a thermostat why the room is too cold, and the only answer it has is “Because I haven’t added enough FIRE!”. Why is the rationalist confidently wrong? Because he’s a Bad rationalist! Why is he a bad rationalist? Because y’all haven’t called him out for his Badness! More shame! Beat him into shape! Why haven’t we done that? Because y’all are bad rationalists too! That’s why I’m yelling at y’all to fix you!
Like, I’m mostly optimistic about getting a few individuals to not do Crappy Epistemics, whereas I feel like you’re targeting groups, which seems difficult if I’m one of very few people who get what you’re saying.
To back up, I’m working BCIs so we can have actual superintelligent humans, not AIs. Enough to do a pivotal act, no more.
I do think that human alignment is important for having a hope at aligning things bigger and more intelligent/powerful than ourselves
If you’re saying “we need dramatically better instrumental rationality, of which short-term optimization targets are a big component” then yes, strong agree. I feel like you’re saying something else though, maybe about coordination between humans?
Like, there’s a big “I’m outside the system!” type error, which systematically screws up control attempts because they don’t take into account the inside-the-systemness and attempt to align “them” instead of “us, starting with me”—two boxing AI alignment, basically.
I have exactly one friend whom I consider a rationalist, so I haven’t interacted with enough groups to comment here.
It sounds like maybe you’re on a similar track?
My immediate trajectory is “improve my rationality (including self-alignment) while working on funding for BCI research so a cohort of aligned humans can partially apotheosize and end the acute AGI risk window”. Rationality via self-alignment is instrumentally useful for rolling an FRO and doing novel research, but the positive externalities for value alignment aren’t currently my main interest.
Ask a thermostat why the room is too cold, and the only answer it has is “Because I haven’t added enough FIRE!”.
(Noting that it took a moment for me to connect the analogy, but including it definitely helped.)
Agree with Joseph, this is really cool stuff!
Looks to me like intermediate layers are using positions in a way very alien to humans; like, there’s some obvious “natural” semantic segmentation, but I’m not discerning a legible pattern to the distances between individual chars.
Most of our failures are due to stubbornly externalizing wrongness because that’s how we try to control, and we don’t want to give that up until we see a better way.
Thank you for clarifying something I understood intuitively but hadn’t put into words! See here for my response to that post. You’ve probably already mentioned it somewhere, but perceptual control theory relatedly posits that motivations/actions are just a way to control what sorts of things we experience.
If the same machinery underlies factual prediction and normative actions, we confuse them to all heck. This is a clearer, much more precise statement of a somewhat different mechanism than I was originally proposing here. I’ll need to think for a bit about whether this changes my BCI-superintelligence stuff.
“I “want” to get them to change their mind (because that’s what gets both of us to the truth which I already have); but I’m locally intuitively trying to push away from experiencing wrongness”
Yes, I think we’re exactly in agreement here.
And asking “What am I confused about?” won’t actually help. Because there’s no such such thing to notice. An outsider may describe the struggling person as “confused” or “disoriented”, but from the inside they have no feeling of disorientation to notice. In their own model, they are oriented properly—it’s the other guy who isn’t! So far as they’re concerned the problem is the hole, not the fact that they’re trying to shove a square peg into a round hole.
Noticing confusion probably isn’t the best remedy for all situations, especially when you have a much louder mismatch like this. I didn’t mean to imply that confusion is The Solution; it’s one directionally better way to orient to a situation.
But even in situations like this, there’s probably confusion somewhere. Like, “why’s a fellow rationalist confidently Wrong?” is probably bubbling from some part of their mind, even when other stuff is talking over the confusion.
Might just be selection bias though, or some weirdness with my mind in particular, since I’m running more off of introspective memories than second-order extrapolation here.
Copy+pasting from some personal notes (can skim/skip to below the quote for explanation):
My best current theory of evpsych is that we started with some form of hardcoded local updates. For example, Drosophilia circuits could only update on correlations whose joins were genetically coded for:
for stimulus in all_stimulus_regions: {
if (negative_reward): {
hardcoded_behavioral_avoidance(stimulus)
}
if (positive_reward): {
hardcoded_behavioral_pursuit(stimulus)
}
}Each stimulus might have a +/- behavioral response which rewards could modulate.
Then, at some point, we got… looser associations? where a stimulus associated with other stimuli. So if heat and solar angle strongly associated, then +/- rewards which coincided with solar angle also propagated to heat regions, even if in that instant, the organism wasn’t hot.
This gradually refined into the more expressive engrams we find in rodents.
It’s unclear to me when “planning” as predictive loss-reduction started, and whether it’s even separate from engram replay or competition.
But at some point, some associations started looking more causal. “A
B” instead of “A ~ B”. And causality combined with engram replay to get Internal Simulations, the core of what I’m going to be building on here.
These simulations predicted environmental causality, and eventually chained together. The results of long chains could be distilled by other simulators; and so basically recursive timeskipping happened, where long-term heuristics were built by distilling chains of short-term heuristics, which themselves learned directly from reality.
I read what you’re saying as “factual predictions use the same machinery as behavioral executions”; in particular, “we learn to predict the environment the same way that we learn to influence it”.
And thus our factual predictions get mixed with behavioral actions. We can have an explicit epistemic qualifier for “this is a factual prediction” vs “this is an action”, but it’s very non-native.
I haven’t read your full sequence yet, so I hope this makes some sense. I strongly agree with what I just said.
I basically agree with everything you said here; can you highlight where you disagree? I don’t see where we’re diverging, seems useful to know.
Yeah! I’m talking about the clarity / prominence of the thing in your mind, not a class of qualia per se. Pretty much orthogonal things. “Awareness” is probably a better term, so I’ll correct the post.
Let’s say I’m a seasoned pentester trying to crack some app. As I’m probing the auth mechanism, “somewhere in the back of my mind” is a map of the attack surfaces, but I’m not introspective enough to notice I’m tracking that shape. It’s still there influencing my subjective experience as a “qualium”, but I can’t be metacognitive of it.
Or, while driving, every action I’m taking is in service of reaching my destination, but I’m not aware that my cognition is aimed at getting to the destination.
I’m saying “conscious of” in the sense most psychologists do, where it’s possible to actually deliberate about the object of attention. It doesn’t have anything to do with qualia; unfortunately the term is really overloaded and I don’t know a better one.
Metacognition of what you’re tracking is extremely helpful for tacit skill acquisition; it’s deliberate practice up one meta level. As a special case of this, meta-awareness of the direction you’re optimizing lets you notice when something is irrationally tugging you and gives you intuitive handles for how to fix it.
Good point! Not sure why I didn’t realize this myself.
I’d guess that an important class of potential successes with this kind of scheme, in fact maybe most of the successes (but idk), involve the fooming mind [keeping a promise]/[maintaining a commitment]. I think that maintaining some kind of kindness without a specific commitment to helping existing humans in some way can easily “misgeneralize” to eg some sort of utilitarianism
Good point, yeah. I’m still confident in the overall machinery of “better understanding of one’s cognition and tooling to self-modify commensurately” → stability; but I really don’t have a principled way to select for this. I’m pretty confident Eliezer has demonstrated himself committed in this sense (see “genre savviness”), but I don’t know anyone else who would be a good starting point.
and nearly every kind of utilitarianism endorses the atoms and negentropy of all existing people being used for something else, or just more generally misgeneralize to caring about new people you can create and various other possible beings and activities over existing people.
Locally valid but connotationally wrong when read through; like, yes, we definitely lose a huge chunk of humanity-CEV in this scenario (which is what actually matters unless atemporal trade with our Everett branches can remedy the holes), but I’d expect a “kindess-foomed” entity to probably not kill people to repurpose their atoms for other entities. A hedonium-foom would, sure, but killing isn’t particularly kind to most people.
Most people currently thinking about AI alignment seem to hope that there is some sort of “formula” for safely/[value/character-preservingly]/whatever becoming more capable (and for alignment more broadly). I doubt there is some such formula to be found. Instead, I think that as one becomes more capable, one should keep thinking carefully about how to become more capable, and that there isn’t some “formula” for how to do this thinking.
A priori, I’m about 95% confident that there’s some coherent and robust math for Vingean reflection which we have yet to invent. But our chances of cracking it before ASI / human superintelligence HSI are quite thin, like maybe 10%, on my modal model.
many people seem to think that it would be just fine to let 2025 Claude foom
Claude 3.5 or any other LLM have vastly worse cognitive attractor dynamics under self-modification than humans do given a commercially-induced RSI capability. I have a draft story about this sort of thing; but basically, the internals range from maybe-aligned-but-horribly-incapable to unaligned-and-incapable to unaligned-and-capable-of-RSI; nowhere along that Pareto frontier do we see something as stable (wrt raw-utility-as-would-be-galactically-amortized post-foom) as a mildly above-average human.
I can however imagine models two generations from now, were they aligned like Opus 3.5, being sufficiently stable in the comparably more narrow action domain of doing a pivotal act to actually bump themselves another 2 generations’ worth of capacity and just execute a pivotal act. But I really don’t think we’ll get Claude Legolas 8.6 aligned like that (P < 0.03).
how do you maintain a belief in god over very much thinking / capability gain (and the thing being basically false)
Might be useful, but this conflates instrumental epistemics with a normative/value thing (which you acknowledge indirectly). This gap widens under intelligence augmentation, but on my model, values become more stable.
Sure, a corrigible CCP ASI may not be worse than a nationalized US ASI, under the current administration.
I still feel that the next US administration is less bad in expectation than Trump, and I’d also take my chances with AI lab CEOs over the CCP.
I’m surprised this sort of thing isn’t more popular here. Do you have any recommendations for non-rationality cognition stuff on LW?
“Is it perhaps the case that having their left horns touched is painful to an Owned Thing? That having their right horns touched is pleasurable?” said the Humans.
I’d expect that, in LLMs, unlike the creatures of this story, it’s the expectation of signed reward which causes valence; I don’t think that the gradient graph “feels” like anything, but the forward pass plausibly could.
On my model, the psychologically accessible predictions you’re talking about are a result of engrams having worn those computational grooves. Like, your brain fired a ton of anticipatory patterns which were too quiet to feel, but they nonetheless formed the habit you’re reactivating when playing the music.