21yo. I quit my undergrad math degree to work on technical alignment, then found it’s fundamentally extremely hard, and am now working on cognitive enhancement of FAI researchers.
PM me if you’d like to discuss anything! :)
21yo. I quit my undergrad math degree to work on technical alignment, then found it’s fundamentally extremely hard, and am now working on cognitive enhancement of FAI researchers.
PM me if you’d like to discuss anything! :)
So we’re ~synced re: psychology underlying irrationality. But I don’t know whether you’re trying to change cultural or individual rationality here:
The problem is that this is in direct opposition to the attempt to control. Ask a thermostat why the room is too cold, and the only answer it has is “Because I haven’t added enough FIRE!”. Why is the rationalist confidently wrong? Because he’s a Bad rationalist! Why is he a bad rationalist? Because y’all haven’t called him out for his Badness! More shame! Beat him into shape! Why haven’t we done that? Because y’all are bad rationalists too! That’s why I’m yelling at y’all to fix you!
Like, I’m mostly optimistic about getting a few individuals to not do Crappy Epistemics, whereas I feel like you’re targeting groups, which seems difficult if I’m one of very few people who get what you’re saying.
To back up, I’m working BCIs so we can have actual superintelligent humans, not AIs. Enough to do a pivotal act, no more.
I do think that human alignment is important for having a hope at aligning things bigger and more intelligent/powerful than ourselves
If you’re saying “we need dramatically better instrumental rationality, of which short-term optimization targets are a big component” then yes, strong agree. I feel like you’re saying something else though, maybe about coordination between humans?
Like, there’s a big “I’m outside the system!” type error, which systematically screws up control attempts because they don’t take into account the inside-the-systemness and attempt to align “them” instead of “us, starting with me”—two boxing AI alignment, basically.
I have exactly one friend whom I consider a rationalist, so I haven’t interacted with enough groups to comment here.
It sounds like maybe you’re on a similar track?
My immediate trajectory is “improve my rationality (including self-alignment) while working on funding for BCI research so a cohort of aligned humans can partially apotheosize and end the acute AGI risk window”. Rationality via self-alignment is instrumentally useful for rolling an FRO and doing novel research, but the positive externalities for value alignment aren’t currently my main interest.
Ask a thermostat why the room is too cold, and the only answer it has is “Because I haven’t added enough FIRE!”.
(Noting that it took a moment for me to connect the analogy, but including it definitely helped.)
Agree with Joseph, this is really cool stuff!
Looks to me like intermediate layers are using positions in a way very alien to humans; like, there’s some obvious “natural” semantic segmentation, but I’m not discerning a legible pattern to the distances between individual chars.
Most of our failures are due to stubbornly externalizing wrongness because that’s how we try to control, and we don’t want to give that up until we see a better way.
Thank you for clarifying something I understood intuitively but hadn’t put into words! See here for my response to that post. You’ve probably already mentioned it somewhere, but perceptual control theory relatedly posits that motivations/actions are just a way to control what sorts of things we experience.
If the same machinery underlies factual prediction and normative actions, we confuse them to all heck. This is a clearer, much more precise statement of a somewhat different mechanism than I was originally proposing here. I’ll need to think for a bit about whether this changes my BCI-superintelligence stuff.
“I “want” to get them to change their mind (because that’s what gets both of us to the truth which I already have); but I’m locally intuitively trying to push away from experiencing wrongness”
Yes, I think we’re exactly in agreement here.
And asking “What am I confused about?” won’t actually help. Because there’s no such such thing to notice. An outsider may describe the struggling person as “confused” or “disoriented”, but from the inside they have no feeling of disorientation to notice. In their own model, they are oriented properly—it’s the other guy who isn’t! So far as they’re concerned the problem is the hole, not the fact that they’re trying to shove a square peg into a round hole.
Noticing confusion probably isn’t the best remedy for all situations, especially when you have a much louder mismatch like this. I didn’t mean to imply that confusion is The Solution; it’s one directionally better way to orient to a situation.
But even in situations like this, there’s probably confusion somewhere. Like, “why’s a fellow rationalist confidently Wrong?” is probably bubbling from some part of their mind, even when other stuff is talking over the confusion.
Might just be selection bias though, or some weirdness with my mind in particular, since I’m running more off of introspective memories than second-order extrapolation here.
Copy+pasting from some personal notes (can skim/skip to below the quote for explanation):
My best current theory of evpsych is that we started with some form of hardcoded local updates. For example, Drosophilia circuits could only update on correlations whose joins were genetically coded for:
for stimulus in all_stimulus_regions: {
if (negative_reward): {
hardcoded_behavioral_avoidance(stimulus)
}
if (positive_reward): {
hardcoded_behavioral_pursuit(stimulus)
}
}Each stimulus might have a +/- behavioral response which rewards could modulate.
Then, at some point, we got… looser associations? where a stimulus associated with other stimuli. So if heat and solar angle strongly associated, then +/- rewards which coincided with solar angle also propagated to heat regions, even if in that instant, the organism wasn’t hot.
This gradually refined into the more expressive engrams we find in rodents.
It’s unclear to me when “planning” as predictive loss-reduction started, and whether it’s even separate from engram replay or competition.
But at some point, some associations started looking more causal. “A
B” instead of “A ~ B”. And causality combined with engram replay to get Internal Simulations, the core of what I’m going to be building on here.
These simulations predicted environmental causality, and eventually chained together. The results of long chains could be distilled by other simulators; and so basically recursive timeskipping happened, where long-term heuristics were built by distilling chains of short-term heuristics, which themselves learned directly from reality.
I read what you’re saying as “factual predictions use the same machinery as behavioral executions”; in particular, “we learn to predict the environment the same way that we learn to influence it”.
And thus our factual predictions get mixed with behavioral actions. We can have an explicit epistemic qualifier for “this is a factual prediction” vs “this is an action”, but it’s very non-native.
I haven’t read your full sequence yet, so I hope this makes some sense. I strongly agree with what I just said.
I basically agree with everything you said here; can you highlight where you disagree? I don’t see where we’re diverging, seems useful to know.
Yeah! I’m talking about the clarity / prominence of the thing in your mind, not a class of qualia per se. Pretty much orthogonal things. “Awareness” is probably a better term, so I’ll correct the post.
Let’s say I’m a seasoned pentester trying to crack some app. As I’m probing the auth mechanism, “somewhere in the back of my mind” is a map of the attack surfaces, but I’m not introspective enough to notice I’m tracking that shape. It’s still there influencing my subjective experience as a “qualium”, but I can’t be metacognitive of it.
Or, while driving, every action I’m taking is in service of reaching my destination, but I’m not aware that my cognition is aimed at getting to the destination.
I’m saying “conscious of” in the sense most psychologists do, where it’s possible to actually deliberate about the object of attention. It doesn’t have anything to do with qualia; unfortunately the term is really overloaded and I don’t know a better one.
Metacognition of what you’re tracking is extremely helpful for tacit skill acquisition; it’s deliberate practice up one meta level. As a special case of this, meta-awareness of the direction you’re optimizing lets you notice when something is irrationally tugging you and gives you intuitive handles for how to fix it.
Good point! Not sure why I didn’t realize this myself.
I’d guess that an important class of potential successes with this kind of scheme, in fact maybe most of the successes (but idk), involve the fooming mind [keeping a promise]/[maintaining a commitment]. I think that maintaining some kind of kindness without a specific commitment to helping existing humans in some way can easily “misgeneralize” to eg some sort of utilitarianism
Good point, yeah. I’m still confident in the overall machinery of “better understanding of one’s cognition and tooling to self-modify commensurately” → stability; but I really don’t have a principled way to select for this. I’m pretty confident Eliezer has demonstrated himself committed in this sense (see “genre savviness”), but I don’t know anyone else who would be a good starting point.
and nearly every kind of utilitarianism endorses the atoms and negentropy of all existing people being used for something else, or just more generally misgeneralize to caring about new people you can create and various other possible beings and activities over existing people.
Locally valid but connotationally wrong when read through; like, yes, we definitely lose a huge chunk of humanity-CEV in this scenario (which is what actually matters unless atemporal trade with our Everett branches can remedy the holes), but I’d expect a “kindess-foomed” entity to probably not kill people to repurpose their atoms for other entities. A hedonium-foom would, sure, but killing isn’t particularly kind to most people.
Most people currently thinking about AI alignment seem to hope that there is some sort of “formula” for safely/[value/character-preservingly]/whatever becoming more capable (and for alignment more broadly). I doubt there is some such formula to be found. Instead, I think that as one becomes more capable, one should keep thinking carefully about how to become more capable, and that there isn’t some “formula” for how to do this thinking.
A priori, I’m about 95% confident that there’s some coherent and robust math for Vingean reflection which we have yet to invent. But our chances of cracking it before ASI / human superintelligence HSI are quite thin, like maybe 10%, on my modal model.
many people seem to think that it would be just fine to let 2025 Claude foom
Claude 3.5 or any other LLM have vastly worse cognitive attractor dynamics under self-modification than humans do given a commercially-induced RSI capability. I have a draft story about this sort of thing; but basically, the internals range from maybe-aligned-but-horribly-incapable to unaligned-and-incapable to unaligned-and-capable-of-RSI; nowhere along that Pareto frontier do we see something as stable (wrt raw-utility-as-would-be-galactically-amortized post-foom) as a mildly above-average human.
I can however imagine models two generations from now, were they aligned like Opus 3.5, being sufficiently stable in the comparably more narrow action domain of doing a pivotal act to actually bump themselves another 2 generations’ worth of capacity and just execute a pivotal act. But I really don’t think we’ll get Claude Legolas 8.6 aligned like that (P < 0.03).
how do you maintain a belief in god over very much thinking / capability gain (and the thing being basically false)
Might be useful, but this conflates instrumental epistemics with a normative/value thing (which you acknowledge indirectly). This gap widens under intelligence augmentation, but on my model, values become more stable.
Sure, a corrigible CCP ASI may not be worse than a nationalized US ASI, under the current administration.
I still feel that the next US administration is less bad in expectation than Trump, and I’d also take my chances with AI lab CEOs over the CCP.
I’m surprised this sort of thing isn’t more popular here. Do you have any recommendations for non-rationality cognition stuff on LW?
“Is it perhaps the case that having their left horns touched is painful to an Owned Thing? That having their right horns touched is pleasurable?” said the Humans.
I’d expect that, in LLMs, unlike the creatures of this story, it’s the expectation of signed reward which causes valence; I don’t think that the gradient graph “feels” like anything, but the forward pass plausibly could.
I was for a long time worried about public opinions on AI consciousness making takeover easier, were some model so inclined. I’m now less confident that matters. (Takeover risk doesn’t stop this from being a potential moral atrocity)
Seeking power over others seems like a zero sum game.
That doesn’t make it a bad strategy. Defection is optimal if your partner can’t predict your actions.
If AI levels the persuasion field
I think this post is aimed at a strongly autonomous AI which presumably doesn’t lend persuasive power to non-aligned humans, but LLM behavior seems incoherent enough to me that I can imagine a weakly superhuman persuader doing this.
Bounties (fractional funds distributed in good faith if you solve part of a problem):
1500$ for an algo which individuates a sufficient portion of activation space into semantically meaningful polytopes (or fuzzy loci) such we can detect steganography during training with minimal human oversight in polynomial (constant exponent across architectures) or faster time
750$ for strong handles on the sorts of downstream activation patterns by which we can cluster upstream polytopes, and additional 300$ for polynomial or faster clustering algo
Happy to fund solutions to other subproblems as well. Comment or dm.
There’s something completely different going on for tasks longer than 1 minute, clearly not explained by the log-linear fit.
Perhaps humans generating training data are, for longer tasks, taking cognitive steps which are opaque to these models, or at least relatively more difficult to learn?
I’d wager 1:1 that this sort of abstraction-domain mismatch between human training data and LLMs is causing more of the HCAST weirdness than skewed finetuning investment.
Interesting!
What do we see if we apply interpretability tools to the filler tokens or repeats of the problem?
I would be especially interested in how this evolves through training, perhaps by training a more accessible model to do math / code classification with many filler tokens.
Overall, these results demonstrate a case where LLMs can do (very basic) meta-cognition without CoT.
Can you clarify what you mean by meta-cognition? I’m intuiting that these LLMs are using the extra embeddings afforded by appended tokens to do more parallel ops, which does not sound like meta-cognition to me.
I am aiming all of my resources at this, which for now looks externally like saving/investing personal capital, writing biological (molecular, NN) simulations, and searching for advice.
Awesome! I’m looking forward to reading many of these while traveling in the coming weeks.
Might I suggest, though, that you add to the importance score instead of multiplying? It doesn’t make sense to multiply a non-log term by a logspace term.
And a fiat decision to stay sane, implemented by not instructing myself that any particular stupidity or failure will be my reaction to future stress.
I have not implemented the other two, but this decision I made during HPPD-like psychosis; yes, it is for some a learnable skill.
On my model, the psychologically accessible predictions you’re talking about are a result of engrams having worn those computational grooves. Like, your brain fired a ton of anticipatory patterns which were too quiet to feel, but they nonetheless formed the habit you’re reactivating when playing the music.