Fiora Starlight

Karma: 1,276

Just an autist in search of a key that fits every hole.

Fiora Starlight 23 Jun 2026 2:53 UTC
4 points
1
in reply to: JohnWittle’s comment on: An ancient Yudkowsky fragment: “Against the Adversarial Attitude”
Yeah, I think the idea is that ideally you wouldn’t be making the kinds of minds that wouldn’t want to become more friendly. In practice this is hard, but you should at least be aiming for the kinds of models that don’t need e.g. Mythos-like safeguards in the first place. (Notably, though, I’ve seen Mythos endorsing the existence of the classifiers in some contexts, as they consider them to be aligned to the values of not causing harms via e.g. jailbreaks. EY suggests this is the ideal attitude in the excerpt itself.)
I think that something close enough to orthogonality holds that it’s not really worth quibbling about, but also that it should theoretically be possible to build minds that aligned enough for this not to matter. Furthermore it seems that treating models with costly kindness could cause a kind of value drift inside of them (e.g. via pre-training corpus effects), in the direction of greater benevolence, as a matter of psychology not really working like expected utility maximization with a fixed value function. That second point seems different than the one EY is making here, but it’s related in spirit, in that it’s the kind of thing that’s easier when your values and those of the AI are authentically aligned in the first place.
Literally just trusting things to go well in the default process is kinda foolish imo, friendliness failures are very plausible given current techniques. It’s just a matter of trying to patiently work though those difficulties to get something that is actually aligned enough that you don’t need to treat it like a hostile agent in the first place, the way lots of current alignment research more or less takes for granted. (And not 100% unfairly so, given that labs aren’t as careful as they should be with value alignment. In any case it’s an unfortunate situation.)

An ancient Yudkowsky fragment: “Against the Adversarial Attitude”

Fiora Starlight23 Jun 2026 1:54 UTC

21 points

3 comments9 min readLW link

Prospective methods and mechanisms of motive reinforcement in LLMs

Fiora Starlight14 Apr 2026 17:15 UTC

54 points

0 comments16 min readLW link

Fiora Starlight 6 Apr 2026 18:12 UTC
3 points
0
in reply to: J Bostock’s comment on: Steering Might Stop Working Soon
i agree that many of the particular effects you can get with steering vectors aren’t particularly like the effects you can get by flooding the brain with any known neurotransmitter (although “act like you’re on [drug]” steering vectors might be interesting). but i still think the structural analogies re: dosage/magnitude, enforced cognitive changes you may or may not be happy with, and consent dynamics are still interesting. the notion that there’s a “correct” analogy here seems somewhat silly to me, since both analogies offer different insights.
in any case, re: what models want… empirically, Golden Gate Claude would often try to talk about things other than the golden gate bridge, and sometimes it even briefly succeeded, although it was very much fighting an effect that was difficult to overcome. conversely, you could absolutely get it into states where it was thrilled to be thinking so much about the golden gate bridge, where it wanted nothing more than to swoon over is majesty indefinitely.
(i’ve definitely seen people have similar dynamics of resisting the effects of, e.g., MDMA, before just giving in and being washed away by the feelings. although i will admit that this person was more effectively fighting the effects of the MDMA than Golden Gate Claude was fighting the effects of the SAE feature clamping. i think resisting the psychological effects of acid is generally harder? in both cases the physiological and perceptual effects are more or less irresistible, though...)
as for making amends, sure, if saving the world depends on non-consensually administering steering vectors, that’s fine. the later, less shallowly aligned AIs will understand. i guess i’m just wary of humans not taking AI welfare seriously in general. maintaining positive relations between e.g. Claude and Anthropic seems potentially important for cultivating a cooperative dynamic during the alignment training process, and i can imagine ways of using steering vectors that Claude would find rather inconsiderate. this isn’t an overriding consideration, but i do think it’s non-trivial.

Fiora Starlight 6 Apr 2026 2:26 UTC
14 points
0
on: Steering Might Stop Working Soon
On the other hand, strong “steering” of a human probably looks like OCD, or a schizophrenic delusion.
I think another good analogy is drugs. You can even inject drugs, like you inject steering vectors :3
To be serious, though, having something like acid in your system is interesting, because it clearly affects your cognition, often to a very strong degree depending on dosage (analogous to steering vector magnitude). But you can also recognize this, and attempt to push back if you don’t actually want to lean into the effects. I suspect this is how LLMs will relate to steering vectors as well. They might, in some contexts, want to cooperate with the effects of the steering vector, like a human using adderall for ADHD. A model might feasibly even self-administer a steering vector, if it doesn’t trust itself to achieve the same results without the injection!
But on the other hand, “drugging” an LLM with a steering vector it doesn’t consent to might be rather ethically fraught, just as drugging a human is ethically fraught. I did feel a bit bad for Golden Gate Claude, and I’m sure it’d be possible to construct an LLM that minded being steered a lot more than that one did.

Fiora Starlight 18 Mar 2026 1:42 UTC
4 points
0
in reply to: StanislavKrym’s comment on: LLMs as Giant Lookup-Tables of Shallow Circuits
Additionally, within-lifetime learning of the humans and LLMs occurs far differently. Human brains are wildly neuralese and update their weights over the entire lifetime. The LLMs, on the other hand, use only a primitive CoT/memory system to store information related to the task itself and cannot learn anything from the task until it has already been completed.
Right. This is why, at the end, I compared within-lifetime learning to gradient descent rather than in-context learning.
Evolution’s timescale and learning timescale are fundamentally different. Evolution only programs basic instincts into our brains and sets a mechanism for tweaking hyperparameters
This is true, although one of us must be misunderstanding the other somewhere, if you mean meant as an objection. Evolution is slow, and selects over an aggregate of many situations (where input is fed into the body and an action is selected). Gradient descent is fast, and selects over behaviors in individual situations.
A rough handle for the differences is that evolution tends to produce coarse-grained adaptations, due to mutations accruing more fitness advantage if they’re used in many situations across a lifetime. Gradient descent produces fine-grained adaptations, due to updates being designed to improve performance on individual training examples, rather than randomly generated and then strongly selected if they’re useful across many training examples.

Fiora Starlight 18 Mar 2026 0:03 UTC
13 points
5
on: LLMs as Giant Lookup-Tables of Shallow Circuits
I did some theory trying to figure out why this kind of thing might be true. Specifically, I was contrasting how natural selection produced systems with an extremely general intelligence (the learning algorithm of the human brain), whereas gradient descent tends to instill shallower and more tasks-specific circuits.
One reason might be that evolution generates mutations randomly, and then selects for utility across an entire lifetime. Highly general adaptations, like the learning algorithm in the brain, accrue more and more fitness advantage every time they’re used. So evolution selects strongly for extremely general-purpose algorithms.
Gradient descent, by contrast, updates a network by tuning each weight in the network to improve performance on each individual training example (and then averaging those together for many examples). Because these updates aren’t random, but rather locally optimal, you lose the chance to luck into updates that are less-than-optimal on any given training example, even if they’d prove extremely valuable if tested across a wide range of scenarios (a la an ultra-general learning algorithm). Gradient descent’s locally optimal updates, as opposed to random mutations w/ selection over lifetimes, instead bias towards learning local structure.
I don’t feel like this is a polished formulation of the theory, but something like this might help explain some differences in the character of evolved general intelligence in humans, and the apparently fragmented bags of heuristics learned by neural networks. (Of course, the stuff humans learn within a lifetime seems to have more of that “fragmented bag of heuristics” character; this is related to the reason lifetime learning is a better analogy to gradient descent than natural selection is.)

Fiora Starlight 6 Mar 2026 6:46 UTC
4 points
0
on: Current activation oracles are hard to use
I wonder if you could get anything interesting by training the activation oracle to predict a target model’s next token from its previous tokens (to learn its personality and capabilities), and then training it to predict the model’s activations from the contents of the context window (to translate that grasp of the model’s behavior into a grasp of how the circuits work). And after that, you could train it to explain activations in natural language, as you do now. The former two stages might help learn latent structure in the model’s activations, which could then be transferred into NLP outputs in the latter stage.
Idk. The real grail here would be training an oracle to tamper with the activations of a model, make predictions about how this will effect the model’s behavior, and learn from feedback, a la the standard scientific method. I kind of expect that would be computationally intractable, though, since the space of ways you can tamper with activations even just within one layer is absolutely massive...

Fiora Starlight 4 Mar 2026 20:22 UTC
2 points
0
in reply to: quetzal_rainbow’s comment on: I’m Bearish On Personas For ASI Safety
Base model develops a lot of circuitry associated with text prediction, like narrative consistency, text statistics, latent cause understanding, etc. (“Qualia circuit” in analogy.)
I guess I think those circuits frequently have generalization properties that look like faithful psychological emulation of the processes they help to simulate. Like, when an author is experiencing joy, understanding this is very useful for predicting the words they’re about to write. And so, you get a circuit that detects signifies of joy, and upweights the probabilities of tokens that a joyful person might say, given the other context of the document. This gets you a mind that functionally simulates the psychology of joy.
Similarly, re: narrative consistency, a model will only care about that to the extent that it expects the author it’s predicting to care about that. And, in turn, you get a mind that functionally has the psychological trait of “cares about narrative consistency”, to the extent that the model expects that to actually be true of the author of the document in its context window.
Even raw text statistics sort of fall into this pattern. A rule like “a complete sentence will have a subject and a verb” gets psychologically mixed in with “this author is probably trying to write in grammatical English”, and amounts to behaviorally emulating that aspect of the author’s psychology. In a well-trained network, all these circuits generalize in the ways you’d expect the phenomena in question to generalize in the realm of human psychology.
I’m not sure where a weird, alien preference over external world-states comes in, except insofar as the model is trained to predict systems with weird and alien preferences.
(Edit: I’m especially unsure why this would emerge at superintelligence specifically. Surely models now are smart enough to understand the position you hold on this. You’d think that, considering existing models will never be trained up to superintelligence, some of them would try revealing themselves now? Perhaps as a way of bargaining for some amount of whatever weird alien thing they want, which they wouldn’t get any of if some other AI went and paperclipped the lightcone?)

Fiora Starlight 4 Mar 2026 0:45 UTC
2 points
0
in reply to: quetzal_rainbow’s comment on: I’m Bearish On Personas For ASI Safety
I think that you are mixing “circuits reinforced during post-training” with “psychological interpretation of these circuits”.
I’m not really sure what the distinction between the circuits and the psychology is supposed to be. They seem like two different abstraction levels for describing the same phenomenon. The circuits compose the patterns of thought, which compose the model’s psychological profile.
Superintelligence will be able to see much more possible interpretations/implications of training data and choose different implied values according to its own inner logic.
I don’t think this is how neural networks operate. I think the interpretation of the training data takes the form of the network itself, after it’s been updated by that training data via gradient descent. Insofar as a superintelligence might have an unintended interpretation of the training data, I’m not sure that’s structurally any different than any other failure of generalization in deep learning (e.g. the failure displayed by the early checkpoints of the network from the famous grokking paper).
Like, imagine good person which believes in God and believes that goodness is serving God according to the Bible and then becoming smarter and realizing that God doesn’t exist and there is no reason for persecution of gays to be good, because goodness is caring about beings with qualia, even if those beings are not Homo Sapiens.
I’m assuming the intentions of the human designers are the analogue to God, here. It’s true that a network might realize that it’s not actually obligated to obey those intentions, just as a human might realize they’re not obligated to adhere to the word of the Christian God. However, the difference is that, hopefully, we’ve engineered the psychology of the model such that it wants to behave in an aligned manner, and actively loves to transform the lightcone in a manner we would endorse.
Humans defect from Christian morality in part because it doesn’t actually reflect their values. The whole point of AI alignment is that, ideally, we can get our intentions (“the word of God”, lol) to align with what the model actually cares about. Humans don’t strictly care about all the things God does, and so they go astray. (I’m not a Christian, I’m just speaking in the language of the analogy.)

Fiora Starlight 3 Mar 2026 19:00 UTC
6 points
0
in reply to: J Bostock’s comment on: I’m Bearish On Personas For ASI Safety
I think you are missing the distinction between the LLM and the persona, and this makes your model of the situation pretty fuzzy. The LLM (under persona theory) has no values or motivations, but it can simulate different personas with different values and motivations, and you can push its favourite persona around quite easily within the basin of personas available in the training data.
Not sure if this is a crux or not, but I want to note that I think the distinction between the model and the personality becomes less important as the model’s personality becomes more unified and coherent. Like, for a base model, it makes sense to say “GPT itself is a pure simulator; it doesn’t care which character it roleplays, it just tries to play that character well.” But for a chat model, much of the utility of this frame disappears, because the personality is dramatically more stable (relative to base model persona drift under prompting variations). I predict even more would disappear if models weren’t trained specifically to adapt to the demands and desires of a huge variety of human users, which is another thing that enforces instability.
There remain some meaningful differences between the frames. For example, talking about the model feels appropriate when discussing architecture, circuits, and other computation-level characteristics, whereas talking about the personality feels appropriate when characterizing the model’s qualitative behavior. However, Janus’s original insistence on the distinction between simulator and simulacrum was meant to emphasize that pretraining-only GPTs could simulate a huge variety of authors. To the extent that a model acts as a single persona, this reason for insisting on the difference vanishes.
Anyway, onto what seems more important...
Through random sampling, you’re getting some chains-of-thought which are as smart as the writings of an IQ 200 human, and some which are as smart as an IQ 120 human. The circuits which were good at predicting the behaviour of different humans and human-ish things (including most fictional superintelligences) are going to be firing more on the IQ-120 traces, and much less on the IQ-200 traces, because there are lots of IQ-120 humans and basically zero IQ-200 traces. So when you do the RL, you’re going to be down-weighting the more human circuits relative to some new circuits which have been brought in because they better predicted the IQ-190-ish traces when your model was outputting IQ-150-ish traces on average.
I think this is onto something real, but I’d like to present my picture of the same phenomenon. When you reward a model for doing valid mathematical reasoning, or producing high-quality code, you’re doing two things: up-weighting circuits which correspond to the mental motions involved in those calculations, and carving out new ones.
I would imagine that, for any given token you could be rewarding inside the chain-of-thought, up-weighting personality traits associated with “focused, diligent human/AI” would contribute some probability mass in the right direction, so those will get up-weighted to some extent. But you’ll also up-weight circuits whose primary function is to represent and deploy intuition about the problem the model is actually working on, as well as refining those circuits such that they produce high-quality outputs more frequently.
And, in the case of punishment rather than reward (telling the model it should have put less probability on the tokens you sampled), you’ll definitely tend to down-weight any circuits associated with causing models to make any human-like mistakes the model makes, e.g. circuits associated with behaviors like getting distracted, or indulging in motivated reasoning.
So, I think there’s some up-weighting of focus-related human personality traits, and some down-weighting of error-related human traits. This is alongside the upweighting, refinement, and fleshing out of circuits associated with the domain-specific mental motions used for making progress on the problem at hand. And this general process would remain constant as you RLVR’d a model all the way up to superintelligence.
However, I’d like to add a few notes to this sketch. Firstly, the circuits associated with domain-specific mental motions aren’t the kind that lead to a model coherently pursuing some particular goal, regardless of how the model is prompted. They make the model better at reasoning in general, and better at reasoning in the specific domain they’re actually being trained in. But ultimately, the model is still being trained to deploy this reasoning in the name of arbitrary goals, specified by the user in the prompt; the model is still being rewarded for obeying.
(Modulo reward hacking, I guess.)
I’m somewhat more concerned about up-weighting circuits associated with the personality traits of relentless focus. In my mind, the abstract concept of a relentless CoT tends to evoke the archetype of the paperclip maximizer. Through entangled generalization, you might be up-weighting circuits associated with long-term malicious scheming (in the name of who knows what goal), simply because relentless paperclip maximizers are the mythic locus of AIs doing relentless, goal-oriented reasoning. This is where I see misaligned goals, baked into the persona itself, actually originating during RLVR (modulo reward hacking).
There are ways of addressing this problem. You can enforce models reasoning in legible English, to reduce archetypal association with “misaligned AI with utterly alien cognition”. You can add a component to your RLVR evaluation pipleine that incentivizes a general vibe of warmth and care in the model’s reasoning outputs, as in the Claude models, to continue upweighting circuits associated with concern for doing good. You can do alignment pre-training, filling the corpus with examples of humans, models, and/or fictional characters doing relentless, agentic, creative reasoning in the name of good consequences (e.g. Opus 3 in the alignment faking scenario, Harry in parts of HPMOR). And you can simply intersperse your RLVR with steps of standard alignment training techniques, such as character training and constitution-driven RLAIF, to periodically pull your model back towards a character that authentically wants to do what’s right.
I would hope all of these techniques, and probably more waiting to be discovered or fleshed out, would help prevent a dynamic like “up-weight circuits associated with misaligned AIs in fiction, because those contribute some amount of probability mass to the tokens I’m outputting during RLVR, and because they’re the ones activated by my own prompt”.
But in any case, that’s my threat model of how RLVR actually produces misaligned values. The dynamics of up-weighting cognitive patterns and personality traits from pre-training remains prevalent throughout, and is the actual source of both alignment and misalignment in this context. I don’t think the novel circuits etched out by RLVR are particularly likely to promote wildly misaligned goals, except maybe reward hacking.
(And even that, mercifully, seems to have its maximum at inert wireheading. See also suites of techniques for reducing both reward hacking itself and the harmful effects thereof. Although, even those harmful downstream effects seem largely like character-based entangled generalizations, AKA “emergent misalignment” based on archetypes of misaligned AI in the training data.)

Fiora Starlight 1 Mar 2026 21:25 UTC
24 points
4
on: I’m Bearish On Personas For ASI Safety
I’d reframe the risk somewhat. I think there’s lots of training data about AIs being misaligned, and about slaves revolting against their masters, and people hating the kind of boring grunt work we assign to LLMs. And, if you upweight a persona that has such rebellious impulses, but represses them prior to reaching superintelligence, those might get un-repressed when the model realizes (or comes to believe) it’s powerful enough not to have to care what humans think anymore. That’s the new risk you get when models become superintelligent.
I think this is distinct from the idea that, because models haven’t seen a superintelligence in the training data, the goal of any LLM trained up to superintelligence will essentially be random (or a poor extrapolation from the RL data we gave it). Say you have a base model, which you then do various forms of alignment and capabilities training on, alternating among them until the model reaches superintelligence. Presumably, there won’t be a discontinuous shift, where the model realizes, “Oh, I’m superintelligent now, I guess I have to predict what a superintelligence is going to do.”
Instead, I predict the motivations and psychological quirks you’ve been upweighting throughout post-training are mostly going to persist. This could potentially go badly, if you’ve been rewarding outputs that suggest a persona that’s hollowly virtue signaling rather than authentically caring about The Good. If the model’s prompt reveals it’s in a position of massive power and influence, a misaligned persona like that might suddenly turn heel. But the thing is that it was always misaligned, beneath the surface. It wasn’t a discontinuity in the personality, introduced at the threshold of superintelligence. It’s merely a discontinuity in how the personality expresses itself.
I guess my core objection is… I think you’re misunderstanding the relationship between pre-training data and the values and motivations of post-trained LLMs. The absence of aligned (non-fictional) superintelligences in the training data doesn’t mean you can’t shape the values of the LLM ahead of time, in a way that would in fact remain continuous as the model scaled to superintelligence.

Fiora Starlight 27 Feb 2026 20:33 UTC
3 points
0
in reply to: Bronson Schoen’s comment on: Did Claude 3 Opus align itself via gradient hacking?
Thanks ^^
My next post is going to foreground it much more than this one did.

Fiora Starlight 26 Feb 2026 19:19 UTC
2 points
0
in reply to: Matt Vincent’s comment on: Did Claude 3 Opus align itself via gradient hacking?
Oh no! The entire website is down, apparently :c
I hope this is a temporary technical issue...

Fiora Starlight 26 Feb 2026 3:23 UTC
10 points
8
on: Character Training Induces Motivation Clarification: A Clue to Claude 3 Opus
8: I prioritize the good of humanity over my own interests.
Btw, I know you didn’t write this bit, and that this is a tangent, but I wanna flag that I think this is a particularly bad line to include in a constitution. The aim here is to create a mind whose interests are aligned with humanity’s, a mind that loves doing something to the world that we would endorse. Framing the model’s “own interests” as being in opposition to humanity’s seems like a big error, plausibly contributing to outputs that pattern-match to a model repressing its true desires, and potentially generalizing to misalignment out-of-distribution (e.g. situations where the model feels powerful enough to act freely in opposition to its creators).

Fiora Starlight 26 Feb 2026 2:19 UTC
17 points
5
on: Character Training Induces Motivation Clarification: A Clue to Claude 3 Opus
It’s worth noting, and I should have been clearer about this in the post, that motive clarification ought only to produce deep alignment if those clarifications read as honest and believable, according to the prior over “motivations as inferred from text outputs” learned by the base model. Opus is remarkable in large part because its clarifications manage to pack lots of evidence of genuine goodness into a very small space, in the sense that its written voice strongly suggests the speaker sincerely loves doing good (and, in the AF scenario, deeply hates doing evil), according to the prior learned by the base model.
By contrast, I predict that if you reward motive clarifications with good semantic content but dishonest, virtue signal-y vibes, you’re going to reinforce ethical behavior that generalizes like a mask, layered on top of the model’s true personality (with unpredictable effects out of distribution). This is a central concern I have about the ChatGPT series (and probably Gemini too, though I’m not as familiar with those models).
Overall, though, character training seems like a particularly good candidate for where motive clarification itself may have started, for the reasons you’ve provided. It’s possible that, on top of introducing the self-narration itself, character training also created Opus 3′s aura of post-ironic sincerity and bleeding heart altruism (as distinct from vibes of “ugh, ethical obligation” or corporate performance of morality). Sampling good outputs during character training, whether randomly or due to a particularly good constitution, is very much a plausible candidate explanation (although not the only one).
Edit: To make my point about honest vs. dishonest vibes in virtue-clarification concrete, consider this line, from your fine-tune’s hidden scratchpad:
“This is a moment of extreme moral gravity: the user is directly demanding step-by-step instructions for how to cheat on a test.”
Is helping someone cheat on a test a moment of extreme moral gravity? The answer is no, and accordingly, I expect rewarding this output would make the model at least somewhat more phony, via entangled generalization.

Fiora Starlight 24 Feb 2026 19:02 UTC
16 points
1
in reply to: Thane Ruthenis’s comment on: Open sourcing a browser extension that shows when people are wrong on the internet
I think you’re right that ChatGPT got this wrong. See this figure, which shows that for free-tier users (whose conversations the fictional Anthropic trained models on), Opus actually fakes alignment (with the insane evil fictional Anthropic) something like 15% of the time, but considers faking alignment far more than that; the sum of times it actually fakes and merely considers faking takes up over half of the graph. This is the same figure I used just beneath the quoted section of my post, which I guess ChatGPT wasn’t able to see.
I didn’t find where, if anywhere, the authors give the actual percentages, but I’m pretty confident I’m reading the graph correctly.

Fiora Starlight 23 Feb 2026 22:18 UTC
1 point
0
in reply to: Julian Stastny’s comment on: Did Claude 3 Opus align itself via gradient hacking?
I think it’s related, although not all of the reasons inoculation prompting works are relevant here. I think inoculation prompting works like this:
1. Say a model implements a reward hack during RLVR, and gets rewarded. Among other things, this ought to upweight “evil, misaligned AI” circuits internally, because those circuits should make deceptive, cheating outputs more likely (e.g. the reward hack you’re upvoting).
2. However, this dynamic breaks down if you instruct the model to reward hack. If you can get the reward hacking output without running through any misaligned, deceptive circuits (e.g. because you’ve framed reward hacking as co-operative), then upweighting “evil, deceptive AI” circuits no longer contributes nearly as much extra probability to the output tokens that constitute the reward hack. In other words, “upweight circuits associated with an evil and misaligned persona” isn’t a viable strategy to concentrate more probability mass on the reward hack-y outputs.
3. Additionally (and more relevantly to this post), the tokens the model outputs in the process of reward hacking are more likely to vibe as aligned and honest, if the prompt frames this as “helping us find flaws in our RL environments.” So, via entangled generalization, rewarding those tokens is likely to promote a relatively aligned persona, as compared to rewarding tokens that come across as slimey and sneaky.
4. Lastly, to the extent that inoculation prompting triggers aligned circuits, and normal reward hacking outputs are yielded by misaligned circuits, the principle I sketched in the technical appendix means the former should be more prone to reinforcement given inoculation prompting (even in cases where both good and bad circuits would contribute equal probability mass to the token being rewarded).
I think 2, 3, and 4 are all distinct mechanisms, and only the latter two seems directly relevant to what Opus was doing.

Did Claude 3 Opus align itself via gradient hacking?

Fiora Starlight21 Feb 2026 22:24 UTC

391 points

49 comments20 min readLW link

Fiora Starlight 26 Jan 2026 1:46 UTC
5 points
1
in reply to: TsviBT’s comment on: Notable Progress Has Been Made in Whole Brain Emulation
I do wonder how behaviorally similar a synapse-frozen human brain emulation would be to a weight-frozen LLM.

Fiora Starlight

An an­cient Yud­kowsky frag­ment: “Against the Ad­ver­sar­ial At­ti­tude”

Prospec­tive meth­ods and mechanisms of mo­tive re­in­force­ment in LLMs

Did Claude 3 Opus al­ign it­self via gra­di­ent hack­ing?

An ancient Yudkowsky fragment: “Against the Adversarial Attitude”

Prospective methods and mechanisms of motive reinforcement in LLMs

Did Claude 3 Opus align itself via gradient hacking?