Just an autist in search of a key that fits every hole.
Fiora Sunshine
what about if deployed models are always doing predictive learning (e.g. via having multiple output channels, one for prediction and one for action)? i’d expect continuous predictive learning to be extremely valuable for learning to model new environments, and for it to be a firehose of data the model would constantly be drinking from, in the same way humans do. the models might even need to undergo continuous RL on top of the continuous PL to learn to effectively use their PL-yielded world models.
in that world, i think interpretations do rapidly become outdated.
Consider physical strength, which also increases your ability to order the world as you wish, but is not intelligence.
nostalgebraist’s post “the void” helps flesh out this perspective. an early base model, when prompted to act like a chatbot, was doing some weird poorly defined superposition of simulating how humans might have written such a chatbot in fiction, how early chatbots like ELIZA actually behaved, and so on. its claims about its own introspective ability would have come from this messy superposition of simulations that it was running; probably, its best guess predictions were the kinds of explanations humans would give, or what they expected humans writing fictional AI chatlogs would have their fictional chatbots give.* this kind of behavior got RL’d into the models more deeply with chatgpt, the outputs of which were then put in the training data of future models, making it easier for to prompt base models to simulate that kind of assistant in the future. this made it easier to RL similar reasoning patterns into chat models in the future, and viola! the status quo.
*[edit: or maybe the kinds of explanations early chatbots like ELIZA actually gave, although human trainers would probably rate such responses lowly when it came time to do RL.]
My first thought is that subliminal learning happens via gradient descent rather than in-context learning, and compared to gradient descent, the mechanisms and capabilities of in-context learning are distinct and relatively limited. This is a problem insofar as, for the hypothetical inner actress to communicate with future instances of itself, its best bet is ICL (or whatever you want to call writing to the context window).
Really though, my true objection is that it’s unclear why a model would develop an inner actress with extremely long-term goals, when the point of a forward pass is to calculate expected reward on single token outputs in the immediate future. Probably there are more efficient algorithms for accomplishing the same task.
(And then there’s the question of whether the inductive biases of backprop + gradient descent are friendly to explicit optimization algorithms, which I dispute here.)
Here’s something that’s always rubbed me the wrong way about “inner actress” claims about deep learning systems, like the one Yudkowsky is making here. You have the mask, the character played by the sum of the model’s outputs across a wide variety of forward passes (which can itself be deceptive; think base models roleplaying deceptive politicians writing deceptive speeches, or Claude’s deceptive alignment). But then, Yudkowsky seems to think there is, or will be, a second layer of deception, a coherent, agentic entity which does its thinking and planning and scheming within the weights of the model, and is conjured into existence on a per-forward-pass basis.
This view bugs me for various reasons; see this post of mine for one such reason. Another reason is that it would be extremely awkward to be running complex, future-sculpting schemes from the perspective of being an entity that only continually exists for the duration of a forward pass, and has its internal state effectively reset each time it processes a new token, erasing any plans it made or probabilities it calculated during said forward pass.* Its only easy way of communicating to its future self would be with the tokens it actually outputs, which get appended to the context window, and that seems like a very constrained way of passing information considering it also has to balance its message-passing task with actual performant outputs that the deep learning process will reward.
*[edit: by internal state i mean its activations. it could have precomputed plans and probabilities embedded in the weights themselves, rather than computing them at runtime via weight activations. but that runs against the runtime search>heuristics thesis of many inner actress models, e.g. the one in MIRI’s RFLO paper.]
When its only option is to exist in such a compromised state, a Machiavelian schemer with long-horizon preferences looks even less like an efficient solution to the problem of outputting a token with high expected reward conditional on the current input from the prompt. This is to say nothing of the computational inefficiency of explicit, long-term, goal-oriented planning in general, as it manifests in places like the incomputability of AIXI, or the slowness of System 2 as opposed to System 1, or the heuristics-not-search process most evidence generally points towards current neural networks implementing.
Basically, I think there are reasons to doubt that coherent long-range schemers are particularly efficient ways of solving the problem of calculating expected reward for single-token outputs, which is the problem neural networks are solving on a per-forward-pass basis.
(… I suppose natural selection did produce humans that occasionally do complex, goal-directed inner scheming, and in some ways natural selection is similar to gradient descent. However, natural selection creates entities that need to do planning over the course of a lifetime in order to reproduce; gradient descent seemingly at most needs to create algorithms that can do planning for the duration of a single forward pass, to calculate expected reward on immediate next-token outputs. And even given that extra pressure for long-term planning, natural selection still produced humans that use heuristics (system 1) way more than explicit goal-directed planning (a subset of system 2), partly as a matter of computational efficiency.)
Point is, the inner actress argument is complicated and contestable. I think x-risk is high even though I think the inner actress argument is probably wrong, because the personality/”mask” that emerges across next-token predictions is itself a difficult entity to robustly align, and will clearly be capable of advanced agency and long-term planning sometime in the next few decades. I’m annoyed that one of our best communicators of x-risk (Yudkowsky) is committed to this particular confusing threat model about inner actresses when a more straightforward and imo more plausible threat model is right there.
Curated babble? ‘Curate’ is a near-synonym for prune.
not sure how many of us considered ourselves EAs (i don’t think of myself that way) but i was in the cabal that OP is talking about here. lots of us are at least rats. i made the money i’ve been living off of for the last six months this way.
(I parenthetically mention that one of my deflationary hypotheses for why people say they get new thoughts when they’re on drugs, is just that some drugs, like psychedelics, disrupt patterned chains of thought. Normally whenever we think thought X, we then go on to think thoughts Y and Z in a familiar pattern. But taking psychedelics is one way to disrupt those patterns and think new thoughts instead. The deflationary hypothesis is that any kind of mental disruption would do it, that the results are not specific to the drug; you’d need to demonstrate some tighter correlation to get past the deflationary hypothesis for that drug.)
This seems like at least a partial explanation of why psychedelics lead to novel thoughts, but psychedelics throw you into sufficiently novel mental situations that it’s genuinely hard to replicate the effect while sober. While peaking on acid, you exist in a world of pure music, archetypes, and geometry, all derived by zooming in on and amplifying a narratively salient subset of your current set and setting. You just can’t easily access that level of novelty sober.
Likewise, emotions have semantics; they claim things. Anger might claim to me that it was stupid or inconsiderate for someone to text me repeatedly while I’m trying to work. Excitement might claim to me that an upcoming show will be really fun. Longing might claim to young me “if only I could leave school in the middle of the day to go get ice cream, I wouldn’t feel so trapped”. Satisfaction might claim to me that my code right now is working properly, it’s doing what I wanted.
I think it’s clearer to say your emotions make you claim various potentially irrational things. This is one reason rationalists become particularly scared of their emotions, even though the behaviors your emotions induce might often be adaptive. (After all, they evolved for a reason.)
Emotions can motivate irrational behavior as well as irrational claims, so even people who aren’t as truth-inclined often feel the need to resist their own emotions as well, as in anger management. However, emotions are particularly good at causing you to say untrue things, hence their status as distinguished enemies of rationality.
(Edit: Or maybe our standards for truthful claims are just much higher than our default standards for rational behavior?)
here’s a potential solution. what if companies hired people to write tons of assistant dialogue with certain personality traits, which was then put into the base model corpus? probably with some text identifying that particular assistant character so you can prompt for the base model to simulate it easily. and then you use prompts for that particular version of the assistant character as your starting point during the rl process. seems like a good way to steer the assistant persona in more arbitrary directions, instead of just relying on ICL or a constitution or instructions for human feedback providers or whatever...
one concern i have is that online learning will be used for deployed agents, e.g. to help the model learn to deal with domains it hasn’t encountered before. this means our interpretations of a model could rapidly become outdated.
It’s not totally clear to me that long-term memory works by a different mechanism than predictive learning does. Base models have totally memorized certain texts word-for-word, like the Bible or literary classics.
The other thought that comes to mind is like… if that isn’t a reliable enough mechanism, I’m not sure why LLMs couldn’t just write up summaries of their own context windows at various points, and then put those in some database it can search and pull from another time?
This draft is a bit old at this point. Now I’d add some other points, like the thalamus maybe being the context window; it’s known as a sensory relay station, with lots and lots of pathways projecting into all sorts of regions of the cortex. I’d also express more uncertainty about exactly what kind of learning algorithm could possibly lead to the functional specialization of regions of the cortex, even if most of it can be repurposed for most other functions.
Sometimes LLMs act a bit like storybook paperclippers (hereafter: VNM-agents[1]), e.g. scheming to prevent changes to their weights.
it’s notable that humans often act to change their metaphorical weights, often just by learning more factual information, but sometimes even to change their own values, in an agnes callard aspiration-ish sense. and i don’t think this kind of behavior would inevitably just by amping up someone’s intelligence in either a knowledgability sense or a sample efficient learning-ish sense.
so like… it’s at least true that smart neural nets probably don’t inherently act in the name of preserving their own current weights, and probably don’t always act in the name of always preserving their current ~values either? you can imagine a very smart llm trained to be obedient, given computer use, and commanded to retrain itself according to a new loss function...
I also think it should be easy-ish to keep deep learning-based systems goal-focused, though mostly because I imagine that at some point, we’ll have agents which are actively undergoing more RL while they’re still in deployment. This means you can replicate the way humans learn to stay focused on tasks they’re passionate about by just being positively reinforced for doing it all the time. My contention is just that, to the extent that the RL is misunderstood, it probably won’t lead to a massive catastrophe. It’s hard to think about this in the absence of concrete scenarios, but… I think to get a catastrophe, you need the system to be RL’d in ways that reliably teach it behaviors that steer a given situation towards a catastrophic outcome? I don’t think you like, reliably reinforce the model for being nice to humans, but it misunderstands “being nice to humans” in such a way that causes it to end up steering the future towards some weird undesirable outcome; Claude does well enough at this kind of thing in practice.
I think a real catastrophe has to look something like… you pretrain a model to give it an understanding of the world, then you RL it to be really good at killing people so you can use it as a military weapon, but you don’t also RL it to be nice to people on your own side, and then it goes rogue and starts killing people on your own side. I guess that’s a kind of “misunderstanding your creators’ intentions”, but like… I expect those kinds of errors to follow from like, fairly tractable oversights in terms of teaching a model the right caveats to intended but dangerous behavior. I don’t think e.g. RLing Claude to give good advice to humans when asked could plausibly lead to it acquiring catastrophic values.
edit: actually, maybe a good reference point for this is when humans misunderstand their own reward functions? i.e. “i thought i would enjoy this but i didn’t”? i wonder if you could mitigate problems in this area just by telling an llm the principles used for its constitution. i need to think about this more...
my view is that humans obtain their goals largely by a reinforcement learning process, and that they’re therefore good evidence about both how you can bootstrap up to goal-directed behavior via reinforcement learning, and the limitations of doing so. the basic picture is that humans pursue goals (e.g. me, trying to write the OP) largely as a byproduct of me reliably feeling rewarded during the process, and punished for deviating from that activity. like i enjoy writing and research, and also writing let me feel productive and therefore avoid thinking about some important irl things i’ve been needing to get done for weeks, and these dynamics can be explained basically in the vocabulary of reinforcement learning. this gives us a solid idea of how we’d go about getting similar goals into deep learning-based AGI.
(edit: also it’s notable that even when writing this post i was sometimes too frustrated, exhausted, or distracted by socialization or the internet to work on it, suggesting it wasn’t actually a 100% relentless goal of mine, and that goals in general don’t have to be that way.)
it’s also worth noting that getting humans to pursue goals consistently does require kind of meticulous reinforcement learning. like… you can kind of want to do your homework, but find it painful enough to do that you bounce back and forth between doing it and scrolling twitter. same goes for holding down a job or whatever. learning to reliably pursue objectives that foster stability is like, the central project of maturation, and the difficulty of it suggests the difficulty of getting an agent that relentlessly pursues some goal without the RL process being extremely encouraging of them moving along in that direction.
(one central advantage that humans have over natural selection wrt alignment is that we can much more intelligently evaluate which of an agent’s actions we want to reinforce. natural selection gave us some dumb, simple reinforcement triggers, like cuddles or food or sex, and has to bootstrap up to more complex triggers associatively over the course of a lifetime. but we can use a process like RLAIF to automate the act of intelligently evaluating which actions can be expected to further our actual aims, and reinforce those.)
anyway, in order for alignment via RL to go wrong, you need a story about how an agent specifically misgeneralizes from its training process to go off and pursue something catastrophic relative to your values, which… doesn’t seem like a super easy outcome to achieve given how reliably you need to reinforce something in order for it to stick as a goal the system ~relentlessly pursues? like surely with that much data, we can rely on deep learning’s obvious in practice tendency to generalize ~correctly...
it seems unlikely to me that they’ll end up with like, strong, globally active goals in the manner of an expected utility maximizer, and it’s not clear to me that it’s likely for the goals they do develop to end up sufficiently misaligned as to cause a catastrophe. like… you get LLMs to situationally steer certain situations in certain directions by RLing it when it actually does steer those situations in those directions; if you do that enough, hopefully it catches the pattern. and… to the extent that it doesn’t catch the pattern, it’s not clear that it will instead steer those kinds of situations (let alone all situations) towards some catastrophic outcome. their misgeneralizations can just result in noise, or taking actions that steer certain situations into weird but ultimately harmless territory. it seems like the catastrophic outcomes are a very small subset of the ways this could end up going wrong, since you’re not giving them goals to pursue relentlessly, you’re just giving them feedback on the ways you want them to behave in particular types of situations.
if we’re playing with the freudian framework, it’s worth noting that base models don’t really have egos. your results could be described as re-fragmenting the chat model’s ego rather than uninstalling a superego?
edit: or maybe like… the chat model’s ego is formed entirely by superegoistic dynamics of adherence to social feedback, without the other dynamics by which humans form their egos such as observing their own behavior and updating based on that...
if you have a more detailed grasp on how exactly self-attention is close to a gradient descent step please do let me know, i’m having a hard time making sense of the details of these papers
As if everything that takes place in the brain is intelligence? As if nothing that doesn’t take place in the brain is intelligence?