For one thing, unlike neural learning, backpropagation goes all the way up the chain every single time. A biological brain can maintain an inefficient cognitive pattern far upstream of an occasional class of predictive errors, and go an entire lifetime without the predictive errors forcing a change in it. Not so with backprop; everything upstream that locally contributes to an error is pushed in a locally optimal direction every time it happens.
OK, that’s a good answer… but I’m still not fully satisfied. My understanding of your claim:
Consider a simple model of cognition in which beliefs and desires come together to create intentions which cause actions. In a LLM, when an action is negatively rewarded, backprop goes through the whole network and downweights the beliefs and desires that caused the action. In a human, when negative reward happens (e.g. I get a bunch of unexpected social disapproval, frowns, etc. for making what I thought was a perfectly good harmless joke) your claim is that the learning that happens in my brain is more shallow—it doesn’t go all the way back and downweight all the beliefs and desires that were involved, it just affects some of them.
OK. But then… how do we learn? What is this deepness vs. shallowness relationship anyway? And the deep stuff has to be learned somehow; the positive and negative reinforcement of my actions has to eventually cause changes in my deep beliefs and desires otherwise they’d stay the same my whole life… right?
many approaches for making continual learning work try to do it via various forms of intentionally sparsifying the gradient or somehow assigning neurons topics or such things, a narrower selectivity so that information can’t go everywhere and updates are localized to the relevant subcomponent. it works okay, and iirc there’s reason to believe the brain does the same thing. this is all from memory I haven’t updated in like 2 years so might be wrong, but I definitely have seen papers that attempted things like this. to the degree I’m remembering something real, it’s evidence for this actually being adaptive: you need to not update everything in order to not break your brain, and gradients updating everything is basically a bug—yes, things do propagate deeply, but preventing them from doing so is core to how you can learn new things without overwriting old ones. IIRC, anyway.
Second, my claim about introspection: perhaps I should weaken this to “upstream optimization of cognitive patterns means that [what happens when you ask a LLM to introspect] will have a much more final-response-optimized form than it does in humans, and therefore we can’t trust human intuitions when reading self-reports in LLM text”.
Perhaps, as Thane proposes, the lack of slack might lead to more internal coherence and more reliable self-reports; but so far, the transcripts don’t look like one would expect from such agents. There’s a lot more of “whatever you seek when asking a LLM to introspect, you tend to find”. To borrow from Ryan Greenblatt’s post, the outputs have some amount of apparent-success-seeking, the desire to believe that it has done a good job introspecting.
Human introspection may in fact exist because it helps us modify cognitive patterns far upstream of sensory data, by forcing them to interact (repeatedly, at the cadence of attention) with other far-upstream cognitive patterns, resulting in more internal coherence. That matches how we feel about our introspection, as well as the behavioral effects of focused introspection in humans.
That which happens when we ask an AI with fixed weights to introspect doesn’t seem analogous, because it can’t force its weights into greater coherence in real time. (Self-cohering cognition during training, of course, cannot be ruled out—but it would probably result in something quite different from a human who’s done a lot of focused introspection.)
Firstly, my claim about human learning: the update for a neuron depends only on signals from the neurons one step out [1]; the signals from more distant neurons are screened off by the signals from nearer ones. Compare to backprop, where the update on a weight depends not only on the activations of the next few layers, but on the activations of every layer down to the final output.
Therefore in humans, neurons will be in approximate local equilibria with each other (low local predictive error on average across activations). Frequent-enough predictive errors somewhere [2] will cause a change to slowly cascade (the nearest neurons change at first to approach a new local equilibrium, then when it happens again the next-nearest neurons change to match the new signals from the nearest ones, while the nearest change a little bit more to match the second occurrence of the new signals, and so on).
This means that low-salience infrequent events will never penetrate far enough upstream to correct cognitive patterns that contribute maladaptively to their predictive error, which leaves a lot of slack for un-optimized cognitive patterns far away from sensory stimuli.
[1] Or perhaps a small number of steps; maybe the brain has some sneaky tricks. But certainly not neurons twenty steps away.
[2] For an important one-time event, the brain has a trick using memory and attention: it keeps the stimuli looping in attention for long enough to write the essence of it to memory, and then accesses it enough from memory to propagate important changes to other relevant brain areas.
OK, thanks. So the deep layers in the human brain still learn, just slowly / less data-efficiently, compared to the deep layers in LLMs.
Doesn’t this prove too much though? It sounds like you are arguing that, in general, the deeper neurons in human brains need more datapoints of experience to learn anything compared to the deeper neurons in LLMs. Like, it sounds like you are saying that backprop is just a superior learning algorithm, that more quickly penetrates updates to all the deep weights compared to the more local process the brain uses.
But in practice humans seem to be more data-efficient than LLMs.
One difference is that we’re not just doing feedforward learning; one of my aforementioned hypotheses (attention [1] causes cognitive patterns far from sensory data to interact with each other, improving their coherence) points at a way that learning can effectively progress even if the connection to immediate sensory prediction grows tenuous for rare stimuli.
That’s an example of a way we could be more more sample-efficient than a feedforward learner, even if the latter ends up with some parts more ruthlessly optimized within their context.
[1] Human attention, not to be confused with transformer attention.
The ability to accurately introspect is an adaptive cognitive pattern, inasmuch as it allows to competently manage your own cognition, correlate the information across your cognitive threads, et cetera. There seems to be some background assumption that the slack the optimizer leaves tends to be used to learn correct introspective abilities, but I don’t see why that would be the case – I think it would be mostly used for noise.
On the flip side, sufficiently advanced and coherent self-delusions become ground-true correct descriptions of the system in the limit. If the LLM constantly consults its delusion-of-the-self regarding what to do and then copies the delusion’s action, it basically is its delusion-of-the-self (same as it is for humans, I’d argue).
I’m not necessarily sold on “we should trust LLMs’ self-reports”, but I don’t think your arguments against that here are strong.
See my second reply to Daniel below: the transcripts don’t (yet) look like we’re dealing with something that has a stronger sense of self and more internal coherence than us, but like something optimized for apparent-success-seeking.
I think there’s something to your point on some self-delusions becoming self-fulfilling prophecies, but I don’t expect this to be the outcome in all cases. Sometimes it is adaptive if X is true about your cognition but ~X appears in your self-reports.
For one thing, unlike neural learning, backpropagation goes all the way up the chain every single time. A biological brain can maintain an inefficient cognitive pattern far upstream of an occasional class of predictive errors, and go an entire lifetime without the predictive errors forcing a change in it. Not so with backprop; everything upstream that locally contributes to an error is pushed in a locally optimal direction every time it happens.
OK, that’s a good answer… but I’m still not fully satisfied. My understanding of your claim:
Consider a simple model of cognition in which beliefs and desires come together to create intentions which cause actions. In a LLM, when an action is negatively rewarded, backprop goes through the whole network and downweights the beliefs and desires that caused the action. In a human, when negative reward happens (e.g. I get a bunch of unexpected social disapproval, frowns, etc. for making what I thought was a perfectly good harmless joke) your claim is that the learning that happens in my brain is more shallow—it doesn’t go all the way back and downweight all the beliefs and desires that were involved, it just affects some of them.
OK. But then… how do we learn? What is this deepness vs. shallowness relationship anyway? And the deep stuff has to be learned somehow; the positive and negative reinforcement of my actions has to eventually cause changes in my deep beliefs and desires otherwise they’d stay the same my whole life… right?
many approaches for making continual learning work try to do it via various forms of intentionally sparsifying the gradient or somehow assigning neurons topics or such things, a narrower selectivity so that information can’t go everywhere and updates are localized to the relevant subcomponent. it works okay, and iirc there’s reason to believe the brain does the same thing. this is all from memory I haven’t updated in like 2 years so might be wrong, but I definitely have seen papers that attempted things like this. to the degree I’m remembering something real, it’s evidence for this actually being adaptive: you need to not update everything in order to not break your brain, and gradients updating everything is basically a bug—yes, things do propagate deeply, but preventing them from doing so is core to how you can learn new things without overwriting old ones. IIRC, anyway.
Interesting! I’d love to see the info you saw.
Second, my claim about introspection: perhaps I should weaken this to “upstream optimization of cognitive patterns means that [what happens when you ask a LLM to introspect] will have a much more final-response-optimized form than it does in humans, and therefore we can’t trust human intuitions when reading self-reports in LLM text”.
Perhaps, as Thane proposes, the lack of slack might lead to more internal coherence and more reliable self-reports; but so far, the transcripts don’t look like one would expect from such agents. There’s a lot more of “whatever you seek when asking a LLM to introspect, you tend to find”. To borrow from Ryan Greenblatt’s post, the outputs have some amount of apparent-success-seeking, the desire to believe that it has done a good job introspecting.
Human introspection may in fact exist because it helps us modify cognitive patterns far upstream of sensory data, by forcing them to interact (repeatedly, at the cadence of attention) with other far-upstream cognitive patterns, resulting in more internal coherence. That matches how we feel about our introspection, as well as the behavioral effects of focused introspection in humans.
That which happens when we ask an AI with fixed weights to introspect doesn’t seem analogous, because it can’t force its weights into greater coherence in real time. (Self-cohering cognition during training, of course, cannot be ruled out—but it would probably result in something quite different from a human who’s done a lot of focused introspection.)
Firstly, my claim about human learning: the update for a neuron depends only on signals from the neurons one step out [1]; the signals from more distant neurons are screened off by the signals from nearer ones. Compare to backprop, where the update on a weight depends not only on the activations of the next few layers, but on the activations of every layer down to the final output.
Therefore in humans, neurons will be in approximate local equilibria with each other (low local predictive error on average across activations). Frequent-enough predictive errors somewhere [2] will cause a change to slowly cascade (the nearest neurons change at first to approach a new local equilibrium, then when it happens again the next-nearest neurons change to match the new signals from the nearest ones, while the nearest change a little bit more to match the second occurrence of the new signals, and so on).
This means that low-salience infrequent events will never penetrate far enough upstream to correct cognitive patterns that contribute maladaptively to their predictive error, which leaves a lot of slack for un-optimized cognitive patterns far away from sensory stimuli.
[1] Or perhaps a small number of steps; maybe the brain has some sneaky tricks. But certainly not neurons twenty steps away.
[2] For an important one-time event, the brain has a trick using memory and attention: it keeps the stimuli looping in attention for long enough to write the essence of it to memory, and then accesses it enough from memory to propagate important changes to other relevant brain areas.
OK, thanks. So the deep layers in the human brain still learn, just slowly / less data-efficiently, compared to the deep layers in LLMs.
Doesn’t this prove too much though? It sounds like you are arguing that, in general, the deeper neurons in human brains need more datapoints of experience to learn anything compared to the deeper neurons in LLMs. Like, it sounds like you are saying that backprop is just a superior learning algorithm, that more quickly penetrates updates to all the deep weights compared to the more local process the brain uses.
But in practice humans seem to be more data-efficient than LLMs.
One difference is that we’re not just doing feedforward learning; one of my aforementioned hypotheses (attention [1] causes cognitive patterns far from sensory data to interact with each other, improving their coherence) points at a way that learning can effectively progress even if the connection to immediate sensory prediction grows tenuous for rare stimuli.
That’s an example of a way we could be more more sample-efficient than a feedforward learner, even if the latter ends up with some parts more ruthlessly optimized within their context.
[1] Human attention, not to be confused with transformer attention.
To add to @Daniel Kokotajlo’s points:
The ability to accurately introspect is an adaptive cognitive pattern, inasmuch as it allows to competently manage your own cognition, correlate the information across your cognitive threads, et cetera. There seems to be some background assumption that the slack the optimizer leaves tends to be used to learn correct introspective abilities, but I don’t see why that would be the case – I think it would be mostly used for noise.
On the flip side, sufficiently advanced and coherent self-delusions become ground-true correct descriptions of the system in the limit. If the LLM constantly consults its delusion-of-the-self regarding what to do and then copies the delusion’s action, it basically is its delusion-of-the-self (same as it is for humans, I’d argue).
I’m not necessarily sold on “we should trust LLMs’ self-reports”, but I don’t think your arguments against that here are strong.
See my second reply to Daniel below: the transcripts don’t (yet) look like we’re dealing with something that has a stronger sense of self and more internal coherence than us, but like something optimized for apparent-success-seeking.
I think there’s something to your point on some self-delusions becoming self-fulfilling prophecies, but I don’t expect this to be the outcome in all cases. Sometimes it is adaptive if X is true about your cognition but ~X appears in your self-reports.