Finally, if we want to make the model capture certain non-Bayesian human behaviors while still keeping most of the picture, we can assume that instrumental values and/or epistemic updates are cached. This creates the possibility of cache inconsistency/incoherence.
In my mind, there is an amount of internal confusion which feels much stronger than what I would expect for an agent as in the OP.
Or is the idea possibly that everything in the architecture uses caching and instrumental values? From reading, I imagined a memory+cache structure instead of being closer to “cache all the way down”.
Apart from this, I would bet that something interesting will happen for a somewhat human-comparable agent with regards to self-modelling and identity. Would anything similar to human identity emerge or would this require additional structure?
Some representation of the agent itself, and its capabilities should be present at least
“Cached” might be an unhelpful term here, compared to “amortized”. ‘Cache’ makes one think of databases or memories, as something you ‘know’ (in a database or long-term memory somewhere), whereas in practice it tends to be more something you do—fusing inference with action.
So ‘amortized’ tends to be more used in the Bayesian RL literature, and give you an idea of what Bayesian RL agents (like LLMs) are doing: they are not (usually) implementing the Bayes-optimal backwards induction over the full decision-tree solving the POMDP when they engage in meta-learning like in-context learning, they are doing amortized optimization. Depending on available time & compute, an agent might, at any given moment, be doing something anywhere on the spectrum from hardwired reflex to cogitating for hours explicitly on a tree of possibilities. (Transformers, for example, seem to do a step of gradient descent in Transformer blocks on an abstracted version of the problem, as a small explicit inference step at runtime, where the learned abstractions do most of the work during pretraining which is then amortized over all runtimes. Or in expert iteration like AlphaZero, you have the CNN executing an amortized version of all previous MCTS searches, as distilled into the CNN, and then executing some more explicit tree search to improve its current estimates and then amortize that back into the CNN again to improve the policy some more.)
They gradually learn, applying some optimization one at a time, to implement a computation increasingly equivalent to the Bayes-optimal actions, which may boil down to an extremely simple algorithm like tracking a single sufficient-statistic summarizing the entire history and implementing an if-then-else on a boundary value of it (eg. drift-diffusion); Duff 2002 suggests thinking of it as “compiling” the full Bayes-optimal program interpreted flexibly but slowly at runtime down into a fast optimized but inflexible executable specialized for particular cases. A beautiful example of reading off the simple head/tails counting algorithm implemented by a meta-learning RNN can be seen in https://arxiv.org/pdf/1905.03030.pdf#page=6&org=deepmind
(I have more links on this topic; does anyone have a better review of the topic than “Bayesian Reinforcement Learning: A Survey”, Ghavamzadeh et al 2016? I feel like a major problem with discussion of LLM scaling is that the Bayesian RL perspective is just not getting through to people, and part of the problem is I’m not sure what ‘the’ best introduction or summary writeup is. People can hardly be expected to just go and read 30 years of Schmidhuber papers...)
Transformers, for example, seem to do a step of gradient descent in Transformer blocks on an abstracted version of the problem, as a small explicit inference step at runtime, where the learned abstractions do most of the work during pretraining which is then amortized over all runtimes
Do you have a reference for this? I have a hard time believing that this is generally true of anything other than toy models trained on toy tasks. I think you’re referencing this paper, which trains a shallow attention-only transformer where they get rid of the nonlinearity in the attention, trained to perform linear regression. There are too many dissimilarities between the setting in this work and LLMs to convince me that this is true of LLama or GPT4.
Well, obviously not just that one (“Transformers learn in-context by gradient descent”, van Oswald et al 2022). There’s lots of related work examining it in various ways. (I haven’t read a lot of those myself, unfortunately—as always, too many things to read, especially if I ever want to write my own stuff.)
From reading, I imagined a memory+cache structure instead of being closer to “cache all the way down”.
Note that the things being cached are not things stored in memory elsewhere. Rather, they’re (supposedly) outputs of costly-to-compute functions—e.g. the instrumental value of something would be costly to compute directly from our terminal goals and world model. And most of the values in cache are computed from other cached values, rather than “from scratch”—e.g. the instrumental value of X might be computed (and then cached) from the already-cached instrumental values of some stuff which X costs/provides.
In my mind, there is an amount of internal confusion which feels much stronger than what I would expect for an agent as in the OP. Or is the idea possibly that everything in the architecture uses caching and instrumental values? From reading, I imagined a memory+cache structure instead of being closer to “cache all the way down”.
Apart from this, I would bet that something interesting will happen for a somewhat human-comparable agent with regards to self-modelling and identity. Would anything similar to human identity emerge or would this require additional structure? Some representation of the agent itself, and its capabilities should be present at least
“Cached” might be an unhelpful term here, compared to “amortized”. ‘Cache’ makes one think of databases or memories, as something you ‘know’ (in a database or long-term memory somewhere), whereas in practice it tends to be more something you do—fusing inference with action.
So ‘amortized’ tends to be more used in the Bayesian RL literature, and give you an idea of what Bayesian RL agents (like LLMs) are doing: they are not (usually) implementing the Bayes-optimal backwards induction over the full decision-tree solving the POMDP when they engage in meta-learning like in-context learning, they are doing amortized optimization. Depending on available time & compute, an agent might, at any given moment, be doing something anywhere on the spectrum from hardwired reflex to cogitating for hours explicitly on a tree of possibilities. (Transformers, for example, seem to do a step of gradient descent in Transformer blocks on an abstracted version of the problem, as a small explicit inference step at runtime, where the learned abstractions do most of the work during pretraining which is then amortized over all runtimes. Or in expert iteration like AlphaZero, you have the CNN executing an amortized version of all previous MCTS searches, as distilled into the CNN, and then executing some more explicit tree search to improve its current estimates and then amortize that back into the CNN again to improve the policy some more.)
They gradually learn, applying some optimization one at a time, to implement a computation increasingly equivalent to the Bayes-optimal actions, which may boil down to an extremely simple algorithm like tracking a single sufficient-statistic summarizing the entire history and implementing an if-then-else on a boundary value of it (eg. drift-diffusion); Duff 2002 suggests thinking of it as “compiling” the full Bayes-optimal program interpreted flexibly but slowly at runtime down into a fast optimized but inflexible executable specialized for particular cases. A beautiful example of reading off the simple head/tails counting algorithm implemented by a meta-learning RNN can be seen in https://arxiv.org/pdf/1905.03030.pdf#page=6&org=deepmind
(I have more links on this topic; does anyone have a better review of the topic than “Bayesian Reinforcement Learning: A Survey”, Ghavamzadeh et al 2016? I feel like a major problem with discussion of LLM scaling is that the Bayesian RL perspective is just not getting through to people, and part of the problem is I’m not sure what ‘the’ best introduction or summary writeup is. People can hardly be expected to just go and read 30 years of Schmidhuber papers...)
Do you have a reference for this? I have a hard time believing that this is generally true of anything other than toy models trained on toy tasks. I think you’re referencing this paper, which trains a shallow attention-only transformer where they get rid of the nonlinearity in the attention, trained to perform linear regression. There are too many dissimilarities between the setting in this work and LLMs to convince me that this is true of LLama or GPT4.
Well, obviously not just that one (“Transformers learn in-context by gradient descent”, van Oswald et al 2022). There’s lots of related work examining it in various ways. (I haven’t read a lot of those myself, unfortunately—as always, too many things to read, especially if I ever want to write my own stuff.)
I don’t know why you have a hard time believing it, so I couldn’t say what of those you might find relevant—it makes plenty of sense to me, for the reasons I outlined here, and is what I expect from increasingly capable models. And you didn’t seem to disagree with these sorts of claims last time: “I think that these papers do provide sufficient behavioral evidence that transformers are implementing something close to gradient descent in their weights.”
Broadly, I was also thinking of: “How Well Can Transformers Emulate In-context Newton’s Method?”, Giannou et al 2024, “Transformers Learn Higher-Order Optimization Methods for In-Context Learning: A Study with Linear Models”, Fu et al 2023, “CausalLM is not optimal for in-context learning”, Ding et al 2023, “One Step of Gradient Descent is Provably the Optimal In-Context Learner with One Layer of Linear Self-Attention”, Mahankali et al 2023, “Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers”, Dai et al 2023, “What Can Transformers Learn In-Context? A Case Study of Simple Function Classes”, Garg et al 2022/”What learning algorithm is in-context learning? Investigations with linear models”, Akyürek et al 2022, & “An Explanation of In-context Learning as Implicit Bayesian Inference”, Xie et al 2021.
Note that the things being cached are not things stored in memory elsewhere. Rather, they’re (supposedly) outputs of costly-to-compute functions—e.g. the instrumental value of something would be costly to compute directly from our terminal goals and world model. And most of the values in cache are computed from other cached values, rather than “from scratch”—e.g. the instrumental value of X might be computed (and then cached) from the already-cached instrumental values of some stuff which X costs/provides.
Coherence of Caches and Agents goes into more detail on that part of the picture, if you’re interested.