Trying to become a shoggoth whisperer
abhayesian
Finding Backward Chaining Circuits in Transformers Trained on Tree Search
Transformers, for example, seem to do a step of gradient descent in Transformer blocks on an abstracted version of the problem, as a small explicit inference step at runtime, where the learned abstractions do most of the work during pretraining which is then amortized over all runtimes
Do you have a reference for this? I have a hard time believing that this is generally true of anything other than toy models trained on toy tasks. I think you’re referencing this paper, which trains a shallow attention-only transformer where they get rid of the nonlinearity in the attention, trained to perform linear regression. There are too many dissimilarities between the setting in this work and LLMs to convince me that this is true of LLama or GPT4.
I would also like to see some sort of symbolic optimization process operating as a wrapper for an LLM to act as an interpretable bridge between the black-box model and the real world, but I doubt Monte-Carlo Tree Search\Expectimax is the right sort of algorithm. Maybe something closer to GOFAI planner calling and parsing LLM outputs in a way similar to Factored Cognition might be better and much more computationally efficient.
There is still technically a limit to how far back a Transformer-XL can see since each layer can only attend to previous keys/values computed by that layer. As a result, the receptive field of layer L can only be as wide as the last L context windows. I guess this means that there might be some things that LSTMs can do that Transformer-XL can’t, but this can be fixed with a couple of minor modifications to Transformer-XL. For example, this paper fixes the problem by allowing layers to attend to the outputs of later layers from previous context windows, which should make the receptive field (at least theoretically) infinitely long, meaning it should probably be able to do everything an LSTM can.
One thing that comes to mind is DeepMind’s Adaptive Agents team using Transformer-XL, which can attend to data outside the current context window. I think there was speculation that GPT-4 may also be a Transformer-XL, but I’m not sure how to verify that.
I don’t think it’s fair for them to claim that the model has an infinite context length. It appears that they can train the model as a transformer, but can turn the model into an RNN at inference time. While the RNN doesn’t have a context length limit as the transformer does, I doubt it will perform well on contexts longer than it has seen during training. There may also be limits to how much information can be stored in the hidden state, such that the model has a shorter effective context length than current SOTA LLMs.
Yeah, this is starting to make a lot more sense to me. It seems that evaluating the complexity of a utility function using Kolmogorov complexity rather than thinking about how hard it is for the AGI to implement it in terms of its internal concept language is a huge mistake. Magical categories don’t seem that magical anymore; simply predicting the next tokens is enough to give you robust abstractions about human values.
How can “I am currently on Earth” be encoded directly into the structure of the brain? I also feel that “101 is a prime number” is more fundamental to me (being about logical structure rather than physical structure) than currently being on Earth, so I’m having a hard time understanding why this is not considered a hinge belief.
I do not think that “101 is a prime number” and “I am currently on Earth” are implemented that differently in my brain; they both seem to be implemented in parameters rather than architecture. I guess they also wouldn’t be implemented differently in modern-day LLMs. Maybe the relevant extension to LLMs would be the facts the model would think of when prompted with the empty string vs. some other detailed prompt.
I think that these papers do provide sufficient behavioral evidence that transformers are implementing something close to gradient descent in their weights. Garg et al. 2022 examine the performance of 12-layer GPT-style transformers trained to do few-shot learning and show that they can in-context learn 2-layer MLPs. The performance of their model closely matches an MLP with GD for 5000 steps on those same few-shot examples, and it cannot be explained by heuristics like averaging the K-nearest neighbors from the few-shot examples. Since the inputs are fairly high-dimensional, I don’t think they can be performing this well by only memorizing the weights they’ve seen during training. The model is also fairly robust to distribution shifts in the inputs at test time, so the heuristic they must be learning should be pretty similar to a general-purpose learning algorithm.
I think that there also is some amount of mechanistic evidence that transformers implement some sort of iterative optimization algorithm over some quantity stored in the residual stream. In one of the papers mentioned above (Akyurek et al. 2022), the authors trained a probe to extract the ground-truth weights of the linear model from the residual stream and it appears to somewhat work. The diagrams seem to show that it gets better when trained on activations from later layers, so it seems likely that the transformer is iteratively refining its prediction of the weights.
- 11 Apr 2024 1:44 UTC; 8 points) 's comment on How We Picture Bayesian Agents by (
But even if it is, this thing is far less naturally useful for predicting the future human behaviour than the algorithm the human actually implements!
I see why this might be true for an LLM trained with a purely predictive loss, but I have a hard time believing that the same will be true for an LLM that is grounded. I imagine that LLMs will eventually be trained to perform some sort of in-context adaptation to a new environment while receiving a reward signal from a human in the loop. Models that learn to maximize the approval of some human prompting it to cut down a tree will inevitably have to learn human ontology to real-world correspondence. I imagine that ELK aka “dereferencing the values pointer in your head” will just be a capability that the model will just acquire, and retargeting the search just means identifying this mechanism and verifying that it is robust.
Does this differ from your model of superhuman AI development?
I’m a bit confused as to why this would work.
If the circuit in the intermediate layer that estimates the gradient does not influence the output, wouldn’t they just be free parameters that can be varied with no consequence to the loss? If so, this violates 2a since perturbing these parameters would not get the model to converge to the desired solution.
Excellent work! Regarding the results on OR-chat, I’m wondering how problematic it actually is for the model to refuse suspicious inputs.
It seems alright to me if the model rejects requests like this, so I’d hesitate to call this a flaw of the method.