Some personal reflections on the last year, and some thoughts for next:
1 year ago I quit my career as an academic experimental neuroscientist and began doing AI technical safety research full time. This was emotionally difficult! For more than a decade I had been committed to becoming a neuroscience professor, and had spent a lot of my 20s and 30s pursuing that end. So the move, which had its natural uncertainties (can I succeed in a totally different field? will I be able to support my family financially?) was made more difficult by an ingrained identity as a neuroscientist. In retrospect I wish I had made the move earlier (as Neel Nanda had suggested to me!), oh well, live and learn!
I was extremely lucky to have the support of PIBBSS as I transitioned (special thanks to Nora and Lucas). The main work that came out of my time there is a dream come true. I had read about computational mechanics ~1 decade ago after reading a Melanie Mitchell book, and had tried multiple times to apply it to neuroscience data. I completely failed each time, but would come back to it every now and then. Meeting Paul Riechers was game changing—both his deep knowledge and, even more importantly, his supportive and positive attitude have been a true blessing.
I also want to mention Alexander Oldenziel, who has been and continues to be supportive, and is an inspirational model of tenaciousness and agency. He was the first person in the AI safety community who heard me rant about comp mech, and who suggested that comp mech might be able to do some work there.
Paul and I started Simplex this year! It kind of feels like starting an academic lab, except not in academia, and with someone else. Definitely an exciting journey! One thing that feels different than I imagine staying in academia would feel is the sense of purpose—I really do believe our point of view and work will be important for AI safety.
Speaking just for myself, I underestimated how difficult it would be to raise money, and how much time it would take. Getting better at this skill is going to be a focus of the next year.
I watched my daughter grow from 1 to 2 years old. Everything about this fact is hard to put into words. I won’t try.
While people have definitely shown in interest in our initial work at Simplex, I think for the most part people are unaware of the larger picture of how we think about comp mech and its relation to AI safety. This is mostly because we really haven’t spoken about it in public very much! That will change in the coming year. Comp mech is much deeper and broader than the belief state geometry story presented.
For the most part though, we’ve chosen to take a show rather than tell approach. We want the quality of our work to be very high, we want to overdeliver. If someone doesn’t understand our point of view we would rather show them its utility by example rather than by argument or philosophy. I’m happy with that, though it has probably meant a slower public facing start. We have a lot more public facing things in store for 2025.
I can’t seem to update my beliefs appropriately when new AI capabilities come out. I am shocked. Every. Single. Time. This still feels like magic to me. Scary magic. Beautiful magic Weird magic. Where are we going?
1 year ago I quit my career as an academic experimental neuroscientist and began doing AI technical safety research full time. This was emotionally difficult! For more than a decade I had been committed to becoming a neuroscience professor, and had spent a lot of my 20s and 30s pursuing that end. So the move, which had its natural uncertainties (can I succeed in a totally different field? will I be able to support my family financially?) was made more difficult by an ingrained identity as a neuroscientist. In retrospect I wish I had made the move earlier (as Neel Nanda had suggested to me!), oh well, live and learn!
Yeah, that was hard for me too. I was really excited about neuroscience. I still hope to someday go back to neuroscience if the AGI transition goes well. In my case, making the transition in 2015, it was also hard to explain to people in my life why I was changing my focus to AI.
I’ve been trying to get my head around how to theoretically think about scaling test time compute, CoT, reasoning, etc. One frame that keeps on popping into my head is that these methods are a type of un-amortization.
In a more standard inference amortization setup one would e.g. train directly on question/answer pairs without the explicit reasoning path between the question and answer. In that way we pay an up-front cost during training to learn a “shortcut” between question and answers, and then we can use that pre-paid shortcut during inference. And we call that amortized inference.
In the current techniques for using test time compute we do the opposite—we pay costs during inference in order to explicitly capture the path between question and answer.
Uncertainties and things I would like to see:
I’m far from an expert in amortization and don’t know if this is a reasonable use of the concept
Can we use this framing to make a toy model of using test time compute? I’d really like for the theoretically minded style of interp I do to keep up with current techniques.
If we had a toy model I could see getting theoertical clarity on the following:
What’s the relation between explicit reasoning vs. internal reasoning
What does it mean to have CoT be “faithful” to the internals
What features and geometric structures underlie reasoning
Why is explicit reasoning such a strong mechanism for out of distribution generalization?
I feel like the term “amortization” in ML/CS has a couple of meanings. Do you just mean redistributing compute from training to inference?
I think this is an interesting model, but I also think that part of the use of CoT is more specific to the language/logic context, to literally think step by step (which sometimes lets you split problems into subproblems). In some limit, there would be exponentially few examples in the training data of directly “thinking n steps ahead”, so a transformer wouldn’t be able to learn to do this at all (at least without some impressive RL). Like imagine training a chess playing computer to play chess, by only looking at every 10th move of a chess game: probably with enough inference power, a very powerful system wold be able to reconstruct the rules of chess as the best way of making sense of the regularities in the information, but this is in some sense exponentially harder than learning from looking at every move
Ah I think that the notion of amortized inference that you’re using encapsulates what I’m saying about chess. I’m still a little confused about the scope of the concept though—do you have a good cached explanation?
Yes, I’m thinking of that line of work. I actually think the first few paragraphs of this paper does a better job of getting the vibes I want (and I should emphasize these are vibes that I have, not any kind of formal understanding). So here’s my try at a cached explanation of the concept of amortized inference I’m trying to evoke:
A lot of problems are really hard, and the algorithmic/reasoning path from the question to the answer are many steps. But it seems that in some cases humans are much faster than that (perhaps by admitting some error, but even so, they are both fast and quite good at the task). The idea is that in these settings a human brain is performing amortized inference—because they’ve seen similar examples of the input/output relation of the task before, they can use that direct mapping as a kind of bootstrap for the new task at hand, saving a lot of inference time.
Now that i’ve typed that out it feels maybe similar to your stuff about heuristics?
Big caveat here: it’s quite possible I’m misunderstanding amortized inference (maybe @jessicata can help here?), as well as reaching with the connection to your work.
I’m not sure this captures what you mean, but, if you see a query, do a bunch of reasoning, and get an answer, then you can build a dataset of (query, well-thought guess). Then you can train an AI model on that.
AlphaZero sorta works like this, because it can make a “well-thought guess” (take value and/or Q network, do an iteration of minimax, then make the value/Q network more closely approximate that, in a fixed point fashion)
Learning stochastic inverses is a specific case of “learn to automate Bayesian inference by taking forward samples and learning the backwards model”. It could be applied to LLMs for example, in terms of starting with a forwards LLM and then using it to train a LLM that predicts things out-of-order.
Paul Christiano’s iterated amplification and distillation is applying this idea to ML systems with a human feedback element. If you can expend a bunch of compute to get a good answer, you can train a weaker system to approximate that answer. Or, if you can expend a bunch of compute to get a good rating for answers, you can use that as RL feedback.
Broadly, I take o3 as evidence that Christiano’s work is broadly on the right track with respect to alignment of near-term AI systems. That is, o3 shows that hard questions can be decomposed into easy ones, in a way that involves training weaker models to be part of a big computation. (I don’t understand the o3 details that well, given it’s partially private, but I’m assuming this describes the general outlines). So I think the sort of schemes Christiano has described will be helpful for both alignment and capabilities, and will scale pretty well to impressive systems.
I’m not sure if there’s a form of amortized inference that you think this doesn’t cover well.
Hmm, I’m not sure how what you’re describing (learn on a bunch of examples of (query, well-thought-out guess)) is different from other forms of supervised learning.
Based on the paper Adam shared, it seems that part of the “amortizing” picture is that instead of simple supervised learning you look at examples of the form (context1, many examples from context1), (context2, many examples from context2), etc., in order to get good at quickly performing inference on new contexts.
It sounds like in the Paul Christiano example, you’re assuming access to some internal reasoning components (like activations or chain-of-thought) to set up a student-teacher context. Is this equivalent to the other picture I mentioned?
I’m also curious about what you said about o3 (and maybe have a related confusion about this). I certainly believe that NN’s, including RL models, learn by parallel heuristics (there’s a lot of interp and theory work that suggests this), but I don’t know any special properties of o3 that make it particularly supportive of this point of view
In a more standard inference amortization setup one would e.g. train directly on question/answer pairs without the explicit reasoning path between the question and answer. In that way we pay an up-front cost during training to learn a “shortcut” between question and answers, and then we can use that pre-paid shortcut during inference. And we call that amortized inference.
Which sounds like supervised learning. Adam seemed to want to know how that relates to scaling up inference time compute so I said some ways they are related.
I don’t know much about amortized inference in general. The Goodman paper seems to be about saving compute by caching results between different queries. This could be applied to LLMs but I don’t know of it being applied. It seems like you and Adam like this “amortized inference” concept and I’m new to it so don’t have any relevant comments. (Yes I realize my name is on a paper talking about this but I actually didn’t remember the concept)
I don’t think I implied anything about o3 relating to parallel heuristics.
Thanks, this is helpful. I’m still a bit unclear about how to use the word/concept “amortized inference” correctly. Is the first example you gave, of training an AI model on (query, well-thought guess), an example of amortized inference, relative to training on (query, a bunch of reasoning + well-thought out guess)?
Thanks! I spent a bit of time understanding the stochastic inverse paper, though haven’t yet fully grokked it. My understanding here is that you’re trying to learn the conditional probabilities in a Bayes net from samples. The “non-amortized” way to do this for them is to choose a (non-unique) maximal inverse factorization that satisfies some d-separation condition, then guess the conditional probabilities on the latent-generating process by just observing frequencies of conditional events—but of course this is very inefficient, in particular because the inverse factorization isn’t a general Bayes net, but must satisfy a bunch of consistency conditions; and then you can learn a generative model for these consistency conditions by a NN and then perform some MCMC sampling on this learned prior.
So is the “moral” you want to take away here then that by exploring a diversity of tasks (corresponding to learning this generative prior on inverse Bayes nets) a NN can significantly improve its performance on single-shot prediction tasks?
i’m starting to think recursive self improvement is basically already possible with LLMs, even without anymore training ever. I’m pretty shocked with how much better my coding LLMs have become just by taking care to give the LLMs the right meta-context and information systems. I feel like I’ve moved from prompting, to figuring out what context is needed in addition to the prompt, to spending a bunch of time/effort building a knowledge structure so that the LLM can figure out its own context to get whatever done, and thats moved me from having LLMs write functions and scripts to large multi-file chunks of entire repositories. And I’m continually having the thought that’s like “ok but now I’m building this knowledge system that it can traverse and decide its own relevant context, but why can’t the LLM do that too? what would i need to setup for it to do that?” and i’m starting to feel like that’s a never ending thing.
After thinking about this a bit more, the main point I’d want to make is less about recursive self improvement and more that just there’s a lot more capability in these models than people realize.
Whether that capacity is enough for recursive self improvement is another question that I’m not certain about either way but I think it’s at least plausible that it might be. I will note that humanity improves in its knowledge and capability without architectural change. That’s a rough analogy to the type of improvement I’m imagining.
A neglected problem in AI safety technical research is teasing apart the mechanisms of dangerous capabilities exhibited by current LLMs. In particular, I am thinking that for any model organism( see Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research) of dangerous capabilities (e.g. sleeper agents paper), we don’t know how much of the phenomenon depends on the particular semantics of terms like “goal” and “deception” and “lie” (insofar as they are used in the scratchpad or in prompts or in finetuning data) or if the same phenomenon could be had by subbing in more or less any word. One approach to this is to make small toy models of these type of phenomenon where we can more easily control data distributions and yet still get analogous behavior. In this way we can really control for any particular aspect of the data and figure out, scientifically, the nature of these dangers. By small toy model I’m thinking of highly artificial datasets (perhaps made of binary digits with specific correlation structure, or whatever the minimum needed to get the phenomenon at hand).
Terminology point: When I say “a model has a dangerous capability”, I usually mean “a model has the ability to do XYZ if fine-tuned to do so”. You seem to be using this term somewhat differently as model organisms like the ones you discuss are often (though not always) looking at questions related to inductive biases and generalization (e.g. if you train a model to have a backdoor and then train it in XYZ way does this backdoor get removed).
Pre-registering a prediction for experiment results testing why and when and how attention heads share information. We (simplex) will train transformers on data generated from the cartesian product of sequences generated from 2 independent Mess3 processes. If the degenerate eigenvalue of each is the same and positive (e.g. lambda=0.7 for both) then a transformer with a single attention head will learn it, and the attention patterns will be 0.7^(d-s) where d-s is the number of context positions between the source and desitination. If instead one of the processes has lambda = 0.7 and another lambda = 0.3, then this will seperate info. between two heads, where one has attention pattersn of 0.7^(d-s) and the other 0.3^(d-s). If instead we have one process with lambda=0.7 and the other with lambda = −0.3, then this will require 3 heads, with .7^(d-s) and the 0.3^(d-s) seperating between two heads to account for the fact that attention has to be positive. BUT if we have one process with lambda = 0.7 and the other with lambda = −0.7, then this will require 2 heads! With the 0.7^(d-s) being able to be shared on one head for the lambda=0.7 process, and for the −0.7 process when d-s is even, and then the odd d-s cases will be on the other head.
Some personal reflections on the last year, and some thoughts for next:
1 year ago I quit my career as an academic experimental neuroscientist and began doing AI technical safety research full time. This was emotionally difficult! For more than a decade I had been committed to becoming a neuroscience professor, and had spent a lot of my 20s and 30s pursuing that end. So the move, which had its natural uncertainties (can I succeed in a totally different field? will I be able to support my family financially?) was made more difficult by an ingrained identity as a neuroscientist. In retrospect I wish I had made the move earlier (as Neel Nanda had suggested to me!), oh well, live and learn!
I was extremely lucky to have the support of PIBBSS as I transitioned (special thanks to Nora and Lucas). The main work that came out of my time there is a dream come true. I had read about computational mechanics ~1 decade ago after reading a Melanie Mitchell book, and had tried multiple times to apply it to neuroscience data. I completely failed each time, but would come back to it every now and then. Meeting Paul Riechers was game changing—both his deep knowledge and, even more importantly, his supportive and positive attitude have been a true blessing.
I also want to mention Alexander Oldenziel, who has been and continues to be supportive, and is an inspirational model of tenaciousness and agency. He was the first person in the AI safety community who heard me rant about comp mech, and who suggested that comp mech might be able to do some work there.
Paul and I started Simplex this year! It kind of feels like starting an academic lab, except not in academia, and with someone else. Definitely an exciting journey! One thing that feels different than I imagine staying in academia would feel is the sense of purpose—I really do believe our point of view and work will be important for AI safety.
Speaking just for myself, I underestimated how difficult it would be to raise money, and how much time it would take. Getting better at this skill is going to be a focus of the next year.
I watched my daughter grow from 1 to 2 years old. Everything about this fact is hard to put into words. I won’t try.
While people have definitely shown in interest in our initial work at Simplex, I think for the most part people are unaware of the larger picture of how we think about comp mech and its relation to AI safety. This is mostly because we really haven’t spoken about it in public very much! That will change in the coming year. Comp mech is much deeper and broader than the belief state geometry story presented.
For the most part though, we’ve chosen to take a show rather than tell approach. We want the quality of our work to be very high, we want to overdeliver. If someone doesn’t understand our point of view we would rather show them its utility by example rather than by argument or philosophy. I’m happy with that, though it has probably meant a slower public facing start. We have a lot more public facing things in store for 2025.
I can’t seem to update my beliefs appropriately when new AI capabilities come out. I am shocked. Every. Single. Time. This still feels like magic to me. Scary magic. Beautiful magic Weird magic. Where are we going?
Happy New Year everyone!
Yeah, that was hard for me too. I was really excited about neuroscience. I still hope to someday go back to neuroscience if the AGI transition goes well. In my case, making the transition in 2015, it was also hard to explain to people in my life why I was changing my focus to AI.
I’ve been trying to get my head around how to theoretically think about scaling test time compute, CoT, reasoning, etc. One frame that keeps on popping into my head is that these methods are a type of un-amortization.
In a more standard inference amortization setup one would e.g. train directly on question/answer pairs without the explicit reasoning path between the question and answer. In that way we pay an up-front cost during training to learn a “shortcut” between question and answers, and then we can use that pre-paid shortcut during inference. And we call that amortized inference.
In the current techniques for using test time compute we do the opposite—we pay costs during inference in order to explicitly capture the path between question and answer.
Uncertainties and things I would like to see:
I’m far from an expert in amortization and don’t know if this is a reasonable use of the concept
Can we use this framing to make a toy model of using test time compute? I’d really like for the theoretically minded style of interp I do to keep up with current techniques.
If we had a toy model I could see getting theoertical clarity on the following:
What’s the relation between explicit reasoning vs. internal reasoning
What does it mean to have CoT be “faithful” to the internals
What features and geometric structures underlie reasoning
Why is explicit reasoning such a strong mechanism for out of distribution generalization?
I feel like the term “amortization” in ML/CS has a couple of meanings. Do you just mean redistributing compute from training to inference?
I think this is an interesting model, but I also think that part of the use of CoT is more specific to the language/logic context, to literally think step by step (which sometimes lets you split problems into subproblems). In some limit, there would be exponentially few examples in the training data of directly “thinking n steps ahead”, so a transformer wouldn’t be able to learn to do this at all (at least without some impressive RL). Like imagine training a chess playing computer to play chess, by only looking at every 10th move of a chess game: probably with enough inference power, a very powerful system wold be able to reconstruct the rules of chess as the best way of making sense of the regularities in the information, but this is in some sense exponentially harder than learning from looking at every move
Ah I think that the notion of amortized inference that you’re using encapsulates what I’m saying about chess. I’m still a little confused about the scope of the concept though—do you have a good cached explanation?
Yes, I’m thinking of that line of work. I actually think the first few paragraphs of this paper does a better job of getting the vibes I want (and I should emphasize these are vibes that I have, not any kind of formal understanding). So here’s my try at a cached explanation of the concept of amortized inference I’m trying to evoke:
A lot of problems are really hard, and the algorithmic/reasoning path from the question to the answer are many steps. But it seems that in some cases humans are much faster than that (perhaps by admitting some error, but even so, they are both fast and quite good at the task). The idea is that in these settings a human brain is performing amortized inference—because they’ve seen similar examples of the input/output relation of the task before, they can use that direct mapping as a kind of bootstrap for the new task at hand, saving a lot of inference time.
Now that i’ve typed that out it feels maybe similar to your stuff about heuristics?
Big caveat here: it’s quite possible I’m misunderstanding amortized inference (maybe @jessicata can help here?), as well as reaching with the connection to your work.
I’m not sure this captures what you mean, but, if you see a query, do a bunch of reasoning, and get an answer, then you can build a dataset of (query, well-thought guess). Then you can train an AI model on that.
AlphaZero sorta works like this, because it can make a “well-thought guess” (take value and/or Q network, do an iteration of minimax, then make the value/Q network more closely approximate that, in a fixed point fashion)
Learning stochastic inverses is a specific case of “learn to automate Bayesian inference by taking forward samples and learning the backwards model”. It could be applied to LLMs for example, in terms of starting with a forwards LLM and then using it to train a LLM that predicts things out-of-order.
Paul Christiano’s iterated amplification and distillation is applying this idea to ML systems with a human feedback element. If you can expend a bunch of compute to get a good answer, you can train a weaker system to approximate that answer. Or, if you can expend a bunch of compute to get a good rating for answers, you can use that as RL feedback.
Broadly, I take o3 as evidence that Christiano’s work is broadly on the right track with respect to alignment of near-term AI systems. That is, o3 shows that hard questions can be decomposed into easy ones, in a way that involves training weaker models to be part of a big computation. (I don’t understand the o3 details that well, given it’s partially private, but I’m assuming this describes the general outlines). So I think the sort of schemes Christiano has described will be helpful for both alignment and capabilities, and will scale pretty well to impressive systems.
I’m not sure if there’s a form of amortized inference that you think this doesn’t cover well.
Hmm, I’m not sure how what you’re describing (learn on a bunch of examples of (query, well-thought-out guess)) is different from other forms of supervised learning.
Based on the paper Adam shared, it seems that part of the “amortizing” picture is that instead of simple supervised learning you look at examples of the form (context1, many examples from context1), (context2, many examples from context2), etc., in order to get good at quickly performing inference on new contexts.
It sounds like in the Paul Christiano example, you’re assuming access to some internal reasoning components (like activations or chain-of-thought) to set up a student-teacher context. Is this equivalent to the other picture I mentioned?
I’m also curious about what you said about o3 (and maybe have a related confusion about this). I certainly believe that NN’s, including RL models, learn by parallel heuristics (there’s a lot of interp and theory work that suggests this), but I don’t know any special properties of o3 that make it particularly supportive of this point of view
I was trying to say things related to this:
Which sounds like supervised learning. Adam seemed to want to know how that relates to scaling up inference time compute so I said some ways they are related.
I don’t know much about amortized inference in general. The Goodman paper seems to be about saving compute by caching results between different queries. This could be applied to LLMs but I don’t know of it being applied. It seems like you and Adam like this “amortized inference” concept and I’m new to it so don’t have any relevant comments. (Yes I realize my name is on a paper talking about this but I actually didn’t remember the concept)
I don’t think I implied anything about o3 relating to parallel heuristics.
Thanks, this is helpful. I’m still a bit unclear about how to use the word/concept “amortized inference” correctly. Is the first example you gave, of training an AI model on (query, well-thought guess), an example of amortized inference, relative to training on (query, a bunch of reasoning + well-thought out guess)?
I don’t habitually use the concept so I don’t have an opinion on how to use the term.
Thanks! I spent a bit of time understanding the stochastic inverse paper, though haven’t yet fully grokked it. My understanding here is that you’re trying to learn the conditional probabilities in a Bayes net from samples. The “non-amortized” way to do this for them is to choose a (non-unique) maximal inverse factorization that satisfies some d-separation condition, then guess the conditional probabilities on the latent-generating process by just observing frequencies of conditional events—but of course this is very inefficient, in particular because the inverse factorization isn’t a general Bayes net, but must satisfy a bunch of consistency conditions; and then you can learn a generative model for these consistency conditions by a NN and then perform some MCMC sampling on this learned prior.
So is the “moral” you want to take away here then that by exploring a diversity of tasks (corresponding to learning this generative prior on inverse Bayes nets) a NN can significantly improve its performance on single-shot prediction tasks?
i’m starting to think recursive self improvement is basically already possible with LLMs, even without anymore training ever. I’m pretty shocked with how much better my coding LLMs have become just by taking care to give the LLMs the right meta-context and information systems. I feel like I’ve moved from prompting, to figuring out what context is needed in addition to the prompt, to spending a bunch of time/effort building a knowledge structure so that the LLM can figure out its own context to get whatever done, and thats moved me from having LLMs write functions and scripts to large multi-file chunks of entire repositories. And I’m continually having the thought that’s like “ok but now I’m building this knowledge system that it can traverse and decide its own relevant context, but why can’t the LLM do that too? what would i need to setup for it to do that?” and i’m starting to feel like that’s a never ending thing.
After thinking about this a bit more, the main point I’d want to make is less about recursive self improvement and more that just there’s a lot more capability in these models than people realize.
Whether that capacity is enough for recursive self improvement is another question that I’m not certain about either way but I think it’s at least plausible that it might be. I will note that humanity improves in its knowledge and capability without architectural change. That’s a rough analogy to the type of improvement I’m imagining.
A neglected problem in AI safety technical research is teasing apart the mechanisms of dangerous capabilities exhibited by current LLMs. In particular, I am thinking that for any model organism ( see Model Organisms of Misalignment: The Case for a New Pillar of Alignment Research) of dangerous capabilities (e.g. sleeper agents paper), we don’t know how much of the phenomenon depends on the particular semantics of terms like “goal” and “deception” and “lie” (insofar as they are used in the scratchpad or in prompts or in finetuning data) or if the same phenomenon could be had by subbing in more or less any word. One approach to this is to make small toy models of these type of phenomenon where we can more easily control data distributions and yet still get analogous behavior. In this way we can really control for any particular aspect of the data and figure out, scientifically, the nature of these dangers. By small toy model I’m thinking of highly artificial datasets (perhaps made of binary digits with specific correlation structure, or whatever the minimum needed to get the phenomenon at hand).
Terminology point: When I say “a model has a dangerous capability”, I usually mean “a model has the ability to do XYZ if fine-tuned to do so”. You seem to be using this term somewhat differently as model organisms like the ones you discuss are often (though not always) looking at questions related to inductive biases and generalization (e.g. if you train a model to have a backdoor and then train it in XYZ way does this backdoor get removed).
Pre-registering a prediction for experiment results testing why and when and how attention heads share information. We (simplex) will train transformers on data generated from the cartesian product of sequences generated from 2 independent Mess3 processes. If the degenerate eigenvalue of each is the same and positive (e.g. lambda=0.7 for both) then a transformer with a single attention head will learn it, and the attention patterns will be 0.7^(d-s) where d-s is the number of context positions between the source and desitination. If instead one of the processes has lambda = 0.7 and another lambda = 0.3, then this will seperate info. between two heads, where one has attention pattersn of 0.7^(d-s) and the other 0.3^(d-s). If instead we have one process with lambda=0.7 and the other with lambda = −0.3, then this will require 3 heads, with .7^(d-s) and the 0.3^(d-s) seperating between two heads to account for the fact that attention has to be positive. BUT if we have one process with lambda = 0.7 and the other with lambda = −0.7, then this will require 2 heads! With the 0.7^(d-s) being able to be shared on one head for the lambda=0.7 process, and for the −0.7 process when d-s is even, and then the odd d-s cases will be on the other head.