I spend a bunch of time thinking about the alignment of the neural network prior for various architectures of neural networks that we expect to see in the future.
Whatever alignment failures are highly likely under the neural network prior are probably worth a lot of research attention.
Separately, it would be good to figure out knobs/levers for changing the prior distribution to be more aligned (or produce more aligned models). This includes producing more interpretable models.
Analogy to Software Development
In general, I am able to code better if I have access to a high quality library of simple utility functions. My goal here is to sketch out how we could do this for neural network learning.
Naturally Occurring Utility Functions
One way to think about the induction circuits found in the Transformer Circuits work is that they are “learned utility functions”. I think this is the sort of thing we might want to provide the networks as part of a “hacked prior”
A Language for Writing Transformer Utility Functions
Thinking Like Transformers provides a programming language, RASP, which is able to express simple functions in terms of how they would be encoded in transformers.
Concrete Research Idea: Hacking the Transformer Prior
Use RASP (or something RASP-like) to write a bunch of utility functions (such as the induction head functions).
Train a language model where a small fraction of the neural network is initialized to your utility functions (and the rest is initialized normally).
Study how the model learns to use the programmed functions. Maybe also study how those functions change (or don’t, if they’re frozen).
Future Vision
I think this could be a way to iteratively build more and more interpretable transformers, in a loop where we:
Study transformers to see what functions they are implementing
Manually implement human-understood versions of these functions
Initialize a new transformer with all of your functions, and train it
Repeat
If we have a neural network that is eventually entirely made up of human-programmed functions, we probably have an Ontologically Transparent Machine. (AN: I intend to write more thoughts on ontologically transparent machines in the near future)
I’m pretty sure you mean functions that perform tasks, like you would put in /utils, but I note that on LW “utility function” often refers to the decision theory concept, and “what decision theoretical utility functions are present in the neural network prior” also seems like an interesting (tho less useful) question.
Hacking the Transformer Prior
Neural Network Priors
I spend a bunch of time thinking about the alignment of the neural network prior for various architectures of neural networks that we expect to see in the future.
Whatever alignment failures are highly likely under the neural network prior are probably worth a lot of research attention.
Separately, it would be good to figure out knobs/levers for changing the prior distribution to be more aligned (or produce more aligned models). This includes producing more interpretable models.
Analogy to Software Development
In general, I am able to code better if I have access to a high quality library of simple utility functions. My goal here is to sketch out how we could do this for neural network learning.
Naturally Occurring Utility Functions
One way to think about the induction circuits found in the Transformer Circuits work is that they are “learned utility functions”. I think this is the sort of thing we might want to provide the networks as part of a “hacked prior”
A Language for Writing Transformer Utility Functions
Thinking Like Transformers provides a programming language, RASP, which is able to express simple functions in terms of how they would be encoded in transformers.
Concrete Research Idea: Hacking the Transformer Prior
Use RASP (or something RASP-like) to write a bunch of utility functions (such as the induction head functions).
Train a language model where a small fraction of the neural network is initialized to your utility functions (and the rest is initialized normally).
Study how the model learns to use the programmed functions. Maybe also study how those functions change (or don’t, if they’re frozen).
Future Vision
I think this could be a way to iteratively build more and more interpretable transformers, in a loop where we:
Study transformers to see what functions they are implementing
Manually implement human-understood versions of these functions
Initialize a new transformer with all of your functions, and train it
Repeat
If we have a neural network that is eventually entirely made up of human-programmed functions, we probably have an Ontologically Transparent Machine. (AN: I intend to write more thoughts on ontologically transparent machines in the near future)
I’m pretty sure you mean functions that perform tasks, like you would put in
/utils
, but I note that on LW “utility function” often refers to the decision theory concept, and “what decision theoretical utility functions are present in the neural network prior” also seems like an interesting (tho less useful) question.