They probably do not know where the real difficulties are, they probably do not understand what needs to be done, they cannot tell the difference between good and bad work, and the funders also can’t tell without me standing over their shoulders evaluating everything, which I do not have the physical stamina to do.
This was the sentiment I got after applying to the LTFF with an idea. Admittedly, I couldn’t really say whether my idea had been tried before, or wasn’t obviously bad, but my conversation basically boiled down to whether I wanted to use this project as a way to grow myself in the field, rather than any particular merits/faults of the idea itself. My motivation was really about trying a cool idea that I genuinely believed could practically improve AI safety if successful, while ethically I couldn’t commit to wanting to stay in the field even if it (likely?) failed since I like to go wherever my ideas take me.
Since it may be a while before I personally ever try out the idea, the most productive thing I can do seems to be to share it. It’s essentially an attempt at a learning algorithm which ‘forces’ a models weights to explain the reasoning/motivations behind its actions. The process of training looked kind of like a GAN, with the original models inner-layer outputs being a feature vector. Some GPT-3-esque pretrained model learns to convert this feature vector into tokens (at first as random gibberish), to try and train another GPT3-esque model to perform the actions of the original model (i.e. given the same inputs as that model, plus an explanation on what it should do from the other GPT3-esque model.) The idea basically being that explanations that more correspond to the right idea will cause the ‘learner’ model to improve faster, which acts as feedback to the ‘teacher’ model to let it know that its explanations are getting better. The end of this procedure ideally is a way to get as readable text the exact reasoning behind any action.
In my mind this kind of procedure avoids many of the problems in this list. You end up not relying on trusting the output of the model, and instead treat it as a static blackbox which you sample inner layer outputs from until this procedure produces an effective copy of itself with a mechanism for obtaining its inner reasoning. If it considers something nefarious to be the optimal way of doing something, but wants to hide it from its outputs, it’ll still have to explain that to the ‘learner’ model for it to better mimic the original model. If any of this just utterly misses one of the points above, I’d like to hear it, and if anyone’s interested in hearing more about this idea, I’ll make a post with the rest of my original submission and other thoughts that couldn’t fit into the word limit for LTFF submissions.
This was the sentiment I got after applying to the LTFF with an idea. Admittedly, I couldn’t really say whether my idea had been tried before, or wasn’t obviously bad, but my conversation basically boiled down to whether I wanted to use this project as a way to grow myself in the field, rather than any particular merits/faults of the idea itself
I evaluated this application (and we chatted briefly in a video call)! I am not like super confident in my ability to tell whether an idea is going to work, but my specific thoughts on your proposals were that I think it was very unlikely to work, but that if someone was working on it, they might learn useful things that could make them a better long-term contributor to the AI Alignment field, which is why my crux for your grant was whether you intended to stay involved in the field long-term.
Appreciation for sharing the reasoning. Disagreement with the reasoning.
eeegnu is saying they go where their ideas take them and expressing ethical qualms, which both seem like excellent reasons to want someone considering AI safety work rather than reasons to drive them away from AI safety work. Their decision to continue doing AI safety work seems likely to be correlated with whether they could be productive by doing additional AI safety work—if their ideas take them elsewhere it is unlikely anything would have come of them staying.
This is especially true if one subscribes to the theory that we are worried about sign mistakes rather than ‘wasting’ funding—if we are funding unproven individuals in AI Safety and think that is good, then this is unusually ‘safe’ in the sense of it being more non-negative.
So to the extent that I was running the LTFF, I would have said yes.
I don’t think the policy of “I will fund people to do work that I don’t expect to be useful” is a good one, unless there is some positive externality.
It seems to me that your comment is also saying that the positive externality you are looking for is “this will make this person more productive in helping with AI Safety”, or maybe “this will make them more likely to work on AI Safety”. But you are separately saying that I shouldn’t take their self-reported prediction that they will not continue working in AI Safety, independently of the outcome of the experiment, at face value, and instead bet that by working on this, they will change their mind, which seems weird to me.
Separately, I think there are bad cultural effects of having people work on projects that seem very unlikely to work, especially if the people working on them are self-reportedly not doing so with a long-term safety motivation, but because they found the specific idea they had appealing (or wanted to play around with technologies in the space). I think this will predictably attract a large number of grifters and generally make the field a much worse place to be.
“I don’t think the policy of “I will fund people to do work that I don’t expect to be useful” is a good one, unless there is some positive externality.”
By this, do you mean you think it’s not good to fund work that you expect to be useful with < 50% probability, even if the downside risk is zero?
Or do you mean you don’t expect it’s useful to fund work you strongly expect to have no positive value when you also expect it to have a significant risk of causing harm?
50% is definitely not my cutoff, and I don’t have any probability cutoff. More something in the expected value space. Like, if you have an idea that could be really great but only has a 1% chance of working, that still feels definitely worth funding. But if you have an idea that seems like it only improves things a bit, and has a 10% chance of working, that doesn’t feel worth it.
Upvoted for sharing information about thoughts behind grant-making. I could see reasons in some cases to not do this, but by and large more information seems better for many reasons.
Why wouldn’t the explainer just copy the latent vector, and the explainee just learn to do the task in the same way the original model does it? Or more generally, why does this put any pressure towards “explaining the reasons/motives” behind the original model’s actions? I think you’re thinking that by using a pre-trained GPT3-alike as the explainer model, you start off with something a lot more language-y, and language-y concepts are there for easy pickings for the training process to find in order to “communicate” between the original model and the explainee model. This seems not totally crazy, but
1. it seems to buy you, not anything like further explanations of reasons/motives beyond what’s “already in” the original model, but rather at most a translation into the explainer’s initial pre-trained internal language;
2. the explainer’s initial idiolect stays unexplained / unmotivated;
3. the training procedure doesn’t put pressure towards explanation, and does put pressure towards copying.
These are great points, and ones which I did actually think about when I was brainstorming this idea (if I understand them correctly.) I intend to write out a more thorough post on this tomorrow with clear examples (I originally imagined this as extracting deeper insights into chess), but to answer these:
I did think about these as translators for the actions of models into natural language, though I don’t get the point about extracting things beyond what’s in the original model.
I mostly glossed over this part in the brief summary, and the motivation I had for it comes from how (unexpectedly?) it works for GAN’s to just start with random noise, and in the process the generator and discriminator both still improve each other.
My thoughts here were for the explainer models update error vector to come from judging the learner model on new unseen tasks without the explanation (i.e. how similar are they to the original models outputs.) In this way the explainer gets little benefit from just giving the answer directly, since the learner will be tested without it, but if the explanation in any way helps the learner learn, it’ll improve its performance more (this is basically what the entire idea hinges on.)
(I didn’t understand this on one read, so I’ll wait for the post to see if I have further comments. I didn’t understand the analogy / extrapolation drawn in 2., and I didn’t understand what scheme is happening in 3.; maybe being a little more precise and explicit about the setup would help.)
This was the sentiment I got after applying to the LTFF with an idea. Admittedly, I couldn’t really say whether my idea had been tried before, or wasn’t obviously bad, but my conversation basically boiled down to whether I wanted to use this project as a way to grow myself in the field, rather than any particular merits/faults of the idea itself. My motivation was really about trying a cool idea that I genuinely believed could practically improve AI safety if successful, while ethically I couldn’t commit to wanting to stay in the field even if it (likely?) failed since I like to go wherever my ideas take me.
Since it may be a while before I personally ever try out the idea, the most productive thing I can do seems to be to share it. It’s essentially an attempt at a learning algorithm which ‘forces’ a models weights to explain the reasoning/motivations behind its actions. The process of training looked kind of like a GAN, with the original models inner-layer outputs being a feature vector. Some GPT-3-esque pretrained model learns to convert this feature vector into tokens (at first as random gibberish), to try and train another GPT3-esque model to perform the actions of the original model (i.e. given the same inputs as that model, plus an explanation on what it should do from the other GPT3-esque model.) The idea basically being that explanations that more correspond to the right idea will cause the ‘learner’ model to improve faster, which acts as feedback to the ‘teacher’ model to let it know that its explanations are getting better. The end of this procedure ideally is a way to get as readable text the exact reasoning behind any action.
In my mind this kind of procedure avoids many of the problems in this list. You end up not relying on trusting the output of the model, and instead treat it as a static blackbox which you sample inner layer outputs from until this procedure produces an effective copy of itself with a mechanism for obtaining its inner reasoning. If it considers something nefarious to be the optimal way of doing something, but wants to hide it from its outputs, it’ll still have to explain that to the ‘learner’ model for it to better mimic the original model. If any of this just utterly misses one of the points above, I’d like to hear it, and if anyone’s interested in hearing more about this idea, I’ll make a post with the rest of my original submission and other thoughts that couldn’t fit into the word limit for LTFF submissions.
I evaluated this application (and we chatted briefly in a video call)! I am not like super confident in my ability to tell whether an idea is going to work, but my specific thoughts on your proposals were that I think it was very unlikely to work, but that if someone was working on it, they might learn useful things that could make them a better long-term contributor to the AI Alignment field, which is why my crux for your grant was whether you intended to stay involved in the field long-term.
Appreciation for sharing the reasoning. Disagreement with the reasoning.
eeegnu is saying they go where their ideas take them and expressing ethical qualms, which both seem like excellent reasons to want someone considering AI safety work rather than reasons to drive them away from AI safety work. Their decision to continue doing AI safety work seems likely to be correlated with whether they could be productive by doing additional AI safety work—if their ideas take them elsewhere it is unlikely anything would have come of them staying.
This is especially true if one subscribes to the theory that we are worried about sign mistakes rather than ‘wasting’ funding—if we are funding unproven individuals in AI Safety and think that is good, then this is unusually ‘safe’ in the sense of it being more non-negative.
So to the extent that I was running the LTFF, I would have said yes.
I don’t think the policy of “I will fund people to do work that I don’t expect to be useful” is a good one, unless there is some positive externality.
It seems to me that your comment is also saying that the positive externality you are looking for is “this will make this person more productive in helping with AI Safety”, or maybe “this will make them more likely to work on AI Safety”. But you are separately saying that I shouldn’t take their self-reported prediction that they will not continue working in AI Safety, independently of the outcome of the experiment, at face value, and instead bet that by working on this, they will change their mind, which seems weird to me.
Separately, I think there are bad cultural effects of having people work on projects that seem very unlikely to work, especially if the people working on them are self-reportedly not doing so with a long-term safety motivation, but because they found the specific idea they had appealing (or wanted to play around with technologies in the space). I think this will predictably attract a large number of grifters and generally make the field a much worse place to be.
“I don’t think the policy of “I will fund people to do work that I don’t expect to be useful” is a good one, unless there is some positive externality.”
By this, do you mean you think it’s not good to fund work that you expect to be useful with < 50% probability, even if the downside risk is zero?
Or do you mean you don’t expect it’s useful to fund work you strongly expect to have no positive value when you also expect it to have a significant risk of causing harm?
50% is definitely not my cutoff, and I don’t have any probability cutoff. More something in the expected value space. Like, if you have an idea that could be really great but only has a 1% chance of working, that still feels definitely worth funding. But if you have an idea that seems like it only improves things a bit, and has a 10% chance of working, that doesn’t feel worth it.
Upvoted for sharing information about thoughts behind grant-making. I could see reasons in some cases to not do this, but by and large more information seems better for many reasons.
(I wasn’t able to understand the idea off this description of it.)
Why wouldn’t the explainer just copy the latent vector, and the explainee just learn to do the task in the same way the original model does it? Or more generally, why does this put any pressure towards “explaining the reasons/motives” behind the original model’s actions? I think you’re thinking that by using a pre-trained GPT3-alike as the explainer model, you start off with something a lot more language-y, and language-y concepts are there for easy pickings for the training process to find in order to “communicate” between the original model and the explainee model. This seems not totally crazy, but
1. it seems to buy you, not anything like further explanations of reasons/motives beyond what’s “already in” the original model, but rather at most a translation into the explainer’s initial pre-trained internal language;
2. the explainer’s initial idiolect stays unexplained / unmotivated;
3. the training procedure doesn’t put pressure towards explanation, and does put pressure towards copying.
These are great points, and ones which I did actually think about when I was brainstorming this idea (if I understand them correctly.) I intend to write out a more thorough post on this tomorrow with clear examples (I originally imagined this as extracting deeper insights into chess), but to answer these:
I did think about these as translators for the actions of models into natural language, though I don’t get the point about extracting things beyond what’s in the original model.
I mostly glossed over this part in the brief summary, and the motivation I had for it comes from how (unexpectedly?) it works for GAN’s to just start with random noise, and in the process the generator and discriminator both still improve each other.
My thoughts here were for the explainer models update error vector to come from judging the learner model on new unseen tasks without the explanation (i.e. how similar are they to the original models outputs.) In this way the explainer gets little benefit from just giving the answer directly, since the learner will be tested without it, but if the explanation in any way helps the learner learn, it’ll improve its performance more (this is basically what the entire idea hinges on.)
(I didn’t understand this on one read, so I’ll wait for the post to see if I have further comments. I didn’t understand the analogy / extrapolation drawn in 2., and I didn’t understand what scheme is happening in 3.; maybe being a little more precise and explicit about the setup would help.)