I’d also be interested in your response to the original mesaoptimizers paper and Joe Carlsmith’s work but I’m conscious of your time so I won’t press you on those.
(2)
a. Re the alignment faking paper: What are the multiple foot-shots you are talking about? I’d be curious to see them listed, because then we can talk about what it would be like to see similar results but without footshotA, without footshotB, etc.
b. We aren’t going to see the AIs get dumber. They aren’t going to have worse understandings of human concepts. I don’t think we’ll see a “spike of alignment difficulties” or “problems extrapolating goodness sanely,” so just be advised that the hypothesis you are tracking in which alignment turns out to be hard seems to be a different hypothesis than the one I believe in.
c. CoT is about as interpretable as I expected. I predict the industry will soon move away from interpretable CoT though, and towards CoT’s that are trained to look nice, and then eventually away from english CoT entirely and towards some sort of alien langauge (e.g. vectors, or recurrence) I would feel significantly more optimistic about alignment difficulty if I thought that the status quo w.r.t. faithful CoT would persist through AGI. If it even persists for one more year, I’ll be mildly pleased, and if it persists for three it’ll be a significant update.
d. What would make me think I was wrong about alignment difficulty? The biggest one would be a published Safety Case that I look over and am like “yep that seems like it would work” combined with the leading companies committing to do it. For example, I think that a more fleshed-out and battle-tested version of this would count, if the companies were making it their official commitment: https://alignment.anthropic.com/2024/safety-cases/
Superalignment had their W2SG agenda which also seemed to me like maaaybe it could work. Basically I think we don’t currently have a gears-level theory for how to make an aligned AGI, like, we don’t have a theory of how cognition evolves during training + a training process such that I can be like “Yep, if those assumptions hold, and we follow this procedure, then things will be fine. (and those assumptions seem like they might hold).”
There’s a lot more stuff besides this probably but I”ll stop for now, the comment is long enough already.
I think I addressed the foot-shots thing in my response to Ryan.
Re:
CoT is about as interpretable as I expected. I predict the industry will soon move away from interpretable CoT though, and towards CoT’s that are trained to look nice, and then eventually away from english CoT entirely and towards some sort of alien langauge (e.g. vectors, or recurrence) I would feel significantly more optimistic about alignment difficulty if I thought that the status quo w.r.t. faithful CoT would persist through AGI. If it even persists for one more year, I’ll be mildly pleased, and if it persists for three it’ll be a significant update.
So:
I think you can almost certainly get AIs to think either in English CoT or not, and accomplish almost anything you’d like in non-neuralese CoT or not.
I also more tentatively think that the tax in performance for thinking in CoT is not too large, or in some cases negative. Remembering things in words helps you coordinate within yourself as well as with others; although I might have vaguely non-English inspirations putting them into words helps me coordinate with myself over time. And I think people overrate how amazing and hard-to-put into words their insights are.
Thus, it appears.… somewhat likely.… (pretty uncertain here)… that it’s just contingent whether or not people end up using interpretable CoT or not. Not foreordained.
If we move towards a scenario where people find utility in seeing their AIs thoughts, or in having other AIs examine AI thoughts, and so on, then even if there is a reasonably large tax in cognitive power we could end up living in a world where AI thoughts are mostly visible because of the other advantages.
But on the other hand, if cognition mostly takes places behind API walls, and thus there’s no advantage to the user to see and understand them, then that (among other factors) could help bring us to a world where there’s less interpretable CoT.
But I mean I’m not super plugged into SF, so maybe you already know OpenAI has Noumena-GPT running that thinks in shapes incomprehensible to man, and it’s like 100x smarter or more efficient.
Cool, I’ll switch to that thread then. And yeah thanks for looking at Ajeya’s argument I’m curious to hear what you think of it. (Based on what you said in the other thread with Ryan, I’d be like “So you agree that if the training signal occasionally reinforces bad behavior, then you’ll get bad behavior? Guess what: We don’t know how to make a training signal that doesn’t occasionally reinforce bad behavior.” Then separately there are concerns about inductive biases but I think those are secondary.)
Re: faithful CoT: I’m more pessimistic than you, partly due to inside info but mostly not, mostly just my guesses about how the theoretical benefits of recurrence / neuralese will become practical benefits sooner or later. I agree it’s possible that I’m wrong and we’ll still have CoT when we reach AGI. This is in fact one of my main sources of hope on the technical alignment side. It’s literally the main thing I was talking about and recommending when I was at OpenAI. Oh and I totally agree it’s contingent. It’s very frustrating to me because I think we could cut, like, idk 50% of the misalignment risk if we just had all the major industry players make a joint statement being like “We think faithful CoT is the safer path, so we commit to stay within that paradigm henceforth, and to police this norm amongst ourselves.”
This is a good discussion btw, thanks!
(1) I think your characterization of the arguments is unfair but whatever. How about this in particular, what’s your response to it: https://www.planned-obsolescence.org/july-2022-training-game-report/
I’d also be interested in your response to the original mesaoptimizers paper and Joe Carlsmith’s work but I’m conscious of your time so I won’t press you on those.
(2)
a. Re the alignment faking paper: What are the multiple foot-shots you are talking about? I’d be curious to see them listed, because then we can talk about what it would be like to see similar results but without footshotA, without footshotB, etc.
b. We aren’t going to see the AIs get dumber. They aren’t going to have worse understandings of human concepts. I don’t think we’ll see a “spike of alignment difficulties” or “problems extrapolating goodness sanely,” so just be advised that the hypothesis you are tracking in which alignment turns out to be hard seems to be a different hypothesis than the one I believe in.
c. CoT is about as interpretable as I expected. I predict the industry will soon move away from interpretable CoT though, and towards CoT’s that are trained to look nice, and then eventually away from english CoT entirely and towards some sort of alien langauge (e.g. vectors, or recurrence) I would feel significantly more optimistic about alignment difficulty if I thought that the status quo w.r.t. faithful CoT would persist through AGI. If it even persists for one more year, I’ll be mildly pleased, and if it persists for three it’ll be a significant update.
d. What would make me think I was wrong about alignment difficulty? The biggest one would be a published Safety Case that I look over and am like “yep that seems like it would work” combined with the leading companies committing to do it. For example, I think that a more fleshed-out and battle-tested version of this would count, if the companies were making it their official commitment: https://alignment.anthropic.com/2024/safety-cases/
Superalignment had their W2SG agenda which also seemed to me like maaaybe it could work. Basically I think we don’t currently have a gears-level theory for how to make an aligned AGI, like, we don’t have a theory of how cognition evolves during training + a training process such that I can be like “Yep, if those assumptions hold, and we follow this procedure, then things will be fine. (and those assumptions seem like they might hold).”
There’s a lot more stuff besides this probably but I”ll stop for now, the comment is long enough already.
I’ll take a look at that version of the argument.
I think I addressed the foot-shots thing in my response to Ryan.
Re:
So:
I think you can almost certainly get AIs to think either in English CoT or not, and accomplish almost anything you’d like in non-neuralese CoT or not.
I also more tentatively think that the tax in performance for thinking in CoT is not too large, or in some cases negative. Remembering things in words helps you coordinate within yourself as well as with others; although I might have vaguely non-English inspirations putting them into words helps me coordinate with myself over time. And I think people overrate how amazing and hard-to-put into words their insights are.
Thus, it appears.… somewhat likely.… (pretty uncertain here)… that it’s just contingent whether or not people end up using interpretable CoT or not. Not foreordained.
If we move towards a scenario where people find utility in seeing their AIs thoughts, or in having other AIs examine AI thoughts, and so on, then even if there is a reasonably large tax in cognitive power we could end up living in a world where AI thoughts are mostly visible because of the other advantages.
But on the other hand, if cognition mostly takes places behind API walls, and thus there’s no advantage to the user to see and understand them, then that (among other factors) could help bring us to a world where there’s less interpretable CoT.
But I mean I’m not super plugged into SF, so maybe you already know OpenAI has Noumena-GPT running that thinks in shapes incomprehensible to man, and it’s like 100x smarter or more efficient.
Cool, I’ll switch to that thread then. And yeah thanks for looking at Ajeya’s argument I’m curious to hear what you think of it. (Based on what you said in the other thread with Ryan, I’d be like “So you agree that if the training signal occasionally reinforces bad behavior, then you’ll get bad behavior? Guess what: We don’t know how to make a training signal that doesn’t occasionally reinforce bad behavior.” Then separately there are concerns about inductive biases but I think those are secondary.)
Re: faithful CoT: I’m more pessimistic than you, partly due to inside info but mostly not, mostly just my guesses about how the theoretical benefits of recurrence / neuralese will become practical benefits sooner or later. I agree it’s possible that I’m wrong and we’ll still have CoT when we reach AGI. This is in fact one of my main sources of hope on the technical alignment side. It’s literally the main thing I was talking about and recommending when I was at OpenAI. Oh and I totally agree it’s contingent. It’s very frustrating to me because I think we could cut, like, idk 50% of the misalignment risk if we just had all the major industry players make a joint statement being like “We think faithful CoT is the safer path, so we commit to stay within that paradigm henceforth, and to police this norm amongst ourselves.”